About Me

My photo
Mumbai, Maharastra, India
He has more than 7.6 years of experience in the software development. He has spent most of the times in web/desktop application development. He has sound knowledge in various database concepts. You can reach him at viki.keshari@gmail.com https://www.linkedin.com/in/vikrammahapatra/ https://twitter.com/VikramMahapatra http://www.facebook.com/viki.keshari

Search This Blog

Tuesday, October 29, 2019

Comparing Two Dataframe and Showing Difference Using Stack

It is quite simple to compare two dataframe and show the difference using stack method of Pandas, lets directly jump into the solution

Below are the steps to be taken to compare two dataframe
  • ·          First define dataframe df1 and df2 with some difference
  • ·          Create a new boolean dataframe with the result of df1!=df2, this will store true at the place where there is difference
  • ·          Then self multiply df1 and df2 with newly Boolean dataframe  
  • ·          Then stack the dataframe df1 and df2 in long format output
  • ·          Then concat stacked df1 and df2 with axis=1


Here below we are creating two dataframe with some difference in 2nd and 3rd row


import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import glob

# first dataframe
df1 = pd.DataFrame({"A":[1,3,5,7],
                    "B":[2,4,6,8],
                    "C":[0,2,4,6]})

# Second dataframe
df2 = pd.DataFrame({"A":[1,3,9,7],
                    "B":[2,2,6,8],
                    "C":[0,2,5,6]})
df1

Output:
         A        B        C
0        1        2        0
1        3        4        2
2        5        6        4
3        7        8        6       

df2
         A        B        C
0        1        2        0
1        3        2        2
2        9        6        5
3        7        8        6

Here below we are comparing two dataframe df1!=df2 and stores the Boolean in a new dataframe df3, df3 hold True value where there is difference in two dataframe, and false where there is same value

Then we are multiplying df1 with df3, i.e. df1[df3] , which will result in new dataframe with values only in the position where there is TRUE value in df3  

df3 = df1!=df2
df1[df3]

Output
A        B        C
0        NaN      NaN      NaN
1        NaN      4.0      NaN
2        5.0      NaN      4.0
3        NaN      NaN      NaN

The result of df1[df3] and df2[df3] are stacked and concatenated with axis=1 to form a difference dataframe.

pd_diff=pd.concat([df1[df3].stack().to_frame(),df2[df3].stack().to_frame()],axis=1)
pd_diff.columns = ["df1_values","df2_values"]
pd_diff

Output

df1_values       df2_values
1        B        4.0      2.0
2        A        5.0      9.0
C        4.0      5.03    

The above code can be re-written as

pd_d=pd.concat([df1[df1!=df2].stack().to_frame(),df2[df1!=df2].stack().to_frame()],axis=1)
pd_d.columns = ["df1_values","df2_values"]
pd_d

Output

df1_values       df2_values
1        B        4.0      2.0
2        A        5.0      9.0
C        4.0      5.03    


Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share

No comments:

Post a Comment