Aristocratic Elfin Share

Tuesday, October 29, 2019

Comparing Two Dataframe and Showing Difference Using Stack

It is quite simple to compare two dataframe and show the difference using stack method of Pandas, lets directly jump into the solution

Below are the steps to be taken to compare two dataframe

· First define dataframe df1 and df2 with some difference
· Create a new boolean dataframe with the result of df1!=df2, this will store true at the place where there is difference
· Then self multiply df1 and df2 with newly Boolean dataframe
· Then stack the dataframe df1 and df2 in long format output
· Then concat stacked df1 and df2 with axis=1

Here below we are creating two dataframe with some difference in 2^nd and 3^rd row

import pandas as pd

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import glob

# first dataframe

df1 = pd.DataFrame({"A":[1,3,5,7],

"B":[2,4,6,8],

"C":[0,2,4,6]})

# Second dataframe

df2 = pd.DataFrame({"A":[1,3,9,7],

"B":[2,2,6,8],

"C":[0,2,5,6]})

df1

Output:

A B C

0 1 2 0

1 3 4 2

2 5 6 4

3 7 8 6

df2

A B C

0 1 2 0

1 3 2 2

2 9 6 5

3 7 8 6

Here below we are comparing two dataframe df1!=df2 and stores the Boolean in a new dataframe df3, df3 hold True value where there is difference in two dataframe, and false where there is same value

Then we are multiplying df1 with df3, i.e. df1[df3] , which will result in new dataframe with values only in the position where there is TRUE value in df3

df3 = df1!=df2

df1[df3]

Output

A B C

0 NaN NaN NaN

1 NaN 4.0 NaN

2 5.0 NaN 4.0

3 NaN NaN NaN

The result of df1[df3] and df2[df3] are stacked and concatenated with axis=1 to form a difference dataframe.

pd_diff=pd.concat([df1[df3].stack().to_frame(),df2[df3].stack().to_frame()],axis=1)

pd_diff.columns = ["df1_values","df2_values"]

pd_diff

Output

df1_values df2_values

1 B 4.0 2.0

2 A 5.0 9.0

C 4.0 5.03

The above code can be re-written as

pd_d=pd.concat([df1[df1!=df2].stack().to_frame(),df2[df1!=df2].stack().to_frame()],axis=1)

pd_d.columns = ["df1_values","df2_values"]

pd_d

Output

df1_values df2_values

1 B 4.0 2.0

2 A 5.0 9.0

C 4.0 5.03

Data Science with…Python J

Post Reference: Vikram Aristocratic Elfin Share

Monday, October 21, 2019

Using Generator Expression to Read multiple file dynamically and store data in single Pandas dataframe

In previous two posts we used traditional way to read multiple file, store it dynamically with filename as an additional column in the data frame. Here in this post we will do the same but with the help of Generator expression.

Let look at how we can read multiple file, here we called pandas concat function and kept a iterative read_csv as parameter which take file name from for loop i.e. generator expression.

import pandas as pd

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import glob

read_file = glob.glob('emp*.csv')

read_file

type(read_file)

Output:

List

df_all=pd.concat(pd.read_csv(file) for file in read_file)

df_all.reset_index(drop = True)

Output:

emp_no emp_name emp_sal

0 E1001 Aayansh 1000

1 E1002 Prayansh 2000

2 E1003 Rishika 1500

3 E1004 Mishty 900

4 E2001 Sidhika 1000

5 E2002 Kavita 2000

6 E2003 Happy 1500

7 E2004 Sandeep 900

Let’s add file name to the dataframe, which can be done by calling assign function and pass new column assignment in it. Below is the code

df_all=pd.concat(pd.read_csv(file).assign(filename=file) for file in read_file)

df_all.reset_index(drop = True)

Output

emp_no emp_name emp_sal filename

0 E1001 Aayansh 1000 emp1.csv

1 E1002 Prayansh 2000 emp1.csv

2 E1003 Rishika 1500 emp1.csv

3 E1004 Mishty 900 emp1.csv

4 E2001 Sidhika 1000 emp2.csv

5 E2002 Kavita 2000 emp2.csv

6 E2003 Happy 1500 emp2.csv

7 E2004 Sandeep 900 emp2.csv

Previous Post:

· Keeping track of which data comes from which file

· Reading multiple file dynamically and storing data in single Pandas dataframe

Data Science with…Python J

Post Reference: Vikram Aristocratic Elfin Share

Sunday, October 20, 2019

Python: Keeping track of which data comes from which file

If you have dataframe consist of data from multiple file, and you want to keep a column to preserve the information of file from where the data is coming along with the content of the file then we can make use of DataFrame assign function to create the column while reading and creating dataframe.

We are first taking an empty dataframe with columns in it.

To read multiple file in a directory we are using glob module

Then while reading the content of file in dataframe, we will use assign function to create new column to store filename along with data.

We have two file emp1.csv and emp2.csv in our python directory, lets try to read the file name through glob module

import pandas as pd

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import glob

read_file = glob.glob('emp*.csv')

read_file

Output: ['emp1.csv', 'emp2.csv']

type(read_file)

Output: list

Here you see the read_file object of glob consist of all the files from the python parent directory.

Now we need to read the content of all file in the directory and keep all data in a single dataframe.

First we create an empty dataframe with column name and an additional column to store the filename from where data is getting extracted and stored in dataframe.

Then we loop through the filenames list (read_file) and pass the file name to pd.read_csv function

While reading file and storing data in dataframe we use assign function to create one more filed with the information of filename

df_file = pd.read_csv(files).assign(filename=files)

There after we concate the data frame to make a single dataframe. In the output you can see data are segregated under filename

df = pd.DataFrame(columns=['emp_no','emp_name','emp_sal','filename'])

for files in read_file:

df_file = pd.read_csv(files).assign(filename=files)

df= pd.concat([df,df_file],axis=0)

Output

emp_no emp_name emp_sal filename

0 E1001 Aayansh 1000 emp1.csv

1 E1002 Prayansh 2000 emp1.csv

2 E1003 Rishika 1500 emp1.csv

3 E1004 Mishty 900 emp1.csv

0 E2001 Sidhika 1000 emp2.csv

1 E2002 Kavita 2000 emp2.csv

2 E2003 Happy 1500 emp2.csv

3 E2004 Sandeep 900 emp2.csv

df.groupby(['filename']).count()

emp_no emp_name emp_sal

filename

emp1.csv 4 4 4

emp2.csv 4 4 4

Data Science with…Python J

Post Reference: Vikram Aristocratic Elfin Share

Reading multiple file dynamically and storing data in single Pandas dataframe

Here we will make use of glob module which gives us all file in a directory in List format.

We have two file emp1.csv and emp2.csv in our python directory, lets try to read the file name through glob module

import pandas as pd

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import glob

read_file = glob.glob('emp*.csv')

read_file

Output: ['emp1.csv', 'emp2.csv']

type(read_file)

Output: list

Here you see the read_file object of glob consist of all the files from the python parent directory.

Now we need to read the content of all file in the directory and keep all data in a single dataframe.

First we create an empty dataframe with column name, there after we use concat function of pandas to concatenate the previous read content with new file content through dataframe

df = pd.DataFrame(columns=['emp_no','emp_name','emp_sal'])

for files in read_file:

df_file = pd.read_csv(files)

df= pd.concat([df,df_file],axis=0)

Output

emp_no emp_name emp_sal

0 E1001 Aayansh 1000

1 E1002 Prayansh 2000

2 E1003 Rishika 1500

3 E1004 Mishty 900

0 E2001 Sidhika 1000

1 E2002 Kavita 2000

2 E2003 Happy 1500

3 E2004 Sandeep 900

Data Science with…Python J

Post Reference: Vikram Aristocratic Elfin Share

Aristocratic Elfin Share

Pages

About Me

Search This Blog

Tuesday, October 29, 2019

Comparing Two Dataframe and Showing Difference Using Stack

Monday, October 21, 2019

Using Generator Expression to Read multiple file dynamically and store data in single Pandas dataframe

Sunday, October 20, 2019

Python: Keeping track of which data comes from which file

Reading multiple file dynamically and storing data in single Pandas dataframe