About Me

My photo
Mumbai, Maharastra, India
He has more than 7.6 years of experience in the software development. He has spent most of the times in web/desktop application development. He has sound knowledge in various database concepts. You can reach him at viki.keshari@gmail.com https://www.linkedin.com/in/vikrammahapatra/ https://twitter.com/VikramMahapatra http://www.facebook.com/viki.keshari

Search This Blog

Thursday, October 31, 2019

Performance of Slicing Dataframe through Boolean Index with numpy.array vs pandas.series

There are multiple ways to slice your dataframe with given filter criteria, but my preferable way is to do it with Boolean Index. But in Boolean index itself, you need to make a choice of either going with numpy.array to form your matching Boolean set or using pandas.series to form your matching Boolean set

Lets directly jump into practical to find the optimal of these two option, here below I am creating a dataframe  of (3,4) size and let’s suppose we need to slice our dataframe with condition of col1 which is “A” = ‘aaa’.


import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import glob

import pandas as pd, numpy as np

df = pd.DataFrame({'A': 'aaa bbb xxx zzz aaa aaa kkk aaa'.split(),
                   'B': 'one two two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
df
Output:
         A        B        C        D
0        aaa      one      0        0
1        bbb      two      1        2
2        xxx      two      2        4
3        zzz      three    3        6
4        aaa      two      4        8
5        aaa      two      5        10
6        kkk      one      6        12
7        aaa      three    7        14      


Now here we are trying to form our matching Boolean set, by
Numpy.array and
Pandas.series

Which consist of TRUE value in the index where there is a match i.e. A=’aaa’

matchSeries = df['A'] == 'aaa'
matchNumpyArry =df['A'].values=='aaa'

display(type(matchSeries),type(matchNumpyArry))

Output
pandas.core.series.Series
numpy.ndarray

Lets try to find out how long it takes to form the Boolean set of conditional match, here we are trying to check the performance by calling 7*1000 times the same operation 

%timeit -n 1000 matchNumpyArry = df['A'].values == 'aaa'
%timeit -n 1000 matchSeries = df['A'] == 'aaa'

Output
7 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
284 µs ± 8.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)       

You can see from the above output that numpy.arry outperform, the mean of numpy.arry is 7 micro seconds which is ~40 times faster then pandas.series which takes mean time as 284 micro sec.

Lets try a negative test where we are trying to match a negative criteria, here also you can see from the result that numpy.array outperform pandas.series

%timeit -n 1000 matchNumpyArry = df['A'].values == 'xyz'
%timeit -n 1000 matchSeries = df['A'] == 'xyz'

Output

17 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
284 µs ± 8.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        

Lets now use this Boolean set to slice the dataframe and check the cost of slicing using numpy.array vs pandas.series
Here below we can see there is not much time saving with 1000*7 run but still numpy is leading

%timeit -n 1000 df[matchNumpyArry]
%timeit -n 1000 df[matchSeries]

Output

448 µs ± 32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
510 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)       

Result of dataframe after slicing from both way:

display(df[matchNumpyArry],df[matchSeries])

Output

         A        B        C        D
0        aaa      one      0        0
4        aaa      two      4        8
5        aaa      two      5        10
7        aaa      three    7        14
        
         A        B        C        D
0        aaa      one      0        0
4        aaa      two      4        8
5        aaa      two      5        10
7        aaa      three    7        14

Conclusion: Using numpy.array overall slicing experience outperform over pandas.series.


Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share

Tuesday, October 29, 2019

Comparing Two Dataframe and Showing Difference Using Stack

It is quite simple to compare two dataframe and show the difference using stack method of Pandas, lets directly jump into the solution

Below are the steps to be taken to compare two dataframe
  • ·          First define dataframe df1 and df2 with some difference
  • ·          Create a new boolean dataframe with the result of df1!=df2, this will store true at the place where there is difference
  • ·          Then self multiply df1 and df2 with newly Boolean dataframe  
  • ·          Then stack the dataframe df1 and df2 in long format output
  • ·          Then concat stacked df1 and df2 with axis=1


Here below we are creating two dataframe with some difference in 2nd and 3rd row


import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import glob

# first dataframe
df1 = pd.DataFrame({"A":[1,3,5,7],
                    "B":[2,4,6,8],
                    "C":[0,2,4,6]})

# Second dataframe
df2 = pd.DataFrame({"A":[1,3,9,7],
                    "B":[2,2,6,8],
                    "C":[0,2,5,6]})
df1

Output:
         A        B        C
0        1        2        0
1        3        4        2
2        5        6        4
3        7        8        6       

df2
         A        B        C
0        1        2        0
1        3        2        2
2        9        6        5
3        7        8        6

Here below we are comparing two dataframe df1!=df2 and stores the Boolean in a new dataframe df3, df3 hold True value where there is difference in two dataframe, and false where there is same value

Then we are multiplying df1 with df3, i.e. df1[df3] , which will result in new dataframe with values only in the position where there is TRUE value in df3  

df3 = df1!=df2
df1[df3]

Output
A        B        C
0        NaN      NaN      NaN
1        NaN      4.0      NaN
2        5.0      NaN      4.0
3        NaN      NaN      NaN

The result of df1[df3] and df2[df3] are stacked and concatenated with axis=1 to form a difference dataframe.

pd_diff=pd.concat([df1[df3].stack().to_frame(),df2[df3].stack().to_frame()],axis=1)
pd_diff.columns = ["df1_values","df2_values"]
pd_diff

Output

df1_values       df2_values
1        B        4.0      2.0
2        A        5.0      9.0
C        4.0      5.03    

The above code can be re-written as

pd_d=pd.concat([df1[df1!=df2].stack().to_frame(),df2[df1!=df2].stack().to_frame()],axis=1)
pd_d.columns = ["df1_values","df2_values"]
pd_d

Output

df1_values       df2_values
1        B        4.0      2.0
2        A        5.0      9.0
C        4.0      5.03    


Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share

Monday, October 21, 2019

Using Generator Expression to Read multiple file dynamically and store data in single Pandas dataframe

In previous two posts we used traditional way to read multiple file, store it dynamically with filename as an additional column in the data frame. Here in this post we will do the same but with the help of Generator expression.

Let look at how we can read multiple file, here we called pandas concat function and kept a iterative read_csv as parameter which take file name from for loop i.e. generator expression.


import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import glob

read_file = glob.glob('emp*.csv')
read_file
type(read_file)

Output: 
List

df_all=pd.concat(pd.read_csv(file) for file in read_file)
df_all.reset_index(drop = True)

Output: 
         emp_no   emp_name emp_sal
0        E1001    Aayansh           1000
1        E1002    Prayansh          2000
2        E1003    Rishika           1500
3        E1004    Mishty            900
4        E2001    Sidhika           1000
5        E2002    Kavita            2000
6        E2003    Happy             1500
7        E2004    Sandeep           900
        
Let’s add file name to the dataframe, which can be done by calling assign function and pass new column assignment in it. Below is the code

df_all=pd.concat(pd.read_csv(file).assign(filename=file) for file in read_file)
df_all.reset_index(drop = True)


Output
emp_no   emp_name emp_sal  filename
0        E1001    Aayansh  1000     emp1.csv
1        E1002    Prayansh 2000     emp1.csv
2        E1003    Rishika  1500     emp1.csv
3        E1004    Mishty   900      emp1.csv
4        E2001    Sidhika  1000     emp2.csv
5        E2002    Kavita   2000     emp2.csv
6        E2003    Happy    1500     emp2.csv
7        E2004    Sandeep  900      emp2.csv

Previous Post:


Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share