About Me

My photo
Mumbai, Maharastra, India
He has more than 7.6 years of experience in the software development. He has spent most of the times in web/desktop application development. He has sound knowledge in various database concepts. You can reach him at viki.keshari@gmail.com https://www.linkedin.com/in/vikrammahapatra/ https://twitter.com/VikramMahapatra http://www.facebook.com/viki.keshari

Search This Blog

Thursday, October 31, 2019

Performance of Slicing Dataframe through Boolean Index with numpy.array vs pandas.series

There are multiple ways to slice your dataframe with given filter criteria, but my preferable way is to do it with Boolean Index. But in Boolean index itself, you need to make a choice of either going with numpy.array to form your matching Boolean set or using pandas.series to form your matching Boolean set

Lets directly jump into practical to find the optimal of these two option, here below I am creating a dataframe  of (3,4) size and let’s suppose we need to slice our dataframe with condition of col1 which is “A” = ‘aaa’.


import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import glob

import pandas as pd, numpy as np

df = pd.DataFrame({'A': 'aaa bbb xxx zzz aaa aaa kkk aaa'.split(),
                   'B': 'one two two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
df
Output:
         A        B        C        D
0        aaa      one      0        0
1        bbb      two      1        2
2        xxx      two      2        4
3        zzz      three    3        6
4        aaa      two      4        8
5        aaa      two      5        10
6        kkk      one      6        12
7        aaa      three    7        14      


Now here we are trying to form our matching Boolean set, by
Numpy.array and
Pandas.series

Which consist of TRUE value in the index where there is a match i.e. A=’aaa’

matchSeries = df['A'] == 'aaa'
matchNumpyArry =df['A'].values=='aaa'

display(type(matchSeries),type(matchNumpyArry))

Output
pandas.core.series.Series
numpy.ndarray

Lets try to find out how long it takes to form the Boolean set of conditional match, here we are trying to check the performance by calling 7*1000 times the same operation 

%timeit -n 1000 matchNumpyArry = df['A'].values == 'aaa'
%timeit -n 1000 matchSeries = df['A'] == 'aaa'

Output
7 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
284 µs ± 8.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)       

You can see from the above output that numpy.arry outperform, the mean of numpy.arry is 7 micro seconds which is ~40 times faster then pandas.series which takes mean time as 284 micro sec.

Lets try a negative test where we are trying to match a negative criteria, here also you can see from the result that numpy.array outperform pandas.series

%timeit -n 1000 matchNumpyArry = df['A'].values == 'xyz'
%timeit -n 1000 matchSeries = df['A'] == 'xyz'

Output

17 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
284 µs ± 8.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        

Lets now use this Boolean set to slice the dataframe and check the cost of slicing using numpy.array vs pandas.series
Here below we can see there is not much time saving with 1000*7 run but still numpy is leading

%timeit -n 1000 df[matchNumpyArry]
%timeit -n 1000 df[matchSeries]

Output

448 µs ± 32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
510 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)       

Result of dataframe after slicing from both way:

display(df[matchNumpyArry],df[matchSeries])

Output

         A        B        C        D
0        aaa      one      0        0
4        aaa      two      4        8
5        aaa      two      5        10
7        aaa      three    7        14
        
         A        B        C        D
0        aaa      one      0        0
4        aaa      two      4        8
5        aaa      two      5        10
7        aaa      three    7        14

Conclusion: Using numpy.array overall slicing experience outperform over pandas.series.


Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share

No comments:

Post a Comment