There are multiple ways to slice your dataframe with given filter criteria,
but my preferable way is to do it with Boolean Index. But in Boolean index
itself, you need to make a choice of either going with numpy.array to form your
matching Boolean set or using pandas.series to form your matching Boolean set
Lets directly jump into practical to find the optimal of these two
option, here below I am creating a dataframe of (3,4) size and let’s suppose we need to
slice our dataframe with condition of col1 which is “A” = ‘aaa’.
import pandas as pd
import matplotlib.pyplot
as plt
import pandas as pd
import numpy as np
import glob
import pandas as pd,
numpy as np
df = pd.DataFrame({'A':
'aaa bbb xxx zzz aaa aaa kkk aaa'.split(),
'B': 'one two two three two
two one three'.split(),
'C': np.arange(8), 'D':
np.arange(8) * 2})
df
Output:
A B C D
0 aaa one 0 0
1 bbb two 1 2
2 xxx two 2 4
3 zzz three 3 6
4 aaa two 4 8
5 aaa two 5 10
6 kkk one 6 12
7 aaa three 7 14
Now here we are trying to form our matching Boolean set, by
Numpy.array and
Pandas.series
Which consist of TRUE value in the index where there is a match i.e.
A=’aaa’
matchSeries = df['A']
== 'aaa'
matchNumpyArry
=df['A'].values=='aaa'
display(type(matchSeries),type(matchNumpyArry))
Output
pandas.core.series.Series
numpy.ndarray
Lets try to find out how long it takes to form the Boolean set of
conditional match, here we are trying to check the performance by calling
7*1000 times the same operation
%timeit -n 1000
matchNumpyArry = df['A'].values == 'aaa'
%timeit -n 1000
matchSeries = df['A'] == 'aaa'
Output
7 µs ± 1.82 µs per
loop (mean ± std. dev. of 7 runs, 1000 loops each)
284 µs ± 8.19 µs per
loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can see from the above output that numpy.arry outperform, the mean
of numpy.arry is 7 micro seconds which is ~40 times faster then pandas.series
which takes mean time as 284 micro sec.
Lets
try a negative test where we are trying to match a negative criteria, here also
you can see from the result that numpy.array outperform pandas.series
%timeit -n 1000
matchNumpyArry = df['A'].values == 'xyz'
%timeit -n 1000
matchSeries = df['A'] == 'xyz'
Output
17 µs ± 1.82 µs per
loop (mean ± std. dev. of 7 runs, 1000 loops each)
284 µs ± 8.19 µs per
loop (mean ± std. dev. of 7 runs, 1000 loops each)
Lets now use this Boolean set to slice the dataframe and check the cost
of slicing using numpy.array vs pandas.series
Here below we can see there is not much time saving with 1000*7 run but
still numpy is leading
%timeit -n 1000
df[matchNumpyArry]
%timeit -n 1000
df[matchSeries]
Output
448 µs ± 32 µs per
loop (mean ± std. dev. of 7 runs, 1000 loops each)
510 µs ± 25.6 µs per
loop (mean ± std. dev. of 7 runs, 1000 loops each)
Result of dataframe after slicing from both way:
display(df[matchNumpyArry],df[matchSeries])
Output
A B C D
0 aaa one 0 0
4 aaa two 4 8
5 aaa two 5 10
7 aaa three 7 14
A B C D
0 aaa one 0 0
4 aaa two 4 8
5 aaa two 5 10
7 aaa three 7 14
Conclusion: Using numpy.array overall slicing experience
outperform over pandas.series.
Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share
No comments:
Post a Comment