Aristocratic Elfin Share: Pandas Filter Condition

Showing posts with label Pandas Filter Condition. Show all posts

Saturday, November 2, 2019

Performance of Various Slicing Method of Pandas Dataframe

There are multiple way to slice your data, but let’s see which way is the most efficient way to slice the data. Let’s take a Dataframe and check the various data slicing

import pandas as pd

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import glob

df = pd.DataFrame({'A': 'aaa bbb xxx zzz aaa aaa kkk aaa'.split(),

'B': 'one two two three two two one three'.split(),

'C': np.arange(8), 'D': np.arange(8) * 2})

Output:

A B C D

0 aaa one 0 0

1 bbb two 1 2

2 xxx two 2 4

3 zzz three 3 6

4 aaa two 4 8

5 aaa two 5 10

6 kkk one 6 12

7 aaa three 7 14

Now lets slice the data with A=’aaa’ condition using various methods

%timeit -n 1000 df[df['A'].values == 'aaa']

%timeit -n 1000 df[df['A'] == 'aaa']

%timeit -n 1000 df.query('A == "aaa"')

%timeit -n 1000 df[df["A"]=='aaa']

%timeit -n 1000 df[df["A"].isin(['aaa'])]

%timeit -n 1000 df.set_index('A', append=True, drop=False).xs('aaa', level=1,drop_level=True)

%timeit -n 1000 df.iloc[np.where(df['A']=='aaa')]

Output

454 µs ± 7.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

787 µs ± 4.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

1.8 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

787 µs ± 8.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

710 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2.58 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

855 µs ± 6.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Here you can see the best performance is achived using dataframe.where method and second lead is taken by cross section of index method.

The lowest among all was showed by iloc method.

erience outperform over pandas.series.

Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share

Thursday, October 31, 2019

Performance of Slicing Dataframe through Boolean Index with numpy.array vs pandas.series

There are multiple ways to slice your dataframe with given filter criteria, but my preferable way is to do it with Boolean Index. But in Boolean index itself, you need to make a choice of either going with numpy.array to form your matching Boolean set or using pandas.series to form your matching Boolean set

Lets directly jump into practical to find the optimal of these two option, here below I am creating a dataframe of (3,4) size and let’s suppose we need to slice our dataframe with condition of col1 which is “A” = ‘aaa’.

import pandas as pd

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import glob

import pandas as pd, numpy as np

df = pd.DataFrame({'A': 'aaa bbb xxx zzz aaa aaa kkk aaa'.split(),

'B': 'one two two three two two one three'.split(),

'C': np.arange(8), 'D': np.arange(8) * 2})

Output:

A B C D

0 aaa one 0 0

1 bbb two 1 2

2 xxx two 2 4

3 zzz three 3 6

4 aaa two 4 8

5 aaa two 5 10

6 kkk one 6 12

7 aaa three 7 14

Now here we are trying to form our matching Boolean set, by

Numpy.array and

Pandas.series

Which consist of TRUE value in the index where there is a match i.e. A=’aaa’

matchSeries = df['A'] == 'aaa'

matchNumpyArry =df['A'].values=='aaa'

display(type(matchSeries),type(matchNumpyArry))

Output

pandas.core.series.Series

numpy.ndarray

Lets try to find out how long it takes to form the Boolean set of conditional match, here we are trying to check the performance by calling 7*1000 times the same operation

%timeit -n 1000 matchNumpyArry = df['A'].values == 'aaa'

%timeit -n 1000 matchSeries = df['A'] == 'aaa'

Output

7 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

284 µs ± 8.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can see from the above output that numpy.arry outperform, the mean of numpy.arry is 7 micro seconds which is ~40 times faster then pandas.series which takes mean time as 284 micro sec.

Lets try a negative test where we are trying to match a negative criteria, here also you can see from the result that numpy.array outperform pandas.series

%timeit -n 1000 matchNumpyArry = df['A'].values == 'xyz'

%timeit -n 1000 matchSeries = df['A'] == 'xyz'

Output

17 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

284 µs ± 8.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Lets now use this Boolean set to slice the dataframe and check the cost of slicing using numpy.array vs pandas.series

Here below we can see there is not much time saving with 1000*7 run but still numpy is leading

%timeit -n 1000 df[matchNumpyArry]

%timeit -n 1000 df[matchSeries]

Output

448 µs ± 32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

510 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Result of dataframe after slicing from both way:

display(df[matchNumpyArry],df[matchSeries])

Output

A B C D

0 aaa one 0 0

4 aaa two 4 8

5 aaa two 5 10

7 aaa three 7 14

A B C D

0 aaa one 0 0

4 aaa two 4 8

5 aaa two 5 10

7 aaa three 7 14

Conclusion: Using numpy.array overall slicing experience outperform over pandas.series.

Data Science with…Python J

Post Reference: Vikram Aristocratic Elfin Share

Sunday, April 21, 2019

Fetch rows on the basis of condition in Pandas Dataframe

I have an excel sheet with below records

Here we are trying to implement various filter criteria

· Implementing value search

· Implementing like condition

· Implementing not like condition

And while doing these we will try to ignore the NULL condition with the help of “na=Falase” parameter. Lets code it

import pandas as pd
import numpy as np

df1 = pd.read_csv("NullFilterExample.csv")

print('Original DF rows \n',df1 , '\n')

#implementing value search
df2=df1[df1.customer_name == 'Rishika']
print('Rows where customer Name like Rishika \n',df2)

#implementing like condition
df3 = df1[df1.customer_name.str.contains('ika', na=False)]
print('Rows where customer Name contain ika \n',df3)

#implementing not like condition
df4 = df1[~df1.customer_name.str.contains('ika', na=False)]
print('Rows where customer Name not contain ika \n',df4)

Output:

Original DF rows

account_no branch city_code customer_name amount

0 2112 3212.0 321.0 Sidhika 19000

1 2119 NaN 215.0 Prayansh 12000

2 2115 4321.0 212.0 Rishika 15000

3 2435 2312.0 NaN Sagarika 13000

4 2356 7548.0 256.0 NaN 15000

Rows where customer Name like Rishika

account_no branch city_code customer_name amount

2 2115 4321.0 212.0 Rishika 15000

Rows where customer Name contain ika

account_no branch city_code customer_name amount

0 2112 3212.0 321.0 Sidhika 19000

2 2115 4321.0 212.0 Rishika 15000

3 2435 2312.0 NaN Sagarika 13000

Rows where customer Name not contain ika

account_no branch city_code customer_name amount

1 2119 NaN 215.0 Prayansh 12000

4 2356 7548.0 256.0 NaN 15000

Data Science with…Python :)

Post Reference: Vikram Aristocratic Elfin Share

Aristocratic Elfin Share

Pages

About Me

Search This Blog

Saturday, November 2, 2019

Performance of Various Slicing Method of Pandas Dataframe

Thursday, October 31, 2019

Performance of Slicing Dataframe through Boolean Index with numpy.array vs pandas.series

Sunday, April 21, 2019

Fetch rows on the basis of condition in Pandas Dataframe