About Me

My photo
Mumbai, Maharastra, India
He has more than 7.6 years of experience in the software development. He has spent most of the times in web/desktop application development. He has sound knowledge in various database concepts. You can reach him at viki.keshari@gmail.com https://www.linkedin.com/in/vikrammahapatra/ https://twitter.com/VikramMahapatra http://www.facebook.com/viki.keshari

Search This Blog

Saturday, November 2, 2019

Performance of Various Slicing Method of Pandas Dataframe

There are multiple way to slice your data, but let’s see which way is the most efficient way to slice the data. Let’s take a Dataframe and check the various data slicing

import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import glob

df = pd.DataFrame({'A': 'aaa bbb xxx zzz aaa aaa kkk aaa'.split(),
                   'B': 'one two two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
df
Output:
         A        B        C        D
0        aaa      one      0        0
1        bbb      two      1        2
2        xxx      two      2        4
3        zzz      three    3        6
4        aaa      two      4        8
5        aaa      two      5        10
6        kkk      one      6        12
7        aaa      three    7        14               
 +-     

Now lets slice the data with A=’aaa’ condition using various methods

%timeit -n 1000 df[df['A'].values == 'aaa']
%timeit -n 1000 df[df['A'] == 'aaa']
%timeit -n 1000 df.query('A == "aaa"')
%timeit -n 1000 df[df["A"]=='aaa']
%timeit -n 1000 df[df["A"].isin(['aaa'])]
%timeit -n 1000 df.set_index('A', append=True, drop=False).xs('aaa', level=1,drop_level=True)
%timeit -n 1000 df.iloc[np.where(df['A']=='aaa')]


Output

454 µs ± 7.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
787 µs ± 4.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.8 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
787 µs ± 8.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
710 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.58 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
855 µs ± 6.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Here you can see the best performance is achived using dataframe.where method and second lead is taken by cross section of index method.
The lowest among all was showed by iloc method.
erience outperform over pandas.series.

Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share

No comments:

Post a Comment