There are multiple way to slice your data, but let’s see which way is the most
efficient way to slice the data. Let’s take a Dataframe and check the various
data slicing
Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share
import pandas as pd
import matplotlib.pyplot
as plt
import pandas as pd
import numpy as np
import glob
df =
pd.DataFrame({'A': 'aaa bbb xxx zzz aaa aaa kkk aaa'.split(),
'B': 'one two two three two
two one three'.split(),
'C': np.arange(8), 'D':
np.arange(8) * 2})
df
Output:
A B C D
0 aaa one 0 0
1 bbb two 1 2
2 xxx two 2 4
3 zzz three 3 6
4 aaa two 4 8
5 aaa two 5 10
6 kkk one 6 12
7 aaa three 7 14
+-
Now lets slice the data with A=’aaa’ condition using various methods
%timeit -n 1000
df[df['A'].values == 'aaa']
%timeit -n 1000
df[df['A'] == 'aaa']
%timeit -n 1000
df.query('A == "aaa"')
%timeit -n 1000
df[df["A"]=='aaa']
%timeit -n 1000
df[df["A"].isin(['aaa'])]
%timeit -n 1000
df.set_index('A', append=True, drop=False).xs('aaa', level=1,drop_level=True)
%timeit -n 1000
df.iloc[np.where(df['A']=='aaa')]
Output
454 µs ± 7.97 µs per
loop (mean ± std. dev. of 7 runs, 1000 loops each)
787 µs ± 4.77 µs per
loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.8
ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs,
1000 loops each)
787 µs ± 8.38 µs per
loop (mean ± std. dev. of 7 runs, 1000 loops each)
710 µs ± 6.78 µs per
loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.58 ms ± 233 µs per loop (mean
± std. dev. of 7 runs, 1000 loops each)
855 µs ± 6.98 µs per loop
(mean ± std. dev. of 7 runs, 1000 loops each)
Here you can see the best performance is achived using dataframe.where
method and second lead is taken by cross section of index method.
The lowest among all was showed by iloc method.
erience outperform over pandas.series.
Post Reference: Vikram Aristocratic Elfin Share