About Me

My photo
Mumbai, Maharastra, India
He has more than 7.6 years of experience in the software development. He has spent most of the times in web/desktop application development. He has sound knowledge in various database concepts. You can reach him at viki.keshari@gmail.com https://www.linkedin.com/in/vikrammahapatra/ https://twitter.com/VikramMahapatra http://www.facebook.com/viki.keshari

Search This Blog

Showing posts with label Pandas. Show all posts
Showing posts with label Pandas. Show all posts

Tuesday, April 14, 2020

COVID19 India Data Analysis, Predicting Total Case on 4th of May (by end of lockdown Version-02)


Here we trying to focus on what will be the confirmed case count on the last day of lockdown version-02 in India, the entire analysis is based on growth rate technique.  Let’s import required modules

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(
connected=True)
import folium
import seaborn as sns
import os
import datetime


Let try to find out growth rate, considering the data from 30th Jan

confirmed_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/'+
                          
'COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/'
                          
+'time_series_covid19_confirmed_global.csv')

india_sel  = confirmed_df[confirmed_df[
'Country/Region']=='India'].loc[:'4/13/20']
india_confirmed_list = india_sel.values.tolist()[
0]
india_confirmed_list[
4]
growth_diff = []

for i in range(4,len(india_confirmed_list)):
   
if (i == 4) or india_confirmed_list[i-1] == 0 :
        growth_diff.append(india_confirmed_list[i])
   
else:
        growth_diff.append(india_confirmed_list[i] / india_confirmed_list[i-
1])

growth_factor =
sum(growth_diff)/len(growth_diff)
print('Average growth factor',growth_factor)

#OUTPUT: GROWTH RATE
Average growth factor 1.0637553331032963


Lets now calculate the next twenty 21 days case count and plot it in chart

x_axis_prediction_dt = []

dates =
list(confirmed_df.columns[4:])
dates =
list(pd.to_datetime(dates))

#we will add one day to the last day till which we have data
start_date = dates[len(dates) - 1]
for i in range(21):
    date = start_date + datetime.timedelta(
days=1)
    x_axis_prediction_dt.append(date)
    start_date = date

# Get the last available day total number   
previous_day_cases = confirmed_df[confirmed_df['Country/Region']=='India'].iloc[:,-1]
# Converting series to float value
previous_day_cases = previous_day_cases.iloc[0]
y_axis_predicted_next21days_cases = []

for i in range(21):
    predicted_value = previous_day_cases *  growth_factor
    y_axis_predicted_next21days_cases.append(predicted_value)
    previous_day_cases = predicted_value
# print(previous_day_cases)

#add Graph
fig1=go.Figure()
fig1.add_trace(go.Scatter(
x=x_axis_prediction_dt,
                          
y=y_axis_predicted_next21days_cases,
                         
name='India'
                              
))

fig1.layout.update(
title_text='COVID-19 next twenty one prediction',xaxis_showgrid=False, yaxis_showgrid=False, width=800,
       
height=500,font=dict(
#         family="Courier New, monospace",
       
size=12,
       
color="white"
   
))
fig1.layout.plot_bgcolor =
'Black'
fig1.layout.paper_bgcolor = 'Black'
fig1.show()

Growth rate predict cases will jump over 35k by 3rd of May



Post Reference: Vikram Aristocratic Elfin Share

Sunday, November 3, 2019

Python: Multiprocessing with 4 Core CPU

We have some processing to perform on existing Dataframe, where will try to add few columns on the bases of existing columns values, this we will try to do it serially and then compare the output performance with multiprocessing

Below we have a definition in the file, which takes a dataframe as an input and add 9 further columns in it based on existing column value and some mathematical expression.

import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import multiprocessing as mp
import time


def operationonDFs(dfinput):
    dfinput['z'] = dfinput.apply(lambda newCol: 2 * newCol['x'] , axis = 1)
    dfinput['a'] = dfinput.apply(lambda newCol: 3.4 * newCol['x'] , axis = 1)
    dfinput['b'] = dfinput.apply(lambda newCol: 2.1 * newCol['x'] , axis = 1)
    dfinput['c'] = dfinput.apply(lambda newCol: 5.9 * newCol['x'] , axis = 1)
    dfinput['d'] = dfinput.apply(lambda newCol: 7.3 * newCol['x'] , axis = 1)
    dfinput['d'] = dfinput.apply(lambda newCol: 3.3 * newCol['x'] , axis = 1)
    dfinput['e'] = dfinput.apply(lambda newCol: 7.1 * newCol['x'] , axis = 1)
    dfinput['f'] = dfinput.apply(lambda newCol: 4.3 * newCol['x'] , axis = 1)
    dfinput['g'] = dfinput.apply(lambda newCol: 5.3 * newCol['x'] , axis = 1)
    return dfinput

Now in second part I have created a dataframe and call the def operationDfs by passing the newly created dataframe, here if you see the we have used time package to record the total execution time to run the entire program.

if __name__ ==  '__main__':
    start_time = time.time()   

    dfinput = pd.DataFrame({'x':range(1,100000),
                       'y':range(1,100000)})

    df = operationonDFs(dfinput)
   
    print(df)
    print("--- %s seconds ---" % (time.time() - start_time))

Lets see the output

C:\Users\Atoshi\mypy>python operationonDFWOMP.py
           x      y       z         a         b         c         d         e         f         g
0          1      1       2       3.4       2.1       5.9       3.3       7.1       4.3       5.3
1          2      2       4       6.8       4.2      11.8       6.6      14.2       8.6      10.6
2          3      3       6      10.2       6.3      17.7       9.9      21.3      12.9      15.9
3          4      4       8      13.6       8.4      23.6      13.2      28.4      17.2      21.2
4          5      5      10      17.0      10.5      29.5      16.5      35.5      21.5      26.5
...      ...    ...     ...       ...       ...       ...       ...       ...       ...       ...
99994  99995  99995  199990  339983.0  209989.5  589970.5  329983.5  709964.5  429978.5  529973.5
99995  99996  99996  199992  339986.4  209991.6  589976.4  329986.8  709971.6  429982.8  529978.8
99996  99997  99997  199994  339989.8  209993.7  589982.3  329990.1  709978.7  429987.1  529984.1
99997  99998  99998  199996  339993.2  209995.8  589988.2  329993.4  709985.8  429991.4  529989.4
99998  99999  99999  199998  339996.6  209997.9  589994.1  329996.7  709992.9  429995.7  529994.7

[99999 rows x 10 columns]
--- 24.10855269432068 seconds ---
So it took 24 second to execute the program.

Lets re-write the program to use multiprocessing, first importing multiprocessing package in our program and checking the number of cpu core available in our system using cpu_count method of multiprocessing.

import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import multiprocessing as mp
import time

def operationonDFs(dfinput):
    dfinput['z'] = dfinput.apply(lambda newCol: 2 * newCol['x'] , axis = 1)
    dfinput['a'] = dfinput.apply(lambda newCol: 3.4 * newCol['x'] , axis = 1)
    dfinput['b'] = dfinput.apply(lambda newCol: 2.1 * newCol['x'] , axis = 1)
    dfinput['c'] = dfinput.apply(lambda newCol: 5.9 * newCol['x'] , axis = 1)
    dfinput['d'] = dfinput.apply(lambda newCol: 7.3 * newCol['x'] , axis = 1)
    dfinput['d'] = dfinput.apply(lambda newCol: 3.3 * newCol['x'] , axis = 1)
    dfinput['e'] = dfinput.apply(lambda newCol: 7.1 * newCol['x'] , axis = 1)
    dfinput['f'] = dfinput.apply(lambda newCol: 4.3 * newCol['x'] , axis = 1)
    dfinput['g'] = dfinput.apply(lambda newCol: 5.3 * newCol['x'] , axis = 1)
    return dfinput


if __name__ ==  '__main__':
    start_time = time.time()
   
    cpu_count = mp.cpu_count()
    no_of_split = 50

    dfinput = pd.DataFrame({'x':range(1,100000),
                       'y':range(1,100000)})

    dfinput_split = np.array_split(dfinput, no_of_split)
    pool = mp.Pool(cpu_count)
    df = pd.concat(pool.map(operationonDFs, dfinput_split))
    pool.close()
    pool.join()

    print(df)
    print("--- %s seconds ---" % (time.time() - start_time))

We have used pool class, the pool distributes the tasks to the available processors using a FIFO scheduling. It works like a map reduce architecture. It maps the input to the different processors and collects the output from all the processors.
The input to the pool.map method is definition which we want to execute in parallel with the splited dataframe.
Once the execution is finished, it joins all the output to form a single set of dataframe.

Let’s see how much we save with this architecture:

 C:\Users\Atoshi\mypy>python operationonDF.py
           x      y       z         a         b         c         d         e         f         g
0          1      1       2       3.4       2.1       5.9       3.3       7.1       4.3       5.3
1          2      2       4       6.8       4.2      11.8       6.6      14.2       8.6      10.6
2          3      3       6      10.2       6.3      17.7       9.9      21.3      12.9      15.9
3          4      4       8      13.6       8.4      23.6      13.2      28.4      17.2      21.2
4          5      5      10      17.0      10.5      29.5      16.5      35.5      21.5      26.5
...      ...    ...     ...       ...       ...       ...       ...       ...       ...       ...
99994  99995  99995  199990  339983.0  209989.5  589970.5  329983.5  709964.5  429978.5  529973.5
99995  99996  99996  199992  339986.4  209991.6  589976.4  329986.8  709971.6  429982.8  529978.8
99996  99997  99997  199994  339989.8  209993.7  589982.3  329990.1  709978.7  429987.1  529984.1
99997  99998  99998  199996  339993.2  209995.8  589988.2  329993.4  709985.8  429991.4  529989.4
99998  99999  99999  199998  339996.6  209997.9  589994.1  329996.7  709992.9  429995.7  529994.7

[99999 rows x 10 columns]
--- 13.999533653259277 seconds ---

That’s really a good save compare to serial method.
Serial Method: 24 sec
With Multiprocessing: 13 sec


Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share

Saturday, November 2, 2019

Performance of Various Slicing Method of Pandas Dataframe

There are multiple way to slice your data, but let’s see which way is the most efficient way to slice the data. Let’s take a Dataframe and check the various data slicing

import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import glob

df = pd.DataFrame({'A': 'aaa bbb xxx zzz aaa aaa kkk aaa'.split(),
                   'B': 'one two two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
df
Output:
         A        B        C        D
0        aaa      one      0        0
1        bbb      two      1        2
2        xxx      two      2        4
3        zzz      three    3        6
4        aaa      two      4        8
5        aaa      two      5        10
6        kkk      one      6        12
7        aaa      three    7        14               
 +-     

Now lets slice the data with A=’aaa’ condition using various methods

%timeit -n 1000 df[df['A'].values == 'aaa']
%timeit -n 1000 df[df['A'] == 'aaa']
%timeit -n 1000 df.query('A == "aaa"')
%timeit -n 1000 df[df["A"]=='aaa']
%timeit -n 1000 df[df["A"].isin(['aaa'])]
%timeit -n 1000 df.set_index('A', append=True, drop=False).xs('aaa', level=1,drop_level=True)
%timeit -n 1000 df.iloc[np.where(df['A']=='aaa')]


Output

454 µs ± 7.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
787 µs ± 4.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.8 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
787 µs ± 8.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
710 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.58 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
855 µs ± 6.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Here you can see the best performance is achived using dataframe.where method and second lead is taken by cross section of index method.
The lowest among all was showed by iloc method.
erience outperform over pandas.series.

Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share