Aristocratic Elfin Share

Tuesday, April 14, 2020

COVID19 India Data Analysis, Predicting Total Case on 4th of May (by end of lockdown Version-02)

Do check prev post: Covid19_StateWise_Analysis_India_Part1

Here we trying to focus on what will be the confirmed case count on the last day of lockdown version-02 in India, the entire analysis is based on growth rate technique. Let’s import required modules

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)
import folium
import seaborn as sns
import os

import datetime

Let try to find out growth rate, considering the data from 30^th Jan

confirmed_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/'+
                           'COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/'
                           +'time_series_covid19_confirmed_global.csv')

india_sel = confirmed_df[confirmed_df['Country/Region']=='India'].loc[:'4/13/20']
india_confirmed_list = india_sel.values.tolist()[0]
india_confirmed_list[4]
growth_diff = []

for i in range(4,len(india_confirmed_list)):
    if (i == 4) or india_confirmed_list[i-1] == 0 :
       growth_diff.append(india_confirmed_list[i])
    else:
        growth_diff.append(india_confirmed_list[i] / india_confirmed_list[i-1])

growth_factor = sum(growth_diff)/len(growth_diff)
print('Average growth factor',growth_factor)

#OUTPUT: GROWTH RATE

Average growth factor 1.0637553331032963

Lets now calculate the next twenty 21 days case count and plot it in chart

x_axis_prediction_dt = []

dates = list(confirmed_df.columns[4:])
dates = list(pd.to_datetime(dates))

#we will add one day to the last day till which we have data
start_date = dates[len(dates) - 1]
for i in range(21):
    date = start_date + datetime.timedelta(days=1)
    x_axis_prediction_dt.append(date)
    start_date = date

# Get the last available day total number
previous_day_cases = confirmed_df[confirmed_df['Country/Region']=='India'].iloc[:,-1]
# Converting series to float value
previous_day_cases = previous_day_cases.iloc[0]
y_axis_predicted_next21days_cases = []

for i in range(21):
    predicted_value = previous_day_cases * growth_factor
    y_axis_predicted_next21days_cases.append(predicted_value)
    previous_day_cases = predicted_value
# print(previous_day_cases)

#add Graph
fig1=go.Figure()
fig1.add_trace(go.Scatter(x=x_axis_prediction_dt,
                          y=y_axis_predicted_next21days_cases,
                          name='India'
                               ))

fig1.layout.update(title_text='COVID-19 next twenty one prediction',xaxis_showgrid=False, yaxis_showgrid=False, width=800,
        height=500,font=dict(
#         family="Courier New, monospace",
        size=12,
        color="white"
    ))
fig1.layout.plot_bgcolor = 'Black'
fig1.layout.paper_bgcolor = 'Black'
fig1.show()

Growth rate predict cases will jump over 35k by 3rd of May

Post Reference: Vikram Aristocratic Elfin Share

Monday, April 13, 2020

Analysis of top 5 Indian state Covid19 confirmed case till March month - Part1

I am using Kaggle dataset "covid19-corona-virus-india-dataset/complete.csv" for my analysis.

We will first try to find out top five states with most number of cases, and then will try to plot the data on day on day basis.

Lets first import relevant module

Now in second part I have created a dataframe and call the def operationDfs by passing the newly created dataframe, here if you see the we have used time package to record the total execution time to run the entire program.

df_complete = pd.read_csv('../input/covid19-corona-virus-india-dataset/complete.csv')
df_patient_wise = pd.read_csv('../input/covid19-corona-virus-india-dataset/patients_data.csv')

#date and state wise total
df = pd.DataFrame(df_complete.groupby(['Date','Name of State / UT'])['Total Confirmed cases (Indian National)'].sum()).reset_index()
df[df['Name of State / UT']=='Maharashtra']

#State wise Total till 29th March
df_stateWiseTot = pd.DataFrame(df.groupby(['Name of State / UT'])['Total Confirmed cases (Indian National)'].sum()).reset_index()
df_stateWiseTot.sort_values('Total Confirmed cases (Indian National)', axis = 0, ascending = False, inplace = True, na_position ='last')
df_stateWiseTot.nlargest(5,'Total Confirmed cases (Indian National)')

#OUTPUT
Name of State / UT Total Confirmed cases (Indian National)
Maharashtra           1294
Kerala           1264
Uttar Pradesh     512
Karnataka         480
Delhi            390

Lets plot the data state wise with confirmed case on day on day basis

fig1=go.Figure()
fig1.add_trace(go.Scatter(x=df[(df['Name of State / UT']=='Maharashtra') & (df['Date'] < '2020-03-29') ]['Date'],
                                 y=df[df['Name of State / UT']=='Maharashtra']['Total Confirmed cases (Indian National)'],
                          name='Maharashtra'
                               ))
fig1.add_trace(go.Scatter(x=df[(df['Name of State / UT']=='Kerala') & (df['Date'] < '2020-03-29') ]['Date'],
                                 y=df[df['Name of State / UT']=='Kerala']['Total Confirmed cases (Indian National)'],
                          name='Kerala'
                               ))
fig1.add_trace(go.Scatter(x=df[(df['Name of State / UT']=='Uttar Pradesh') & (df['Date'] < '2020-03-29') ]['Date'],
                                 y=df[df['Name of State / UT']=='Uttar Pradesh']['Total Confirmed cases (Indian National)'],
                          name='Uttar Pradesh'
                               ))
fig1.add_trace(go.Scatter(x=df[(df['Name of State / UT']=='Karnataka') & (df['Date'] < '2020-03-29') ]['Date'],
                                 y=df[df['Name of State / UT']=='Karnataka']['Total Confirmed cases (Indian National)'],
                          name='Karnataka'
                               ))
fig1.add_trace(go.Scatter(x=df[(df['Name of State / UT']=='Delhi') & (df['Date'] < '2020-03-29') ]['Date'],
                                 y=df[df['Name of State / UT']=='Delhi']['Total Confirmed cases (Indian National)'],
                          name='Delhi'
                               ))

fig1.layout.update(title_text='COVID-19 Top 4 State Wise Data in India',xaxis_showgrid=False, yaxis_showgrid=False, width=1100,
        height=500,font=dict(
#         family="Courier New, monospace",
        size=12,
        color="white"
    ))
fig1.layout.plot_bgcolor = 'Black'
fig1.layout.paper_bgcolor = 'Black'
fig1.show()

covid19-india-data-analysis-vkm

Data Science with…Python J

Post Reference: Vikram Aristocratic Elfin Share

Sunday, November 3, 2019

Python: Multiprocessing with 4 Core CPU

We have some processing to perform on existing Dataframe, where will try to add few columns on the bases of existing columns values, this we will try to do it serially and then compare the output performance with multiprocessing

Below we have a definition in the file, which takes a dataframe as an input and add 9 further columns in it based on existing column value and some mathematical expression.

import pandas as pd

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import multiprocessing as mp

import time

def operationonDFs(dfinput):

dfinput['z'] = dfinput.apply(lambda newCol: 2 * newCol['x'] , axis = 1)

dfinput['a'] = dfinput.apply(lambda newCol: 3.4 * newCol['x'] , axis = 1)

dfinput['b'] = dfinput.apply(lambda newCol: 2.1 * newCol['x'] , axis = 1)

dfinput['c'] = dfinput.apply(lambda newCol: 5.9 * newCol['x'] , axis = 1)

dfinput['d'] = dfinput.apply(lambda newCol: 7.3 * newCol['x'] , axis = 1)

dfinput['d'] = dfinput.apply(lambda newCol: 3.3 * newCol['x'] , axis = 1)

dfinput['e'] = dfinput.apply(lambda newCol: 7.1 * newCol['x'] , axis = 1)

dfinput['f'] = dfinput.apply(lambda newCol: 4.3 * newCol['x'] , axis = 1)

dfinput['g'] = dfinput.apply(lambda newCol: 5.3 * newCol['x'] , axis = 1)

return dfinput

if __name__ == '__main__':

start_time = time.time()

dfinput = pd.DataFrame({'x':range(1,100000),

'y':range(1,100000)})

df = operationonDFs(dfinput)

print(df)

print("--- %s seconds ---" % (time.time() - start_time))

Lets see the output

C:\Users\Atoshi\mypy>python operationonDFWOMP.py

x y z a b c d e f g

0 1 1 2 3.4 2.1 5.9 3.3 7.1 4.3 5.3

1 2 2 4 6.8 4.2 11.8 6.6 14.2 8.6 10.6

2 3 3 6 10.2 6.3 17.7 9.9 21.3 12.9 15.9

3 4 4 8 13.6 8.4 23.6 13.2 28.4 17.2 21.2

4 5 5 10 17.0 10.5 29.5 16.5 35.5 21.5 26.5

... ... ... ... ... ... ... ... ... ... ...

99994 99995 99995 199990 339983.0 209989.5 589970.5 329983.5 709964.5 429978.5 529973.5

99995 99996 99996 199992 339986.4 209991.6 589976.4 329986.8 709971.6 429982.8 529978.8

99996 99997 99997 199994 339989.8 209993.7 589982.3 329990.1 709978.7 429987.1 529984.1

99997 99998 99998 199996 339993.2 209995.8 589988.2 329993.4 709985.8 429991.4 529989.4

99998 99999 99999 199998 339996.6 209997.9 589994.1 329996.7 709992.9 429995.7 529994.7

[99999 rows x 10 columns]

--- 24.10855269432068 seconds ---

So it took 24 second to execute the program.

Lets re-write the program to use multiprocessing, first importing multiprocessing package in our program and checking the number of cpu core available in our system using cpu_count method of multiprocessing.