About Me

My photo
Mumbai, Maharastra, India
He has more than 7.6 years of experience in the software development. He has spent most of the times in web/desktop application development. He has sound knowledge in various database concepts. You can reach him at viki.keshari@gmail.com https://www.linkedin.com/in/vikrammahapatra/ https://twitter.com/VikramMahapatra http://www.facebook.com/viki.keshari

Search This Blog

Sunday, April 21, 2019

Reindexing Dataframe after rows filter in Pandas and preserving previous index

I have an excel sheet with below records

Lets filter out all those rows where customer name contain “ika”

import pandas as pd
import numpy as np

df1 = pd.read_csv(
"NullFilterExample.csv")

print('Original DF rows \n',df1 , '\n')

#implementing filter  not like condition
df2 = df1[~df1.customer_name.str.contains('ika', na=False)]
print('Rows where customer Name not contain ika \n',df2)

Output
Original DF rows
    account_no  branch  city_code customer_name  amount
0        2112  3212.0      321.0       Sidhika   19000
1        2119     NaN      215.0      Prayansh   12000
2        2115  4321.0      212.0       Rishika   15000
3        2435  2312.0        NaN      Sagarika   13000
4        2356  7548.0      256.0           NaN   15000

Rows where customer Name not contain ika
    account_no  branch  city_code customer_name  amount
1        2119     NaN      215.0      Prayansh   12000
4        2356  7548.0      256.0           NaN   15000

Now here if you see the output of df2, you will there are two rows with index 1 and 4, which simply indicates that it require reindexing. Lets put the logic of reindexing

#reindexing DF2 dataframe
df3 = df2.reset_index(drop=True)
print('After reindexing of DF3 \n',df3)

Output:
After reindexing of DF3
    account_no  branch  city_code customer_name  amount
0        2119     NaN      215.0      Prayansh   12000
1        2356  7548.0      256.0           NaN   15000

Now here if you see in output, the index are correct and in sequence i.e. 0 and 1.

Now lets check the syntax
df2.reset_index(drop=True)

there is a parameter “drop=True”, this actually drops the existing index on the rows and create new one starting with 0.

But what if we want to preserve the actual index of the rows… just simple remove the optional parameter  “drop=True”

#what if we remove "drop=True" parameter og reset index

df4 = df2.reset_index()
print('After reindexing of DF2 \n',df4)

Output
After reindexing of DF2
    index  account_no  branch  city_code customer_name  amount
0      1        2119     NaN      215.0      Prayansh   12000
1      4        2356  7548.0      256.0           NaN   15000

So here you can see, it creates a new column called “index”, and preserve the existing index numbering.

Full code:
import pandas as pd
import numpy as np

df1 = pd.read_csv(
"NullFilterExample.csv")

print('Original DF rows \n',df1 , '\n')

#implementing filter  not like condition
df2 = df1[~df1.customer_name.str.contains('ika', na=False)]
print('Rows where customer Name not contain ika \n',df2)

#reindexing DF2 dataframe
df3 = df2.reset_index(drop=True)
print('After reindexing of DF2 \n',df3)

#what if we remove "drop=True" parameter og reset index

df4 = df2.reset_index()
print('After reindexing of DF2 \n',df4)



Data Science with…Python :
Post Reference: Vikram Aristocratic Elfin Share

Fetch rows on the basis of condition in Pandas Dataframe

I have an excel sheet with below records
Here we are trying to implement various filter criteria
·          Implementing value search
·          Implementing like condition
·          Implementing not like condition


And while doing these we will try to ignore the NULL condition with the help of “na=Falase” parameter. Lets code it  

import pandas as pd
import numpy as np

df1 = pd.read_csv(
"NullFilterExample.csv")

print('Original DF rows \n',df1 , '\n')

#implementing value search
df2=df1[df1.customer_name == 'Rishika']
print('Rows where customer Name like Rishika \n',df2)

#implementing like condition
df3 = df1[df1.customer_name.str.contains('ika', na=False)]
print('Rows where customer Name contain ika \n',df3)

#implementing not like condition
df4 = df1[~df1.customer_name.str.contains('ika', na=False)]
print('Rows where customer Name not contain ika \n',df4)

Output:
Original DF rows
    account_no  branch  city_code customer_name  amount
0        2112  3212.0      321.0       Sidhika   19000
1        2119     NaN      215.0      Prayansh   12000
2        2115  4321.0      212.0       Rishika   15000
3        2435  2312.0        NaN      Sagarika   13000
4        2356  7548.0      256.0           NaN   15000

Rows where customer Name like Rishika
    account_no  branch  city_code customer_name  amount
2        2115  4321.0      212.0       Rishika   15000

Rows where customer Name contain ika
    account_no  branch  city_code customer_name  amount
0        2112  3212.0      321.0       Sidhika   19000
2        2115  4321.0      212.0       Rishika   15000
3        2435  2312.0        NaN      Sagarika   13000

Rows where customer Name not contain ika
    account_no  branch  city_code customer_name  amount
1        2119     NaN      215.0      Prayansh   12000
4        2356  7548.0      256.0           NaN   15000

Data Science with…Python :) 

Post Reference: Vikram Aristocratic Elfin Share

Saturday, April 20, 2019

Removing Null Value rows from Dataframe in Pandas

I have an excel sheet with below records









Here if you see second and fourth rows having null value, so our objective to remove these rows from process, lets see how we can do with panda package of python

import pandas as pd
import numpy as np

df1 = pd.read_csv(
"NullFilterExample.csv")

print('Original DF with NULL value in rows \n',df1 , '\n')

for col in df1.columns:
    df1=df1[df1[col].notnull()]

print('After removing rows with NULL Value \n',df1)

Output:

Original DF with NULL value in rows
    account_no  branch  city_code customer_name  amount
0        2112  3212.0      321.0       Sidhika   19000
1        2119     NaN      215.0      Prayansh   12000
2        2115  4321.0      212.0       Rishika   15000
3        2435  2312.0        NaN      Sagarika   13000

After removing rows with NULL Value
    account_no  branch  city_code customer_name  amount
0        2112  3212.0      321.0       Sidhika   19000
2        2115  4321.0      212.0       Rishika   15000


Data Science with…Python J

Post Reference: Vikram Aristocratic Elfin Share

Thursday, November 22, 2018

Performance of Read Ahead Read with Trace Flag 642

When user submit a query to fetch data from SQL server, database engine do a logical read to check if requested data page is holding  the requested data are present in cache if so it will do logical read and sent back to user, but if the requested data pages are not present in buffer cache then it will do a physical read which is reading from disk, this is an expensive operation involving high IO and wait type.

To avoid physical read, SQL Server has something known as Read Ahead Read, this will bring the data(data pages in buffer) even before it is requested from the query. Read Ahead Read operation is a default behavior of SQL Server.

In this post we will check the performance of Read Ahead Read compared to physical read with the help of Trace Flag 642.

I have created two set of Query, one using Read Ahead read to fetch the data and other one using physical read.

With Read Ahead Read: Here I have set the IO on to capture the plan and freed out the cache and buffer

dbcc traceoff(652,-1)
dbcc freeproccache
dbcc dropcleanbuffers
go
set statistics io on
set statistics time on 
  --select * from dbo.person   
  select * from person.address
set statistics io off
set statistics time off
go

Lets check what IO info  is saying
(19614 row(s) affected)
Table 'Address'. Scan count 1, logical reads 346, physical reads 1, read-ahead reads 344, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 93 ms,  elapsed time = 855 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Here we can see Read Ahead Read is 150, that means to fetch 19972 records, database storage engine gets 150 eight K pages to cache before plan get executed. Now let’s check Without Read Ahead Read

dbcc traceon(652,-1)
dbcc dropcleanbuffers  --do not run this on prod
dbcc freeproccache()   --do not run this on prod
set statistics io on
set statistics time on
       --  select * from dbo.person   
       select * from person.address
set statistics io off
set statistics time off
go

Let’s check what IO info is saying

(19614 row(s) affected)
Table 'Address'. Scan count 1, logical reads 345, physical reads 233, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 141 ms,  elapsed time = 3041 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Here clearly we can see the difference, read ahead read give better elapsed time compare to Physical Read.

Conclusion: Read Ahead Read perform well as compared to physical read in our case.

Enjy coding…SQL J 

Post Reference: Vikram Aristocratic Elfin Share

Monday, November 19, 2018

Logical Read, Physical Read and Buffer Cash hit

Logical Reads: 
This is also known as cache hit, which means reading pages from cache memory instead of disk. Logical reads in 
Logical read specifies total number of data pages needed to be accessed from data cache to process query. It is very possible that logical read will access same data pages many times, so count of logical read value may be higher than actual number of pages in a table. Usually the best way to reduce logical read is to apply correct index or to rewrite the query.

Physical Reads 
Physical read indicates total number of data pages that are read from disk. In case no data in data cache, the physical read will be equal to number of logical read. And usually it happens for first query request. And for subsequent same query request the number will be substantially decreased because the data pages have been in data cache.

Buffer Cash Hit Ratio
The more logical read count, better would be cash hit ratio.
(logical reads – physical reads)/logical read * 100%. The high buffer hit ratio (if possible to near 100%) indicates good database performance.

Some warning on high number of logical read:
Higher number of Logical Reads tends high memory usage, but there are various way by which we can reduce higher number of logical read
  1. Remove Improper/Useless/Insufficient Indexes: Indexes should be build on the basis of data access or retrieval process if any of the indexes is build on the columns which are not used in a query will leads to High Logical reads and will degrade the performance while reads and writing the data....
  2.  Poor Fill Factor/Page Density: Page use should not be very less. otherwise large number of page will be used for small amount of data which will also leads to High Logical Reads....
  3. Wide Indexes: Indexing on the large number of columns will leads to high logical reads....
  4. Index scanning: if query is leads to index scanning on the table then logical reads will be high...


Logical Reads count can be get by using following ways
  1.     set statistics io on 
  2.          sys.dm_exec_query_Stats
  3.          SQL Profiler: by executing the sql profiler on that database we can find out logical reads..


Example
set statistics io on
set statistics time on
    select * from dbo.person

    select * from dbo.person
set statistics io off
set statistics time off
go
(19972 row(s) affected)
Table 'Person'. Scan count 1, logical reads 150, physical reads 0, read-ahead reads 7, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 15 ms,  elapsed time = 963 ms.

(19972 row(s) affected)
Table 'Person'. Scan count 1, logical reads 150, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 564 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
Enjy coding…SQL J


Post Reference: Vikram Aristocratic Elfin Share