About Me

My photo
Mumbai, Maharastra, India
He has more than 7.6 years of experience in the software development. He has spent most of the times in web/desktop application development. He has sound knowledge in various database concepts. You can reach him at viki.keshari@gmail.com https://www.linkedin.com/in/vikrammahapatra/ https://twitter.com/VikramMahapatra http://www.facebook.com/viki.keshari

Search This Blog

Sunday, October 20, 2019

Python: Keeping track of which data comes from which file

If you have dataframe consist of data from multiple file, and you want to keep a column to preserve the information of file from where the data is coming along with the content of the file then we can make use of DataFrame assign function to create the column while reading and creating dataframe.

We are first taking an empty dataframe with columns in it.
To read multiple file in a directory we are using glob module
Then while reading the content of file in dataframe, we will use assign function to create new column to store filename along with data.

We have two file emp1.csv and emp2.csv in our python directory, lets try to read the file name through glob module


import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import glob

read_file = glob.glob('emp*.csv')
read_file

Output: ['emp1.csv', 'emp2.csv']

type(read_file)

Output: list
        
Here you see the read_file object of glob consist of all the files from the python parent directory.

Now we need to read the content of all file in the directory and keep all data in a single dataframe.
First we create an empty dataframe with column name and an additional column to store the filename from where data is getting extracted and stored in dataframe.

Then we loop through the filenames list (read_file) and pass the file name to pd.read_csv function
While reading file and storing data in dataframe we use assign function to create one more filed with the information of filename
df_file = pd.read_csv(files).assign(filename=files)

There after we concate the data frame to make a single dataframe. In the output you can see data are segregated under filename

df = pd.DataFrame(columns=['emp_no','emp_name','emp_sal','filename'])

for files in read_file:
    df_file = pd.read_csv(files).assign(filename=files)
    df= pd.concat([df,df_file],axis=0)
   
df

Output

emp_no   emp_name emp_sal  filename
0        E1001    Aayansh  1000     emp1.csv
1        E1002    Prayansh 2000     emp1.csv
2        E1003    Rishika  1500     emp1.csv
3        E1004    Mishty   900      emp1.csv
0        E2001    Sidhika  1000     emp2.csv
1        E2002    Kavita   2000     emp2.csv
2        E2003    Happy    1500     emp2.csv
3        E2004    Sandeep  900      emp2.csv

df.groupby(['filename']).count()

         emp_no   emp_name emp_sal
filename                 
emp1.csv 4        4        4
emp2.csv 4        4        4



Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share

No comments:

Post a Comment