If you have dataframe consist of data from multiple file, and you want to
keep a column to preserve the information of file from where the data is coming
along with the content of the file then we can make use of DataFrame assign function
to create the column while reading and creating dataframe.
We are first taking an empty dataframe with columns in it.
To read multiple file in a directory we are using glob module
Then while reading the content of file in dataframe, we will use assign function
to create new column to store filename along with data.
We have two file emp1.csv and emp2.csv in our python directory, lets try
to read the file name through glob module
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import glob
read_file = glob.glob('emp*.csv')
read_file
Output: ['emp1.csv', 'emp2.csv']
type(read_file)
Output: list
Here you see the read_file object of glob consist of all the files from
the python parent directory.
Now we need to read the content of all file in the directory and keep all
data in a single dataframe.
First we create an empty dataframe with column name and an additional
column to store the filename from where data is getting extracted and stored in
dataframe.
Then we loop through the filenames list (read_file) and pass the file
name to pd.read_csv function
While reading file and storing data in dataframe we use assign function to
create one more filed with the information of filename
df_file = pd.read_csv(files).assign(filename=files)
There after we concate the data frame to make a single dataframe. In the
output you can see data are segregated under filename
df =
pd.DataFrame(columns=['emp_no','emp_name','emp_sal','filename'])
for files in
read_file:
df_file
= pd.read_csv(files).assign(filename=files)
df= pd.concat([df,df_file],axis=0)
df
Output
emp_no emp_name emp_sal filename
0 E1001 Aayansh 1000 emp1.csv
1 E1002 Prayansh 2000 emp1.csv
2 E1003 Rishika 1500 emp1.csv
3 E1004 Mishty 900 emp1.csv
0 E2001 Sidhika 1000 emp2.csv
1 E2002 Kavita 2000 emp2.csv
2 E2003 Happy 1500 emp2.csv
3 E2004 Sandeep 900 emp2.csv
df.groupby(['filename']).count()
emp_no emp_name emp_sal
filename
emp1.csv 4 4 4
emp2.csv 4 4 4
Data Science with…Python J
Post Reference: Vikram Aristocratic Elfin Share
No comments:
Post a Comment