I intend to read csv file with specific pattern from nested file directories where each subdirectories has multiple csv files and I only want to read the one end with specific pattern. I already figured out the way of doing this in R, but wanted to do this in pandas. I found couple of useful post but can't able to read the file that I wanted to read in pandas.
current attempt
here is file structure that I have and wanted to read file start with Z_19_xx.csv. for instance:
import pandas as pd
dir1="demo2020/p1 pop/csv/Z_19_master.csv"
f1=pd.read_csv(dir1)
this is hard coded and I want to avoid of doing this. Below is the file structure:
demo2020
- p1 pop
-csv
- A_17_master.csv
- A_18_master.csv
- B_18_master.csv
- C_19_master.csv
- Z_19_master.csv
- p2 cop
-csv
- A_17_cop.csv
- A_18_cop.csv
- B_18_cop.csv
- C_19_cop.csv
- Z_19_cop.csv
- p3 res
-csv
- A_17_res.csv
- A_18_res.csv
- B_18_res.csv
- C_19_res.csv
- Z_19_res.csv
- p4 nac
-csv
- A_17_nac.csv
- A_18_nac.csv
- B_18_nac.csv
- C_19_nac.csv
- Z_19_nac.csv
my current attempt in R:
here is my R code to do this in handy:
yr=19
dir="demo2020/"
files <-c(f1 = paste0("p1 pop/csv/Z_", yr, "_master.csv") ,
f2 = paste0('p2 cop/csv/Z_', yr,'_cop.csv') ,
f3 = paste0('p3 res/csv/Z_', yr,'_res.csv') ,
f4 = paste0('p4 nac/csv/Z_', yr,'_nac.csv')
)
path=(paste0(dir,files))
> path
[1] "demo2020/p1 pop/csv/Z_19_master.csv"
[2] "demo2020/p2 cop/csv/Z_19_cop.csv"
[3] "demo2020/p3 res/csv/Z_19_res.csv"
[4] "demo2020/p4 nac/csv/Z_19_nac.csv"
# read them
for(i in 1:length(files))
{
f <- assign(names(files[i]), read.csv(paste0(dir, files[i]),stringsAsFactors = FALSE,skip = 1))
}
python objective - pandas
I want to do this in python without hard coding, and simply want to use above R code logic in python and use pandas to read csv files. So far, here is my attempt:
import pandas
import os
parent_dir = 'demo2020/'
subject_dirs = [os.path.join(parent_dir, dir) for dir in os.listdir(parent_dir) if os.path.isdir(os.path.join(parent_dir, dir))]
filelist = []
for dir in subject_dirs:
csv_files = [os.path.join(dir, csv) for csv in os.listdir(dir) if os.path.isfile(os.path.join(dir, csv)) and and csv.startswith('Z_') and csv.endswith('.csv')]
for file in csv_files:
df=pd.read_csv(file)
filelist.append(df)
but still not getting this right, I only want to read Z_19_xx.csv from each subfolder and concatenate them. How can we do this nicely in python? Can anyone point me out of making this right in python? Any idea?