How to read csv files with specific pattern from nested file directories in pandas?

Question

I intend to read csv file with specific pattern from nested file directories where each subdirectories has multiple csv files and I only want to read the one end with specific pattern. I already figured out the way of doing this in R, but wanted to do this in pandas. I found couple of useful post but can't able to read the file that I wanted to read in pandas.

current attempt

here is file structure that I have and wanted to read file start with Z_19_xx.csv. for instance:

import pandas as pd

dir1="demo2020/p1 pop/csv/Z_19_master.csv"
f1=pd.read_csv(dir1)

this is hard coded and I want to avoid of doing this. Below is the file structure:

demo2020
    - p1 pop
        -csv
            - A_17_master.csv
            - A_18_master.csv
            - B_18_master.csv
            - C_19_master.csv
            - Z_19_master.csv
    - p2 cop
        -csv
            - A_17_cop.csv
            - A_18_cop.csv
            - B_18_cop.csv
            - C_19_cop.csv
            - Z_19_cop.csv
    - p3 res
        -csv
            - A_17_res.csv
            - A_18_res.csv
            - B_18_res.csv
            - C_19_res.csv
            - Z_19_res.csv
    - p4 nac
        -csv
            - A_17_nac.csv
            - A_18_nac.csv
            - B_18_nac.csv
            - C_19_nac.csv
            - Z_19_nac.csv

my current attempt in R:

here is my R code to do this in handy:

yr=19
dir="demo2020/"
files <-c(f1  = paste0("p1 pop/csv/Z_", yr, "_master.csv") , 
                    f2 = paste0('p2 cop/csv/Z_', yr,'_cop.csv') ,
                    f3 = paste0('p3 res/csv/Z_', yr,'_res.csv') , 
                    f4  = paste0('p4 nac/csv/Z_', yr,'_nac.csv') 
)

path=(paste0(dir,files))
> path
[1] "demo2020/p1 pop/csv/Z_19_master.csv"
[2] "demo2020/p2 cop/csv/Z_19_cop.csv"   
[3] "demo2020/p3 res/csv/Z_19_res.csv"   
[4] "demo2020/p4 nac/csv/Z_19_nac.csv"

# read them

for(i in 1:length(files))
{
    f <- assign(names(files[i]), read.csv(paste0(dir, files[i]),stringsAsFactors = FALSE,skip = 1))
}

python objective - pandas

I want to do this in python without hard coding, and simply want to use above R code logic in python and use pandas to read csv files. So far, here is my attempt:

import pandas
import os

parent_dir = 'demo2020/'
subject_dirs = [os.path.join(parent_dir, dir) for dir in os.listdir(parent_dir) if os.path.isdir(os.path.join(parent_dir, dir))]

filelist = []
for dir in subject_dirs:
    csv_files = [os.path.join(dir, csv) for csv in os.listdir(dir) if os.path.isfile(os.path.join(dir, csv)) and and csv.startswith('Z_') and csv.endswith('.csv')]
    for file in csv_files:
        df=pd.read_csv(file)
        filelist.append(df)

but still not getting this right, I only want to read Z_19_xx.csv from each subfolder and concatenate them. How can we do this nicely in python? Can anyone point me out of making this right in python? Any idea?

akuiper · Accepted Answer · 2022-02-23 19:01:54Z

1

You can use glob pattern to match the files: demo2020/p*/csv/Z_*.csv

import glob

csv_files = glob.glob('demo2020/p*/csv/Z_*.csv')

filelist = []
for file in csv_files:
    df = pd.read_csv(file)
    filelist.append(df)

answered Feb 23, 2022 at 19:01

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Hamilton Over a year ago

what about there is multiple Z_ file like Z_17_xx, Z_18_xx, Z_19_xx and I only want to read one of them? Could you point me out that can we take above R code logic in your input here? if yes, how? Thank you

akuiper Over a year ago

You can always make * more specific. For instance to read Z_19_xx, demo2020/p*/csv/Z_19_*.csv

Hamilton Over a year ago

how about I only want to read the file from p1, p2, p3 only? how should I do it?

akuiper Over a year ago

Didn't test, but I believe you can use demo2020/p[123]*/csv/Z_19_*.csv

akuiper Over a year ago

I think in that case, you need write another pattern to match k directory: demo2020/k[4l]*/csv/Z_19_*.csv

|

Jane Luo · Accepted Answer · 2022-02-23 19:17:56Z

0

You can use Glob() function to find files recursively in Python. With glob, we can also use wildcards ("*, ?, [ranges]) apart from exact string search to make path retrieval more simple and convenient.

If you want to match the files: Z_19_xx.csv instead of Z_xx.csv, you can use the following:

import glob

csv_files = glob.glob('demo2020/p*/csv/Z_19_*.csv')

filelist = []
for file in csv_files:
    df = pd.read_csv(file)
    filelist.append(df)

answered Feb 23, 2022 at 19:17

Jane Luo

11 bronze badge

Collectives™ on Stack Overflow

How to read csv files with specific pattern from nested file directories in pandas?

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related