0

I intend to read csv file with specific pattern from nested file directories where each subdirectories has multiple csv files and I only want to read the one end with specific pattern. I already figured out the way of doing this in R, but wanted to do this in pandas. I found couple of useful post but can't able to read the file that I wanted to read in pandas.

current attempt

here is file structure that I have and wanted to read file start with Z_19_xx.csv. for instance:

import pandas as pd

dir1="demo2020/p1 pop/csv/Z_19_master.csv"
f1=pd.read_csv(dir1)

this is hard coded and I want to avoid of doing this. Below is the file structure:

demo2020
    - p1 pop
        -csv
            - A_17_master.csv
            - A_18_master.csv
            - B_18_master.csv
            - C_19_master.csv
            - Z_19_master.csv
    - p2 cop
        -csv
            - A_17_cop.csv
            - A_18_cop.csv
            - B_18_cop.csv
            - C_19_cop.csv
            - Z_19_cop.csv
    - p3 res
        -csv
            - A_17_res.csv
            - A_18_res.csv
            - B_18_res.csv
            - C_19_res.csv
            - Z_19_res.csv
    - p4 nac
        -csv
            - A_17_nac.csv
            - A_18_nac.csv
            - B_18_nac.csv
            - C_19_nac.csv
            - Z_19_nac.csv

my current attempt in R:

here is my R code to do this in handy:

yr=19
dir="demo2020/"
files <-c(f1  = paste0("p1 pop/csv/Z_", yr, "_master.csv") , 
                    f2 = paste0('p2 cop/csv/Z_', yr,'_cop.csv') ,
                    f3 = paste0('p3 res/csv/Z_', yr,'_res.csv') , 
                    f4  = paste0('p4 nac/csv/Z_', yr,'_nac.csv') 
)

path=(paste0(dir,files))
> path
[1] "demo2020/p1 pop/csv/Z_19_master.csv"
[2] "demo2020/p2 cop/csv/Z_19_cop.csv"   
[3] "demo2020/p3 res/csv/Z_19_res.csv"   
[4] "demo2020/p4 nac/csv/Z_19_nac.csv"

# read them

for(i in 1:length(files))
{
    f <- assign(names(files[i]), read.csv(paste0(dir, files[i]),stringsAsFactors = FALSE,skip = 1))
}

python objective - pandas

I want to do this in python without hard coding, and simply want to use above R code logic in python and use pandas to read csv files. So far, here is my attempt:

import pandas
import os

parent_dir = 'demo2020/'
subject_dirs = [os.path.join(parent_dir, dir) for dir in os.listdir(parent_dir) if os.path.isdir(os.path.join(parent_dir, dir))]

filelist = []
for dir in subject_dirs:
    csv_files = [os.path.join(dir, csv) for csv in os.listdir(dir) if os.path.isfile(os.path.join(dir, csv)) and and csv.startswith('Z_') and csv.endswith('.csv')]
    for file in csv_files:
        df=pd.read_csv(file)
        filelist.append(df)

but still not getting this right, I only want to read Z_19_xx.csv from each subfolder and concatenate them. How can we do this nicely in python? Can anyone point me out of making this right in python? Any idea?

2 Answers 2

1

You can use glob pattern to match the files: demo2020/p*/csv/Z_*.csv

import glob

csv_files = glob.glob('demo2020/p*/csv/Z_*.csv')

filelist = []
for file in csv_files:
    df = pd.read_csv(file)
    filelist.append(df)
Sign up to request clarification or add additional context in comments.

6 Comments

what about there is multiple Z_ file like Z_17_xx, Z_18_xx, Z_19_xx and I only want to read one of them? Could you point me out that can we take above R code logic in your input here? if yes, how? Thank you
You can always make * more specific. For instance to read Z_19_xx, demo2020/p*/csv/Z_19_*.csv
how about I only want to read the file from p1, p2, p3 only? how should I do it?
Didn't test, but I believe you can use demo2020/p[123]*/csv/Z_19_*.csv
I think in that case, you need write another pattern to match k directory: demo2020/k[4l]*/csv/Z_19_*.csv
|
0

You can use Glob() function to find files recursively in Python. With glob, we can also use wildcards ("*, ?, [ranges]) apart from exact string search to make path retrieval more simple and convenient.

If you want to match the files: Z_19_xx.csv instead of Z_xx.csv, you can use the following:

import glob

csv_files = glob.glob('demo2020/p*/csv/Z_19_*.csv')

filelist = []
for file in csv_files:
    df = pd.read_csv(file)
    filelist.append(df)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.