1

I have file names as below in a folder C/Downloads -

Mango001-003.csv
Mango004-006.csv
Mango007-100.csv
Applefruit.csv
Banana001-003.csv
Banana004-006.csv

How to import the fruits files separately and then join same fruit files together into a single file?

What's expected is one output for Mango, one for Apple & one for Banana

import os
import re
data_files = os.listdir(r'C:\Downloads')
def load_files(filenames):
    # Pre-compile regex for code readability
    regex = re.compile(r'Mango.*?.csv')
    
    # Map filenames to match objects, filter out not matching names
    matches = [m for m in map(regex.match, filenames) if m is not None]
    
    li = []
    for match in matches:
                
        df = pd.read_csv(match, index_col=None, header=0, dtype=object)
        li.append(df)
        
    #Concatenating the data
    frame = pd.concat(li, axis=0, ignore_index=True)
    return (frame)
    
df  = load_files(data_files)
print(df.shape)
df.head(2)

I am getting errors. In addition, it cannot be so complex, I must be doing something wrong.

3 Answers 3

1

I think the easiest way to do this is to use glob.glob to get a list of all files that start with a particular fruit name (here I used mango) and concatenate them all together using pd.concat.

data_files = r"path\to\folder\containing\csv"
df_mango= pd.DataFrame()
df_mango= pd.concat(map(pd.read_csv,glob.glob(os.path.join(data_files,'mango*.csv'))), ignore_index= True)
df_mango.to_csv('mango.csv')

Here is the example I tried:

mango0110.csv
   A  B  C
0  1  2  3
mango01220.csv
   A  B  C
0  4  5  6
To get:
   A  B  C
0  1  2  3
1  4  5  6
Sign up to request clarification or add additional context in comments.

6 Comments

It does the job but I am unable to insert a separator df = pd.read_csv(filename, sep=",")
Why do you need the separator ? I'm assuming all the file names start with the name of a fruit and glob.glob(os.path.join(data_files,'mango*.csv') gets the files that start with mango after which they are all concatenated at once
Apparently the * accounts for anything that follows mango. As shown in my example like mango0110.csv etc
Assume values in my file is separated by "~!" and I want to import the datasets
Wow thats interesting. I dint really consider anything other than comma separated values as of now (because I found only csv in your example) . Let me go try this out for such cases :)
|
0

Perhaps not the greatest way to do it but, for the file names given...

Try:

import pandas as pd
import glob
import re

path = r'./files' # use your path
all_files = glob.glob(path + "/*.csv")

fruits = []

# for all files in the folder get the fruit name
# this could be where things go wrong if the regex does not
# account for all filename types.  Pattern may need tweaking
# example https://regex101.com/r/E69LWa/1
for file in all_files:
    cleanFile = file.replace('fruit', '')
    match = re.match(r'^.*/([A-Za-z]+)',cleanFile)
    fruits.append(match.group(1))

# There will be one output for Mango, one for Apple & one for Banana hence three...
dfs_man = []
dfs_ban = []
dfs_app = []

# for all files create a df and append to the correct list holding other dfs of the same fruit
for i, file in enumerate(all_files):
    df = pd.read_csv(file)
    if fruits[i] == 'Mango':
        dfs_man.append(df)
    elif fruits[i] == 'Banana':
        dfs_ban.append(df)
    elif fruits[i] == 'Apple':
        dfs_app.append(df)

# concatenate if more than one df in list, else just get the df out of list
if len(dfs_man) > 1:
    df_mango = pd.concat(dfs_man, ignore_index=True)
elif len(dfs_man) == 1:
    df_mango = dfs_man[0]
if len(dfs_ban) > 1:
    df_banana = pd.concat(dfs_ban, ignore_index=True)
elif len(dfs_ban) == 1:
    df_banana = dfs_ban[0]
if len(dfs_app) > 1:
    df_apple = pd.concat(dfs_app, ignore_index=True)
elif len(dfs_app) == 1:
    df_apple = dfs_app[0]
    
print(df_mango.shape, df_banana.shape, df_apple.shape)

1 Comment

@VidyaGanesh what's unsafe?
0

Thank you @Vidya Ganesh

data_files = r'C:\Downloads'
list_file_names = ['Mango','Apple','Banana']
for i in list_file_names:
    name = i
    df = pd.DataFrame()
    df= pd.concat(map(pd.read_csv,glob.glob(os.path.join(data_files,str(name)+'*.csv'))), ignore_index= True)
    df = df.loc[:1000,:]
    print (name)
    print (df.shape)
    df.to_csv(str(name)+".csv")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.