Importing multiple csv files into pandas and merge them into one DataFrame

Question

I have multiple csv files (Each file contains N number of Rows (e.g., 1000 rows) and 43 Columns).

I would like to read several csv files from a folder into pandas and merge them into one DataFrame.

I have not been able to figure it out though.

The problem is that, the final output of the DataFrame (i.e., frame = pd.concat(li, axis=0, ignore_index=True) ) merge all columns (i.e., 43 columns) into one column (see the attached image) Screenshot of the code

an example of selected rows and columns (file one)

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9

an example of selected rows and columns (file two) Client_ID Client_Name Pointer_of_Bins Date Weight C0000001 POLYGONE TI006093 22/04/2019 1.5 C0000001 ALDI TI006098 22/04/2019 0.7 C0000001 ALDI TI006098 22/04/2019 2.4 C0000001 ALDI TI006898 24/04/2019 1.9

The expected outputs would look like this (merge of multiple files that might contains thousands of rows and several columns, as the attached data is just an example, while the actual csv files might contain thousands of rows and more than 45 columns in each file)

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9

TO Download the two CSV files, click here (dummy data

Here is what I have done so far:

import pandas as pd
import glob
path = r'C:\Users\alnaffakh\Desktop\doc\Data\data2\Test'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
    df = pd.read_csv(filename, sep='delimiter', index_col=None, header=0)
  # df = pd.read_csv(filename, sep='\t', index_col=None, header=0)
    li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)

get rid of sep='delimeter'. The code as it is now, read all dataframes as one columns. — Quang Hoang
– Quang Hoang, Commented Oct 7, 2019 at 17:27
@QuangHoang, Thanks for your reply, but if i remove it, i get this error ( UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 8: invalid continuation byte) — Wisam hasan
– Wisam hasan, Commented Oct 7, 2019 at 17:29
Please share some dummy data. I support what @QuangHoang mentioned: you need to either get rid of sep='delimiter' or use an actual delimiter which has been used in the file. This is why I suggest you share some dummy data (may be 4 lines with only 5 columns) so we could test against that. — CypherX
– CypherX, Commented Oct 7, 2019 at 17:53

CypherX · Accepted Answer · 2019-10-07 22:03:02Z

1

Solution

You could use pandas.concat to recursively concatenate the .csv file contents.
In fact, I see that you used it and your application of concat seems fine to me. Try investigating the individual dataframes that you read. The only way your columns could merge into a single column is if you did not mention the correct delimiter.

import pandas as pd

dfs = list()
for filename in filesnames:    
    df = pd.read_csv(filename)    
    dfs.append(df)
frame = pd.concat(dfs, axis=0, ignore_index=True)
df.head()

Example with Dummy Data

Since the dummy data available is not in text format yet, I am using just some dummy data I made.

import pandas as pd
from io import StringIO # needed for string to dataframe conversion

file1 = """
Col1    Col2    Col3    Col4    Col5
1   ABCDE   AE10    CD11    BC101F
2   GHJKL   GL20    JK22    HJ202M
3   MNPKU   MU30    PK33    NP303V
4   OPGHD   OD40    GH44    PG404E
5   BHZKL   BL50    ZK55    HZ505M
"""

file2 = """
Col1    Col2    Col3    Col4    Col5
1   AZYDE   AE10    CD11    BC100F
2   GUFKL   GL24    JK22    HJ207M
3   MHPRU   MU77    PK39    NP309V
4   OPGBB   OE90    GH41    PG405N
5   BHTGK   BL70    ZK53    HZ508Z
"""

Load data as individual dataframes and then concatenate them.

df1 = pd.read_csv(StringIO(file1), sep='\t')
df2 = pd.read_csv(StringIO(file2), sep='\t')
print(pd.concat([df1, df2], ignore_index=True))

Output:

   Col1   Col2  Col3  Col4    Col5
0     1  ABCDE  AE10  CD11  BC101F
1     2  GHJKL  GL20  JK22  HJ202M
2     3  MNPKU  MU30  PK33  NP303V
3     4  OPGHD  OD40  GH44  PG404E
4     5  BHZKL  BL50  ZK55  HZ505M
5     1  AZYDE  AE10  CD11  BC100F
6     2  GUFKL  GL24  JK22  HJ207M
7     3  MHPRU  MU77  PK39  NP309V
8     4  OPGBB  OE90  GH41  PG405N
9     5  BHTGK  BL70  ZK53  HZ508Z

edited Oct 7, 2019 at 22:03

answered Oct 7, 2019 at 17:29

CypherX

7,4034 gold badges29 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

CypherX Over a year ago

@Wisamhasan Thank you for making the data available. However, please paste the first 5 columns and 4 rows of each of the two csv files into your problem statement as your sample data from csv files. And then also provide what you will expect. You data needs to be minimal and reproducible. It’s best not to share data files.

CypherX Over a year ago

@Wisamhasan Thank you for the rows and columns. However, I asked for the data to be pasted into your problem description as text. This makes your problem easily replicable. Please make a code block and paste your data-columns (subset) from file-1 and file-2 into that code block.

Wisam hasan Over a year ago

thanks but again the attached code doesnt solve the proplem.

CypherX Over a year ago

The code is to answer the problem you mentioned. I left another comment about checking the actual delimiter used. It looks like that your problem exists in the data. Please check what delimiter has been used and then use that.

CypherX Over a year ago

@Meet Yes, you can use multi-index with a file-source identifier. But I would advise against using filenames as part of the multi index. Filenames could be long and when they are named you may not have any control over their nomenclature logic. Rather, if you want to track just the source of the data, I would suggest you add another column “Source” and fill in the file name there. You can always conditionally extract file-specific data this way. But consider keeping your index singular as long as you can, unless it is absolutely necessary.

|

Collectives™ on Stack Overflow

Importing multiple csv files into pandas and merge them into one DataFrame

1 Answer 1

Solution

Example with Dummy Data

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Solution

Example with Dummy Data

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related