3

I have multiple csv files (Each file contains N number of Rows (e.g., 1000 rows) and 43 Columns).

I would like to read several csv files from a folder into pandas and merge them into one DataFrame.

I have not been able to figure it out though.

The problem is that, the final output of the DataFrame (i.e., frame = pd.concat(li, axis=0, ignore_index=True) ) merge all columns (i.e., 43 columns) into one column (see the attached image) Screenshot of the code

an example of selected rows and columns (file one)

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9

an example of selected rows and columns (file two) Client_ID Client_Name Pointer_of_Bins Date Weight C0000001 POLYGONE TI006093 22/04/2019 1.5 C0000001 ALDI TI006098 22/04/2019 0.7 C0000001 ALDI TI006098 22/04/2019 2.4 C0000001 ALDI TI006898 24/04/2019 1.9

The expected outputs would look like this (merge of multiple files that might contains thousands of rows and several columns, as the attached data is just an example, while the actual csv files might contain thousands of rows and more than 45 columns in each file)

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             

TO Download the two CSV files, click here (dummy data

Here is what I have done so far:

import pandas as pd
import glob
path = r'C:\Users\alnaffakh\Desktop\doc\Data\data2\Test'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
    df = pd.read_csv(filename, sep='delimiter', index_col=None, header=0)
  # df = pd.read_csv(filename, sep='\t', index_col=None, header=0)
    li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)

4
  • get rid of sep='delimeter'. The code as it is now, read all dataframes as one columns. Commented Oct 7, 2019 at 17:27
  • 1
    @QuangHoang, Thanks for your reply, but if i remove it, i get this error ( UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 8: invalid continuation byte) Commented Oct 7, 2019 at 17:29
  • Please share some dummy data. I support what @QuangHoang mentioned: you need to either get rid of sep='delimiter' or use an actual delimiter which has been used in the file. This is why I suggest you share some dummy data (may be 4 lines with only 5 columns) so we could test against that. Commented Oct 7, 2019 at 17:53
  • You might consider to use dask. Commented Oct 7, 2019 at 19:18

1 Answer 1

1

Solution

You could use pandas.concat to recursively concatenate the .csv file contents.
In fact, I see that you used it and your application of concat seems fine to me. Try investigating the individual dataframes that you read. The only way your columns could merge into a single column is if you did not mention the correct delimiter.

import pandas as pd

dfs = list()
for filename in filesnames:    
    df = pd.read_csv(filename)    
    dfs.append(df)
frame = pd.concat(dfs, axis=0, ignore_index=True)
df.head()

Example with Dummy Data

Since the dummy data available is not in text format yet, I am using just some dummy data I made.

import pandas as pd
from io import StringIO # needed for string to dataframe conversion

file1 = """
Col1    Col2    Col3    Col4    Col5
1   ABCDE   AE10    CD11    BC101F
2   GHJKL   GL20    JK22    HJ202M
3   MNPKU   MU30    PK33    NP303V
4   OPGHD   OD40    GH44    PG404E
5   BHZKL   BL50    ZK55    HZ505M
"""

file2 = """
Col1    Col2    Col3    Col4    Col5
1   AZYDE   AE10    CD11    BC100F
2   GUFKL   GL24    JK22    HJ207M
3   MHPRU   MU77    PK39    NP309V
4   OPGBB   OE90    GH41    PG405N
5   BHTGK   BL70    ZK53    HZ508Z
"""

Load data as individual dataframes and then concatenate them.

df1 = pd.read_csv(StringIO(file1), sep='\t')
df2 = pd.read_csv(StringIO(file2), sep='\t')
print(pd.concat([df1, df2], ignore_index=True))

Output:

   Col1   Col2  Col3  Col4    Col5
0     1  ABCDE  AE10  CD11  BC101F
1     2  GHJKL  GL20  JK22  HJ202M
2     3  MNPKU  MU30  PK33  NP303V
3     4  OPGHD  OD40  GH44  PG404E
4     5  BHZKL  BL50  ZK55  HZ505M
5     1  AZYDE  AE10  CD11  BC100F
6     2  GUFKL  GL24  JK22  HJ207M
7     3  MHPRU  MU77  PK39  NP309V
8     4  OPGBB  OE90  GH41  PG405N
9     5  BHTGK  BL70  ZK53  HZ508Z
Sign up to request clarification or add additional context in comments.

9 Comments

@Wisamhasan Thank you for making the data available. However, please paste the first 5 columns and 4 rows of each of the two csv files into your problem statement as your sample data from csv files. And then also provide what you will expect. You data needs to be minimal and reproducible. It’s best not to share data files.
@Wisamhasan Thank you for the rows and columns. However, I asked for the data to be pasted into your problem description as text. This makes your problem easily replicable. Please make a code block and paste your data-columns (subset) from file-1 and file-2 into that code block.
thanks but again the attached code doesnt solve the proplem.
The code is to answer the problem you mentioned. I left another comment about checking the actual delimiter used. It looks like that your problem exists in the data. Please check what delimiter has been used and then use that.
@Meet Yes, you can use multi-index with a file-source identifier. But I would advise against using filenames as part of the multi index. Filenames could be long and when they are named you may not have any control over their nomenclature logic. Rather, if you want to track just the source of the data, I would suggest you add another column “Source” and fill in the file name there. You can always conditionally extract file-specific data this way. But consider keeping your index singular as long as you can, unless it is absolutely necessary.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.