Read in csv files in a folder and output to one single csv

123 Views Asked by At

I have a folder with many csv files. I want to read them in and depending on certain criteria output the records to specific output files. So in my case I have 3 different output files.

So I have many of csv files. Let's consider one file looking like:

Column1;Column2
90ABCDE;AB
80BDESD;CD

And another looking like:

Column1;Column2
80ABCDE;AB
80BDESD;CD
80BCCDE;AB
70BDESD;CD

Each csv file has an header. The header is always the same. In the final csv files I would like to have the header once in the beginning, but not in between the data.

I want to have one file where every record is stored. In another I would like to have only those records where Column1 begins with '80B'. In the third file I would like to have those records where Column1 does not begin with '80B' and the fourth character is not equal to 'D'.

So the output should be:

file 'all.csv'

Column1;Column2
90ABCDE;AB
80BDESD;CD
80ABCDE;AB
80BDESD;CD
80BCCDE;AB
70BDESD;CD

file 'subset_1'

Column1;Column2
80BDESD;CD
80BDESD;CD
80BCCDE;AB

file 'subset_2'

Column1;Column2
80BCCDE;AB

I tried the following code:

import glob
import csv
import os


path = r'C:\myfolder\test'

all_files=glob.glob(os.path.join(path, "*.csv"))

with open(r'C:\myfolder\all.csv', "w", newline='') as dall, \
open(r'C:\myfolder\subset_1.csv', "w", newline='') as \
subset_1, open(r'C:\myfolder\subset_2.csv', "w", newline='') as subset_2:
    
    cw_all = csv.writer(dall, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    cw_subset_1 = csv.writer(subset_1, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    cw_subset_2 = csv.writer(subset_2, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    
    cw_all.writerow(['Column1','Column2'])
    cw_subset_1.writerow(['Column1','Column2'])
    cw_subset_2.writerow(['Column1','Column2'])
    
    for filename in all_files:
        with open(filename) as infile:
            cr = csv.reader(infile, delimiter=";")
            #next(cr)
            for line in cr:
                cw_all.writerow(line)
            if (
                (line[0][:3] !="80B")
                ): cw_subset_1.writerow(line)
            if (
                (line[0][:3] =="80B") and
                (line[0][3:4] =="D")
                ): cw_subset_2.writerow(line)

For the first try I also ignored the problem with the header and commented out the next(cr). But it is not working. Somehow the records are not properly stored into the corresponding files. The line pointer is not putting each record properly into the files. Where is my mistake?

I would like to do it on a csv level. Without pandas.

(I want to write it "on the fly" while reading the files, so I do not want to first create a large file with everything, then read this once to create the first subset and then read the large file a second time to create the second subset. This is quite inefficient as I have to read the large file several times.)

1

There are 1 best solutions below

4
On BEST ANSWER

There are three problems I see:

  1. Uncomment next(cr) so the headers aren't copied into the new files.
  2. The if statements should be indented under the for line in cr: line.
  3. line[0][3:4] == "D" should be be line[0][3:4] != "D".

Note that line[0][3:4] != "D" can be just line[0][3] != "D" when checking a single character in a string.

You description of the 3rd file does not match the desired output. I went with the description below. Comments are from the OP requirements.

for filename in all_files:
    with open(filename) as infile:
        cr = csv.reader(infile, delimiter=';')
        next(cr)  # skip the header in each input file
        for line in cr:
            # one file where every record is stored.
            cw_all.writerow(line)
            # only those records where Column1 begins with '80B'.
            if line[0][:3] == '80B':
                cw_subset_1.writerow(line)
            # those records where Column1 does not begin with '80B'
            # and the fourth character is not equal to 'D'.            
            if line[0][:3] != '80B' and line[0][3] != 'D':
                cw_subset_2.writerow(line)