Pandas Combine Excel Files from Multiple Directories into a third directory

260 Views Asked by At

I am trying to combine multiple Excel spreadsheets from two different directories into one directory using Pandas.

I created a function that will change the Census population data from a .csv data into an excel spreadsheet.

import requests, re, logging, sys
from os import path
import os
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from pathlib import Path
import shutil
def pop_data_clean():
    pop_direc = 'Path A'
    for pop_file in os.listdir(pop_direc):
        pop_filename = os.fsdecode(pop_file)
        if pop_filename.endswith('.csv'):
            population = pd.read_csv(pop_file)
            population = population[[ 'STATE', 'COUNTY', 'STNAME', 'CTYNAME' ,'YEAR' ,'POPESTIMATE',
                                     'POPEST_MALE','POPEST_FEM' ,'AGE18PLUS_TOT', 'AGE18PLUS_MALE',
                                     'AGE18PLUS_FEM','AGE65PLUS_TOT', 'AGE65PLUS_MALE','AGE65PLUS_FEM',
                                     'MEDIAN_AGE_TOT', 'MEDIAN_AGE_MALE','MEDIAN_AGE_FEM']]
            population = population.rename(columns={ 
            'STNAME':'STATE NAME',
            'CTYNAME':'CITY NAME',
            'POPESTIMATE':'POPULATION ESTIMATE',
            'POPEST_MALE':'TOTAL MALE POPULATION',
            'POPEST_FEM':'TOTAL FEMALE POPULATION',
            'AGE18PLUS_TOT':'TOTAL POPULATION ABOVE 18',
            'AGE18PLUS_MALE':'TOTAL MALE POPULATION ABOVE 18',
            'AGE18PLUS_FEM':'TOTAL FEMALE POPULATION ABOVE 18',
            'AGE65PLUS_TOT':'TOTAL POPULATION ABOVE 65',
            'AGE65PLUS_MALE':'TOTAL MALE POPULATION ABOVE 65',
            'AGE65PLUS_FEM':'TOTAL FEMALE POPULATION ABOVE 65',
            'MEDIAN_AGE_TOT':'MEDIAN AGE',
            'MEDIAN_AGE_MALE':'MEDIAN MALE AGE',
            'MEDIAN_AGE_FEM':'MEDIAN FEMALE AGE'})
            population.to_excel(pop_filename[0:6] + '.xlsx', index=False)

I then did the same function for the ethnicity data and saved it into a different folder.

def ethnic_data_clean():
    ethnic_direc = 'Path B'
    for ethn_file in os.listdir(ethnic_direc):
        ethn_filename = os.fsdecode(ethn_file)
        if ethn_filename.endswith('.csv'):
            Ethnic = pd.read_csv(ethn_file)
            for i in range(0,18):
                Ethnic = Ethnic[Ethnic.AGEGRP != i+1]
            Ethnic = Ethnic[['WA_MALE', 'WA_FEMALE', 'BA_MALE', 'BA_FEMALE', 'IA_MALE', 'IA_FEMALE', 'AA_MALE', 'AA_FEMALE', 
            'NA_MALE', 'NA_FEMALE', 'NH_MALE', 'NH_FEMALE', 'H_MALE', 'H_FEMALE', ]]
            Ethnic = Ethnic.rename(columns={ 
            'WA_MALE': 'White Alone Male Population',
            'WA_FEMALE':'White Alone Female Population',
            'BA_MALE': 'Black or African American Male Population',
            'BA_FEMALE':'Black or African American Female Population',
            'IA_MALE':'American Indian and Alaska Native Male Population',
            'IA_FEMALE':'American Indian and Alaska Native Female Population',
            'AA_MALE':'Asian Male Population',
            'AA_FEMALE':'Asian Female Population',
            'NA_MALE':'Native Hawaiian and Other Pacific Islander Male Population',
            'NA_FEMALE':'Native Hawaiian and Other Pacific Islander Female Population',
            'NH_MALE': 'Non-Hispanic Male population',
            'NH_FEMALE': 'Non-Hispanic Female population',
            'H_MALE': 'Hispanic Male population',
            'H_FEMALE':'Hispanic Female population'})
            Ethnic.to_excel(ethn_filename[0:9] + '.xlsx', index=False)

I am using the same naming conventions between folder A "Path A" and folder B "Path B". The file starts with the 2 state name Abbreviation then an underscore then what it is. I want to combine the excel sheet from Path A and Path B into one excel sheet into Path C and loop through without having to code each individual state.

Example:

Path A has: AL_CensusData SC_CensusData

Path B has: AL_Ethnic SC_Ethnic

I want to combine AL_CensusData with AL_Ethnic data into AL_Master and save it in Path C and do that for each file in each folder. The cleaning process ensures that each file has the same number of rows so there won't be any Pandas errors. I can't figure out how to alternately loop between two separate directories and combining a file, saving it to a third folder, then moving to the next alternate file in the two folders.

1

There are 1 best solutions below

3
piterbarg On BEST ANSWER

I would first get the file names from both pathA and pathB directories, index them by state abbreviation, and then loop over states reading/writing appropriate files

Here is the code to do it, you may need to tweak it for your requirements but hopefully the idea is clear. I assume the direcotiries are './test/A', etc as defined in the code

from pathlib import Path,PurePath


# these are our paths
pathA = './test/A'
pathB = './test/B'
pathC = './test/C'

# wrap into Path objects
pathA_ = Path(pathA)
pathB_ = Path(pathB)

# we will map filenames by state abbrev in pathA. assume file names are  XX_blah
a_map = {file.name.split('_')[0]:str(file) for file in pathA_.glob('*')}
b_map = {file.name.split('_')[0]:str(file) for file in pathB_.glob('*')}

# now loop over states, loading pandas from pathA and pathB from files that correspond to each state
# and saving to pathC
for state in a_map.keys():
    c_file = PurePath(pathC, state + '_Master.csv')
    print(f'pathA file: {a_map[state]}, pathB file:{b_map[state]}, pathC file: {c_file}')
    
    # Here you put your df merging logic, this is a stub
    dfa = pd.read_csv(a_map[state])
    dfb = pd.read_csv(b_map[state])
    dfc = pd.concat([dfa,dfb], axis=1, sort = False)
    dfc.to_csv(c_file)

Note the debugging print statement -- it produces, for me,

pathA file: test\A\AK_C.txt, pathB file:test\B\AK_E.txt, pathC file: test\C\AK_Master.csv
pathA file: test\A\AL_C.txt, pathB file:test\B\AL_E.txt, pathC file: test\C\AL_Master.csv
pathA file: test\A\AZ_C.txt, pathB file:test\B\AZ_E.txt, pathC file: test\C\AZ_Master.csv

(I only created three states' worth of files)