Append processed data to a csv file and keep reord of last procssed row

37 Views Asked by At

I have 4000 csv files in a folder in windows 10, each files has around 500 rows,where i read text column and few identity column, each file got processed in a loop and saved. Because of system limitation sometime process got interrupted. So instead of saving whole file after processing,I want to kept on appending output csv file with processed individual records. anytime if process interrupted for suppose 'python process' closed or 'system restarted', script should restart itself, start processing last file + last record and begin appending again. I don't have admin access.

Please suggest some efficient way. Process code has NLP cleaning, alot of custom regex, and custom processing. its very busy process.

sample code:

clean process(df):
  some code

def read_save_csv():
   logging.debug('start reading files")
   files_path="some path"
   for file in glob.glob(file_path):
       df=pd.read_csv(file)
       logging.debug('ended file read')
       try:
          df_process= clean_process(df)
          logging.debug("start saving file" +filename)
       except exception as e:
          logging.debug("error" + str(e))

read_save()
1

There are 1 best solutions below

2
Patrick Pichler On

If you want to persist the state (last file and record) when the program is closed and restarted, then you have to save the state into a file. If the file exists, then the loop should continue at the last file and record, otherwise start from the beginning.

To save the state when the program exits, the following code can be used:

import signal
import atexit

def init_state():
  last_file = None
  last_state = None
  if exists("state.txt"):
    f = open("state.txt", "r")
    lines = f.readlines()
    last_file = lines[0]
    last_state = lines[1]

def handle_exit(*args):
  # save state (last file and record) to file
  f = open("state.txt", "w")
  f.writelines(last_file, last_record)  

atexit.register(handle_exit)
signal.signal(signal.SIGTERM, handle_exit)
signal.signal(signal.SIGINT, handle_exit)

init_state()
# iterate over loops
# start at last_file and last_record if they are not None

It should work with graceful exits and kills.