I have 4000 csv files in a folder in windows 10, each files has around 500 rows,where i read text column and few identity column, each file got processed in a loop and saved. Because of system limitation sometime process got interrupted. So instead of saving whole file after processing,I want to kept on appending output csv file with processed individual records. anytime if process interrupted for suppose 'python process' closed or 'system restarted', script should restart itself, start processing last file + last record and begin appending again. I don't have admin access.
Please suggest some efficient way. Process code has NLP cleaning, alot of custom regex, and custom processing. its very busy process.
sample code:
clean process(df):
some code
def read_save_csv():
logging.debug('start reading files")
files_path="some path"
for file in glob.glob(file_path):
df=pd.read_csv(file)
logging.debug('ended file read')
try:
df_process= clean_process(df)
logging.debug("start saving file" +filename)
except exception as e:
logging.debug("error" + str(e))
read_save()
If you want to persist the state (last file and record) when the program is closed and restarted, then you have to save the state into a file. If the file exists, then the loop should continue at the last file and record, otherwise start from the beginning.
To save the state when the program exits, the following code can be used:
It should work with graceful exits and kills.