Request for help to speed up batch program for 17,000 TXT files

67 Views Asked by At

I have over 17,000 pages that have been scanned (for a local history archive) which I have OCRed using Tesseract to individual TXT files. I want to be able to search/locate every page containing a search word of more than 3, lower case letters. So for each TXT file I need to:

  1. Delete all rubbish from the OCR text i.e. non-alphanumeric characters - jrepl "[^a-zA-Z0-9\s]" "" /x /f %%G /O -
  2. Remove 1, 2 and 3 letter words - jrepl "\b\w{1,3}\b" "" /x /f %%G /O -
  3. Change all characters to lower case - jrepl "(\w)" "$1.toLowerCase()" /i /j /x /f %%G /O -
  4. To be able to sort the remaining words they need to be on separate new lines - jrepl "\s" "\n" /x /f %%G /O -
  5. Finally sort all unique words into alphabetic order and create the modified TXT file - sort /UNIQUE %%G /O %%G

I have a batch file that does the above using JREPL but it is very slow. It has been running for over 100 HOURS and I'm not even half way. Any suggestions so as to speed up the processing? I am running Windows 10. Thanks.

1

There are 1 best solutions below

0
Magoo On

Solution?

Since your existing batch does what you want, no doubt testing a replacement will occupy some hours - so:

Split the 17,000 files - or those that remain unprocessed into (however many cores you have) separate directories, then start your existing batch on each directory. Since it's the weekend, leave the process running overnight. 8 cores? should be done in 15 hours or so, while you catch up on sleep or gardening or whatever.