How can I remove a list of substrings, then find and delete the first TAB and after in a CSV with multiple lines?

38 Views Asked by At

Example of my dataset "doc name with spaces.csv" (with anonymized data) file that has multiple lines. Length of file will be variable from day to day as part of an export.

Patient Full Name   Order Date Of Service   Order Accession Number  Day of Patient Birth Date   Procedure Description   Facility Name   
AAAAA, Ms Joan  10/11/2022  xx.1111111  1 November 2000 Ultrasound Obstetric 22+ Weeks  Facility 1  
BBBBB, Mr John  10/11/2022  xx.2222222  2 July 2000 Ultrasound Left Calf    Facility 2  
CCCCC, Mrs Anne 10/11/2022  xx.3333333  3 July 2000 X-ray Chest Facility 3  
DDDDD, Master Jack  10/11/2022  xx.4444444  4 July 2000 Ultrasound Left Ankle   Facility 4
....

Trying to create a BATCH script to

  1. Read each Line of "doc name with spaces.csv"
  2. Delete all occurrences of strings matching lines found in "titles.txt" (located in same directory)
  3. Delete first TAB (\t) found per line, and everything after it on same line.
  4. Copy results to Windows clipboard

Example:

AAAAA, Ms Joan  10/11/2022  xx.1111111  1 November 2000 Ultrasound Obstetric 22+ Weeks  Facility 1
BBBBB, Mr John  10/11/2022  xx.2222222  2 July 2000 Ultrasound Left Calf    Facility 2  

to

AAAAA, Joan
BBBBB, John

NB: The title is always followed by a white space, so no risk of removing Dr or Mr etc from a name, if the white space is accounted for in the find/delete. Content of "titles.txt" below:

Mrs 
Mr 
Miss 
Ms 
Dr 
Prof 
A/Prof 

Taken a look at other scripts online, but none quite match what I'm doing. Also a bit advanced for where I am currently at, but the need for this has arisen regardless.

1

There are 1 best solutions below

1
Magoo On BEST ANSWER
@ECHO OFF
SETLOCAL
rem The following settings for the directories and filename are names
rem that I use for testing and deliberately include names which include spaces to make sure
rem that the process works using such names. These will need to be changed to suit your situation.

SET "sourcedir=u:\your files"
SET "filename1=%sourcedir%\q74397743.txt"
SET "filename2=%sourcedir%\q74397743_2.txt"
SET "destdir=u:\your results"
SET "outfile=%destdir%\outfile.txt"

(
FOR /f "usebackqskip=1delims=" %%e IN ("%filename1%") DO @CALL :process %%e
)>"%outfile%"
TYPE "%outfile%"|clip

GOTO :EOF

:process
:: first parameter = patient_id
SET "patient_id=%1
SET "patient_name="
SHIFT
:: Second parameter = Title
:: skip if on titles list
FINDSTR /i /x "%1" "%filename2%">NUL
IF NOT ERRORLEVEL 1 SHIFT
:: Build name until %1 begins with a numeric
:nameloop
SET "nextpart=%1"
SET "firstchar=%nextpart:~0,1%"
FOR /L %%z IN (0,1,9) DO IF "%firstchar%"=="%%z" ECHO %patient_id%,%patient_name%&GOTO :eof
SET "patient_name=%patient_name% %nextpart%"
SHIFT
GOTO nameloop
GOTO :eof

Always verify against a test directory before applying to real data.

Note that if the filename does not contain separators like spaces, then both usebackq and the quotes around %filename1% can be omitted.

You don't indicate where the Tabs are. Master missing from titles file. Spaces removed from end-of-line in titles file.

Assumed that since the date follows the name, finding a field that starts with a numeric is sufficient for end-of-name.

Surnames missing despite column name "full name"

Simply read each line and extract first token, optionally skip second then build together next until numeric character found. Use the comma, spaces and tabs as separators for the subroutine parameters.