Regular Expressions to extract certain information

79 Views Asked by At

I have a body of information. A very large text file that is roughly 200k lines. This text file was built by merging thousands of pages of PDF text (extracted via OCR obviously). This information is 'meeting minutes' from a medical board. Within this information is a reoccurring pattern of critical information that follows such as"

##-##  (this is a numbered designation of the 'case')

ACTION: [.....]  (this is a sentence that describes what procedure or action is being taken with this 'case')

DECISION [.....] (this is a sentence that describes the outcome or decision of a medical board about this specific case and action)

Here is a live example (with some data scrambled for obvious medical information reasons)


06-02    Cancer and bubblegum trials                                                                                                                                                                                                    Primary Investigator:                                                                                                                                                               
                                                                                                                                                                                                    "Dr. Strangelove, Ph.D."                                                                                                                                                                
"ACTION:  At the January 4, 2015 meeting, request for review and approval of the Application for Initial Review"                                                                                                                                                                                                                                                                                                                                                                    
and attachments for the above-referenced study.                                                                                                                                                                                                                                                                                                                                                                 

"DECISION:  After discussing the risks and safety of the human subjects that will take part in this study, the Board"                                                                                                                                                                                                                                                                                                                                                                   
approved the submitted documents and initiation of the study.  Waiver of Consent granted.                                                                                                                                                                                                                                                                                                                                                                   
"Approval Period:  January 4, 2015 – January 3, 2016"                                                                                                                                                                                                                                                                                                                                                                   
"Total = 6.  Vote:  For = 6, Against = 0, Abstain = 0"

My need is to extract very simple key information that would end up looking like:

##-##
ACTION: Initial Application for Review
DECISION: Initial Application Approved by Board

So the key criteria is the ##-## field and whatever sentence follows the keywords ACTION & DECISION

So far by using regular expression in TextWrangler I am able to match

(\d\d-\d\d) or (ACTION) or (DECISION).... what I am having a hard time doing is figuring out how to select all other text and delete it, or simply copy this grouping and put it into another file.

I plan to use regular expression and anything else in a Bash file that is ran inside text wrangler. Any help is so greatly appreciated as I am a noob with regular expression. Bash scripting I am novice with.

1

There are 1 best solutions below

0
On

Assuming there is a minor mistake in your input file: DECISION: ... instead of DECISION ..., you could easily achieve this using . All we have to do is check if a line starts with either DECISION, ACTION or ##-##. A regular expression for this is /^(##-##)|^(ACTION)|^(DECISION)/. The resulting one-liner is as follows:

$ awk '/^(##-##)|^(ACTION)|^(DECISION)/ { print }' /path/to/file

Example usage:

$ head -n7 file
##-##

ACTION: Initial Application for Review

DECISION: Initial Application Approved by Board

Here is a live example (with some data scrambled for obvious medical 
information reasons)
$ awk '/^(##-##)|^(ACTION)|^(DECISION)/ { print }' file
##-##
ACTION: Initial Application for Review
DECISION: Initial Application Approved by Board

If the data of the action and decision is between square brackets you'll need another regex to extract the information, in that case leave a comment.