Extract multiple strings from a multiple lines text file - an extra rule and larger test file

54 Views Asked by At

How to extract multiple strings from a multiple lines text file with these rules? The search strings are "String server", " pac " and "String method". They may or may not appear only once within the enclosing "{}". If the value of the search string "String server" is not enclosed within "", ignore its value. After the search strings are matched, extract their values enclosed within "" without "()". The value of the search string " pac " take precedence over the search string "String server" value. The value of either search string "String server" or " pac " appear only once - no duplication. Its value will appear before the value of the search string "String method". e.g. sample text file in:

{{{{{
public AResponse retrieveA(ARequest req){
    String server = "AAA";
    String method =  "retrieveA()";
    log.info(method,
            server,
            req);
    return res;
}

public BResponse retrieveB(BRequest req){
    String method =  "retrieveB()";
    BBB pac = new BBB();
    log.info(method,
            pac,
            req);
    return res;
}

public CResponse retrieveC(CRequest req) {
    String server = "CCC";
    log.info(server,
            req);
    return res;
}

public DResponse retrieveD(DRequest req) {
    String method = "retrieveD()";
    log.info(method,req);
    return res;
}

public EResponse retrieveE(ERequest req){
    EEE pac = new EEE();
    String method =  "retrieveE()";
    String server = "EEE";
    log.info(method,
            server,
            pac,
            req);
    return res;
}

public FResponse callretrieveF(FRequest req) throws InvalidDataException {
        String server = "FFFFF";
        //retrieveF
        String method =  "retrieveF()";
        try {
            log.info(method,
                     server,
                     req);

            FFFFF pac = new FFFFF();
        }
}

/**
 * callgetG
* getG
*/
public GResponse callgetG(GRequest req) throws InvalidDataException {
        //getG
        String method =  "getG()";
        String server = "GGGGGG";
        try {
            try {
                GGGGGG pac = new GGGGGG();
                log.info(method,
                     server,
                     req);
            }
        }
}

/**
 * getH
*/
    public HResponse getH(HRequest req) 
                                throws InvalidDataException {

        //getH
        String method =  "getH()";
        String server = "HHHHHHH";
        String calledMethod =  "getH2()";

        ARequest aReq = new ARequest(req.getH(),
                                     req.getR());
        ProgramAccountInformationResponse resp = null;
        try {
            log.info(LogMessages.msgInfoMethodStartPrivate(method,
                                                           server,
                                                           calledMethod,
                                                           req));
            return resp;
        }catch(InvalidDataException ide){
            log.error(method);
            throw ide;
        }
    }

    public IResponse determineI(IRequest req){
        String method = "determineI()";
    }
    
    private IResponse callI(IRequest req){
        String method = "callI()";
        IIIII pac = new IIIII();
        String server = pac.getName();
    }
}}}}}

Expected output:

AAA retrieveA
BBB retrieveB
CCC 
    retrieveD
EEE retrieveE
FFFFF   retrieveF
GGGGGG  getG
HHHHHHH getH
    determineI
IIIII callI

The solution from Extract multiple strings from a multiple lines text file - larger test file do not get the last 2 output properly: determineI IIIII callI

2

There are 2 best solutions below

2
anubhava On BEST ANSWER

You may use this refactored awk script for this data. Note that we need to check for both private or public keywords for our start of the block marker. Moreover checks like !col1[n] and !col2[n] are required to avoid overwriting same array indices for both the arrays.

cat parse.awk

BEGIN { OFS="\t"; FS="=" }
/^[[:blank:]]*(public|private) / {++n}
NF==2 && /^[[:blank:]]*String | pac *=/ {
   gsub(/^[[:blank:]]*("|new +)|[()";]+$/, "", $2)
   if (!col1[n] && $1 ~ / (server|pac) *$/)
      col1[n] = $2
   else if (!col2[n] && $1 ~ / method *$/)
      col2[n] = $2
}
END {
   for (i=1; i<=n; ++i)
      print col1[i], col2[i]
}

Then call this script as:

awk -f parse.awk file

AAA     retrieveA
BBB     retrieveB
CCC
        retrieveD
EEE     retrieveE
FFFFF   retrieveF
GGGGGG  getG
HHHHHHH getH
        determineI
IIIII   callI
0
Kaz On

Solution in TXR, almost precisely matching the required output format. At the end of the answer, an attempt is shown to precisely match the format:

$ txr extract4.txr even-longer-data
AAA retrieveA
BBB retrieveB
CCC
    retrieveD
EEE     retrieveE
FFFFF   retrieveF
GGGGGG  getG
HHHHHHH getH
    determineI
IIIII callI

What is not exactly clear is why EEE retrieveE is to be printed without the column padding. Is the rule that the first item of the group doesn't participate in the column calculation? Or does the first item of the group contribute to the column calculation, but only isn't printed that way? Or is it just a mistake in the specification?

Code:

@(collect)
@/.*(public|private).*/
@  (gather :vars ((server nil) (meth nil) (pac nil)))
 String server = "@server";
 String method = "@meth()";
 @pac pac = new @pac();
@  (until)
@/.*(public|private).*/
@  (end)
@  (bind tuple (server meth pac))
@(end)
@(do (let ((parts (partition-by (ado not (and @2 (or @1 @3))) tuple)))
       (each ((part parts))
         (let ((maxlen (find-max-key part : [chain [orf first third] len])))
           (each ((tuple part))
             (match (@server @meth @pac) tuple
               (put-line
                 (cond
                   ((and server meth) `@{server maxlen} @meth`)
                   ((and meth pac) `@{pac maxlen} @meth`)
                   (server server)
                   (meth `    @meth`)))))))))

I don't like the inconsistency of access, namely that: in the partition-by call, we use op syntax where we reference the tuples with lambda parameters @1, @2 and @3; in the calculation of maxlen, we refer to the tuple using the accessors first and third; and our output block uses pattern matching to give them the names server, meth and pac (as in previous iterations of the code).

This is about the level of complexity at which we should think about using structs.

First cut at introducing a tuple structure with slots server, meth and pac:

@(do (defstruct tuple () server meth pac))
@(collect)
@/.*(public|private).*/
@  (gather :vars ((server nil) (meth nil) (pac nil)))
 String server = "@server";
 String method = "@meth()";
 @pac pac = new @pac();
@  (until)
@/.*(public|private).*/
@  (end)
@  (bind tuple @(new tuple server server meth meth pac pac))
@(end)
@(do (let ((parts (partition-by [notf [andf .meth [orf .server .pac]]]
                                tuple)))
       (each ((part parts))
         (let ((maxlen (find-max-key part : [chain [orf .server .pac] len])))
           (each ((tup part))
             (match @(struct tuple server @server meth @meth pac @pac) tup
               (put-line
                 (cond
                   ((and server meth) `@{server maxlen} @meth`)
                   ((and meth pac) `@{pac maxlen} @meth`)
                   (server server)
                   (meth `    @meth`)))))))))

Now we can see that the maxlen is calculated over a group by maximizing the length of either the server or pac string, with priority to server; since that datum will be the first column when there are two.

In previous answers, we put the printing inside the TXR @(repeat), printing the data as it is being extracted. Now we have a @(collect) which shores it all up, so we can make another pass over it, for the sake of grouping it into the subgroups that have the first column adjusted.

The printing logic is the same as before, except that when only a meth occurs, we print it indented by four spaces.

Precisely matching output

We introduce a hack whereby any multi-column item whose first column is three columns or less is printed with a column width of four. So the alignment applies only to the rows with longer items:

@(do (defstruct tuple () server meth pac))
@(collect)
@/.*(public|private).*/
@  (gather :vars ((server nil) (meth nil) (pac nil)))
 String server = "@server";
 String method = "@meth()";
 @pac pac = new @pac();
@  (until)
@/.*(public|private).*/
@  (end)
@  (bind tuple @(new tuple server server meth meth pac pac))
@(end)
@(do (let ((parts (partition-by [notf [andf .meth [orf .server .pac]]]
                                tuple)))
       (each ((part parts))
         (let ((maxlen (find-max-key part : [chain [orf .server .pac] len])))
           (each ((tup part))
             (match @(struct tuple server @server meth @meth pac @pac) tup
               (let ((width (if (< (len (or server meth)) 4) 0 maxlen)))
                 (put-line
                   (cond
                     ((and server meth) `@{server width} @meth`)
                     ((and meth pac) `@{pac width} @meth`)
                     (server server)
                     (meth `    @meth`))))))))))

Run:

$ txr extract6.txr even-longer-data
AAA retrieveA
BBB retrieveB
CCC
    retrieveD
EEE retrieveE
FFFFF   retrieveF
GGGGGG  getG
HHHHHHH getH
    determineI
IIIII callI

Now the EEE line is printed with a narrow column.

Note that the output format almost looks as if the requirement might be just for aligning to four space tab stops; but the IIIII callI line doesn't follow that; the callI is not on a four space tab stop.