Extract multiple strings from a multiple lines text file

57 Views Asked by At

How to extract multiple strings from a multiple lines text file with these rules? The search strings are "String server", " pac " and "String method". They may or may not appear only once within the enclosing "{}". After the search strings are matched, extract their values enclosed within "" without "()". The value of either search string "String server" or " pac " appear only once - no duplication. Its value will appear before the value of the search string "String method". e.g. sample text file in:

public AResponse retrieveA(ARequest req){
    String server = "AAA";
    String method =  "retrieveA()";
    log.info(method,
            server,
            req);
    return res;
}

public BResponse retrieveB(BRequest req){
    String method =  "retrieveB()";
    BBB pac = new BBB();
    log.info(method,
            pac,
            req);
    return res;
}

public CResponse retrieveC(CRequest req) {
    String server = "CCC";
    log.info(server,
            req);
    return res;
}

public DResponse retrieveD(DRequest req) {
    String method = "retrieveD()";
    log.info(method,req);
    return res;
}

public EResponse retrieveE(ERequest req){
    EEE pac = new EEE();
    String method =  "retrieveE()";
    String server = "EEE";
    log.info(method,
            server,
            pac,
            req);
    return res;
}

Expected output:

AAA retrieveA
BBB retrieveB
CCC 
retrieveD
EEE retrieveE

I tried GNU Awk 5.0.1:

awk '{
if ($0 ~ /String method/ || ($0 ~ /String server/) ) 
{
str=$0; 
sub("String", "", str);
sub(")", "", str);
sub("=", "", str);
gsub(/\(/, "", str);
gsub(/"/, "", str);
gsub(/;/, "", str);
if (str ~ /method/) 
{
    method = str; 
    gsub(/[[:blank:]]/, "", method); 
    gsub(/method/, "", method); 
    arr[i][1] = method
    count++
} else if (str ~ /server/) 
{
    server = str; 
    gsub(/[[:blank:]]/, "", server); 
    gsub(/server/, "", server); 
    arr[i][0] = server
    count++
}
}
if (count > 1 || $0 ~ /log./) {
    count = 0
    i++
}
}
END {
for (i in arr) {
    printf "%s %s\n", arr[i][0], arr[i][1];
}
}' in
4

There are 4 best solutions below

2
anubhava On BEST ANSWER

This awk solution should work for you:

awk -v OFS='\t' -F= '
/\{[[:blank:]]*$/ {++n}
NF==2 && /String | pac/ {
   gsub(/^[[:blank:]]*("|new +)|[()";]+$/, "", $2)
   if ($1 ~ / (server|pac)/)
      col1[n] = $2
   else if ($1 ~ / method/)
      col2[n] = $2
}
END {
   for (i=1; i<=n; ++i)
      print col1[i], col2[i]
}' file

AAA     retrieveA
BBB     retrieveB
CCC
        retrieveD
EEE     retrieveE
0
Kaz On

Solution 1 using TXR:

The sample data is in the file data:

$ txr extract.txr data
AAA retrieveA
BBB retrieveB
CCC
retrieveD
EEE retrieveE

In extract.txr we have:

@(repeat)
@  (cases)
 String server = "@x";
 String method = "@y()";
@    (bind out `@x @y`)
@  (or)
 String method = "@y()";
 @x pac = new @x();
@    (bind out `@x @y`)
@  (or)
 @x pac = new @x();
 String method = "@y()";
@    (bind out `@x @y`)
@  (or)
 String server = "@out";
@  (or)
 String method = "@out()";
@  (end)
@  (skip)
}
@  (do (put-line out))
@(end)

In TXR, one space implicitly matches one or more spaces, so the patterns need only one space of indentation, and the extra whitespace in method = "retrieveB()" does not appear. The data is exactly as given, with the extra whitespace.

2
sqrt-1 On

This one too, should work with your file (not sure if I considered all possible variants...)

$ cat test.awk 
$1 == "String" {
  if ($2 == "server") {
    first = $4" "
  } else if ($2 == "method") {
    second = $4
  }
}

$2 == "pac" {
    first   = $1" "
}

$1 == "return" {
  print first second
  first=""; second=""
}

$ awk -f test.awk file | tr -d '"();'
AAA retrieveA
BBB retrieveB
CCC 
retrieveD
EEE retrieveE

Where the awk script, when finding the string "String" in 1st position, checks the 2nd position: if it's "server" then saves the 4th position in first variable, otherwise, if it's "method" saves the 4th position in second variable. Then, going on, in case of "pac" string in 2nd position it saves the 1st position again in first variable: in this case only one between "pac" and "String server" is considered. Finally, when it finds the string "return" (hoping it's always at the end of each "public" block ...) it prints out the 2 variables (valued or not) saved before.
The piped final tr is to delete eventual remaining character "(); (here too, I don't know if that chars can be part of the other values to keep...).

I know, a bit too many concerns in my answer. I hope it can help you anyway.

0
Kaz On

Solution 2 using TXR:

The sample data is in the file data:

$ txr extract2.txr data
AAA retrieveA
BBB retrieveB
CCC
retrieveD
EEE retrieveE

In extract2.txr we have:

@(repeat)
public @nil{
@  (gather :vars ((server nil) (meth nil) (pac nil)))
 String server = "@server";
 String method = "@meth()";
 @pac pac = new @pac();
@  (until)
}
@  (end)
@  (do
     (put-line
       (cond
         ((and server meth) `@server @meth`)
         ((and meth pac) `@pac @meth`)
         (server)
         (meth))))
@(end)

In solution 2 we take advantage of the @(gather) directive, which collects matches that may be out of order. In each function block we gather up to three items. They may be missing, so we default them to nil via :vars; that allows gather to succeed even if some of the items are missing, with those default values. Note nil is false in TXR Lisp; the missing items look false.

After the gather block, we do a bit of Lisp to test which variables we have in a cond form, to calculate an output string that is output via put-line.

We take advantage of a somewhat rarely-used feature of cond (which is also present in ANSI Common Lisp) which is that if a cond clause contains only one expression, and that expression is true (so that cond stops there), the value of that expression is the result of cond, such that (cond (1)) yields 1: 1 is both the true expression that terminates the conditional, and serves as the result value.