Extract multiple values using regex

2.2k Views Asked by At

Can you please help me figure this regex out. I have an output that looks something like this:

Wed Aug 30 14:47:11.435 EDT 

  Interface : p16, Value Count : 9 
  References : 1, Internal : 0x1 
  Values : 148, 365, 366, 367, 371 
        120577, 120578, 120631, 120632 

I need to extract all the numbers from that output. There can be more or less values then what is there already. So far I have this (but it only extracts the last value):

\s+Values\s+:\s+((\d+)(?:,?)(?:\s+))+

Thank you

EDIT: added the full output.

5

There are 5 best solutions below

1
On BEST ANSWER

Assuming the string is in the variable s:

% regexp -inline -all {\d+} [regexp -inline {[^:]+$} $s]
148 365 366 367 371 120577 120578 120631 120632

That is: pick all the text between the last colon and the end of the string (strictly: the longest sequence of characters (from a set that excludes the colon) that is anchored by the end of the string). From this text, match all groups of digits. This is a similar solution to Wiktor's, but uses a somewhat less intricate pattern for the match in the first step. There is no problem if there is no match, since that will only mean that you get an empty list of number in the second step.

Documentation: regexp, Syntax of Tcl regular expressions

0
On

Assuming that you are searching for all numbers after the string "Values :", and that there is nothing else after those numbers, you can do it with the usual string commands. This returns a list containing the numbers:

set result [split [string map {\n ","} [string range $text [string first "Values :" $text ]+8 end] ] ","]

Reading it from the inside out, you search for the index of the "Values :" string. You then grab the string from that index plus 8, until the end of the string. Then you use string map to replace any newlines with a comma. Finally you use split to convert the string to a list, using the comma as a delimiter.

0
On
[0-9]

this is the regex which would match only numbers in the string. And it matches for every number in there.

5
On

Why not just match \d+ (each set of one or more digits)?

2
On

As @dawg mentions, you need a 2 step approach in Tcl, since its regex does not allow storing multiple captures in one and the same group, and it does not support \G operator.

Here is a final solution:

set text {Wed Aug 30 14:47:11.435
EDT Interface : p16,
Value Count : 9 References : 1, Internal : 0x1
Values : 148, 365, 366, 367, 371
         120577, 120578, 120631, 120632}

set pattern {\sValues\s*:\s*\d+(?:[\s,]*\d+)*} 
regexp $pattern $text match
if {[info exists match]} {
    set results [regexp -all -inline {\d+} $match]
    puts $results
} else {
    puts "No match"
}

See the Tcl demo printing 148 365 366 367 371 120577 120578 120631 120632.

Details

The first matching operaton extracts the substring starting with Values and then having comma or space separated numbers:

  • \s - a whitespace
  • Values - a Values word
  • \s*:\s* - a colon enclosed with 0+ whitespaces
  • \d+ - 1 or more digits
  • (?:[\s,]*\d+)* - 0+ sequences of 0+ whitespaces or commas followed with 1+ digits.

The second step is extracting all chunks of 1+ digits with regexp -all -inline {\d+} $match.