Recursive regex with garbled text surrounding? Getting "ArrayArray"

296 Views Asked by At

I asked a similar question, but it was closed for being too broad. Basically, I have a bunch of questions like this. I'm hoping just asking one will be easier. I've tried some different ways to solve this, but none of them actually work.

I have a text file with a lot of data. The only data that I'm interested in falls between two brackets, "(" ")". I'm wondering how to get each instance of info that lies between brackets into an array.

The code I'm using right now returns ArrayArray:

function get_between($startString, $endString, $myFile){
  preg_match_all('/\$startString([^$endString]+)\}/', $myFile, $matches);
  return $matches;
}
$myFile = file_get_contents('explode.txt');
$list = get_between("&nbsp(", ")", $myFile);
foreach($list as $list){
  echo $list;
}
2

There are 2 best solutions below

4
On BEST ANSWER
<?php
function get_between($startString, $endString, $myFile){
  //Escape start and end strings.
  $startStringSafe = preg_quote($startString, '/');
  $endStringSafe = preg_quote($endString, '/');
  //non-greedy match any character between start and end strings. 
  //s modifier should make it also match newlines.
  preg_match_all("/$startStringSafe(.*?)$endStringSafe/s", $myFile, $matches);
  return $matches;
}
$myFile = 'fkdhkvdf(mat(((ch1)vdsf b(match2) dhdughfdgs (match3)';
$list = get_between("(", ")", $myFile);
foreach($list[1] as $list){
  echo $list."\n";
}

I did this and it seems to work. (Obviously, you'll need to replace my $myFile assignment line with your file_get_contents statement.) A few things:

A: Variable replacement won't occur with single-quotes. So your preg_replace_all regular expression won't work as a result. As it literally adds $startString to your expression instead of (. (I also removed the check for } at the end of the matched string. Add it back in if you need it with \\} just before the ending delimiter.)

B: $list will be an array of arrays. I believe by default, index zero will contain all full matches. index one will contain the first subpattern match.

C: This only works so long as $endString will not ever be found inside of a subpattern you are attempting to match. Say, if you expect (matc(fF)) to give you matc(fF), it won't. It'll give you match(fF. You'll need a more powerful parser if you want to get the former result in this case.

Edit: The get_between function here should work with &nbsp;( and )} as well, or whatever else you'd want.

6
On

Your regex is completely misleaded.

First: [^...] is a complemented character class. A complemented character class is an atom, and whatever ... is is the set of characters which must not be allowed at this point. Ie, [^ab] will allow anything but a and b.

Second: you seem to want to be able to capture between parens. But a paren (open or closing) is a special character in a regex. So, in your example, if $startString is &nbsp(, the paren will be interpreted as a regex metacharacter.

Third: unfortunately, this cannot be solved with regexes, but nested $startString and $endString cannot be matched (well, they can with perl, but perl is perl).

The closest you can get to what you really want is rewriting your regex to use with preg_match_all as follows:

$start = preg_quote($startString, '/');
$end = preg_quote($endString, '/');
$re = '/\Q' . $start . '\E'       # literal $start
    . '('                         # capture...
    . '(?:(?!\Q' . $end . '\E).)' # any character, as long as $end is not found at this position,
    . '+)'                        # one or more times
    . '\Q' . $end . '\E/';        # literal $end

and then use that as your first argument to preg_match_all.

The \Q and \E regex modifiers tell that anything between the first and second should be treated as literals -- hence the paren in &nbsp( will be treated literally, and not as the group opening metacharacter.