Locating tags in a string in PHP (with respect to the string with tags removed)

51 Views Asked by At

I want to create a function that labels the location of certain HTML tags (e.g., italics tags) in a string with respect to the locations of characters in a tagless version of the string. (I intend to use this label data to train a neural network for tag recovery from data that has had the tags stripped out.) The magic function I want to create is label_italics() in the below code.

$string = 'Disney movies: <i>Aladdin</i>, <i>Beauty and the Beast</i>.';
$string_all_tags_stripped_but_italics = strip_tags($string, '<i>'); // same as $string in this example
$string_all_tags_stripped = strip_tags($string); // 'Disney movies: Aladdin, Beauty and the Beast.'
$featr_string = $string_all_tags_stripped.' '; // Add a single space at the end
$label_string = label_italics($string_all_tags_stripped_but_italics);
echo $featr_string; // 'Disney movies: Aladdin, Beauty and the Beast. '
echo $label_string; // '0000000000000001000000101000000000000000000010'

If a character is supposed to have an <i> or </i> tag immediately preceding it, it is labeled with a 1 in $label_string; otherwise, it is labeled with a 0 in $label_string. (I'm thinking I don't need to worry about the difference between <i> and </i> because the recoverer will simply alternate between <i> and </i> so as to maintain well-formed markup, but I'm open to reasons as to why I'm wrong about this.)

I'm just not sure what the best way to create label_italics() is.

I wrote this function that seems to work in most cases, but it also seems a little clunky and I'm posting here in hopes that there is a better way. (If this turns out to be the best way, the below function would be easily generalizable to any HTML tag passed in as a second argument to the function, which could be renamed label_tag().)

function label_italics($stripped) {
  while ((stripos($stripped, '<i>') || stripos($stripped, '</i>')) !== FALSE) {
    $position = stripos($stripped, '<i>');
    if (is_numeric($position)) {
      for ($c = 0; $c < $position; $c++) {
        $output .= '0';
      }
      $output .= '1';
    }
    $stripped = substr($stripped, $position + 4, NULL);
    $position = stripos($stripped, '</i>');
    if (is_numeric($position)) {
      for ($c = 0; $c < $position; $c++) {
        $output .= '0';
      }
      $output .= '1';
    }
    $stripped = substr($stripped, $position + 5, NULL);
  }
  for ($c = 0; $c <= strlen($stripped); $c++) {
    $output .= '0';
  }
  return $output;
}

The function produces bad output if the tags are surplus or the markup is badly formed in the input. For example, for the following input:

$string = 'Disney movies: <i><i>Aladdin</i>, <i>Beauty and the Beast</i>.';

The following misaligned output is given.

Disney movies: Aladdin, Beauty and the Beast.
0000000000000001000000000101000000000000000000010

(I'm also open to reasons why I'm going about the creation of the label data all wrong.)

2

There are 2 best solutions below

1
On

After some additional experimentation, this is what I arrived at:

$label_string = mb_ereg_replace('#0', '1', mb_ereg_replace('(#)\1+0', '1', mb_ereg_replace('\/', '0', mb_ereg_replace('i', '0', mb_ereg_replace('<\/i>', '#', mb_ereg_replace('<i>', '#', mb_ereg_replace('[^<\/i\>]', '0', mb_strtolower($featr_string))))))));

I couldn't get @KIKO Software's preg_replace()-based solution to work with multibyte strings. So I changed to this slightly ungainly, but better-operative, mb_ereg_replace()-based solution instead.

6
On

I think I've got something. How about this:

function label_italics($string) {
    return preg_replace(['/<i>/', '/<\/i>/', '/[^#]/', '/##0/', '/#0/'], 
                        ['#', '#', '0', '2', '1'], $string);
}

see: https://3v4l.org/cKG46

Note that you need to supply the string with the tags in it.

How does it work?

I use preg_replace() because it can use regular expressions, which I need once. This function goes through the two arrays and execute each replacement in order. First it replace all occurrences of <i> and </i> by # and anything else by 0. Then replaces ##0 by 2 and #0 by 1. The 2 is extra to be able to replace <i></i>. You can remove it, and simplify the function, if you don't need it.

The use of the # is arbitrary. You should use anything that doesn't clash with the content of your string.


Here's an updated version. It copes with tags at the end of the line and it ignores any # characters in the line.

function label_italics($string) {
    return preg_replace(['/[^<\/i\>]/', '/<i>/', '/<\/i>/', '/i/', '/##0/', '/#0/'], 
                        ['0', '#', '#', '0', '2', '1'], $string . ' ');
}

See: https://3v4l.org/BTnLc