How to handle "invalid character" errors with PHP xml_parse?

74 Views Asked by At

When parsing a file received from a trusted third party, I've an error Invalid character (error code: 9) caused by the invalid HTML entity  (see weird things below).

How could I handle that kind of problem? (so that the file is still parsed)
For example by simply deleting the invalid character.

As I need to handle large files (50MB), I'm using fread(). So it's not safe to check/clean the part read ($data) before passing it to the XML parser.

$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, FALSE);
xml_set_element_handler($parser, "startElement", "endElement"); 
xml_set_character_data_handler($parser, "characterData"); 

$fp = fopen($file, "rb");
while ($data = fread($fp, 4096)) {
  if (!xml_parse($parser, $data, feof($fp))) {
    $errorCode = xml_get_error_code($parser);
    printf(
      "XML Parser: %s (error code: %d). File: %s, line %d, column %d.",
      xml_error_string($errorCode),
      $errorCode,
      $file,
      xml_get_current_line_number($parser),
      xml_get_current_column_number($parser),
    );
  }
}

3 weird things:

  1. I've the same problem in 4 files (out of 24).
    All these problematic files have 2 times that invalid character, on 2 different lines.
    The error message is always the same, indicating the line number of the first occurence (the content is actually the same on both line, so the column is also the same, which is normal but probably wrong).

  2. The same error message is displayed too many times.
    File 1: 126.38KB, 4 times. File 2: 11.70MB, 4 times. File 3: 36.64MB, 70 times. File 4: 4.55MB, 96 times.
    Note: xml_parser_free($parser); $parser = NULL; is called after each file. Anyway, PHP > 8.0.0, so not needed anymore.

  3. The error message indicates the wrong(?) column number.
    The relevant portion of the line is excellent technical knowhow in design.
    The invalid character is at column 325 but the error message says 312 (which is the "e" in "technical").

0

There are 0 best solutions below