Regex - the difference in \\n and \n

2k Views Asked by At

Sorry to add another "Regex explanation" question to the internet but I must know the reason for this. I have ran this regex through RegexBuddy and Regex101.com with no help.

I came across the following regex ("%4d%[^\\n]") while debugging a time parsing function. Every now and then I would receive an 'invalid date' error but only during the months of January and June. I mocked up some code to recreate exactly what was happening but I can't figure out why removing the one slash fixes it.

<?php
$format = '%Y/%b/%d';
$random_date_strings = array(
    '2015/Jan/03',
    '1985/Feb/13',
    '2001/Mar/25',
    '1948/Apr/02',
    '1948/May/19',
    '2020/Jun/22',
    '1867/Jul/09',
    '1901/Aug/11',
    '1945/Sep/21',
    '2000/Oct/31',
    '2009/Nov/24',
    '2015/Dec/02'
    );

$year = null;
$rest_of_string = null;

echo 'Bad Regex:';
echo '<br/><br/>';
foreach ($random_date_strings as $date_string) {
    sscanf($date_string, "%4d%[^\\n]", $year, $rest_of_string);
    print_data($date_string, $year, $rest_of_string);
}

echo 'Good Regex:';
echo '<br/><br/>';
foreach ($random_date_strings as $date_string) {
    sscanf($date_string, "%4d%[^\n]", $year, $rest_of_string);
    print_data($date_string, $year, $rest_of_string);
}

function print_data($d, $y, $r) {

    echo 'Date string: ' . $d;
    echo '<br/>';
    echo 'Year: ' . $y;
    echo '<br/>';
    echo 'Rest of string: ' . $r;
    echo '<br/>';
}
?>

Feel free to run this locally but the only two outputs I'm concerned about are the months of June and January. "%4d%[^\\n]" will truncate $rest_of_string to /Ju and /Ja while "%4d%[^\n]" displays the rest of the string as expected (/Jan/03 & /Jun/22).

Here's my interpretation of the faulty regex:

  • %4d% - Get four digits.
  • [^\\n] - Look for those digits in between the beginning of the string and a new line.

Can anyone please correct my explanation and/or tell me why removing the slash gives me the result I expect?

I don't care for the HOW...I need the WHY.

1

There are 1 best solutions below

2
On BEST ANSWER

Like @LucasTrzesniewski pointed out, that's sscanf() syntax, it has nothing to do with Regex. The format is explained in the sprintf() page.

In your pattern "%4d%[^\\n]", the two \\ translate to a single backslash character. So the correct interpretation of the "faulty" pattern is:

  • %4d - Get four digits.
  • %[^\\n] - Look for all characters that are not a backslash or the letter "n"

That's why it matches everything up until the "n" in "Jan" and "Jun".

The correct pattern is "%4d%[^\n]", where the \n translates to a new line character, and it's interpretation is:

  • %4d - Get four digits.
  • %[^\n] - Look for all characters that are not a new line