Limiting the number of words in Hebrew

84 Views Asked by At

This formula is great Limit the number of words in a response with a regular expression It works great in English

But probably not so good in Hebrew

Is there anything that should be changed?

I entered the formula But when I try to enter the form in Hebrew An error message (of word limit) is observed already in the first word

If it's in English it works great

I hope I am answering in the right place I'm having a little trouble understanding where to answer This is the screenshot https://prnt.sc/fO3OeoPeMRXI This is the error message https://prnt.sc/gPMzM9eVY2Ej

1

There are 1 best solutions below

9
VonC On

I suggested before:

^(?:\b\w+\b[\s\r\n]*){1,250}$

But \b (word boundary) does not work well with any non-Latin script (like Hebrew) in regular expressions.
It is due to its reliance on the characteristics of the Latin script and the ASCII character set: \b matches positions where one side is a word character (\w) and the other side is not a word character.

For Latin scripts, \w matches any alphanumeric character (letters and digits) and underscores (_). However, this set of characters does not include characters from scripts like Hebrew, Arabic, Cyrillic, etc. So \b does not recognize the boundary of a word written in Hebrew correctly, as it does not see Hebrew characters as part of the \w category.

To work with other non-Latin scripts (like Hebrew for instance), you would need to define your own word boundaries, typically by directly specifying the range of characters in the script (like [\u0590-\u05FF] for Hebrew) and using other means to detect spaces or separators between words. That is why custom solutions are necessary for regex operations in non-Latin scripts.

^(?:[\u0590-\u05FF]+(?:\s+|$)){0,250}$

In the regex pattern ^(?:[\u0590-\u05FF]+(?:\s+|$)){0,250}$ designed for Hebrew text, the detection of spaces or separators between words is handled by the part (?:\s+|$).


I tried this formula ^(?:[\u0590-\u05FF]+(?:\s+|$)){0,4}$, and it already seems to work well when I try to write. *But there is an error when sending the form I'm trying to add this to a jet form. Why?

The screenshot indicates that the error is occurring due to the use of PCRE2 (Perl Compatible Regular Expressions version 2) in PHP which does not support the use of \U in the regular expression. That is a common issue when transitioning from PCRE to PCRE2, as \U is interpreted as the start of a Unicode escape sequence, which is not completed in the pattern.

To fix this, you should use lowercase \u for Unicode escape sequences in your regular expression, and also make sure your regular expression is enclosed in double quotes (" "), since PHP interprets escape sequences differently in single-quoted strings. The double quotes will allow PHP to interpret the \u escape sequence correctly.

"/^(?:[\u{0590}-\u{05FF}]+(?:\s+|$)){0,4}$/u"

With:

  • \u{0590}-\u{05FF} is the correct syntax for Unicode escape sequences in PHP regex.
  • the u modifier at the end of the regex pattern, necessary to treat the pattern as UTF-8.

The error upon form submission could also be due to several factors unrelated to the regex itself, such as:

  • Incorrect handling of character encoding in the form submission process.
  • Server-side validation that does not correctly handle Unicode or multi-byte characters.
  • Issues within the JetForm plugin or the PHP environment configuration.

Make sure the server-side environment is correctly configured to handle UTF-8 encoded data, and that the form processing script is using the corrected regex pattern. If the issue persists, they might need to check the documentation of the JetForm plugin or contact support for that plugin to resolve compatibility issues with Unicode patterns in PHP.


The regex ^\s?([\u0590-\u05fe]+\s?){1,5}$ from your picture is intended to match between 1 to 5 groups of Hebrew characters, where each group is optionally preceded by a whitespace character and optionally followed by a whitespace character.
That regex is anchored at the beginning (^) and end ($) of the string to match the whole input.

It might fail with a Sanitize_Value_Exception in a Jet form because of:

  • Incorrect Unicode Syntax: In PHP PCRE regex, Unicode characters should be expressed with \x{} or \u{} syntax when using the u modifier. The regex provided lacks the curly braces {} around the Unicode hex codes, which might be causing the pattern to be invalid or incorrectly interpreted by the PHP engine.

  • Character Range: The range \u0590-\u05fe includes almost all the characters in the Hebrew block of Unicode, but the syntax without braces {} is incorrect in PHP.

  • Form Field Validation: The Jet form might be expecting a certain format or encoding for the input data, and if the input does not strictly conform to these expectations, it could throw a Sanitize_Value_Exception.

The corrected regex in PHP should look like this:

/^\s?([\x{0590}-\x{05fe}]+\s?){1,5}$/u

Or, if you are including it in a PHP string, it should be double-escaped:

"/^\\s?([\\x{0590}-\\x{05fe}]+\\s?){1,5}$/u"

The input might be sanitized in a way that removes or alters characters expected by the regex, causing the match to fail.