Why does strpos return different results?

313 Views Asked by At

I have the following function that transforms special accent characters (like ă) into a-zA-Z characters in a string:

function tradu($sir){

        $sir_aux = $sir;

        $diacritice = array("ă"=>"a", "â"=>"a", "î"=>"i", "Î"=>"I", "ș"=>"s", "ş"=>"s", "ţ"=>"t", "ț"=>"t");

        for($i=0; $i<strlen($sir_aux); $i++){

            foreach($diacritice as $key=>$value){
                if($sir_aux[$i]==$key)
                    $sir_aux[$i]=$value;
            }
        }

        $sir_aux = strtr($sir, $diacritice);

        return $sir_aux;
    }   

Let's say a is the original string and a_translated is the translated string.

When I use strpos(a, string_to_find) and strpos(a_translated, string_to_find), the returned values are different. I also checked strlen(a) and strlen(a_translate) and they give different results. Why is this happening?

I need this explanation because I need to search if a string with accents contains a given normal string (without accents), but I must return the portion from the original string where I found it even if it contains accents.

What I tried I translate the original string and find the position where the searched_string starts, then I substr(ORIGINAL_STRING, position). This is where I noticed the positions do not correspond.

Example: ORIGINAL STRING: Universitatea a fost înființată în 2001 pentru a oferi... SEARCHED STRING: infiintata DESIRED RESULT: înființată în 2001 pentru a oferi...

2

There are 2 best solutions below

1
On BEST ANSWER

The position you get from strpos is not correct, because your original string is multi-byte and strpos can't handle multibyte strings. Try mb_strpos instead.

Try:

mb_strpos(a,string_to_find,0,'UTF-8');

and

mb_strpos(a_translated,string_to_find,0,'UTF-8');

you will see they have the same result.

See this code demonstrates the difference between strpos (which cant handle multi-byte strings) and mb_strpos:

$original_multibyte_string       = 'țată în  HERE';
$a_non_multibyte_str_same_length = '123456789HERE';
// HERE is on 10th (index 9 on array) character

echo 'strpos finds HERE in multibyte at: '.strpos($original_multibyte_string,'HERE').' '.'strpos finds HERE in non-multibyte at: '.strpos($a_non_multibyte_str_same_length,'HERE');
// OUTPUTS: strpos finds HERE in multibyte at: 12 strpos finds HERE in non-multibyte at: 9

echo "\n";
// now lets test the multibyte:

echo 'mb_strpos finds HERE in multibyte at: '.mb_strpos($original_multibyte_string,'HERE',0,'UTF-8').' '.'mb_strpos finds HERE in non-multibyte at: '.mb_strpos($a_non_multibyte_str_same_length,'HERE',0,'UTF-8');
// OUTPUTS: mb_strpos finds HERE in multibyte at: 9 mb_strpos finds HERE in non-multibyte at: 9

http://3v4l.org/ksYal

1
On

It's because these functions are not supporting UTF8 characters.

a = 1 bit encoding ă = 2 bit encoding

It's the answer!