utf-8 character set, 7bit encoding, PHP adding strange characters

1.3k Views Asked by At

I'm sorry my title is not better, but I'm not even sure how to categorize this problem. I know this has to do with encoding, but I am not sure how.

I am doing a project for an ESP. Their emails are 7-bit encoded, with utf-8 character set (which doesn't really make sense to me).

Exhibit A:

encoding settings

I get the html email text via an API. I then use PHP to modify some of the text (via a str_replace), and then post the new html via the API.

All is fine, except every time I post, I am getting some strange characters, i.e. every time I run the code it adds another funky character.

Here is the affected section of the email before I make any changes (this is in "view" mode, i.e. how a browser would see it):

start

Here is the code that produces that Copyright symbol AND the A with the "acute" symbol above it:

                            © 2012 H

What's strange is that the only way to get rid of that A with the "acute" symbol above it is to delete the copyright symbol...somehow they are related.

Every time I post to the API via PHP, I get some new funky characters, thus:

1st post:

enter image description here

2nd post:

enter image description here

3rd post:

enter image description here

It's so strange...this is the only part that is not working! Please help...this is making me crazy! :-)

EDIT:

Here's the relevant PHP:

  1. Get the html from an xml response:

    $html = (string)$data;

  2. Replace some stuff:

    $newHTML = str_replace($oldExpiresString, $newExpiresString, $html);

  3. Put the new HTML into the xml post variables:

    $input = ''.$newHTML.'';

  4. URLEncode it:

    $formatted = urlencode($input);

  5. Post via curl:

    $postVariables = array( 'type' => urlencode($type), 'activity' => urlencode($activity), 'input' => urlencode($input) );

    $rawResponseString = post_url($urlBase, $postVariables); print $rawResponseString;

1

There are 1 best solutions below

2
On

To elaborate on my comment:

$screwed = '©';

echo html_entity_decode($screwed, ENT_COMPAT, 'ISO-8859-1');

This returns "©", decoding the screwed up multi-single-byte-HTML encoding back into UTF-8 encoded text. So from here you just need to treat the text as if it were UTF-8 encoded (which it is now).