I just finished up project for a client of mine that involved, amongst other things migrating a blog from one area to another.
Without going into too much detail, one problem I had was that a lot of their blog entries seemed to have been copy pasted from Microsoft Word.
This was even more problematic when migrating them, as the blog system I was migrating to displayed the Microsoft word characters (ex. curly quotation marks, AKA smart quotes), as messy blocky symbols - like it didn't know what symbol it was.
This had to do with the character encoding and what not, but it came down to that I needed the posts to use normal characters. " instead of ” and ' instead of ’.
There were also some very mysterious spaces that I could just not get rid of. I had to deal with those too.
Overall, this problem was frustrating for me. It cost me 2 hours of research and testing. I wanted to release this function to perhaps help anyone who ever gets into this problematic scenario.
PHP Code:
function fixMessyString($string) // Works exceptionally well with screwy Microsoft Word strings
$find[] = '“'; // left side double smart quote
$find[] = 'â€'; // right side double smart quote
$find[] = '‘'; // left side single smart quote
$find[] = '’'; // right side single smart quote
$find[] = '…'; // elipsis
$find[] = '—'; // em dash
$find[] = '–'; // en dash
$find[] = html_entity_decode('<p> </p>'); // Messy lines at the end
$find[] = chr(160); // Mysterious space
$find[] = ' '; // We fixed the mysterious space above, so now we will make sure there are no double spaces
$replace[] = '"';
$replace[] = '"';
$replace[] = "'";
$replace[] = "'";
$replace[] = "...";
$replace[] = "-";
$replace[] = "-";
$replace[] = "";
$replace[] = " ";
$replace[] = " ";
$string = str_replace($find, $replace, $string);
$string = str_replace(utf8_encode('Â'), '', utf8_encode($string)); // Finish removing odd double spaces
return $string;
}
The second str_replace is what I found to be necessary to completely remove all traces of the mysterious space, which is apparently chr(160) (in PHP, anyways). I could not figure out any other way to completely kill it off.
If anyone has any comments or suggestions as to how I could better deal with this scenario in future cases, then please do speak up.
G'luck all.