[PHP Function] Make Microsoft Word documents understandable
I just finished up project for a client of mine that involved, amongst other things migrating a blog from one area to another.
Without going into too much detail, one problem I had was that a lot of their blog entries seemed to have been copy pasted from Microsoft Word.
This was even more problematic when migrating them, as the blog system I was migrating to displayed the Microsoft word characters (ex. curly quotation marks, AKA smart quotes), as messy blocky symbols - like it didn't know what symbol it was.
This had to do with the character encoding and what not, but it came down to that I needed the posts to use normal characters. " instead of ” and ' instead of ’.
There were also some very mysterious spaces that I could just not get rid of. I had to deal with those too.
Overall, this problem was frustrating for me. It cost me 2 hours of research and testing. I wanted to release this function to perhaps help anyone who ever gets into this problematic scenario.
PHP Code:
function fixMessyString($string) // Works exceptionally well with screwy Microsoft Word strings
$find[] = '“'; // left side double smart quote
$find[] = 'â€'; // right side double smart quote
$find[] = '‘'; // left side single smart quote
$find[] = '’'; // right side single smart quote
$find[] = '…'; // elipsis
$find[] = '—'; // em dash
$find[] = '–'; // en dash
$find[] = html_entity_decode('<p> </p>'); // Messy lines at the end
$find[] = chr(160); // Mysterious space
$find[] = ' '; // We fixed the mysterious space above, so now we will make sure there are no double spaces
$replace[] = '"';
$replace[] = '"';
$replace[] = "'";
$replace[] = "'";
$replace[] = "...";
$replace[] = "-";
$replace[] = "-";
$replace[] = "";
$replace[] = " ";
$replace[] = " ";
$string = str_replace($find, $replace, $string);
$string = str_replace(utf8_encode('Â'), '', utf8_encode($string)); // Finish removing odd double spaces
return $string;
}
The second str_replace is what I found to be necessary to completely remove all traces of the mysterious space, which is apparently chr(160) (in PHP, anyways). I could not figure out any other way to completely kill it off.
If anyone has any comments or suggestions as to how I could better deal with this scenario in future cases, then please do speak up.
G'luck all.
Re: [PHP Function] Make Microsoft Word documents understandable
did you try
PHP Code:
string trim ( string $str [, string $charlist ] )
This function returns a string with whitespace stripped from the beginning and end of str. Without the second parameter, trim() will strip these characters:
" " (ASCII 32 (0x20)), an ordinary space.
"\t" (ASCII 9 (0x09)), a tab.
"\n" (ASCII 10 (0x0A)), a new line (line feed).
"\r" (ASCII 13 (0x0D)), a carriage return.
"\0" (ASCII 0 (0x00)), the NUL-byte.
"\x0B" (ASCII 11 (0x0B)), a vertical tab.
Re: [PHP Function] Make Microsoft Word documents understandable
Quote:
Originally Posted by
kutsumo
did you try
PHP Code:
string trim ( string $str [, string $charlist ] )
This function returns a string with whitespace stripped from the beginning and end of str. Without the second parameter, trim() will strip these characters:
" " (ASCII 32 (0x20)), an ordinary space.
"\t" (ASCII 9 (0x09)), a tab.
"\n" (ASCII 10 (0x0A)), a new line (line feed).
"\r" (ASCII 13 (0x0D)), a carriage return.
"\0" (ASCII 0 (0x00)), the NUL-byte.
"\x0B" (ASCII 11 (0x0B)), a vertical tab.
Heh, yes. That was one of the first few things I tried. It did absolutely nothing ;).
Re: [PHP Function] Make Microsoft Word documents understandable
How about
PHP Code:
str_replace(" ","",$string);
??
Re: [PHP Function] Make Microsoft Word documents understandable
Quote:
Originally Posted by
kutsumo
How about
PHP Code:
str_replace(" ","",$string);
??
look closer at his function