[PHP Function] Make Microsoft Word documents understandable

Results 1 to 5 of 5
  1. #1
    Software Person TimeBomb is offline
    ModeratorRank
    May 2008 Join Date
    United StatesLocation
    1,252Posts

    [PHP Function] Make Microsoft Word documents understandable

    I just finished up project for a client of mine that involved, amongst other things migrating a blog from one area to another.

    Without going into too much detail, one problem I had was that a lot of their blog entries seemed to have been copy pasted from Microsoft Word.
    This was even more problematic when migrating them, as the blog system I was migrating to displayed the Microsoft word characters (ex. curly quotation marks, AKA smart quotes), as messy blocky symbols - like it didn't know what symbol it was.
    This had to do with the character encoding and what not, but it came down to that I needed the posts to use normal characters. " instead of ” and ' instead of ’.

    There were also some very mysterious spaces that I could just not get rid of. I had to deal with those too.

    Overall, this problem was frustrating for me. It cost me 2 hours of research and testing. I wanted to release this function to perhaps help anyone who ever gets into this problematic scenario.

    PHP Code:
    function fixMessyString($string// Works exceptionally well with screwy Microsoft Word strings
        
    $find[] = '“';  // left side double smart quote
        
    $find[] = '”';  // right side double smart quote
        
    $find[] = '‘';  // left side single smart quote
        
    $find[] = '’';  // right side single smart quote
        
    $find[] = '…';  // elipsis
        
    $find[] = '—';  // em dash
        
    $find[] = '–';  // en dash
        
    $find[] = html_entity_decode('<p>&nbsp;</p>'); // Messy lines at the end
        
    $find[] = chr(160); // Mysterious space
        
    $find[] = '  '// We fixed the mysterious space above, so now we will make sure there are no double spaces

        
    $replace[] = '"';
        
    $replace[] = '"';
        
    $replace[] = "'";
        
    $replace[] = "'";
        
    $replace[] = "...";
        
    $replace[] = "-";
        
    $replace[] = "-";
        
    $replace[] = "";
        
    $replace[] = " ";
        
    $replace[] = "  ";
        
        
    $string str_replace($find,  $replace$string);
        
    $string str_replace(utf8_encode('Â'), ''utf8_encode($string)); // Finish removing odd double spaces
        
        
    return $string;

    The second str_replace is what I found to be necessary to completely remove all traces of the mysterious space, which is apparently chr(160) (in PHP, anyways). I could not figure out any other way to completely kill it off.

    If anyone has any comments or suggestions as to how I could better deal with this scenario in future cases, then please do speak up.

    G'luck all.
    Last edited by TimeBomb; 29-02-12 at 11:31 AM.


  2. #2
    Custom Title Enabled George SS is offline
    LegendRank
    Oct 2005 Join Date
    3,281Posts

    Re: [PHP Function] Make Microsoft Word documents understandable

    did you try

    PHP Code:
    string trim string $str [, string $charlist ] ) 
    This function returns a string with whitespace stripped from the beginning and end of str. Without the second parameter, trim() will strip these characters:

    " " (ASCII 32 (0x20)), an ordinary space.
    "\t" (ASCII 9 (0x09)), a tab.
    "\n" (ASCII 10 (0x0A)), a new line (line feed).
    "\r" (ASCII 13 (0x0D)), a carriage return.
    "\0" (ASCII 0 (0x00)), the NUL-byte.
    "\x0B" (ASCII 11 (0x0B)), a vertical tab.

  3. #3
    Software Person TimeBomb is offline
    ModeratorRank
    May 2008 Join Date
    United StatesLocation
    1,252Posts

    Re: [PHP Function] Make Microsoft Word documents understandable

    Quote Originally Posted by kutsumo View Post
    did you try

    PHP Code:
    string trim string $str [, string $charlist ] ) 
    This function returns a string with whitespace stripped from the beginning and end of str. Without the second parameter, trim() will strip these characters:

    " " (ASCII 32 (0x20)), an ordinary space.
    "\t" (ASCII 9 (0x09)), a tab.
    "\n" (ASCII 10 (0x0A)), a new line (line feed).
    "\r" (ASCII 13 (0x0D)), a carriage return.
    "\0" (ASCII 0 (0x00)), the NUL-byte.
    "\x0B" (ASCII 11 (0x0B)), a vertical tab.
    Heh, yes. That was one of the first few things I tried. It did absolutely nothing ;).

  4. #4
    Custom Title Enabled George SS is offline
    LegendRank
    Oct 2005 Join Date
    3,281Posts

    Re: [PHP Function] Make Microsoft Word documents understandable

    How about
    PHP Code:
    str_replace(" ","",$string); 
    ??

  5. #5
    Infraction Baɴɴed holthelper is offline
    MemberRank
    Apr 2008 Join Date
    1,765Posts

    Re: [PHP Function] Make Microsoft Word documents understandable

    Quote Originally Posted by kutsumo View Post
    How about
    PHP Code:
    str_replace(" ","",$string); 
    ??
    look closer at his function



Advertisement