1 Reply Latest reply: Dec 11, 2009 11:20 AM by user4994457 RSS

    Using sed to strip weird special characters

      Hi all! We get a daily XML file from a vendor that, after being transmitted and unencrypted, contains a few funny characters at the top of the file:

      ÿ£¢<?xml version="1.0" encoding="utf-8"?>

      If you vi the file, they look like this:


      Escape codes! I know how those work. The vendor insists that nothing they're doing is putting the files in there, and won't change their process, and so I just have to fix it. We already run the file through a script that "dos2unix"es it; as part of the script, I'd like to use sed to strip out the special characters. Unfortunately, I'm having difficulty getting it to match them.

      From a command line, I can type:

      sed s/ÿ£¢//g file.xml

      This works great. However, when I edit my script with vi, it automatically translates the funny characters into the escape codes:

      sed s/\357\273\277//g file.xml

      And it doesn't match them. Any ideas? Supposedly the characters will always look the same, and will always be at the head of the file, but I'd like to make the process as flexible as possible in case they are wrong. I guess I could somehow parse the individual characters of the file to just drop the first three, but that seems like it would be 10 times more difficult than using sed.