Hi all! We get a daily XML file from a vendor that, after being transmitted and unencrypted, contains a few funny characters at the top of the file:
ÿ£¢<?xml version="1.0" encoding="utf-8"?>
If you vi the file, they look like this:
Escape codes! I know how those work. The vendor insists that nothing they're doing is putting the files in there, and won't change their process, and so I just have to fix it. We already run the file through a script that "dos2unix"es it; as part of the script, I'd like to use sed to strip out the special characters. Unfortunately, I'm having difficulty getting it to match them.
From a command line, I can type:
sed s/ÿ£¢//g file.xml
This works great. However, when I edit my script with vi, it automatically translates the funny characters into the escape codes:
sed s/\357\273\277//g file.xml
And it doesn't match them. Any ideas? Supposedly the characters will always look the same, and will always be at the head of the file, but I'd like to make the process as flexible as possible in case they are wrong. I guess I could somehow parse the individual characters of the file to just drop the first three, but that seems like it would be 10 times more difficult than using sed.
The characters are a UTF-8 BOM. They are not recommended, but allowed in UTF-8 files (and windows programs tend to do it). See [http://en.wikipedia.org/wiki/Byte_Order_Mark].
Seems to me 'tr' would be a better choice than 'sed'.