Tuesday, October 14, 2008

Sanitizing bad UTF-8

I've hit a problem recently where statsvn does not manage to generate statistics on one of my repositories. After investigation, it turned out that there are some raw latin1 sequences in the XML log file (from a conversion from latin to UTF-8 long time ago), and that makes logfile parsing fail. To sanitize this, I use the following filter:

iconv -c -f UTF-8 -t UTF-8

This way, obnoxious sequences are dropped and processing can go on. (or, at least, could go on if it was not failing just a little later with a Null pointer exception...).

No comments: