This is a little Python library (with dependencies included) that demonstrates how to convert "documents in unknown encodings" into utf-8. For example: $ python unidam.py unknownencoding.txt > utf8.txt To check it out, do: $ svn http://ruphus.com/svn/unidam unidam If you find a file that breaks this pipeline, I'd like to hear about it (especially if you have a URL): pathall@gmail.com