Files with mixed and invalid Encodings in Ruby

Recently I encountered a file which mostly contained UTF-8 characters. I could read the file and even throw it at Nokogiri and there was no problem.

When I wanted to preprocess the content of the file with gsub ruby raised this Exception:

invalid byte sequence in UTF-8 (ArgumentError)

Now what to do?

There aren’t many attractive options. I googled around for quite some time but nobody seemed to have a reasonable solution so I tried wrapping a begin…rescue around each element I extracted with nokogiri and attempted to do the gsub clean up work there and if that would fail I would either try to guess the right encoding (ISO Latin or ASCII) or skip that element entirely. However even if that would have silenced the exceptions it would most likely be incorrect in the end.

My favorite option would be to simply ignore the bad sequences all together but I couldn’t find a proper way to do it in Ruby. After some time of googling I found this blog post describing how this can be achieved with Iconv

In case the blog post disappears I will write down the executive summary here:

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string)

I guess this is useful outside the ruby universe as well. I hate encoding issues – I really do. Somehow though you end up with them in _every_ project.

One Response to Files with mixed and invalid Encodings in Ruby

Leave a Reply

Your email address will not be published. Required fields are marked *