Detecting invalid encoding in CSV uploads

January 16, 2009 Pivotal Labs

We ran into an odd bug using FasterCSV to import some data. We were requiring the CSV files to be UTF-8 encoded, but some users tried to upload files in other encodings. FasterCSV ended up choking on characters that weren’t valid UTF-8 and truncating the data to the end of the line and leaving fields blank. We didn’t want to ask the user to select an encoding, because they’d probably get it wrong anyway, so we decided to reject any files with characters that would cause problems. The trick then, is how to detect that.

First, the tests. We want to detect if an input string contains valid characters in UTF-8 encoding. And we need to deal with both strings and IO (File or StringIO) objects (more on that in a bit).

describe "::encoding_is_utf8? checks strings and IOs" do
  before do
    @utf8 = "This is a test with ç international characters"
  end

  it "returns true when all characters are valid" do
    Importer.encoding_is_utf8?(@utf8).should be_true
    Importer.encoding_is_utf8?(StringIO.new(@utf8)).should be_true
  end

  it "returns false when any characters are invalid" do
    bogus = Iconv.conv('ISO-8859-1', 'UTF-8', @utf8)
    Importer.encoding_is_utf8?(bogus).should be_false
    Importer.encoding_is_utf8?(StringIO.new(bogus)).should be_false
  end
end

Here’s the implementation:

class Importer
  def self.encoding_is_utf8?(file_or_string)
    file_or_string = [file_or_string] if file_or_string.is_a?(String)
    is_utf8 = file_or_string.all? { |line| Iconv.conv('UTF-8//IGNORE', 'UTF-8', line) == line }
    file_or_string.rewind if file_or_string.respond_to?(:rewind)
    is_utf8
  end
#...

So the meat of the check is that we are using the Iconv library to detect bad characters. We convert from an assumed UTF-8 to UTF-8, ignoring any characters that can’t be represented in UTF-8. If the output and input aren’t identical, that means there were bogus characters and the uploaded file should be rejected.

The #rewind is needed to reset the read position in the file so FasterCSV can start over from the beginning. Specs for that aren’t included here.

Then in our controller, we ensure the CSV doesn’t have any bad characters before we give it to FasterCSV. We extracted that check into its own method, shown here:

def require_utf8!(csv_content)
  unless Importer.encoding_is_utf8?(csv_content)
    raise "Import file must be UTF-8 only. You can paste non-UTF-8 CSV directly into the CSV Text field for automatic conversion."
  end
end

As you can read in the exception message (which ends up in the flash), the user can work around the encoding issue by pasting the CSV into a textarea input in the browswer, which automatically transcodes the data into UTF-8. Aren’t browsers awesome? The other option would be to transcode the CSV file, but the textarea is easier if the files aren’t gigundous. Anyway, since we can input CSV as either a file or a textarea string, that’s why #encoding_is_utf8? needs to check both files and strings.

This approach and implementation seem fine to me. I get the feeling there might be a much simpler way, though. Anyone got a better idea?

About the Author

Biography

More Content by Pivotal Labs
Previous
Project Startup: Talent & Team
Project Startup: Talent & Team

Hosted by Pivotal Labs and VentureArchetypes. Moderated panel discussion on best practices for building yo...

Next
Pivotal Tracker and GTD
Pivotal Tracker and GTD

Michael Buffington has posted the first of a what will hopefully be a series of tours on Pivotal Tracker, t...