Wednesday, May 4, 2011

Ruby's String#unpack and Array#pack and Unicode

At work, I'm exporting some data for use by a system that doesn't understand Unicode. I ran across an Obie Fernandez post explaining his approach.

One method he showed was this:

def to_ascii_iconv
    converter = Iconv.new('ASCII//IGNORE//TRANSLIT', 'UTF-8') 
    converter.iconv(self).unpack('U*').select{ |cp| cp < 127 }.pack('U*')
  end
Essentially, this means that Iconv will convert the characters that it can from Unicode to ASCII. For example, รก will be converted to 'a. (I don't know where to find a comprehensive list of these transliterations.) After the transliteration step, any remaining non-ASCII characters - for example, Japanese characters, which have no ASCII equivalent - are discarded using the pack and unpack methods. To see how this works, try this in irb:
"abcdefg".unpack('U*')
Your string is converted into an array of numbers representing their Unicode values, as we requested by using "U*". (I'm still a bit fuzzy on exactly how Unicode values work, but let's keep rolling.) Ruby's Array#pack method can do the opposite conversion: numbers to string values.
(1..127).to_a.pack('U*')
There you should see a string of all the legal ASCII values, which, apparently, are all in the 1 to 127 range. Knowing this, it's easy to see how you can throw away non-ASCII values in a string:
some_string.unpack('U*').select{|character_value| character_value < 127}.pack('U*')
And that's what Obie's code does after its initial conversions with Iconv. Now, if you're curious, you might want to see what Unicode values some other numbers map to. No problem: just change the range value from our earlier example, and write the resulting values to a file (in my case, the Unicode characters don't show correctly in irb):
string = (1..300).to_a.pack('U*')
  File.open('unicodes_hooray.txt','w'){|f| f.write(string)}

Open that up in an editor that can display Unicode to see what you got.