Friday, June 3, 2011

Ruby's Set class

Yesterday at work, we ran into an interesting problem. We're creating the new version of an application and discarding the old, ugly code. But we need to migrate some data: the old system has (let's say) widgets, and the new system has widgets, too. The old system uses 5 different databases (see how ugly?) with weird row schemas, but it does reliably have widget color, size, and shapes. The new system uses one database and has a nice row schema, but it also has widget color, size, and shapes.

We need to know: which widgets are only in the old system? Which widgets are only in the new? Which are in both?

Enter Sets


Asking these questions tripped a switch in my mind. "I know about this!", I thought. "This is a job for sets! And Ruby has a Set class."

I'd never used them yet, but sets are made for this kind of thing. Sets are often illustrated with Venn diagrams: overlapping circles, where you ask "which things are only in the left circle? What's in the overlap?", etc.

For instance:



A set is a list of items where no item is repeated. If you have more than one set, you can compare them and answer the kinds of questions we've been asking. Here's a demo I just threw together:

require 'set'

  def sets_demo

    # Sets ignore duplicate values
    game_words = Set.new(['duck','duck','duck','goose'])
    puts "Unique game words                 : #{game_words}\n\n"
    #=>   Unique game words                 : goose, duck
     

    # Here are two sets with one thing in common
    fast  = Set.new(['bullet', 'cheetah'])
    round = Set.new(['bullet', 'beach ball'])

    # All the ways we can compare them
    puts "Round                             : #{round}"
    #=>   Round                             : bullet, beach ball
    
    puts "Fast                              : #{fast}"
    #=>   Fast                              : cheetah, bullet
    puts ''


    puts "Round and Fast (&)                : #{(fast & round)}"
    #=>   Round and Fast (&)                : bullet
    #
    puts "Round but not Fast (-)            : #{(round - fast)}"
    #=>   Round but not Fast (-)            : beach ball
    
    puts "Fast but not Round (-)            : #{(fast - round)}"
    #=>   Fast but not Round (-)            : cheetah

    puts "Round OR Fast (|)                 : #{(round | fast)}"
    #=>   Round OR Fast (|)                 : cheetah, bullet, beach ball

    puts "Round OR Fast, but NOT both (XOR) : #{((round | fast) - (fast & round))}"
    #=>   Round OR Fast, but NOT both (XOR) : cheetah, beach ball

  end

  # Formatting the way the sets print
  class Set
    def to_s
      to_a.join(', ')
    end
  end

  sets_demo
  

Got it?

In my examples, the items in the sets were strings, but they could be anything. In our case at work, we used hashes: a widget was represented by a hash containing its color, shape and size. So, we just had to:

  1. Connect to each of the databases in the old system, getting all the widgets, creating a hash for each one, and dropping each into an old_system_widgets set (which automatically ignores duplicates)
  2. Connect to the new system's database and make a similar set of its widgets
  3. Do the kinds of set operations illustrated above

Voila! Now we knew which widgets were new and which ones still needed to be migrated to the new system.

In conclusion: sets are swell!

Hmmm. That's a pretty weak ending.