The Sleepless Geek

Saturday, August 7, 2010

Quick Ruby Experiment

Here's an interesting Ruby experiment:

Class.ancestors # [Class, Module, Object, Kernel]

Module.ancestors # [Module, Object, Kernel]

Object.ancestors # [Object, Kernel]

Kernel.ancestors # uninitialized constant Kernel

Ruby is to Esperanto as PHP is to English

In 1887, L.L. Zamenhof published a book called "Unua Libro," describing a brand-new language that he hoped would change the world. A language anyone could learn to speak. One without political affiliation, without a messy history. And, I have to imagine he thought: finally, one that made sense.

English speakers know that our language is a mess - comedy routines have been built around it.

Here's Brian Regan on learning plurals in school.

So she asked this kid who knew everything. Irwin. “Irwin, what’s the plural for ox?”

“Ox. Oxen. The farmer used his oxen.”

“Brian?”

“What?”

“Brian, what’s the plural for box?”

“Boxen. I bought 2 boxen of doughnuts.”

“No, Brian, no. Let’s try another one. Irwin, what’s the plural for goose?”

“Geese. I saw a flock of geese.”

“Brian?”

[Exasperated laughing]“Wha-a-at?”

“What’s the plural for moose?”

“Moosen! I saw a flock of MOOSEN! There were many of ‘em. Many much moosen. Out in the woods…in the wood-es…in the woodsen. The meese want the food in the woodesen…food is the eatenesen…the meese want the food in the woodesenes…food in the woodesenes.”

We learn English rules, like "to make a past tense, add 'ed' to a verb: sailed, repeated, succeeded, constructed, cleaned." But many common verbs don't follow this. I ate, not eated; I ran, not runned; I went, not goed; I slept, not sleeped.

Other languages are the same. Spanish, for example, has tons of irregular verbs. Like many languages, it also has genders for all its nouns. Mountains are feminine. Trees are masculine. Why? Nobody knows. Mark Twain had hilarious complaints about German, which adds another gender called "neuter," among other complications.

Every noun has a gender, and there is no sense or system in the distribution; so the gender of each must be learned separately and by heart. There is no other way. To do this one has to have a memory like a memorandum-book. In German, a young lady has no sex, while a turnip has. Think what overwrought reverence that shows for the turnip, and what callous disrespect for the girl. See how it looks in print -- I translate this from a conversation in one of the best of the German Sunday-school books:

Gretchen: "Wilhelm, where is the turnip?"
Wilhelm: "She has gone to the kitchen."
Gretchen: "Where is the accomplished and beautiful English maiden?"
Wilhelm: "It has gone to the opera."

The reason for all this messiness is simple: nobody planned this. Languages are organic, evolving things. We forget words, make up new ones, pronounce things lazily, and the mistakes become normal. The rules change, and the exceptions pile up.

But not in Esperanto. Esperanto was designed to make sense.

Esperanto is much easier to learn than other languages because:

The different letters are always pronounced in the same way and every letter in a word is pronounced. Therefore there are no difficulties with spelling and pronunciation, as one knows that the penultimate syllable is always stressed.

The grammar is simple, logical and without exceptions. The grammatical exceptions are often what make it so difficult to learn a new language.

Most of the words in Esperanto are international and are found in languages around the world.

It is easy to make new words with prefixes or suffixes. Thus, if one learns one word, ten or more usually come as part of the package.

And while I'm sure it has its peculiarities, it's fundamentally different from English or Russian or Chinese or Swahili, which basically became what they are by chance.

PHP is like English

PHP is a very useful language for web development. But it's a bit haphazard, like English. Just look at some of its string functions:

count_chars() should have a counterpart called count_words(), right? But it doesn't. The counterpart is str_word_count() .
strip_tags() is named with words separated by underscores. stripslashes() has no underscores.
hebrev() will "convert logical Hebrew text to visual text." I don't know what that means, but really: a whole function for this, in the global namespace? Could something similar be done with Korean or Arabic or Portuguese or Tagalo? If so, would we create korev() and arabiv() and portugv()? Wouldn't it make more sense to have something like langv('languageName') and be done with it? For that matter, why include such a specific function in the base langauge, when 99.9% of users won't need it and those who do could use a library?

This haphazard design reflect's PHP's history: once called "Personal Home Page/Forms Interpreter," it was made with one vision and has been amended and revised and rewritten over and over. I get the impression that somebody said, "hey, I want to add a function for messing with Hebrew." And the PHP team said, "sure, knock yourself out. We'll put it in there."

Now don't get me wrong. I'm not smart enough to design a language myself. And these days, you can write solid, test-driven, object-oriented code in PHP. But fundamentally, it feels like a language that happened, not one that was made.

Ruby is like Esperanto

Ruby, on the other hand, was designed by Yukihiro "Matz" Matsumoto, a Japanese computer scientist who wanted to draw on the strengths of Perl and Python, and above all, to write a language for human beings. As Matz wrote in the foreword to Hal Fulton's The Ruby Way:

...Programming languages are ways to express human thought... Machines do not care whether programs are structured well; they just execute them bit by bit. Structured programming is not for machines, but for humans... So to design a human-oriented langauge, Ruby, I followed the Principle of Least Surprise. I consider that everything that surprises me less is good. As a result I feel a natural feeling, even a kind of joy, when programming in Ruby.

The language feels almost philosophical. It starts with principles: everything is an object. This has deep implications: classes are objects. True and false are objects. "Class" itself is an object, of the class "Class." Odd, yes, but not haphazard. Where PHP seems sloppy, Ruby seems mysterious, like real life. How can light be both a particle and a wave? How can "nil" be an object? We don't know, but we suspect there are satisfying reasons if we dig deep enough.

This logical consistency generally extends to the particulars of the language. Want to know the number of items in an array, or the number of characters in a string? .length is your method, in either case. (PHP employs count() and strlen(), respectively.)

If anything can be converted to a string, you can rely on .to_s to do it; to_a and to_i will likewise convert to an array or integer, if possible. There are some unexpected or duplicate methods: str.to_str is the same as str.to_s, but Array doesn't have .to_str. But on the whole, it seems easier to guess what a method will be called than in PHP.

Ruby, like Esperanto, is quirky. It considers 0 to be true, for example, unlike any other language I know about. But when I encounter its quirks, I assume that they're by design; a necessary condition for some beautiful aspect of the language. In PHP, I think, "somebody didn't think that through."

But perhaps I should stop there. Who am I to judge? I'm still a beginner in this field, with plenty to learn in both PHP and Ruby. Judgment can wait.

For now, I should go back to my studies. I should open my copy of "The Well-Grounded Rubyist" with a curious and open mind. And with the enjoyable expectation that, if I read and think and tinker and practice and ask questions, it will all make sense in the end.

Saturday, June 19, 2010

AT&T U-Verse

I just learned AT&T will probably roll out "U-Verse" fiber service in our area soon and went to look at the price and speeds.

Their naming scheme is pretty funny: Pro, Elite, Max, Max Plus, and Max Turbo. It makes me imagine this conversation.

Me: I'm not a heavy bandwidth user. What's your base package?
Them: That would be Pro.
Me: As in Professional?
Them: Yes.
Me: So if I were a professional internet user, I'd want the worst connection.
Them: No, you'd probably want at least Max.
Me: There is more than Max?
Them: Oh yes, there are two more levels past Max. But you probably don't need that. Why don't you try Elite?
Me: How good is that?
Them: It's the second-to-slowest one.
Me: I see. Is your marketing department staffed by native English speakers?

I don't need high speed, but I like where this is going. Maybe I should hold off until they offer Max Xtreme Warp Nuclear.

Thursday, March 25, 2010

Cache busting in PHP: Part 2

In my previous post, I showed how to borrow a technique from Ruby on Rails for busting the browser cache for a particular file.

If you haven't read that, please check it out and come back. It's OK, I'll wait here. I've got a snack.

Improving on cachedFile()

Back? OK, well I've made some improvements to cachedFile() and thought I'd share them. Here are the new capabilities:

1) The function now extracts the file type from the extension
2) It handles images, and specifies their dimensions for faster and smoother rendering by the browser
3) It caches all the information it calculates about a file for faster performance on subsequent requests

#1 and #2 are pretty straightforward: you can use the function like this:

cachedFile('foo.png');
cachedFile('subdirectory/bar.png','class="buz"')

...and it outputs something like this:

I put a cache in your cache so you can cache while you cache

But what about #3? What's this caching business? How can we add caching to a caching function?

Let's back up a bit.

First off, cachedFile() was a bit of a misnomer. This function is really for BUSTING a cache.

1) First, we configured our web server to tell the browser "you can cache these types of files for a whole year - don't ask for them again."
2) Second, we made sure that the browser saw each filename as the combination of the ACTUAL filename, like 'foo.png', and the file's time stamp, resulting in 'foo.png?1241452378' (or something like that). Those numbers represent the last time the file was changed; they're the same time stamp you see on any file on your computer.
3) Third, since the time stamp is automatically pulled from the file, we verified that we can update the file, which will update the time stamp, which will trick the browser into thinking it's never seen that file before, and therefore requesting it again.

The end result: the browser asks for a file once, then never again (at least for a year) - until the moment you change the file. As soon as it's updated, the browser asks for a new copy; until then, it uses the one it cached.

So, instead of cachedFile(), we could have called the function browserCacheBuster(). (But we won't, because I think that sounds cheesy.)

Now, this is all great, but the server is doing a bit of work for each file. Like before, each time you ask our function for a file, it has to go and determine the time stamp. In addition, my new features mean that for image files, it has to compute the width and height of the image.

This is all very fast in human terms, but how will it scale? What if you're using cachedFile() to spit out the same image tag several hundred times on the same page?

In that case, it might be nice to remember what you calculated last time. "Foo.png? Oh yeah, I remember him. I wrote down his dimensions and time stamp right here. No need to calculate them again."[1]

Memoization

To make this happen, we're going to use a design pattern called memoization. It works like this:

1) Before you calculate a result or pull it from the database, see if you've already got that result stored in a cache
2) If not, figure out your result and store it in your cache. If so, skip this step.
3) Now you've verified that you've got it in your cache, so return it from there.

For a given input, the first time the function runs, it will check the cache, find nothing, calculate a result from the input, store the result in cache, and return. Every time after that, it will just check the cache, find a result for that input, and return.

Does it matter?

But is there any point in doing this? Are we prematurely optimizing? Maybe. Let's see how much performance gain this really gets us.

I did a little not-very-scientific testing: added some caching to cachedFile(), called it from a loop a few hundred times, and timed the results using PHP's microtime(). I tried this with js, css, and image files, and did five or ten iterations of each.

Not a great sample size, but here's what I found: for .js files, having a cache made the function 2.72 times faster. For .css files, it made it 3.18 times faster. But for image files, having a cache made the function 119.63 times faster!

Clearly, computing those image dimensions is a bit expensive for the server, and we don't want to do it more than necessary.[2] Caching cuts the workload considerably.

Enough talk - code time

OK, let's see how our function looks with these changes. (The cache is stored in a global variable so it will persist between function calls. To offset this minor sin, I have labeled it clearly and awkwardly to prevent accidental meddling from elsewhere.)

$GLOBAL_cachedFile_cache = null;
function cachedFile($name, $attr=null){
 global $GLOBAL_cachedFile_cache;
 if (!isset($GLOBAL_cachedFile_cache[$name])){
  $root = $_SERVER['DOCUMENT_ROOT'];
  $filetype = substr($name,strripos($name,'.')+1);

  /* Configuration options */
  $imgpath = '/images/';
  $csspath = '/stylesheets/';
  $jspath = '/scripts/';

  switch ($filetype){
   case 'css':
    $output = '<link rel="stylesheet" type="text/css" href="/includes/';
    $output .= $name;
    $output .= '?' . filemtime($root . $csspath . $name) . '" ';
    if($attr){
     $output .= $attr . ' ';
    }
    $output .= '/>' . "\n";
    break;
   case 'js':
    $output = '<script type="text/javascript" src="/includes/';
    $output .= $name;
    $output .= '?' . filemtime($root . $jspath . $name) . '"';
    $output .= '</script>' . "\n";
    break;
   case 'jpg':
   case 'gif':
   case 'png':
    //This code will get run in any of the three cases above
    $output = '<img src="' . $imgpath . $name;
    $output .= '?' . filemtime($root . $imgpath . $name) . '"';
    $imgsize = getimagesize($root . $imgpath . $name);
    $output .= ' ' . $imgsize[3];
    if($attr){
     $output .= ' ' . $attr;
    }
    $output .= ' />';
    break;
  }
  $GLOBAL_cachedFile_cache[$name] = $output;
 }
 echo $GLOBAL_cachedFile_cache[$name];
}

Magnanimousness

What's that? Want to use this code somewhere? Well, sure. No, you don't have to thank me, or license it, or anything. Just name your kid after me or send me a solid gold pickle.

Humility

And of course, perhaps I did something very stupid here. Well, that's what comments are for.

[1]You might worry if this will create problems. After all, if we cache the time stamp, won't we miss the fact that the file has been updated and defeat our purpose? No worries: the cache only lasts as long as the page script is running. So if you update a file while a user is loading the page, they won't see it. But on the next reload, they will.

[2]In fact, it would be reasonable not to do it at all; there are lots of factors in how fast a site performs and seems, but how quickly it renders is certainly one of them. This is meant to help with that, but costs processor speed. You'll have to decide what works best for your site.

Saturday, March 20, 2010

Rails caching and cache busting in PHP

Ever wondered how to use browser caching to speed up your page loads?

I was working on a Rails project recently, and noticed something interesting in the documentation:

Using asset timestamps

By default, Rails appends asset‘s timestamps to all asset paths[1]. This allows you to set a cache-expiration date for the asset far into the future, but still be able to instantly invalidate it by simply updating the file (and hence updating the timestamp, which then updates the URL as the timestamp is part of that, which in turn busts the cache).

It‘s the responsibility of the web server you use to set the far-future expiration date on cache assets that you need to take advantage of this feature. Here‘s an example for Apache:

# Asset Expiration
ExpiresActive On
<filesmatch "\.(ico|gif|jpe?g|png|js|css)$">
ExpiresDefault "access plus 1 year"
</FilesMatch>

As I explained on Stackoverflow (more on that in a moment):

If you look at a the source for a Rails page, you'll see what they mean: the path to a stylesheet might be "/stylesheets/scaffold.css?1268228124", where the numbers at the end are the timestamp when the file was last updated.

So it should work like this:

1. The browser says 'give me this page'
2. The server says 'here, and by the way, this stylesheet called scaffold.css?1268228124 can be cached for a year - it's not gonna change.'
3. On reloads, the browser says 'I'm not asking for that css file, because my local copy is still good.'
4. A month later, you edit and save the file, which changes the timestamp, which means that the file is no longer called scaffold.css?1268228124 because the numbers change.
5. When the browser sees that, it says 'I've never seen that file! Give me a copy, please.' The cache is 'busted.'

Bringing it to PHP

Clever! Now how can we borrow that idea in a PHP app?

The first step, of course, is to set the server to tell browsers 'cache these files.' The example config above worked for me[2].

The second step is to append timestamps to your filenames. Here's a first-pass attempt at that:

That basically works - the time stamp is appended to the file name. But it's not nearly as streamlined as the Rails way, for a couple of reasons.

1) You have to type the file name twice - don't repeat yourself!
2) Come to think of it, all your stylesheet links are going to be the same format. Why keep typing in the boilerplate stuff?

In Rails, you'd just do this:

<%= stylesheet_link_tag 'main' %>

Slick! Helper tags like these take a lot of the drudgery out of HTML when you're using Rails.

A loose aproximation in PHP could be generalized to handle different file types. For example, you might write a function like this:

...which could then be used like this:

cachedFile('css','jquery-ui-1.7.1.custom.css');
cachedFile('css','main.css','title="Default"');
cachedFile('js','jquery-1.4.min.js');

Notice that this function assumes something - that your javascript files and stylesheets will always be in a particular folder. That's part of Rails' "convention over configuration" mentality: if you always do something the same way, you only have to specify it once.

Now, there's still room for improvement. For example, the type could be extracted from the filename, so that's one less argument to pass in. And more file types could be added. But this function already accomplishes several good things:

1) It gets your files to be cached by the browser and to bust the cache when necessary
2) It cuts down on code repetition
3) Naming the function cachedFile makes its purpose obvious

Now - how can you verify that this is working? I had the same question myself.

As Andy on Stackoverflow pointed out, you can load your page in Firefox, use the Firebug add-on, and look in the "Net" panel as you load the page. For any file that's cached, you should see a status message of 304 Not Modified. For anything that's pulled from the server, you should see 200 OK.

Try it:

1) Load the page to request everything once
2) Reload it to verify that things are being cached
3) Make a trivial change to a cached file, so that its timestamp will change
4) Reload the page again and verify that it was requested
5) Reload the page one last time and verify that it's cached again
6) Set up an elaborate Rube Goldberg machine to pat yourself on the back

(Step 6 is optional.)

Great ideas are worth borrowing

One reason that Rails has become so popular is that it codifies a lot of clever ideas and best practices into easy-to-use shortcuts. You can make a whole app with Rails without ever realizing that it's pulling the trick shown here on your behalf.

You don't have to use Rails, but if you see a great idea, it's always worth asking: "can I borrow this?"

Now go bust come caches!

[1]There's a danger here: notice that the Rails docs say all asset paths. If you set Apache to tell the browser to cache all images, style sheets and scripts for a year, and you only use a cache busting strategy for some of those things, then your visitors won't see updated versions of the others unless they clear their own browser cache manually or do a hard refresh with Ctrl+F5.

[2]I put this information into Apache's main config file, httpd.conf. If you're using a web host, they probably don't give you access to that, but they may have configured Apache to look for .htaccess files in your project folders. If so, you can set caching rules there.

Monday, March 1, 2010

It's not magic

A while back, I had a small epiphany. I'd been asked to create a web form that could send emails with attachments. I already had forms that sent email, but attachments? What were they? In my mind, attachments were the little icons above the email. I had no idea how they were created or sent.

At the same time, I was reading The Code Book, an entertaining and fascinating look at cryptography - the art of sending scrambled messages, to be unscrambled only by the intended recipient.

The simplest, most brain-dead kind of encryption is a Caesar cipher, where you take all the letters in a message and shift them by the same amount. For example, with a Caesar cipher of 1, this:

HELLO WORLD

becomes this:

IFMMP XPSME

A modern cryptanalyst would pee his pants laughing if you used this method for anything serious. But one thing about it works: you can encode, and you can decode, as long as you know the rules.

Now, the Code Book traces the development of encryption methods so complex they'll make your head spin, but all of them are systematic: whatever is encoded can be decoded. You just need to know how.

This applies to any method of encoding information, even if it's not cryptography. For example, Morse Code encodes letters as electrical pulses - not to hide the message, but just to transmit it. How cool - you can encode actual language as beeps!

Which shows us another important idea: you can encode anything as anything else. You can encode the weather forecast with colored socks. You can encode the Constitution with duck calls. As long as there are consistent rules for encoding and decoding, it will work.

Going back to my email attachments, I soon discovered how attachments work: the file data is encoded as text in your email. It ends up looking like gibberish, but fortunately, no human has to read it, because the email program does that for you.

How does it know which parts of the text are text and which parts are files? You tell it. Say you want to attach an image. First, you choose some arbitrary string of characters which will probably never occur in an actual email text. Let's say it's "Woo_hamburgers_for_mayor_in_2050_yeeeeeha." Whenever you use that phrase, it means "I'm about to put in some different content." Each section of content also gets a label about what "MIME type" it is, like "image/jpg" or "application/pdf". Your email text goes in one section, and your image data goes in another, after being encoded as text.

(In PHP, you can encode the data as text like this: base64_encode($fileContents).**)

On the other end, when someone opens your email, their program is smart enough not to show all the scrambled-looking letters, but instead to say, 'hey, he said this part was an image - let's show it like that.' And it gets decoded. The little icons show up. It works!

The main thing is, it's not magic. For me, this turned on a light in my head. I stood next to a fax machine, and pointed my finger at it. "I know what you're doing with your crazy noises," I said. "You're encoding image data as sound!"

And that's how computers work, all the way down. Image information is encoded as characters, which are encoded as ones and zeros, which are encoded as magnetic charges on a disk platter, which are transmitted as electrical pulses in a circuit board.

It's hard to understand. It's hard to actually believe sometimes that everything I'm doing on screen can be represented as ones and zeros. But there's no magic. And if there's no magic, there's nothing to be scared of. If I work hard enough, I can understand a little piece of it - enough to get something done.

----
**For a detailed walkthrough of my attachment code, see this Stackoverflow post.)

The Bing Button

From a NY Times article about upcoming Windows 7 phones:

In addition, Microsoft is requiring phone makers to keep basic elements of its user interface, including a physical button to start Web searches on Bing.

Microsoft. Listen. Nobody wants a button for Bing, or Google either, for that matter. This violates three principles about what I want in a smartphone:

Customizable. If a button goes to a web site, or opens a calculator, or gives me a voice prompt, and I can't change that, I'm going to be frustrated.
Neutral. It's my phone, not yours. I'll use Bing or Google or Jeeves or Big Larry's Virus-Laden Search Emporium if I want to. Don't force your product on me.
General purpose. I don't have a "word processor" key on my computer keyboard. I use on-screen menus for that. There are a million programs I could install, and a billion web sites I could visit. Smartphones are smartphones precisely because they share this characteristic. My flip phone has a single calendar program, take it or leave it. With a smartphone, I could install or write my own, or use one on the web. Having a button that does one thing makes this less like a smartphone and more like a calculator - a single-purpose device. I want a browser, not a Binger.

If you can't put your customers' desires above your need to cross-brand, you're going to make lousy products. And your market share will continue to drop.