Monday, April 4, 2011

DNA and the information revolution

When I was a boy, Carl Sagan swept me and so many others away with the “Cosmos” TV series. And looking at it again, now thirty years later, I am reminded of his contagious enthusiasm and not least his cross-disciplinary cosmic vision where everything is connected, from his “billions upon billions of stars” and the mathematics of ancient Greece, to molecular biology.

In the second episode, titled (with characteristic baroque grandeur) “One voice in the cosmic fugue”, Sagan muses on the information content of DNA. I remember it vividly. Standing between shelves of books representing the human genome, he demonstrates how incredibly much information you can get into that little molecule.

That year, 1980, was also when I bought my first computer, a Sinclair ZX80. I was twelve. It was a marvellous toy – booted in a millisecond, never crashed, and so easy to program (in BASIC) that I felt like a master coder after a few hours.

The ZX80 had one kilobyte of random access memory. You read that correctly. 1024 bytes. This included the display buffer, meaning that as your program grew beyond a few lines, the image on the TV screen had to shrink. Admittedly, there were larger computers around at the time, and soon the IBM PC would be released with a whooping 128 Kb or something, but still, it gives an impression of the zeitgeist.

Your computer now has, I don’t know, maybe a gigabyte of RAM – one million times one kilobyte. The colour information contained in a 20x20 pixel square on your computer screen (the size of a single, largish letter) would not fit into the entire memory of the ZX80.

I believe this unbelievable explosion in the capacity of computer memory has radically changed humanity’s perception of information.

Sagan mentioned five billion bits in the human DNA. Not too far off: the current number is about 2.9 billion base pairs in the haploid human genome. Since it takes two bits to code for a C, A, G or T, that would be six billion bits, or 725 megabytes. Back then, people were awed. A third event of note in 1980 was the release of the IBM 3380, the first hard disk with gigabyte capacity. It cost $98,000 for 2.52 gigabytes, and was the size of a double wardrobe.  You could put three complete human genomes onto one of those:

IBM 3380 hard disk assembly (Nik Clayton, Creative Commons)

Today, you can store the entire genome on a memory stick. Taking this further, perhaps only about 3% of the genome consists of protein-coding regions (exons) and regulatory sequences. Although we don’t know yet, it is possible that the rest is “junk DNA” with no critical purpose. That leaves about 22 megabytes of useful information. In 1980, that would cost you perhaps $6,000 (four Seagate ST-506 drives, each holding 5 megabytes). Today, it corresponds to four good digital images on the memory card of your camera.

This technological revolution is perhaps why, in 1980, when Cosmos was shown, not many scientists asked the question: How on earth is it possible that you can code for a complete human being, with billions upon billions of cells, and a reasonable brain on top, with such a ridiculously small genome? As a boy, I thought: Is it not incredible that DNA contains so much information? Today, I think: Is it not incredible that DNA contains so little information?

The morale of the story: Always remember that we see nature from a subjective standpoint, liable to change. This applies not only to information content: Bacteria are only very small relative to us. The ice age was a long time ago only relative to our lifetimes. And, perhaps more profoundly, complexity is relative to the size of our brain. How is it possible that evolution produced so complex organisms? Well, to the Alpha Centaurian with brain the size of a house, a fruit fly is presumably not complex at all, and that it could evolve no surprise whatsoever.

1 comment:

  1. That the human genome contains only junk DNA is not true anymore. Much of the so called " junk" appears to code for non-coding RNA's (bad name) which regulate the expression of the protein coding genes via epigenetic mechanisms. It turns out that the human genome is not so void of information as it was thought to be.

    A nice and informative book on this topic can be found here: