Working from the Ground Up - Genome Research

Back in the 1980's DNA sequencing was carried out 250 base pairs at a time, often a 3-4 day process. The DNA in even a simple species (i.e. its genome) such as Escherichia coli contains around 5 million base pairs, so you can imagine how long it would take to read the entire genome. The genome size of much larger organisms such as ourselves are a thousand times larger!

Such is the importance of being able to read the genomes of many species (e.g. we need to know what makes a dangerous bacterium 'tick' so we need its sequence), that DNA sequencing technology has been massively developed and speeded up. We now routinely use robots to generate DNA sequences at a furious rate, sequencing up to 5-9 million base pairs per day.

 So successful has this effort been that the international repositories for DNA sequences (EMBL, NCBI &  DDBJ currently hold over 2000 billion base pairs from thousands of different species.

Clearly the problem we have in being able to understand how cells work is no longer one of being able to read DNA sequences! But what about the next step? DNA codes for RNA & protein and it is those products that are used to construct all parts of all living things. We now have the DNA sequences and can use these to predict which proteins can be made, giving us a big clue as to what makes different organisms diferent, including what makes them dangerous to us.

However it isn't as simple as all that. Deciding which proteins are expressed is a difficult business at best and with many millions of base pairs to sort through (estimating the average size of a proteing to take up 1000 base pairs) it is a long process. Naturally computers can help us and there are many examples of 'automated annotation' of genomes available, all of which have been carried out by computer software (see detailed explanation of the technical procedures here).

Unfortunately computers can only work from known information, so to get a full accurate appreciation of a genome and to discover new information from that sequence humans must also annotate manually - as you might imagine that is a slow process akin to sequencing 250 base pairs at a time back in the 1980's.

There seem to be 3 complete Aspergillus genomes available at NCBI, and there are two specialist Aspergillus resources that host another 11 genomes: CADRE hosts 9 with mainly automated annotation and the Aspergillus Genome Database hosts 2 which are fairly extensively manually annotated.

A recent project funded by NCBI has attempted to take automated annotation to a higher level of detail (Gnomon). The genome information known about 8 Aspergillus genomes has been pooled and compared with other similar organisms about which a lot more is known (e.g. S. cerevisiae, S pombe - yeast). Gradually more information is gleaned when similarities between areas of each genome are found and compared - effectively a shortcut to maximising the information known about all fungal species by using the information already established from one or two species.

This researc will ultimately be able to tell us about all of the genes in our bodies, when they are expressed and how they are expressed. There is a long way to go but back in the 1980's it was barely believeable that we would be able to read the DNA sequence of one complex organism in our lifetime. That problem was solved very impressively so perhaps in 30 years time we will look back and wonder why we ever doubted it could all be done - possibly while running computer models of entire cells!