The Institute for Genomic Research (TIGR), Rockville, MD, 15 and 16th November, 2000
Meeting attendees:
Michael Anderson (MA), University of Manchester, UK: postdoctoral researcher:
working on Af; meeting secretary.
Javier Arroyo, Complutense University, Spain: working on Saccharomyces cerevisiae
(Sc); head of sequencing facility.
Terri Attwood (TA), University of Manchester, UK: senior lecturer: working on
protein fingerprint database.
Joan Bennett (JB), University of Tulane, USA: professor: working on A. flavus
and A. parasiticus.
Fiona Brinkman (FB), University of British Columbia, Canada: postdoctoral
researcher: working on Pseudomonas genome project.
John Clutterbuck, University of Glasgow, UK: senior lecturer: working on A.
nidulans (An): compiler of linkage map.
Maria Costanzo, Proteome Inc., USA: editor involved with S. cerevisiae,
Schizosaccharomyces pombe (Sp) and filamentous fungal protein databases.
David Denning (DWD), University of Manchester, UK: senior lecturer and consultant
in infectious diseases: working on Af; meeting chair.
Dennis Dixon, National Institute of Allergy and Infectious Diseases (NIAID),
USA: chief of Bacteriology and Mycology Branch.
Rory Duncan, National Institute of Allergy and Infectious Diseases, USA: assistant
program officer, Bacteriology and Mycology Branch.
Tamara Feldblyum (TF), The Institute for Genomic Research, USA: head of the
sequencing facility.
James Galagan, Whitehead Institute, USA: participating in the sequencing of
Neurospora crassa (Nc).
Maria Giovanni, National Institute of Allergy and Infectious Diseases, USA:
associate director, Microbial Genomics.
Rhian Gwilliam, Sanger Centre, UK: formally in charge of physical mapping in
the Pathogen Sequencing Unit (PSU).
Neil Hall (NH), Sanger Centre, UK: project manager in Pathogen Sequencing Unit:
Af and parasites
Phillip Harriman, National Science Foundation, USA.
Maryanna Henkart, National Science Foundation, USA: head of molecular and cellular
biology.
Frank Kunst (FK), Institut Pasteur, France: head of microbial genomes unit.
Victoria McGovern, Burroughs Wellcome Fund, USA: program officer Emerging Infectious
Diseases.
Paul Magee (PM), University of Minnesota, USA: professor: working on Candida
albicans (Ca).
Ron Morris (RM), University of Medicine and Dentistry of New Jersey, USA: professor:
head of A. nidulans consortium involved in trying to get genome sequenced.
William Nierman (WN), The Institute for Genomic Research, USA: vice-president,
PI of Af project.
Michael Quail (MQ), Sanger Centre, UK: project leader, pathogen subcloning in
the Pathogen Sequencing Unit.
Marie-Adele Rajandream, Sanger Centre, UK: project manager in Pathogen Sequencing
Unit: S. pombe.
Matthew Sachs, Oregon Graduate Institute of Science and Technology, USA: working
on N. crassa genome project.
Charles Staben (CS), University of Kentucky, USA: professor: working on Pneumocystis
carinii (Pc) and N. crassa genome projects.
Geoffrey Turner (GT), University of Sheffield, UK: professor of genetics: working
on A. nidulans.
Carlos Vazquez de Aldana, University of Salamanca, Spain: working on S. cerevisiae.
Owen White (OW), The Institute for Genomic Research, USA: director of annotation.
The Aspergillus fumigatus genome sequencing project.
Progress reports and proposed plans about the genome sequencing
1) Welcome and Introduction: David Denning: stated the aims of the meeting which were to arrive at an overall sequencing plan for the genome and to agree on a nomenclatural system for Af's genes and proteins. Introductions were made.
2) Progress report on the pilot project: Michael Quail: introduced work of PSU (head Bart Barrell) which uses 10 % of the sequencing capacity of the Sanger Centre. As well as the program managers in charge of various projects, the unit consists of nine annotators and five bioinformaticians. A BAC library (B28) has been constructed using pBACe3.6 as the vector, and size selected and gel-purified Sau3AI partially digested restriction enzyme fragments. The library has an average insert size of 80 kb with a range from 15-160 kb. 3456 clones have been picked into replicate microtitre plates and gridded onto membranes from a total of ~24,000 and end-sequencing has been initiated (6x96 plates). These end-sequences are searchable on the Sanger Centre website. These clones have also been screened using a probe generated from the niaD gene and three positives have been identified. Currently different restriction enzymes are being tested to identify one that generates 20-50 fragments per clone so that the clones can be used to generate a physical map of overlapping fingerprint contigs. It will be possible to track the fingerprinting on the website. A EcoRI BAC library and a chromosome specific library of the smallest chromosome will also be made. He stated that they have been able to obtain chromosome specific libraries of 80-85 % purity from Plasmodium and 50 % from Dictyostelium.
FK: is it worth the effort of making chromosome specific libraries?
MQ: Yes: provides separate projects for the different sequencing centres. In addition to shotgun cloning and sequencing the gel-purified chromosomal DNA, BAC clones assigned to that chromosome will be sequenced to 2 x coverage (skimmed) which will assist in positioning sequencing contigs.
PM: instead of probing individual chromosomal shotgun clones onto chromosomal DNA strips to establish purity, the purified chromosomal DNA band could be probed back to the strip.
WN: what fraction of the two BAC end sequences were successful? His experience is that the T7 primer is more successful than the SP6 primer.
MQ: 406 out of 536, but stated that failures would be repeated.
RM: why was this strain of Af chosen (AF293)?
DWD: provided an explanation (see page about the isolate)
3) Progress report from the Pasteur: Frank Kunst: His unit has experience of sequencing bacterial genomes up to 6 Mb in size. A BAC library has been constructed by Alain Billaut at The Centre d'Etude du Polymorphisme Humain using DNA provided by Sophie Paris at the Pasteur. The vector used was pBeloBac11 and the enzyme HindIII. The mean insert size is 77 kb with a range from 25-190 kb. 8640 clones have been picked and end-sequencing has been initiated: 2000 sequencing reads have been performed from 1000 clones. Of these, 900 are unique and 440 belong to a contig. The high number of sequences being part of a contig probably reflects the preferential cutting of the restriction enzyme. They propose to end-sequence all the picked clones. In addition, they have generated approx. 8000 sequences from a 1 kb library (see Jan 2000 minutes for details). The Pasteur would be able to assist with the sequencing (up to 10 clones) and library construction.
4) Karyotype progress report: Pete Magee: reported the following technical improvements to their (PM and Jo-Anne von Burik) attempts to generate an electrophoretic karyotype (EK). They obtain Novozyme from Rolf Prade and embed conidia rather than protoplasts in the agarose plugs. Generally 1 x TAE gives better results than 0.5 x TBE, but also suggested that TPE buffer should be tested. He displayed their best karyotype so far with the following assignments: 1 chr at 5.7; 2 at 5.0; 1 at 4.0; 2 at 3.5 and 2 at 1.5 Mb. If funding is received, they propose on assigning 50 Af genes, 25 An genes and 25 BAC end sequences to the EK and restriction enzyme digested DNA blots. The digests will be performed with 8 bp cutters with the aim of generating 60 fragments of approx. 50 kb. The digested DNA will also be probed with a telomeric probe.
Points raised from the floor: CS: wondered if the long time spent in S phase would affect the quality of the EK.
The meeting discussed other forms of physical mapping such as optical which was dismissed as too expensive. HAPPY mapping was mentioned which costs £25 / marker. It was suggested that the number of chromosomes could be determined directly by microscopy using a fluorescent label.
PM: stated that probing a Southern with a telomeric probe gave five distinct bands and a smear. He stated that from his experience with Ca, using agarose plugs rather than strings made no difference to the resolution of the EK. There is also less contamination in larger chromosomal bands, whereas the purity of the smallest Ca chromosome was only 50 %. This background was the result of random shearing caused by non-specific nucleases and physical stress and that as a consequence during shotgun chromosomal sequencing does not build up into contigs.
The suggestion that DNA could be re-run after purification was not recommended since too much DNA would be lost.
5) The NIH funded TIGR project: Dennis Dixon: introduced the program of funding sequencing projects using U01 co-operative grants and was pleased to announce that the $ 3 million grant over two years to DWD and WN was the first to receive funding. Tamara Feldblyum: outlined the approach to sequence 12 Mb which will involve sequencing the 3 mid-sized chromosomes, each of approximately 3.5 Mb. These chromosomes are inseparable on pulsed-field gels and so they will be done using a shotgun approach. This will consist of constructing 2 and 10 kb insert libraries and a 15-20 kb insert BAC library using DNA isolated from pulsed-field gels and sequencing clones from the three libraries to a total of 8 x coverage. 100,000 sequences will be read from the 2 kb library to give 5 x coverage; 40,000 sequences from the 10 kb library to give 2 x coverage and 20,000 sequences from the BAC library to give 1 x coverage. At this stage (end of year 1), based on previous projects using this approach, the estimate is that there will be 30-40 physical gaps and ~300 sequencing gaps. The sequence of these chromosomes will be finished using standard approaches and annotated during year 2. No specific attempt will be made to sequence the telomeres and centromeres, since it is not known if they can be cloned and sequenced. (Point from the floor: it has not been possible to walk across the centromeres using cosmids or YACs in An.)
CS: have you considered the possibility that the DNA is methylated?
WN: we will use host/vector systems developed for Drosophila and human and therefore methylated DNA should not cause any problems.
RM: what are the relative costs?
TF: 50 % for the high throughput stage and 50 % for finishing.
6) The Spanish project: Carlos Vazquez and Javier Arroyo: stated that the Spanish government is prepared to give them sufficient money to sequence at least 2 Mb. Three centres are collaborating: Salamanca University (Miguel Sanchez, Carlos Vazquez, Fernando del Rey) where they have access to one 377 and one 3700 machine and in Madrid at Complutense University (Javier Arroyo, C. Nombela) and the Centro de Investigaciones Biológicas (Miguel Penalva, S. Rodriguez) where they have access to two 377 and one 3700. Previously these groups have been involved in the Sc and Sp sequencing projects and in EUROFAN I and II.
7) General discussion: CS: asked about data release policy.
NH, replying for the Sanger Centre, stated that for raw data (1 kb contigs and BAC end sequence) it was immediate and the data is placed on their website. For deposition to the databases, finished BAC clones and larger units (e.g. chromosomes) were submitted. However, it was pointed out that when to release data resulting from shotgun approaches was less clear and the example of Ca was given. The 10 x WGS is being annotated by Stewart Scherer, but he is being pushed to release the data so that micro-arrays can be built.
A point was raised on the intellectual property rights over the sequence before its publication. NH: stated that the Sanger position is that permission should be sought before any unpublished sequence is used in a publication. The consensus was that a gentlemen’s agreement existed that no global analyses should be published on the completed genome sequence before it had been analysed and published by the sequencing centres.
FK: wished to emphasize his view that a whole-genome shotgun strategy would be more efficient than either a clone-by-clone or chromosome-by-chromosome approach. This was the view of his colleagues at the Pasteur as well.
8) Closed session of the sequencers: The following sequencing plan was put together during the closed session. A chromosome-by-chromosome approach will be taken using gel-purified bands of single or multiple chromosomes from the EK. The chromosomes are numbered from the smallest to the largest and there are assumed to be eight.
a) chromosome I (1.7 Mb): Sanger: a 2 kb randomly sheared insert library will be constructed and shotgun sequence generated up to 8 x coverage. BAC clones assigned to the chromosome will be sequenced to 2 x coverage. Finishing will be done using standard approaches.
b) chromosome II (1.8 Mb): Madrid / Salamanca and Sanger: either a BAC-by-BAC clone approach using assigned contigs or a whole chromosomal shotgun approach using a 2 kb library made at the Sanger (as for chromosome I). The Spanish centres will send their traces to the Sanger where the sequence will be assembled and finished.
c) TIGR section: chromosomes III, IV and V (approx. 10 Mb): the 3.5 Mb band will be gel-purified and DNA used to construct 3 libraries (see section 5). These chromosomes will be shotgunned sequenced, finished and annotated solely at TIGR. Chromosome VI (4.0 Mb) will be sequenced as a later project if any sequencing capacity remains. In addition, TIGR will generate end-sequence from 10,000 BAC clones. These clones will be from a library generated using randomly sheared DNA.
d) chromosomes VII and VIII (5.0 to 5.7 Mb): Sanger with assistance from the Pasteur: the DNA in the largest band(s) will be randomly sheared and 2 kb fragments used to construct a library. Sequencing will be performed as for chromosome I. The Pasteur contribution will consist of BAC end sequencing and otherwise utilising their 1 Mb capacity.
Annotation: fungi
1) The Sanger S. pombe project : Marie-Adele Rajandream: The Sanger Centre leads the consortium to sequence Sp and has done 1/3rd of the sequencing, which is 98% complete and consists of <10 physical gaps. A clone-by-clone approach was adopted using a minimal tile of overlapping cosmid and P1 clones. There are approx. 5000 genes of which 4817 have been analysed and annotated. 43 % of the genes are spliced and one gene lies every 2.4 kb. There is 1.2 Mb of rDNA and 0.4 Mb consisting of telomeres and centromeres. Genefinder is used to predict ORFs, making use of the following data: species-specific codon table and splice recognition sites including that of the branch point. The programme Halfwise locates protein domains and therefore positions exons and in addition splice junctions are located using Fasta. 20 % of the predictions are manually adjusted to increase the confidence level to 95 %. Exons are finally verified using cDNAs where available. Blast searches and InterPro are being used to assign predicted function and GO terms are being applied.
2) Data release policy at the Sanger: Neil Hall: Data such as BAC end-sequence and contigs over 1 kb are placed on the various Sanger FTP and Blast server sites every day. Other data such as that associated with the physical mapping are also released. Finished and annotated sequence is submitted to EMBL. Researchers are encouraged to use these data, but the Sanger likes to be acknowledged. However, whole genome analysis is discouraged before publication of the complete sequence. The clone-by-clone approach, of course, has the advantage that finished sequence is released in discrete packages, whereas with shotgun approaches, annotation is not done until the sequence has been finished. However, automatic annotation is performed on contigs over 2 kb which can be searched by keyword.
3) Comments about annotation: Owen White: explained TIGR’s data release policy. Previously they did not release all data immediately, but they do now. Preliminary data is available by FTP and the TIGR databases can be searched with your own sequence. There is a disclaimer regarding the use of these data before they are published. Completed sequence is submitted to Genbank. He made the case for all annotation of the Af genome to be done using an agreed standard operational procedure. A group of programmes should always be run and new ones can be plugged in. All preliminary analysis needs to be done automatically and run in the background where new contigs can be checked every day. To help with manual annotation, a set of rules needs to be established to ensure consistency. However we will have to accept that individuals will still assign the same sequences differently.
4) The Neurospora crassa project: Matthew Sachs: introduced the NSF funded project ($5.25 million) where the sequencing will be done at the Whitehead Institute in Boston and the annotation and community interaction at Oregon Graduate Institute. A policy committee has been elected. He referred to the new edition of the Neurospora Compendium which lists all of the genes and to the Fungal Genetics Stock Center which has 13,000 strains including 4,700 wild-types. 1000 markers have been mapped into 7 linkage groups and 247 genes have been cloned and sequenced. There is little repetitive DNA and few highly similar duplicated genes. In Germany they have sequenced chromosomes II and V using ordered BAC and cosmid clones. At the University of Georgia , they are ordering cosmids clones and the mapped tile will be used for mutant analysis. The cosmid clones are also being end-sequenced. A database of ESTs has been generated. The genome project is relatively inexpensive and only key features will be annotated. James Galagan: introduced the sequencing by stating that high throughput data collection is here: sequencing capacity is no longer limiting, with a cost of $0.13 per base for finished sequence. The problem now is how to handle the data. The goal for this project is to generate finished sequence. The shotgun phase involved a 10 x coverage of paired reads from 4 kb inserts of random clones. This sequencing was done in one week in 09/00 (1.2 million reads). Paired reads from large inserts will also be generated. The assembly will be verified using the genetic map and paired reads from BAC clones which will not have been included in the assembly. At the finishing stage, the estimate is that there will be 800 gaps. PCR will be used to close gaps. Chuck Staben: introduced the annotation which will only be done once on the finished sequence. It will involve automatic annotation and they will apply for additional money to perform manual annotation and maintain the sequence. Genefinding programmes will need to be retrained. The genes in Genbank have already been analysed and 200 whole cDNAs will be sequenced as well. These data will be used to help locate exons.
5) The Proteome protein databases : Maria Costanzo: stated that they receive funding from companies and from the government. NIAID have funded projects on Sp, Ca and on fungal pathogens. The idea is to incorporate all biological knowledge on proteins and the genes that encode them. They do not include genes that don’t encode proteins. This knowledge is obtained from functional genomics data and from the biological literature. Entries include constantly updated descriptive title lines, controlled vocabulary description lines, Blast hits which are updated every week, free-text annotations and a complete reference list. Proteome have been using their own classification system, but will be using the GO system. The free-text annotations are written by annotators who have a PhD on that organism and are then edited. The annotators keep up with the current literature and try to go back as well. For instance, the entire Sp protein literature has been covered. Presently with the CalPD database, sequences from Genbank are being annotated. They have just received funding to set up a MycoPath database which will cover the pathogenic fungi such as Candida, Aspergillus, Pneumocystis carinii, Histoplasma, Blastomyces, Coccidioides and Cryptococcus.
Annotation: general aspects
1) An approach to community-based genome annotation: the Pseudomonas example: Fiona Brinkman: Pseudomonas aeruginosa (Pa) is the third most cited bacterium after E. coli and Staphylococcus aureus in the literature. The genome sequence of Pseudomonas was published in 2000 . There are ~5600 ORFs in the 6.3 Mb genome which is the largest bacterial genome sequenced to date. All the annotation was manually curated by two to three annotators, references were checked and a confidence level assigned. One indicates that the gene/protein has been studied in Pa; 2 that there is a high level of similarity to a gene studied in E. coli; 3 that the protein maybe has this function and 4 that the sequence is hypothetical. Expert knowledge was incorporated into the automated programs; for instance on the start positions of genes. The community-based annotation of Pseudomonas was begun in 1998 and was called the Pseudomonas aeruginosa community annotation project (PseudoCAP). Its chair is Bob Hancock and FB is its co-ordinator. There is a committee which acts as an aid in decision making. They recruited 61 researchers in 10 countries and the whole project was done electronically and involved no paper. The researchers were provided with automated analyses such as Blast and had access to a password protected area of the website. They were given a deadline for submission, which in hindsight should have been set earlier since the work was not one night’s effort. The researchers were permitted to submit whatever they wanted, in order to establish what researchers would be willing to submit on a volunteer basis. However, the requirement could have been set higher and used a pre-set format. This would be recommended for subsequent projects. Greater than 90 % of those recruited made submissions. All submitted annotations were reviewed by FB. However the final call on the annotation of ORFs was made by the core annotators of the genome project team. Problems encountered were that the quality varied and the formats differed. However, these were a reflection of the prototypical nature of this particular project. Based on PseudoCAP’s analysis of what types of annotations researchers are willing to submit, subsequent projects could utilize more defined submission criteria. One final lesson from this project is to be conservative.
2) Tools for function prediction: Terri Attwood: reviewed the protein domain and fingerprint databases - PROSITE, Pfam and PRINTS. These databases are available as part of the InterPro package which includes a graphical output. She emphasised the importance of using an integrated approach that exploits several overlapping family databases (e.g., InterPro). However, InterPro is not fully comprehensive (there are many databases it doesn't include) and so annotation strategies should not rely solely on it, but should utilise other resources. Care should always be taken - so-called "expert systems" have been developed that integrate virtually the same limited selection of databases (thus they are neither comprehensive, nor expert). Using results from the PRINTS fingerprint database, TA gave examples where incorrect diagnoses had been made using current naive expert systems. The bottom line with function prediction is to develop integrated strategies with care, always bearing in mind the strengths and weaknesses of the different approaches.
The naming of new genes in A. fumigatus: Michael Anderson and Joan Bennett: JB: introduced the session by summarising the various nomenclatural systems which are often organised by one individual. This contrasts with the commissions that exist for the naming of chemical compounds and organisms. She also made the point that journals need to act as gate-keepers as well.
MA: continued by demonstrating that there is no consistency in the current naming of genes in Af and proposed that a genetic name be assigned to those ORFs which had either been studied in Af or for which the orthologue had been studied in another Aspergillus species. However, the meeting consensus was that it certainly would not be possible to assign orthologues with confidence during the genome project and that therefore the genetic names of homologous genes were not to be assigned to ORFs in Af.
CS: stated that the time to evaluate orthologues and paralogues was at the end of the project. OW: stated that even though orthologues might not be identifiable during the sequencing phase of the project, the annotation will have to be explicit regarding similar sequences and will have to record the four character (genetic) names of these hits within the entry.
OW: reviewed the problem regarding the assignment of names to genes/ORFs where a unique identifier is required in four characters. He stated that generally researchers prefer to use this name rather than the systematic name which generally consists of a seemingly random mixture of letters and numbers and is far less memorable. He outlined how a gene codes for a protein which might have an EC number and a common name (e.g. catalase). The gene (ORF) will have a systematic name, which acts as a unique license plate, and a genetic name.
MA: summarised the results of a web survey that was carried out to canvass opinion on which nomenclatural system should be adopted. 88 replies were evaluated and 58 % were in favour of using the An system versus 31 % in favour of the Sc system. Amongst published Af researchers the figures were 62 % for An and 38 % for Sc. On the use of species prefixes, the survey showed that 57 % were for using them only when required to distinguish between species in a discussion and 43 % were for using them all the time. The meeting carried the majority opinion and so we will recommend that a species prefix will only be used when required. The meeting also agreed on the use of a three letter prefix consisting of the initial letter for genus and the second and third letters for the species name.
The following points were raised regarding which nomenclatural system to use. What’s the point of a model which is a close evolutionary relative if we don’t relate Af to An.
GT: highlighted a comment from one of the web responses: Why would we want Aspergilli genes confused or mistaken for S. cerevisiae genes? The more different and distinct we can be from yeast work the better! Criticisms regarding the use of the An system from the web questionnaire responses were addressed:1) distinguishing between dominant and recessive alleles is not as clear as with the Sc system - as dominance is a property of individual alleles rather than of the gene, making this distinction is usually not helpful, but can be indicated if necessary using an optional superscript; 2) the number of genes named after the same biological feature is limited by the alphabet (e.g. stcA-Z) - this is a trivial problem which can be solved by using double letters (e.g. stcAB) or a related name (e.g. stdA).
The meeting consensus was to recommend the use of the An convention for Af.
A proposal to be published will be prepared on behalf of all of those present
recommending the adoption of this convention.
The meeting closed with a vote of thanks to all for coming and contributing.
Michael J. Anderson
January 30, 2001