Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.mccme.ru/albio/slides/christen.pdf
Äàòà èçìåíåíèÿ: Thu Oct 9 22:25:13 2008
Äàòà èíäåêñèðîâàíèÿ: Tue Oct 2 11:51:41 2012
Êîäèðîâêà:

Ïîèñêîâûå ñëîâà: universe
The Massive Parallel Sequencing era: "Global sequencing"

Richard Christen CNRS UMR 6543 & UniversitÈ de Nice christen@unice.fr http://bioinfo.unice.fr

1


At the end of 2007, three next-generation sequencing platforms appeared: Roche/454's Genome Sequencer FLX (which succeeded a first model), Illumina's Genome Analyzer; and Applied Biosystems's SOLiD sequencer. In many applications they will replace the "old Sanger" technology (ABI 3730XL)
2


3


4


5


"The capacity and throughput of the 454 FLX system is quite similar to the Solexa system, if one can afford to run it twice a day". If run at maximum capacity, per year : · consumes about 5,3 millions , · generates about 75 gigabases of data. Lower the cost of sequencing DNA. Simplify the sequencing process (no cloning). Produce hundreds of thousands or millions of sequences at once.

6


Tasks and problems
· Genomes
­ Resequencing genomes. ­ De novo sequencing a genome.

· Transcriptomes. · Biodiversity.
­ SSU rRNA sequences ­ Metagenomes

7


Resequencing a genome
454

Sanger

454 : less than 1 million US $, 7.4-fold redundancy in two months. Sanger : approximately 100 million $... 234 runs of 454 produced over 105 million bases per run. 3.3 million mutations, of which 10,654 cause changes in proteins.
8


Resequencing genomes
454

A total of two, four-hour runs were performed to generate a total of ~800 thousand sequences with an average length of about 100 bases, resulting in more than 20X coverage of the whole genome of the strain. The functional analyses of the differences have revealed a total of 24 genes that may be associated with the loss of virulence

9


Tasks and problems
· Genomes
­ Resequencing genomes. ­ De novo sequencing a genome.

· Transcriptomes. · Biodiversity.
­ SSU rRNA sequences ­ Metagenomes

10


Sequencing new genomes
454 & Sanger

454 : In total, 12.5 million reads corresponding to 2.1 billions bases were produced. Sanger: 6.2 million reads for a total of 3.5 billions bases were produced by Sanger sequencing from 43 libraries The genome size of V. vinifera is 504.6 Mb

11


Problems
· Genomes
­ Resequencing genomes.
· Assemble fragments with the help of the known reference genome. Easy & Known

­ De novo sequencing a genome.
· Assemble fragments without the help of the known reference genome. More difficult & Known

­ Identification of genes, regulatory regions, mutations,...
· Difficult but Known

A flood of data to come
12


Genomes : assembling the tags
· · · · · · · 2008 Zerbino, D. R., and E. Birney. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18:821-829. Butler, J., I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. Lander, C. Nusbaum, and D. B. Jaffe. 2008. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18:810-820. Hernandez, D., P. Francois, L. Farinelli, M. Osteras, and J. Schrenzel. 2008. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 18:802-809. Chaisson, M. J., and P. A. Pevzner. 2008. Short read fragment assembly of bacterial genomes. Genome Res. 18:324-330. 2007 Dohm, J. C., C. Lottaz, T. Borodina, and H. Himmelbauer. 2007. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 17:1697-1706.

Conclusions : · The work is "as before" excepted that sequences to assemble are shorter and in great abundance. · According to publications, this seems to be a very active field.

A flood of data to come

13


Tasks and problems
· Genomes
­ Resequencing genomes. ­ De novo sequencing a genome.

· Transcriptomes. · Biodiversity.
­ SSU rRNA sequences ­ Metagenomes

14


454

Gene expression analyses

Over 30 million bases of cDNA from first larval stage worms. Approximately 14% of the newly sequenced expressed sequence tags do not map to annotated genes these are novel genetic structures. Approximately 15 millions cDNA sequence reads with lengths of 105 bp each rapid and efficient analysis of gene expression in tumors.

15


Gene expression analyses
These new data sets are very much similar to the previous technology such as EST (Expressed Sequence Tags), excepted that : · Sequences are a shorter (but not that much with 454 technology). · There are much much more sequences (in the range 100-1000 fold) Remarks : Most labs use bioinformatic tools that are not well adapted, in particular Blast (or Blat) which was written in 1990 with much fewer sequences in mind. Biologists are in need of tools to : · Assemble tags into a cDNA (not always). · Map the tags onto a reference genome. · Make sense of the data (compare samples, cluster tags & samples, link to knowledge database). Some tools simply need to be improved from previous ones developed for EST, SAGE and DNA chip technologies.

A flood of data to come

16


Tasks and problems
· Genomes
­ Resequencing genomes. ­ De novo sequencing a genome.

· Transcriptomes. · Biodiversity.
­ SSU rRNA sequences ­ Metagenomes

17


Studying biodiversity, why ?
· Most of the earth's biomass is not visible to the naked eye. · These prokaryotes or protists are very difficult (impossible) to identify under a microscope. · They produce more than 50% of the oxygen, and almost entirely recycle the inorganic matter on earth (Nitrogen, Phosphates, ...). · They could play a significant role in the process of "Global Warming". · But : we have almost no idea of how many species there are and of which is doing what and when...
18


The "Loop"

CO
Detritus

2

Larger grazers

Protist grazers

Bacteria
Detritus

10 cells / ml

8

CO

2 Ligth

Primary production

mostly in oceans, mostly microbes The loop has been near equilibrium for a long time
19


CO2 in atmosphere

Greenhouse gases like CO2 are increasing in the atmosphere

Year

20


The "Loop"

CO
Detritus

2

Larger grazers

Protist grazers

Bacteria
Detritus
10 cells / ml
8

CO

2 Ligth

Primary production

How will the loop react to increased CO2 ?

21


The identification of microbes
· Culture them not possible. not feasible.

· Sequence their genomes

· Use a gene present in the genome of every cell.
­ First done in 1977 ­ Now the procedure of choice in every lab in the world.
· Human gut, mouth, wounds,... · Sea water, earth fields, deep earth, ice, very hot waters (>100 °C), ...
­ they are many, everywhere

· Industry & agriculture.

­ The gene used is coding for the ribosomal RNAs (that structures the machinery to make proteins).
22


Studying biodiversity, the "classic" approach

1. 2. 3. 4. 5. 6.

Purify the DNA Extract all the ribosomal gene sequences. Clone the ribosomal RNAs of every cell. Random sequence ... as many clones as possible. Analyse results, compare samples. Publish you results

Genome Res. 2006 16: 316-322

23


Biodiversity analyses - classic

PCR ­ clone - sequence : too tedious for most labs !

24


X
Clone & sequence

X

Sequence every gene isolated : > 400,000 sequences per day

25


Biodiversity, case studies
· Huber, J. A., D. B. Mark Welch, et al. (2007). "Microbial population structures in the deep marine biosphere." Science 318(5847): 97-100. · Sogin, M. L., H. G. Morrison, et al. (2006). "Microbial diversity in the deep sea and the underexplored "rare biosphere"." Proc. Natl. Acad. Sci. U S A 103(32): 12115-20. · Roesch, L. F., R. R. Fulthorpe, et al. (2007). "Pyrosequencing enumerates and contrasts soil microbial diversity." ISME J. 1(4): 283-90.

26


Tag dereplication

100000

10000

Problems : · Strict dereplication ? · Loose dereplication ?

1000 FS396 FS312 100

10

1 1 1970 3939 5908 7877 9846 11815 13784 15753 1772219691

27


Clustering tags into OTU
Operational Taxonomic Unit : cluster together tags that are similar. · How to define similarity ? i.e. how to calculate distances ? · How to cluster ?
· Usual manner for few long sequences : · Do a multiple alignement. · Compute phylogenetic distances. · Phylogeny or various clustering methods. · But : · Too many sequences to align. · Domains are too divergent for present multiple alignements methods. · · Cluster according to words frequencies (ex. words of 5 nt) ? · No alignement, much faster, much better ? ???

We need cleaned experimental data sets to evaluates methods & algorithms
28


Assign each tag to a taxon
Clustering may be fine for comparing samples, but it provides no hint about : · Which are the species present ? · What do they do ? · What is the significance of a change in composition over time or space ?
We need to assign each tag or each OTU to a name, the best would be to assign as much as possible : 1. To a known species (which is in culture somewhere). 2. To an unknown but sequenced species (genome sequenced, but no culture). 3. To a sequence found elsewhere.

Assignments are done by similarity to the public sequences database (Blast).

29


Assign each tag to a taxon

BMC Microbiology 2007, 7:108 30


Assign each tag to a taxon

Simulated resolution at increasing read-lengths

BMC Microbiology 2007, 7:108
31


Numbers of 16S rRNA sequences per species

Only 8,000 species in cultures ! Most species are known from a single sequence ! Tags taxonomic specificities are over-evaluated. Most species have not been sequenced at all.

32


Main taxa that were not amplified

Primers need to be better designed !
33


New tags as a function of sequencing effort Saturation curve
25000 20000

15000

10000

5000

0 0 100000 200000 300000 400000 500000

Even when sequencing 400,000 tags, we were not able to sequence every present species ... We are still missing the rare ones.
34


The singletons !
· A singleton is a sequence which was found only once ! · How many singletons in these experiments ?
Experiment Total tags unique tags singletons tags % Singletons Il Br Ca Fl

31745
9486 7337

26115
7683 5598

53245
14885 11638

28247
8779 6792

23
53R

21
55R

22
112R

24
115R 138 FS396 FS312 FS396 FS312

Experiment Total tags unique tags singletons tags % Singletons

4999
2655 2297

13901
7186 6217

9281
5751 5040

11004
5776 5009

14373
7167 6237

17665
8699 7587

4834
2769 2396

247825
10613 7185

442061
21529 13251

46

45

54

46

43

43

50

3

3

35


Tasks and problems
· Genomes
­ Resequencing genomes. ­ De novo sequencing a genome.

· Transcriptomes. · Biodiversity.
­ SSU rRNA sequences ­ Metagenomes

36


Many genomes are now sequenced. >200 marine microbes now being sequenced

Draft of human genome

2007
37


What is a metagenome ?
· Metagenome experiments consist in : 1. Extract the DNA from a given sample.
2. Sequence it all. 3. Try to assemble these pieces to reconstitute the different genomes that were present in the sample. 4. Try to make sense of this assembly

1. 2. 3. 4.

No problem. Now almost feasible. Works only for samples with few different genomes (presently less than 10). Presently impossible.

NOTE : the first metagenome (Sargasso sea sample) provided more protein sequences than was already known. This required to build a new division for storage in the public database ...
38


Technical problems
­ Lack of complete sequences to evaluate primers. ­ A single sequence available for a majority of species. ­ Most sequences have a poorly annotated taxonomy.
· 112,509 (16.8 %) only of the 670,401 bacterial 16S rRNA gene sequences of length >100 nt presently deposited have a taxonomic description down to the genus level, while 383,570 sequences (57 %) have "environmental samples" as sole description.

­ MPS technologies have not been validated against samples of known compositions. ­ MPS machines are not calibrated before, during or after a run. ­ MPS experiments to estimate diversity are not reproduced (duplicated) !
39


Conclusions in Biology
· The term `post-genomics' has been prematurely coined and we are in fact on the beginning of a global sequencing era, which opens a long journey that will occupy a broad spectrum of the scientific community for decades. ·Global sequencing can now be done in a single operation using benchtop instruments. · Global sequencing will soon replace any other method for estimating biodiversity and in transcriptome studies. · A wide and generalized sequencing effort of well-identified strains deposited in collections worldwide is required to form the basis of derived annotations of environmental sequences. · Developing ecosystem predictive models is fundamental, but this is still a long-term objective, as connection of taxonomy to functions is still missing in most cases.

40


Conclusions in Bioinformatics
· A wide and generalized sequencing effort of ontology building of wellidentified strains deposited in collections worldwide is required to form the basis of derived annotations of environmental sequences. · New formats need to be developed to store the flood of data soon to come, how to store efficiently :
­ ­ ­ The raw data. Data with final annotations. Intermediate calculations and results.

· New tools are required to efficiently query these hudge datasets.
­ ­ ­ Entrez is nearly not usable. SRS is problematic. ACNUC works quite well but is not widely supported

.

41


Conclusions in Informatics
· Efficient algorithms (computer clusters ?) to assemble genomes.
· Already a blooming field !

· Efficient algorithms to analyse transcriptomic data.
· Already a blooming field ! · Most developments are derivatives from earlier methods.

· A query system linking knowledge datases (ontologies) and sequence annotations needs to be developed. · New methods to classify short & divergent sequences are needed. · New methods to search sequences by similarity ? · Is there a better solution than simply flat files or SQL databases to store these hudge data sets?

42