Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.genebee.msu.ru/embl.release
Дата изменения: Fri Jan 17 09:30:58 2003
Дата индексирования: Mon Oct 1 19:35:40 2012
Кодировка:

EMBL Outstation - The European Bioinformatics Institute

EMBL Nucleotide Sequence Database

Release Notes

Release 73 December 2002


EMBL Outstation
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Telephone: +44-1223-494400
Telefax : +44-1223-494468

Electronic mail: datalib@ebi.ac.uk
URL: http://www.ebi.ac.uk


CONTENTS

* 1 RELEASE 73

o 1.1 27.9 Billion Nucleotides
o 1.2 Recently Completed Genomes
o 1.3 New Cross-References to GOA (GO Annotation)
o 1.4 EMBL Sequence Version Archive (SVA)
o 1.5 Third-Party Annotation Dataset (TPA)
o 1.6 CON(struct) Division
o 1.7 Whole Genome Shotgun Sequences (WGS)
o 1.8 New Feature Table Definition Document v5
o 1.8.1 New Qualifier: /segment
o 1.8.2 New Qualifier: /mol_type
o 1.8.3 New Qualifier: /locus_tag
o 1.9 Database Files
o 1.9.1 Naming Conventions
o 1.9.2 EST Database Files
o 1.9.3 GSS Database Files
o 1.9.4 INV Database Files
o 1.9.5 HUM Database Files
o 1.9.6 HTG Database Files
o 1.9.6.1 Base Quality Values
o 1.9.7 PAT Database Files
o 1.9.8 CON Database Files
o 1.9.9 CRC Values for Distributed Files
o 1.10 Cross-Reference Information
o 1.11 Sequence Retrieval System (SRS)
o 1.12 EMBL Database FAQ
o 1.13 Disclaimer

* 2 FORTHCOMING CHANGES
o 2.1 Sequence Length Limit
o 2.2 Molecule Type Information
o 2.3 E-mail Submission Form Discontinued
o 2.4 New reference line-type 'RG' (Reference Group)

* 3 SEQUENCE SUBMISSION SYSTEMS
o 3.1 Checking Sequence Data For Vector Contamination
o 3.2 Webin - WWW Sequence Submission System
o 3.2.1 Webin-Bulk Submissions
o 3.2.2 Webin-TPA (Third Party Annotation) Submissions
o 3.2.3 Webin-Align - WWW Alignment Submission System
o 3.3 SEQUIN - Stand-alone Submission Program
o 3.4 Further Submission Information
o 3.4.1 Annotation Guides

* 4 CITING THE EMBL NUCLEOTIDE SEQUENCE DATABASE

* 5 EBI NETWORK SERVICES
o 5.1 Electronic Mail Server
o 5.2 Anonymous FTP Server
o 5.3 World Wide Web (WWW) Server
o 5.4 Sequence Similarity Search Servers

* 6 DISTRIBUTION FILES
o 6.1 Release 73 Files


* APPENDIX A DATABASE GROWTH TABLE


1 RELEASE 73

The EMBL Nucleotide Sequence Database was frozen to make Release 73 on
29-NOV-2002. The release contains 20,857,746 sequence entries comprising
27,903,283,528 nucleotides. This represents an increase of about 20% over
Release 72. A breakdown of Release 73 by division is shown below:

Division Entries Nucleotides
----------------- ------------ ---------------
Constructed 146 451,710,729
ESTs 14,278,034 7,122,871,284
Fungi 69,690 110,593,809
GSSs 4,181,290 2,343,305,324
HTC 41,184 52,445,825
HTG 63,671 10,664,843,325
Human 233,004 3,725,671,508
Invertebrates 108,405 590,984,276
Other Mammals 42,976 70,886,183
Mus musculus 63,784 792,364,491
Organelles 173,371 143,776,002
Patents 899,597 470,683,009
Bacteriophage 2,239 7,141,342
Plants 134,576 479,339,789
Prokaryotes 166,584 512,495,745
Rodents 23,789 37,005,931
STSs 156,147 66,054,417
Synthetic 7,059 13,434,531
Unclassified 1,500 2,487,640
Viruses 170,355 151,858,111
Other Vertebrates 40,345 93,330,257
---------- --------------
Total 20,857,746 27,903,283,528


1.1 27.9 Billion Nucleotides

When freezing data for this release, the number of nucleotides in the
database had nearly reached the 28 billion mark. EMBL database statistics
are available at URL: http://www3.ebi.ac.uk/Services/DBStats/


1.2 Recently Completed Genomes

Plasmodium falciparum
The analysis of the genome sequence of Plasmodium falciparum strain 3D7
has been published in Nature 419, 498-511 (2002). The nuclear genome
consists of 14 chromosomes, genome size is 23 megabases encoding about
5,300 genes.

Plasmodium falciparum strain 3D7 chromosome 1 AL844501
Plasmodium falciparum strain 3D7 chromosome 2 AE001362
Plasmodium falciparum strain 3D7 chromosome 3 AL844502
Plasmodium falciparum strain 3D7 chromosome 4 AL844503
Plasmodium falciparum strain 3D7 chromosome 5 AL844504
Plasmodium falciparum strain 3D7 chromosome 6 AL844505
Plasmodium falciparum strain 3D7 chromosome 7 AL844506
Plasmodium falciparum strain 3D7 chromosome 8 AL844507
Plasmodium falciparum strain 3D7 chromosome 9 AL844508
Plasmodium falciparum strain 3D7 chromosome 10 AE014185
Plasmodium falciparum strain 3D7 chromosome 11 AE014186
Plasmodium falciparum strain 3D7 chromosome 12 AE014188
Plasmodium falciparum strain 3D7 chromosome 13 AL844509
Plasmodium falciparum strain 3D7 chromosome 14 AE014187

Anopheles gambiae WGS
In a separate project, an international consortium of researchers has
sequenced the genome of the Anopheles gambiae mosquito, which transmits
the parasite to humans. The International Anopheles Genome project is
a collaboration between Celera Genomics, Genoscope, University of
Notre Dame, EBI/SangerInstitute, EMBL, Institut Pasteur, IMBB and TIGR.
The genome sequence of the malaria mosquito Anopheles gambiae has been
published in Science 298, 129 (2002). The Whole Genome Shotgun (WGS)
assembly of the mosquito genome is available from the EBI at
ftp://ftp.ebi.ac.uk/pub/databases/embl/wgs.
Both genome sequences provide the foundation for future studies of
these organisms, and is being exploited in the search for new drugs and
vaccines to fight malaria.

Other recently completed genomes include:

Bifidobacterium longum NCC2705 AE014295
Brucella suis 1330 chromosome I AE014291
Brucella suis 1330 chromosome II AE014292
Corynebacterium efficiens YS-314 BA000035
Leptospira interrogans serovar lai str. 56601 chromosome I AE010300
Leptospira interrogans serovar lai str. 56601 chromosome II AE010301
Oceanobacillus iheyensis BA000028
Shewanella oneidensis MR-1 AE014299
Shigella flexneri 2a str. 301 AE005674
Streptococcus agalactiae 2603V/R AE009948
Streptococcus mutans UA159 AE014133
Wigglesworthia brevipalpis BA000021
Trypanosoma brucei DNA chromosome 1 AL929608
Shewanella oneidensis MR-1 megaplasmid AE014300

Direct access to hundreds of completed genome sequences is available via
EBI's WWW Genomes server at URL http://www.ebi.ac.uk/genomes/


1.3 New Cross-References to GOA (GO Annotation)

GOA is a project at the European Bioinformatics Institute to provide
assignments of gene products to the Gene Ontology (GO) resource. The goal
of the Gene Ontology Consortium is to produce a dynamic controlled vocabulary
that can be applied to all organisms. In the GOA project, this vocabulary
will be applied to a non-redundant set of proteins described in the
SWISS-PROT, TrEMBL and Ensembl databases that collectively provide complete
proteomes for Homo sapiens and other organisms.
EMBL Release 73 includes more than 730,000 new cross-references to GOA.

Example:

ID HSFOS standard; DNA; HUM; 6210 BP.
XX
AC K00650; M16287;
...
DR EPD; EP11145; HS_FOS.
DR GDB; 119917; FOS.
DR GOA; P01100; P01100.
DR SWISS-PROT; P01100; FOS_HUMAN.
...
FT CDS join(889..1029,1783..2034,2466..2573,2688..3329)
FT /codon_start=1
FT /db_xref="GOA:P01100"
FT /db_xref="SWISS-PROT:P01100"
...

Hyperlinks from e.g. /db_xref="GOA:P01100" use the QuickGO browser
at www.ebi.ac.uk/ego/QuickGO?, e.g.
http://www.ebi.ac.uk/ego/QuickGO?mode=search&querytype=protein&query=P01100


1.4 EMBL Sequence Version Archive (SVA)

In order to provide access to previous versions of database records,
the EMBL database has established a Sequence Version Archive (SVA).
Data in the EMBL nucleotide sequence database change over time
for a number of reasons, e.g due to updates/corrections or extensions
based on new findings from more recent experiments. Each time data in
an entry are modified, the entry is assigned a new entry version
number. Entries can change their appearance even while the data
included remain unchanged, due to general flat-file format changes or
when the taxonomic classification of the source organisms changes.
e.g. when an organism is assigned a new place in the hierarchy.
Following these types of changes, the entry will retain its original
sequence version number.

Querying the EMBL Sequence Version Archive

The EMBL Sequence Version Archive is available from the EBI
web-servers at URL http://www.ebi.ac.uk/embl/sva/.

Entry(ies) are viewable by either accession number, nucleotide
sequence identifier and protein identifier.

Query results options allow to
a) show the complete history for an entry, i.e. all recorded flat
files matching the query criterion in chronological order or
b) show a snapshot of the entry at a particular date. Query results
will be presented in a table, listed EMBL entries can be 'Viewed'
on Screen' or 'Saved to File'.
c) show differences between entry versions


1.5 Third-Party Annotation Dataset (TPA)

Following a decision taken at the 2002 Collaborative Meeting, DDBJ/EMBL/GenBank
have been creating a Third Party Annotation (TPA) dataset. The TPA data-
collection is a complement to the existing DDBJ/EMBL/GenBank comprehensive
database of primary nucleotide sequences, which typically result from direct
sequencing of cDNAs, ESTs, genomic DNAs etc.

Primary data are defined to be data for which the submitting group has done
the sequencing and annotation, and as 'owner' of these data has privileges
to submit updates/corrections etc. In contrast, non-primary sequences are
defined as sequences which
a) consist exclusively of DNA from one or several already existing entries
'owned' by other groups or
b) consist of a mixture of new & already existing sequences.


TPA categories and requirements

Users can submit re-annotations/re-assemblies of sequences already
present in DDBJ/EMBL/GenBank and owned by other groups to be included in the
Third Party Annotation (TPA) data-collection.

Categories of data submissions accepted for TPA include:

a. re-annotation/analysis of sequence(s)from DDBJ/EMBL/GenBank
b. mixed primary/non-primary TPA sequence including regions of new and
existing sequence (e.g. filling the gaps with HTG or EST or newly
sequenced data)
c. TPA sequences based on trace archive data
d. TPA sequences based on Whole Genome Shotgun (WGS) sequences

Not accepted are consensus sequences from multiple organisms.

For TPA submissions, the database requires information on the composition of
the TPA sequence to show which spans in a TPA sequence originated from which
contributing database entries. In order to assure that the sequence annotation
is of high quality, we require that the study be published in a peer reviewed
journal before we release the data to the public.

EBI's submission system WEBIN has been customised to allow submissions
of TPA sequences to the EMBL Nucleotide Sequence Database. WEBIN is available
at URL http://www.ebi.ac.uk/embl/Submission/webin.html

TPA sequences are exchanged amongst the DDBJ/EMBL/GenBank database
collaboration. The TPA data-collection is available via the
EBI FTP server at ftp://ftp.ebi.ac.uk/pub/databases/embl/tpa and also
via EBI's Sequence Retrieval System (SRS) at http://srs.ebi.ac.uk.


1.6 CON(struct) Division

Con(structed) or Con(tig) sequences in the CON division represent complete
genomes and other long sequences constructed from segment entries.
Nucleotide sequence records in EMBL (DDBJ and GenBank) currently have a
size restriction of 350 kb. Sequences >350 kb are split into smaller segment
entries before inclusion in the database. Segment entries are assigned
individual accession numbers, they include sequence data and are distributed
in the appropriate taxonomic divisions. In contrast, CON division entries do
not contain sequence data per se, but rather the assembly information on
all accession.versions and sequence locations relevant in building the
contig genome. CON sequence entries follow the daily data exchange mechanism
between DDBJ/EMBL/GenBank. CON entries are available at
ftp://ftp.ebi.ac.uk/pub/databases/embl/release/ in file 'embl.con' and
from the EBI Genome Web server at EBI's WWW Genomes server at URL
http://www.ebi.ac.uk/genomes/.

For more detailed information on specific aspects of CON(structed)
sequences (e.g. examples, CO line-type, gap representation etc.),
please see the previous EMBL release notes (Release 72, Sep 2002).

1.7 Whole Genome Shotgun Sequences (WGS)

Methods using whole genome shotgun data are used to gain a large amount of
genome coverage for an organism. WGS data for a growing number of organisms
are being submitted to DDBJ/EMBL/GenBank. WGS sequences are exchanged
amongst the DDBJ/EMBL/GenBank database collaboration.
EMBL's WGS data-collection is made available via the EBI FTP server at
ftp://ftp.ebi.ac.uk/pub/databases/embl/wgs and also via EBI's Sequence
Retrieval System (SRS) at http://srs.ebi.ac.uk

Detailed information on specific aspects of Whole Genome Shotgun Sequences,
(e.g. accession-number format, examples, lists etc) are available from the
previous EMBL release notes (Release 72, Sep 2002).



1.8 Feature Table Definition Document v5

The new version of the Feature Table Definition Document (FTv5)
is implemented on 15-DEC-2002. The document is available from the EBI
servers at:

http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/


1.8.1 New SOURCE feature qualifier: /segment

/segment is a qualifier to the 'Source' feature key providing information
on the name or number of a viral or phage segment

Qualifier /segment=
Definition name of viral or phage segment sequenced
Value format "text"
Example /segment="6"


1.8.2 New SOURCE feature qualifier: /mol_type

/mol_type is a qualifier to the 'Sourcce feature key' recording the
biological state (in vivo molecule type) of the sequence.

Qualifier /mol_type=
Definition in vivo molecule type
Value format "text"
Example /mol_type="genomic DNA"
Comment text limited to "genomic DNA", "genomic RNA", "mRNA"
(incl. EST), "tRNA", "rRNA", "snoRNA", "snRNA", "scRNA",
"pre-mRNA", "other RNA" (incl. synthetic),"other DNA"
(incl. synthetic), "unassigned DNA" (incl. unknown),
"unassigned RNA" (incl. unknown);
/mol_type will become mandatory qualifier to SOURCE
feature key from 01-JUL-2003.


1.8.3 New qualifier: /locus_tag

/locus_tag allows assignment of systematic tags for tracking purposes.

Qualifier: /locus_tag
Definition: feature tag assigned for tracking purposes
Value Format: "text"(single token)
Example: /locus_tag="RSc0382"
/locus_tag="YPO0002"
Comment: /locus_tag can be used with any feature where /gene
is valid;


1.9 Database Files

In order to keep the size of the data files within reasonable limits for
handling purposes, additional division files will be added in subsequent
releases as appropriate.


1.9.1 Naming Conventions

When a division is split into several files, these are named so that
they sort sequentially, e.g. est_hum01.dat, est_hum02.dat,......,
est_hum22.dat, est_hum23.dat etc


1.9.2 EST Database Files

ESTs (single pass cDNA reads) constitute a major source of sequence records
- the vast majority originating from human and mouse. In addition to the EST
division files in the EMBL database release, EBI's ESTLIB provides
further information about the libraries from which EST sequences were derived.
The according EST division entries in EMBL are cross-referenced to ESTLIB
with a /db_xref qualifier on the source feature, e.g. /db_xref="ESTLIB:863"
ESTLIB is available from ftp://ftp.ebi.ac.uk/pub/databases/embl/estlib/

EST files are split according to taxonomic subdivisions following
the model of the taxonomic split of all other EMBL database divisions.


est_fun01.dat - est_fun02.dat Fungi ESTs
est_hum01.dat - est_hum49.dat Human ESTs
est_inv01.dat - est_inv17.dat Invertebrate ESTs
est_mam01.dat - est_mam04.dat Other Mammal ESTs
est_mus01.dat - est_mus29.dat Mouse ESTs
est_pln01.dat - est_pln29.dat Plant ESTs
est_pro.dat Prokaryote ESTs
est_rod01.dat - est_rod04.dat Other Rodent ESTs
est_unc.dat Unclassified ESTs
est_vrt01.dat - est_vrt12.dat Vertebrate ESTs


1.9.3 GSS Database Files

Genome Survey Sequences (GSS) are of similar nature to EST data, except
that sequences are genomic rather than cDNA (mRNA). The GSS division
contains e.g. random `single pass read' genome survey sequences, single
pass reads from cosmid/BAC/YAC ends, exon trapped genomic sequences and
Alu PCR sequences.

GSS division files are also split according to taxonomic subdivisions.
Mouse GSSs are now included in files gss_mus01.dat - gss_mus10.dat.

gss_fun.dat Fungi GSSs
gss_hum01.dat - gss_hum09.dat Human GSSs
gss_inv01.dat - gss_inv06.dat Invertebrate GSSs
gss_mam01.dat - gss_mam02.dat Other Mammal GSSs
gss_mus01.dat - gss_mus10.dat Mouse GSSs
gss_phg.dat Phage GSSs
gss_pln01.dat - gss_pln11.dat Plant GSSs
gss_pro.dat Prokaryote GSSs
gss_rod01.dat - gss_rod04.dat Rodent GSSs
gss_vrl.dat Viral GSSs
gss_vrt01.dat - gss_vrt03.dat Vertebrate GSSs


1.9.4 INV Database Files

The INV division has been split into 3 files (inv01.dat - inv03.dat).


1.9.5 HUM Database Files

The HUM division has been split into 24 files (hum01.dat-hum24.dat).


1.9.6 HTG Database Files

'Unfinished' DNA sequences generated by the high-throughput sequencing
centers are represented in the HTG division - and are rapidly made available
to the scientific community for homology searches. Entries in this
division all contain keywords to indicate the status of the sequencing
(e.g., HTGS_PHASE1). A single accession number is assigned to one clone,
and as sequencing progresses and the entry passes from one phase to
another, it will retain the same accession number. Once 'finished', HTG
sequences are moved into the appropriate primary EMBL taxonomic division.

HTG division files have also been split according to taxonomic subdivisions.
Mouse HTGs are included in files htgo_mus.dat and htg_mus.dat.

htgo_hum.dat Human HTGs phase0
htgo_mus.dat Mouse HTGs phase0
htgo_other.dat Other HTGs phase0
htg_hum01.dat - htg_hum05.dat Human HTGs
htg_inv01.dat - htg_inv02.dat Invertebrate HTGs
htg_mam.dat Other Mammal HTGs
htg_mus01.dat - htg_mus04.dat Mouse HTGs
htg_other.dat Other HTGs
htg_pln.dat Plant HTGs
htg_rod01.dat - htg_rod07.dat Rodent HTGs
htg_vrt.dat Other Vertebrate HTGs

HTGS_PHASE0 entries typically consist of one-to-few pass reads of a single
clone, have not been assembled into contigs and are unoriented, unordered,
unannotated and contain gaps of unknown length.
Low-pass sequence sampling is useful for identifying clones that may be gene-
rich. Phase0 sequences are used to check whether another center is already
sequencing this clone. If not, it will be sequenced through phase 1 and
phase 2. When records are updated, the accession numbers will be preserved.


1.9.6.1 Base Quality Values

Quality scores from draft HTG data are available on the EBI FTP server. The
Compressed (gzip) files contain base quality values for unfinished human
sequences from Japanese, US and European sequencing centres. The fasta-type
headers contain the EMBL sequence identifier of the corresponding
database entries.

Example: >AL157822.2 Phrap Quality (Length:158745, Min: 1, Max: 99)

In order to keep the size of the files within reasonable limits for handling
purposes, files which in uncompressed form are bigger than 1 Gb, are split
into smaller files.

Directory: ftp://ftp.ebi.ac.uk/pub/databases/embl/quality_scores

Quality score files are updated on a daily basis.


1.9.7 PAT Database Files

PAT files include sequence data incorporated from the European patent
literature (EPO) and complemented by American and Japanese patent
data integrated from NCBI(USA)and DDBJ(Japan).
The Patent division has been split into 5 files (pat01.dat - pat05.dat)

1.9.8 CON Database File

CON files include construct information for building contig sequences
of chromosomes, genomes and other long DNA sequences. CON entries in
file 'embl.con' do not contain sequence data per se. The CON division
data is included in 1 file (embl.con).

1.9.9 CRC values for distributed files

To help users verify the integrity of release data files, we supply files
containing 32-bit checksum Cyclic Redundancy Check (CRC) values, plus byte
counts, for both compressed and uncompressed release files.

These CRC values are calculated based on the IEEE Std 1003.2-1992 (POSIX 1003.2)
and X/Open CAE specifications. These values are generated by default by the
'cksum' command on Irix, RedHat Linux, SunOS, Solaris. On Tru64 unix, the
environment variable CMD_ENV needs to be set to xpg4.

File: crc_gz.txt for compressed data files
File: crc.txt for uncompressed data files

Example from crc.txt: 1553759899 195684609 est_fun.dat

This output shows that the checksum of the file est_fun.dat is 1553759899
and the file contains 195684609 bytes.


1.10 Cross-Reference Information

Links to external databases allow integration with specialised data
collections, such as protein databases, species-specific databases,
taxonomy databases etc. The WWW-based sequence retrieval system SRS
enables users to easily navigate between cross-referenced database entries.

Total number of links in EMBL Release 73 is 19,586,912. More than 1.89
million of these are also referring to individual features e.g. CDS
(coding sequences) via the /db_xref feature qualifier in EMBL entries.

Database Nr of links
---------- -----------
UNILIB 13410818
RZPD 3439996
TrEMBL 864467
GOA 734286
GrainGenes 441708
SWISS-PROT 216658
MaizeDB 210299
RemTrEMBL 87243
IMGT/LIGM 63782
MGD 38275
FLYBASE 25345
MENDEL 21033
SGD 10991
GDB 8430
TRANSFAC 6620
IMGT/HLA 3577
EPD 3384
----------------------
Total 19586912


1.11 Sequence Retrieval System (SRS)

EBI's SRS server is available at URL http://srs.ebi.ac.uk
All external services are available from the 'Toolbox' button on EBI's Web
pages. If you have any comments and/or suggestions please send these to:

support@ebi.ac.uk


1.12 EMBL Database FAQ

An EMBL Database FAQ is available from the EBI at URL
http://www.ebi.ac.uk/embl/Documentation/FAQ/

This document includes information on:

General questions about EMBL and other databases
Submission procedure
Updating database entries
Webin-specific questions
Navigation guide


1.13 Disclaimer

No guarantee is given and no legal liability or responsibility is assumed for
the completeness and accuracy of the database entries, in particular the
conformity of sequence data in the database with the journal publication
where the sequence is also disclosed.


2 FORTHCOMING CHANGES

2.1 Sequence Length Limit

Currently database records are limited in length to 350kb. At the recent
collaborative meeting DDBJ/EMBL/GenBank have discussed the issue of relaxing
the maximum sequence length limit. The plan is to remove the size rescriction
on database records in 2 years time. We will announce this to the community,
especially developers, and we will review this proposal in 12 months time.

2.2 Molecule type information

/mol_type qualifier:
The new /mol_type qualifier described above in 1.7 will be implemented
with the new FTv5-Document on 16-Dec-2002. For the start /mol_type
will be an optional qualifier to the SOURCE feature key to allow
some time for retrofitting of existing data. From 01-JUL-2003
/mol_type will be a mandatory qualifier to the SOURCE feature key
and will consistently display the in vivo molecule type of the sequence.

Molecule type in ID_lines:
At the same time, molecule type information in the flat-file entry
ID-line will display the corresponding value from the /mol_type value
list.


2.3 E-mail submission form discontinued

From 01-Jan-2003, we will discontinue support of the E-mail submission
form and will not be accepting e-mail submissions.
EBI's preferred submission medium is WEBIN (for details see 3.2).

2.4 New reference line-type 'RG' (Reference Group)

A new reference line-type 'RG' will be introduced in June 2003 to
list the consortium name associated with a given citation.

Examples:

RG The C. elegans Sequencing Consortium;
RG The Brazilian Network for HIV Isolation and Characterization;


3 SEQUENCE SUBMISSION SYSTEMS

3.1 Checking Sequence Data For Vector Contamination

We urge submitters to remove vector contamination from sequence data before
submitting to the database. To assist submitters the EBI is providing a
Vector Screening Service using the latest implementation of the BLAST
algorithm and a special sequence databank known as EMVEC. EMVEC is an
extraction of sequences from the SYNthetic division of EMBL containing
more than 2000 sequences commonly used in cloning and sequencing experiments.
EMVEC is by no means a complete vector databank but EBI believes it is
representative of the kind of material used in modern sequencing and should
be useful to submitters. The databank will be updated with each release of
EMBL and made publicly available on the EBI's ftp server.

The interactive WWW service can be found at:

http://www.ebi.ac.uk/embl/Submission/webin.html
http://www.ebi.ac.uk/blastall/vectors.html

The results will list sequences producing significant alignments and
Associated information like vector name, score, alignment etc


3.2 Webin - WWW Sequence Submission System

Webin is the preferred WWW Sequence Submission System for submitting
Nucleotide sequence data and associated biological information to the
EMBL Nucleotide Sequence Database.

To access Webin please use the following URL:

http://www.ebi.ac.uk/embl/Submission/webin.html

Database entries submitted to the EMBL Nucleotide Sequence Database at the
EBI will be exchanged and shared among the International Collaboration of
Nucleotide Sequence Databases (DDBJ/EMBL/GenBank).

Webin guides the user through a sequence of WWW forms allowing the submission
of sequence data and descriptive information in an interactive and easy way.
All the information required to create a database entry will be collected
during this process.

EBI staff will process data submissions within 2 working days and send the
database accession number(s) assigned to your data to your e-mail address.

3.2.1 Webin Bulk Submissions

With the aim to make bulk sequence submission less time consuming for the
submitters, a web-based submission system can be accessed from the Webin page.
Authors planning to submit a large number of similar sequences(i.e.,>25) are
presented with an option for "Bulk Webin Submission". When choosing the bulk
path, submitters follow the usual Webin submission procedure to do a first
single representative sequence, which will be processed by database staff to
create templates for the other sequences, thus saving the author the time
and effort required to complete numerous submission events individually.
Please contact database staff if you require further information.

e-mail: datasubs@ebi.ac.uk

Tel: +44-1223-494499
Fax: +44-1223-494472


3.2.2 Webin-TPA submissions

Users can submit re-annotations/re-assemblies of sequences already
present in DDBJ/EMBL/GenBank and owned by other groups to be included
in the Third Party Annotation (TPA) data-collection.


3.2.3 Webin-Align Sequence Alignment Submissions

The EBI accepts submissions of sequence alignment data (from phylogenetic
and population analysis etc.) via Webin-Align, the EBI's WWW-based
submission tool. After approval by EMBL staff, a unique identifier (an
accession number with the format ALIGN_000001) is assigned to the alignment.
This identifier is then communicated to the submitter, and should be quoted
in publications.

Details on how to access Webin-Align, related help documentation and
annotation example pages are available at:
http://www.ebi.ac.uk/embl/Submission/align_top.html

New users are advised to read the help documentation and FAQ prior to
submitting their data.

Alignment data is available in EMBL-Align and ClustalW file format
from the EBI FTP server. New alignments in EMBL-Align format can be
retrieved via SRS.
http://srs.ebi.ac.uk/ - Sequence Retrieval System
http://www3.ebi.ac.uk/Services/align/listali.html - HTML alignment list

Submitters unable to access Webin-Align should contact the data submissions
staff (email: datasubs@ebi.ac.uk) for advice.

3.3 SEQUIN - Stand-alone Submission Program

Sequin is the multi-platform (Mac/PC/Unix) stand-alone software tool
developed by the NCBI for submitting entries to the EMBL, GenBank, or DDBJ
sequence databases. The Sequin program, along with detailed downloading and
installation instructions plus general information are available from the
EBI via WWW and anonymous FTP.

http://www3.ebi.ac.uk/Services/Sequin/
ftp://ftp.ebi.ac.uk/pub/software/sequin/


3.4 Further Submission Information

3.4.1 Annotation Guides

To help and guide submitters in annotating their sequences, two online guides
are available via hyperlinks from within Webin:

EMBL Annotation Examples (http://www3.ebi.ac.uk/Services/Standards/web/) and
EMBL Features and Qualifiers (http://www3.ebi.ac.uk/Services/WebFeat/).

The annotation examples consist of a list of EMBL approved feature table
annotations for common biological sequences. The EMBL Features and Qualifiers
is a complete list of feature table key and qualifier definitions providing
detailed descriptions and usage examples.

For further information on submission of sequence data to the EMBL Nucleotide
Sequence Database please access:

http://www.ebi.ac.uk/embl/Submission/

or contact database staff at:

EMBL Nucleotide Sequence Submissions
e-mail: datasubs@ebi.ac.uk
telephone: +44-1223-494499
telefax: +44-1223-494472


4 CITING THE EMBL NUCLEOTIDE SEQUENCE DATABASE

We encourage authors to include a reference to the EMBL Database in
publications related to their research.

When citing data in the EMBL Database, we suggest to give the according
primary accession number and the publication in which the sequence first
appeared. For unpublished data, we suggest to contact the original
submitters for recent publication information or revisions of the data.

We suggest to also provide a reference for the EMBL Database itself. Our
recent publication describing the EMBL database should be cited:

Stoesser G., Baker W., van den Broek A., Camon E., Garcia-Pastor M., Kanz C.,
Kulikova T., Leinonen R., Lin Q., Lombard V., Lopez R., Redaschi N., Stoehr
P., Tuli M.A, Tzouvara K. and Vaughan R.
'The EMBL Nucleotide Sequence Database'
Nucleic Acids Res 30:21-26(2002)

Example: The numbers in parentheses refer to the reference citation in the
EMBL database entry, and to the EMBL citation above.

"Sequence entry X56734 (1) has been retrieved from the EMBL Database (2) and
showed significant sequence similarity to ..."

(1) Oxtoby, E., et al., Plant Mol. Biol. 17:209-219(1991).
(2) Stoesser G. et al., Nucleic Acids Res 30:21-26(2002).


5 EBI NETWORK SERVICES

5.1 Electronic Mail Server

Users with access to electronic mail and internet can obtain copies of database
entries, documentation or the data submission form, by sending commands to a
file server running at EBI. New and updated EMBL nucleotide sequence entries
are
made available on the server on a daily basis.

To use this facility, send file server commands to the address
netserv@ebi.ac.uk. Each line of the mail message should consist of a
single file server request.

The most important file server request, to get started, is:

HELP

If the file server receives this command, it will return a helpfile to the
sender, explaining in some detail how to use the facility. For example, to
request a copy of the nucleotide sequence with accession number X55652, use
the command:

GET NUC:X55652

The file server offers various other services, (eg., access to nucleotide and
protein sequence data, protein structure data, software), details of which
are provided in the HELP file.


5.2 Anonymous FTP Server

An alternative method of accessing the EBI archives is to use the file
transfer protocol (ftp). Researchers with direct access to the Internet can
use the FTP program on their local machine to connect to the host
FTP.EBI.AC.UK and enter the username "anonymous" and their email address
as password.
The directory pub/help contains detailed information about the data available
from the EBI anonymous FTP server which includes the complete EMBL
Nucleotide Sequence Database releases as well as daily and weekly updates
and a cumulative update file (gzip compressed format)in the following
directories:

EMBL quarterly release: pub/databases/embl/release
EMBL updates: pub/databases/embl/new


5.3 World Wide Web (WWW) Server

The EBI operates a WWW server at URL http://www.ebi.ac.uk/ providing
information about the EBI and it's products and services.

Data Retrieval: Nucleotide sequences can be retrieved by a simple query by
accession number, or more complex queries can be constructed using the SRS
databank browser at http://srs.ebi.ac.uk
Data Submission: Nucleotide sequences can be submitted to the database using
the interactive submission system Webin at
http://www.ebi.ac.uk/embl/Submission/webin.html

5.4 Sequence Similarity Search Servers

The EBI offers two network servers for sequence similarity searches via
electronic mail or interactive WWW forms:

FASTA based on W. Pearson's FASTA algorithm. Allows local similarity searches
of protein and nucleotide sequence databases.
Send "help" to fasta@ebi.ac.uk or use URL http://www.ebi.ac.uk/fasta3/

BLAST based on the NCBI and WU-Blast software Send "help" to
blast@ebi.ac.uk or use URL http://www.ebi.ac.uk/blast2/

BLITZ allows very fast searches of protein sequence databases for local
similarities. The software used by the blitz service is based on MPsrch
and Scanps. Send "help" to blitz@ebi.ac.uk or use URLs
http://www.ebi.ac.uk/MPsrch/ and http://www.ebi.ac.uk/scanps/

6 DISTRIBUTION FILES

6.1 Release 73 Files

The release contains the files shown below. File sizes are given as numbers
of records.

File Number File Name Description Number of Records

1 CRC.TXT Checksum CRC uncompressed files 278
2 CRC_GZ.TXT Checksum CRC compressed files 278
3 DELETEAC.TXT Deleted accession numbers 187325
4 EMBL.CON Constructed Sequences 13437
5 FTABLE.TXT Feature Table Documentation 474
6 RELNOTES.TXT Release Notes (this document) 1327
7 SUBINFO.TXT Data Submission Documentation 390
8 UPDATE.TXT Data Update Form 107
9 USRMAN.TXT User Manual 1596
10 ACNUMBER.NDX Accession Number Index 20894252
11 CITATION.NDX Citation Index 3015185
12 DIVISION.NDX Division Index 25
13 KEYWORD.NDX Keyword Index 7539949
14 SHORTDIR.NDX Short Directory Index 50666308
15 SPECIES.NDX Species Index 7143659
16 EST_FUN01.DAT EST Sequences 6375143
17 EST_FUN02.DAT EST Sequences 2420352
18 EST_HUM01.DAT EST Sequences 7333014
19 EST_HUM02.DAT EST Sequences 7505165
20 EST_HUM03.DAT EST Sequences 7204627
21 EST_HUM04.DAT EST Sequences 7094572
22 EST_HUM05.DAT EST Sequences 7217072
23 EST_HUM06.DAT EST Sequences 7233645
24 EST_HUM07.DAT EST Sequences 7272291
25 EST_HUM08.DAT EST Sequences 6955801
26 EST_HUM09.DAT EST Sequences 6786550
27 EST_HUM10.DAT EST Sequences 7319274
28 EST_HUM11.DAT EST Sequences 7206978
29 EST_HUM12.DAT EST Sequences 7015107
30 EST_HUM13.DAT EST Sequences 7050431
31 EST_HUM14.DAT EST Sequences 6608002
32 EST_HUM15.DAT EST Sequences 7179386
33 EST_HUM16.DAT EST Sequences 7309111
34 EST_HUM17.DAT EST Sequences 7297241
35 EST_HUM18.DAT EST Sequences 7343197
36 EST_HUM19.DAT EST Sequences 7509606
37 EST_HUM20.DAT EST Sequences 7391010
38 EST_HUM21.DAT EST Sequences 7460626
39 EST_HUM22.DAT EST Sequences 7109359
40 EST_HUM23.DAT EST Sequences 7201641
41 EST_HUM24.DAT EST Sequences 6963596
42 EST_HUM25.DAT EST Sequences 7409104
43 EST_HUM26.DAT EST Sequences 7649761
44 EST_HUM27.DAT EST Sequences 7301340
45 EST_HUM28.DAT EST Sequences 7095929
46 EST_HUM29.DAT EST Sequences 7293403
47 EST_HUM30.DAT EST Sequences 7388960
48 EST_HUM31.DAT EST Sequences 7622217
49 EST_HUM32.DAT EST Sequences 8189837
50 EST_HUM33.DAT EST Sequences 7659621
51 EST_HUM34.DAT EST Sequences 7969240
52 EST_HUM35.DAT EST Sequences 7314376
53 EST_HUM36.DAT EST Sequences 7411549
54 EST_HUM37.DAT EST Sequences 7641701
55 EST_HUM38.DAT EST Sequences 7575333
56 EST_HUM39.DAT EST Sequences 8111579
57 EST_HUM40.DAT EST Sequences 7369420
58 EST_HUM41.DAT EST Sequences 7179631
59 EST_HUM42.DAT EST Sequences 7391983
60 EST_HUM43.DAT EST Sequences 7369009
61 EST_HUM44.DAT EST Sequences 7459552
62 EST_HUM45.DAT EST Sequences 7541066
63 EST_HUM46.DAT EST Sequences 7532994
64 EST_HUM47.DAT EST Sequences 6755787
65 EST_HUM48.DAT EST Sequences 6795182
66 EST_HUM49.DAT EST Sequences 6158759
67 EST_INV01.DAT EST Sequences 6605844
68 EST_INV02.DAT EST Sequences 6176252
69 EST_INV03.DAT EST Sequences 5705260
70 EST_INV04.DAT EST Sequences 5878111
71 EST_INV05.DAT EST Sequences 7174229
72 EST_INV06.DAT EST Sequences 7037426
73 EST_INV07.DAT EST Sequences 7363144
74 EST_INV08.DAT EST Sequences 6276630
75 EST_INV09.DAT EST Sequences 5797544
76 EST_INV10.DAT EST Sequences 7242011
77 EST_INV11.DAT EST Sequences 6765000
78 EST_INV12.DAT EST Sequences 7006928
79 EST_INV13.DAT EST Sequences 5742327
80 EST_INV14.DAT EST Sequences 5466496
81 EST_INV15.DAT EST Sequences 5411826
82 EST_INV16.DAT EST Sequences 5633017
83 EST_INV17.DAT EST Sequences 3111886
84 EST_MAM01.DAT EST Sequences 6414515
85 EST_MAM02.DAT EST Sequences 6783574
86 EST_MAM03.DAT EST Sequences 6957822
87 EST_MAM04.DAT EST Sequences 5077330
88 EST_MUS01.DAT EST Sequences 7588064
89 EST_MUS02.DAT EST Sequences 7833536
90 EST_MUS03.DAT EST Sequences 7487534
91 EST_MUS04.DAT EST Sequences 7051477
92 EST_MUS05.DAT EST Sequences 7659033
93 EST_MUS06.DAT EST Sequences 10008794
94 EST_MUS07.DAT EST Sequences 9133953
95 EST_MUS08.DAT EST Sequences 8597297
96 EST_MUS09.DAT EST Sequences 9964469
97 EST_MUS10.DAT EST Sequences 9962907
98 EST_MUS11.DAT EST Sequences 9933001
99 EST_MUS12.DAT EST Sequences 9833960
100 EST_MUS13.DAT EST Sequences 9752033
101 EST_MUS14.DAT EST Sequences 10006172
102 EST_MUS15.DAT EST Sequences 10746131
103 EST_MUS16.DAT EST Sequences 10873733
104 EST_MUS17.DAT EST Sequences 8640494
105 EST_MUS18.DAT EST Sequences 7605320
106 EST_MUS19.DAT EST Sequences 7435947
107 EST_MUS20.DAT EST Sequences 7572154
108 EST_MUS21.DAT EST Sequences 7076778
109 EST_MUS22.DAT EST Sequences 7430196
110 EST_MUS23.DAT EST Sequences 8246377
111 EST_MUS24.DAT EST Sequences 8048316
112 EST_MUS25.DAT EST Sequences 7560721
113 EST_MUS26.DAT EST Sequences 7274202
114 EST_MUS27.DAT EST Sequences 7938601
115 EST_MUS28.DAT EST Sequences 7723407
116 EST_MUS29.DAT EST Sequences 4204936
117 EST_PLN01.DAT EST Sequences 6849381
118 EST_PLN02.DAT EST Sequences 6406930
119 EST_PLN03.DAT EST Sequences 5983878
120 EST_PLN04.DAT EST Sequences 6321347
121 EST_PLN05.DAT EST Sequences 6714535
122 EST_PLN06.DAT EST Sequences 7390691
123 EST_PLN07.DAT EST Sequences 7089983
124 EST_PLN08.DAT EST Sequences 7045235
125 EST_PLN09.DAT EST Sequences 7227826
126 EST_PLN10.DAT EST Sequences 7462989
127 EST_PLN11.DAT EST Sequences 7296558
128 EST_PLN12.DAT EST Sequences 7182412
129 EST_PLN13.DAT EST Sequences 7183016
130 EST_PLN14.DAT EST Sequences 6710038
131 EST_PLN15.DAT EST Sequences 7211895
132 EST_PLN16.DAT EST Sequences 7675607
133 EST_PLN17.DAT EST Sequences 5991886
134 EST_PLN18.DAT EST Sequences 6690866
135 EST_PLN19.DAT EST Sequences 7896631
136 EST_PLN20.DAT EST Sequences 7315504
137 EST_PLN21.DAT EST Sequences 7186888
138 EST_PLN22.DAT EST Sequences 7118885
139 EST_PLN23.DAT EST Sequences 7362420
140 EST_PLN24.DAT EST Sequences 7211537
141 EST_PLN25.DAT EST Sequences 6207235
142 EST_PLN26.DAT EST Sequences 6276389
143 EST_PLN27.DAT EST Sequences 5788069
144 EST_PLN28.DAT EST Sequences 6383395
145 EST_PLN29.DAT EST Sequences 1934164
146 EST_PRO.DAT EST Sequences 41264
147 EST_ROD01.DAT EST Sequences 7213611
148 EST_ROD02.DAT EST Sequences 7349462
149 EST_ROD03.DAT EST Sequences 7883067
150 EST_ROD04.DAT EST Sequences 6220402
151 EST_UNC.DAT EST Sequences 723
152 EST_VRT01.DAT EST Sequences 6982845
153 EST_VRT02.DAT EST Sequences 5992420
154 EST_VRT03.DAT EST Sequences 7323311
155 EST_VRT04.DAT EST Sequences 7179850
156 EST_VRT05.DAT EST Sequences 7314440
157 EST_VRT06.DAT EST Sequences 5818079
158 EST_VRT07.DAT EST Sequences 6752343
159 EST_VRT08.DAT EST Sequences 7436496
160 EST_VRT09.DAT EST Sequences 7379526
161 EST_VRT10.DAT EST Sequences 7522403
162 EST_VRT11.DAT EST Sequences 6654927
163 EST_VRT12.DAT EST Sequences 4459845
164 FUN.DAT Fungi Sequences 6010827
165 GSS_FUN.DAT Genome Survey Sequences 6610633
166 GSS_HUM01.DAT Genome Survey Sequences 6016393
167 GSS_HUM02.DAT Genome Survey Sequences 6000629
168 GSS_HUM03.DAT Genome Survey Sequences 6148434
169 GSS_HUM04.DAT Genome Survey Sequences 6434740
170 GSS_HUM05.DAT Genome Survey Sequences 6510575
171 GSS_HUM06.DAT Genome Survey Sequences 6481559
172 GSS_HUM07.DAT Genome Survey Sequences 6580378
173 GSS_HUM08.DAT Genome Survey Sequences 6408229
174 GSS_HUM09.DAT Genome Survey Sequences 4350709
175 GSS_INV01.DAT Genome Survey Sequences 6846270
176 GSS_INV02.DAT Genome Survey Sequences 6946885
177 GSS_INV03.DAT Genome Survey Sequences 7373386
178 GSS_INV04.DAT Genome Survey Sequences 6498946
179 GSS_INV05.DAT Genome Survey Sequences 6298586
180 GSS_INV06.DAT Genome Survey Sequences 1218861
181 GSS_MAM01.DAT Genome Survey Sequences 7074821
182 GSS_MAM02.DAT Genome Survey Sequences 4162326
183 GSS_MUS01.DAT Genome Survey Sequences 7158644
184 GSS_MUS02.DAT Genome Survey Sequences 7182266
185 GSS_MUS03.DAT Genome Survey Sequences 7774907
186 GSS_MUS04.DAT Genome Survey Sequences 8163492
187 GSS_MUS05.DAT Genome Survey Sequences 7951661
188 GSS_MUS06.DAT Genome Survey Sequences 7608860
189 GSS_MUS07.DAT Genome Survey Sequences 7852442
190 GSS_MUS08.DAT Genome Survey Sequences 7739116
191 GSS_MUS09.DAT Genome Survey Sequences 7487387
192 GSS_MUS10.DAT Genome Survey Sequences 3065646
193 GSS_PHG.DAT Genome Survey Sequences 7930
194 GSS_PLN01.DAT Genome Survey Sequences 7452483
195 GSS_PLN02.DAT Genome Survey Sequences 7063977
196 GSS_PLN03.DAT Genome Survey Sequences 6371278
197 GSS_PLN04.DAT Genome Survey Sequences 5950592
198 GSS_PLN05.DAT Genome Survey Sequences 6141289
199 GSS_PLN06.DAT Genome Survey Sequences 5915003
200 GSS_PLN07.DAT Genome Survey Sequences 6375699
201 GSS_PLN08.DAT Genome Survey Sequences 6338531
202 GSS_PLN09.DAT Genome Survey Sequences 6339093
203 GSS_PLN10.DAT Genome Survey Sequences 6459290
204 GSS_PLN11.DAT Genome Survey Sequences 37654
205 GSS_PRO.DAT Genome Survey Sequences 788718
206 GSS_ROD01.DAT Genome Survey Sequences 6918296
207 GSS_ROD02.DAT Genome Survey Sequences 7113805
208 GSS_ROD03.DAT Genome Survey Sequences 7137747
209 GSS_ROD04.DAT Genome Survey Sequences 517234
210 GSS_VRL.DAT Genome Survey Sequences 126246
211 GSS_VRT01.DAT Genome Survey Sequences 7438011
212 GSS_VRT02.DAT Genome Survey Sequences 7328881
213 GSS_VRT03.DAT Genome Survey Sequences 3847978
214 HTC.DAT High throughput cDNAs 4413137
215 HTG_HUM01.DAT High Throughput Genome Sequences 8690973
216 HTG_HUM02.DAT High Throughput Genome Sequences 8880540
217 HTG_HUM03.DAT High Throughput Genome Sequences 8693938
218 HTG_HUM04.DAT High Throughput Genome Sequences 8317314
219 HTG_HUM05.DAT High Throughput Genome Sequences 3361981
220 HTG_INV01.DAT High Throughput Genome Sequences 2007807
221 HTG_INV02.DAT High Throughput Genome Sequences 2239775
222 HTG_MAM.DAT High Throughput Genome Sequences 2999564
223 HTG_MUS01.DAT High Throughput Genome Sequences 9650351
224 HTG_MUS02.DAT High Throughput Genome Sequences 9979200
225 HTG_MUS03.DAT High Throughput Genome Sequences 10534863
226 HTG_MUS04.DAT High Throughput Genome Sequences 2551549
227 HTG_OTHER.DAT High Throughput Genome Sequences 103342
228 HTG_PLN.DAT High Throughput Genome Sequences 6323749
229 HTG_ROD01.DAT High Throughput Genome Sequences 13698142
230 HTG_ROD02.DAT High Throughput Genome Sequences 13569850
231 HTG_ROD03.DAT High Throughput Genome Sequences 13228869
232 HTG_ROD04.DAT High Throughput Genome Sequences 12573760
233 HTG_ROD05.DAT High Throughput Genome Sequences 12301822
234 HTG_ROD06.DAT High Throughput Genome Sequences 12357597
235 HTG_ROD07.DAT High Throughput Genome Sequences 9011240
236 HTG_VRT.DAT High Throughput Genome Sequences 2854326
237 HTGO_HUM.DAT High Throughput Genome Sequences phase 0 6633121
238 HTGO_MUS.DAT High Throughput Genome Sequences phase 0 6707942
239 HTGO_OTHER.DATHigh Throughput Genome Sequences phase 0 42757
240 HUM01.DAT Human Sequences 5029177
241 HUM02.DAT Human Sequences 28372672
242 HUM03.DAT Human Sequences 11106654
243 HUM04.DAT Human Sequences 1045837
244 HUM05.DAT Human Sequences 1348766
245 HUM06.DAT Human Sequences 1094217
246 HUM07.DAT Human Sequences 1029938
247 HUM08.DAT Human Sequences 1039696
248 HUM09.DAT Human Sequences 9859535
249 HUM10.DAT Human Sequences 7897860
250 HUM11.DAT Human Sequences 1033036
251 HUM12.DAT Human Sequences 3064424
252 HUM13.DAT Human Sequences 1762084
253 HUM14.DAT Human Sequences 2064733
254 HUM15.DAT Human Sequences 809376
255 HUM16.DAT Human Sequences 564115
256 HUM17.DAT Human Sequences 625646
257 HUM18.DAT Human Sequences 1485653
258 HUM19.DAT Human Sequences 1435561
259 HUM20.DAT Human Sequences 1633138
260 HUM21.DAT Human Sequences 892976
261 HUM22.DAT Human Sequences 943399
262 HUM23.DAT Human Sequences 797317
263 HUM24.DAT Human Sequences 222725
264 INV01.DAT Invertebrate Sequences 10663300
265 INV02.DAT Invertebrate Sequences 6379812
266 INV03.DAT Invertebrate Sequences 650577
267 MAM.DAT Other Mammal Sequences 3628316
268 MUS.DAT Mus musculus Sequences 17438873
269 ORG.DAT Organelle Sequences 12425338
270 PAT01.DAT Patent Sequences 7435654
271 PAT02.DAT Patent Sequences 8392712
272 PAT03.DAT Patent Sequences 9484539
273 PAT04.DAT Patent Sequences 10411529
274 PAT05.DAT Patent Sequences 4141311
275 PHG.DAT Bacteriophage Sequences 348251
276 PLN.DAT Plant Sequences 17404353
277 PRO01.DAT Prokaryote Sequences 9125660
278 PRO02.DAT Prokaryote Sequences 6540736
279 PRO03.DAT Prokaryote Sequences 6329704
280 PRO04.DAT Prokaryote Sequences 1425835
281 ROD.DAT Rodent Sequences 1981634
282 STS.DAT STS Sequences 10660715
283 SYN.DAT Synthetic Sequences 647744
284 UNC.DAT Unclassified Sequences 145719
285 VRL.DAT Viral Sequences 13030513
286 VRT.DAT Other Vertebrate Sequences 3901972

APPENDIX A

DATABASE GROWTH TABLE

The following table shows the growth of the EMBL Nucleotide Sequence Database
at each release.

Release Month Entries Nucleotides

1 06/1982 568 585433
2 04/1983 811 1114447
3 12/1983 1481 1654863
4 08/1984 1698 2147205
5 04/1985 2378 2874493
6 08/1985 4835 4567592
7 12/1985 5789 5622638
8 04/1986 6395 6353040
9 09/1986 7630 7813214
10 12/1986 8817 9766948
11 04/1987 11621 12189783
12 07/1987 12706 13638061
13 10/1987 14397 16023478
14 01/1988 15344 17272160
15 05/1988 17961 20318442
16 08/1988 19592 22625941
17 11/1988 20695 24211054
18 02/1989 22938 27249830
19 05/1989 24365 29066676
20 08/1989 26223 31240948
21 11/1989 28679 34748087
22 02/1990 31508 38165786
23 05/1990 34902 42923803
24 08/1990 37784 47354438
25 11/1990 41580 52900354
26 02/1991 43745 55859549
27 05/1991 46871 59915244
28 09/1991 54558 70448052
29 12/1991 57655 75400487
30 03/1992 63378 83574342
31 06/1992 72481 94390065
32 09/1992 79377 101292310
33 12/1992 89100 111413979
34 03/1993 99591 121420828
35 06/1993 108973 131880111
36 09/1993 127933 145401156
37 12/1993 146576 158171400
38 03/1994 167777 177550115
39 06/1994 182615 192195819
40 09/1994 209352 211017104
41 12/1994 230950 226259607
42 03/1995 303206 262559786
43 06/1995 420111 315840053
44 09/1995 506190 363273777
45 12/1995 622566 427620278
46 03/1996 701246 473691480
47 06/1996 827174 550739395
48 09/1996 928067 608931850
49 12/1996 1047263 696183789
50 03/1997 1187455 789755858
51 06/1997 1432941 931351601
52 10/1997 1787004 1181167498
53 12/1997 1917868 1281391651
54 03/1998 2125225 1427634373
55 06/1998 2330040 1607673907
56 09/1998 2689618 1904091473
57 12/1998 3046471 2164718256
58 03/1999 3272064 2355200790
59 06/1999 3952878 2924568545
60 09/1999 4719266 3543553093
61 12/1999 5303436 4508169737
62 03/2000 5865742 6120908677
63 06/2000 6760113 8255674441
64 09/2000 8344436 9650223037
65 12/2000 9549382 10710321435
66 03/2001 11169673 11916112872
67 06/2001 12044420 12821742622
68 09/2001 12964797 13727100206
69 12/2001 14366182 15383451165
70 03/2002 15851373 17807926047
71 06/2002 17226422 20020556107
72 09/2002 18324246 23090186146
73 12/2002 20857746 27903283528