Документ взят из кэша поисковой машины. Адрес оригинального документа : http://mccmb.belozersky.msu.ru/2013/abstracts/abstracts/222.pdf
Дата изменения: Mon Jun 3 22:43:20 2013
Дата индексирования: Thu Feb 27 21:12:32 2014
Кодировка:
The database of triplet periodicity change points Y.M. Suvorova, E.V. Korotkov
Centre of Bioengineering Russian Academy of Sciences, 117312, prospect 60 -tya Oktyabrya 7/1, Moscow, Russia, suvorovay@gmail.com

It is well-known that triplet periodicit y (TP) is the dist inguishing feature of protein coding sequences of the majorit y of the living organisms [1]. It is also known that in so me stage of evo lut ion mutation types such as fusio ns play an important role in the process of a new protein coding sequence formation [2]. If two genes wit h different triplet periodicit y types were fused then the result ing sequence would consist of two successive parts with different TP types. So one could find a triplet periodicit y change point between these parts in the sequence [3]. In order to allow one to study the pheno menon of triplet periodicit y change point we collected the triplet periodicit y change point events in the database. TPCPDB (Triplet periodicit y Change Point Database) is an online database that contains triplet periodicit y change points that were found in protein coding sequences of prokaryotic geno mes from GenBank. To study triplet periodicit y change po ints in protein coding sequences we used a method that based on maximum difference of TP between adjacent subsequences. We used a sliding window method to find the posit ion of maximal TP difference in a sequence. To estimate statist ical significance of the found change po int cases the Monte-Carlo method was used. Moving a sliding pointer x along a sequence S of length L we considered adjacent regions of length l (fro m 60 to 600 nt) on the left and the right side fro m x. To study and co mpare triplet periodicit y on the regions 4в3 frequency matrixes were used. An element of such a matrix is a number of nucleotides of t ype i (i=1 for `a', i=2; for `t', i=3, for `g' and i=4 for `'), which is in the posit ion j of a codon (j=1,2,3), in the considered region. In order to eliminate the influence of enrichment of a certain t ypes of nucleotides on the difference we use d the fo llowing element wise transformation

nk (i, j )
i=1,2,3,4; j=1,2,3, and pk (i, j ) ((

mk (i, j ) lpk (i, j ) lpk (i, j )(1 pk (i, j ))


i 1

4

m k (i, j )) (


j 1

3

mk (i, j ))) l

2


In order to take into account possible reading frame shifts near the posit ion x we considered all three reading frames (and corresponding matrixes) after the posit ion [4]. So we compared three matrixes on the right -side subsequence (Wk, k=1,2,3) with one on the left fro m x (denote it V):
d min ( Dk ( x, l ))
k 1, 2 ,3


i 1 j 1

4

3

v(i, j ) wk (i, j ) 2

2

Moving a sliding pointer x along the sequence we were looking for the position of the maximum of TP difference (d
ma x

). To define the statistical significance of the found case the Monte-Carlo

method was used. For each considered sequence the set of rando m sequences (of size N=1000) was generated by trio-shuffling of the sequence S. On this set the mean value and the deviat ion of d
ma x

were determined. And the final value for the sequence was defined as

Z

d

max

( S ) d max ( S ) D(d max ( S ))

We included in the database sequences with triplet periodicit y change points where the Z value exceeds the thresho ld Z Z0 4.0 (that corresponds to 5% probabilit y o f the first type error). The current version o f the database includes 179 238 records of protein coding sequences with triplet periodicit y change points. To access the database one of the following search options can be used: geno me sequence ID or name (and if needed the concrete region on the sequence could be specified) and/or gene ident ificat ion (GenBank). Search for a change points using the descript ion of the protein is also allowed. Results are returned in a table showing informat ion about the change points. The informat ion for each change point record includes: internal change point number; gene ident ificator; genome name; product descript ion; change point posit ion; window size; Z-value. The database URL: http://victoria.biengi.ac.ru/tpcpdb/. 1. E.N. Trifo nov (1998) 3-, 10.5-, 200- and 400-base periodicit ies in geno me sequences. Physica A, 249: 511­516. 2. S. Pasek, J.L. Risler, P. Brezellec (2006) Gene fusion/fissio n is a major contributor to evo lut ion of mult i-do main bacterial proteins, Bioinformatics, 22: 1418-1423. 3. Y.M. Suvorova, V.M. Rudenko, E.V. Korotkov (2012) Detection change points of triplet periodicit y o f gene, Gene, 491: 58-64. 4. V. Rudenko, Y. Suvorova, E. Korotkov (2011) Detection of Possible Reading Frame Shifts in Genes Using Triplet Frequencies Ho mogeneit y. Austrian journal of statistics, 40: 137­146.