Документ взят из кэша поисковой машины. Адрес оригинального документа : http://kodomo.fbb.msu.ru/~dendik/data/python/2012-04-28/allpy.pdf
Дата изменения: Sat Apr 28 15:54:06 2012
Дата индексирования: Sun Feb 3 13:17:18 2013
Кодировка:

Introducing allpy,
a yet another library for dealing with alignments in Python.

Daniil Alexeyevsky
Moscow State University, Faculty of Bio engineering and Bioinformatics

April 28, 2012

Outline
What's wrong with the existing libraries? Our definitions Examples What works Projects using allpy Future

Outline
What's wrong with the existing libraries? Our definitions Examples What works Projects using allpy Future

What is a biological sequence alignment?
What it is

Polymer is a molecule consisting of very similar monomers. We may know about polymer as little as primary structure, as much as ternary structure or even more (e.g. secondary structure!) We may refer to polymer as sequence, when most of our knowlege is about primary or secondary structure. Some parts of different sequences have something in common -- it's called homology. Alignment is means to store data about monomer homology.

What is a biological sequence alignment?
What it isn't

What is a biological sequence alignment?
What it isn't

Alignment is not a table.

What is a biological sequence alignment?
What it isn't

Alignment is not a table. Sequence in alignment is not a line with letters.

What is a biological sequence alignment?
What it isn't

Alignment is not a table. Sequence in alignment is not a line with letters. There are no gaps in sequences.

What is a biological sequence alignment?
What it isn't

Alignment is not a table. Sequence in alignment is not a line with letters. There are no gaps in sequences. Monomer is not a letter.

What is a biological sequence alignment?
What it isn't

Alignment is not a table. Sequence in alignment is not a line with letters. There are no gaps in sequences. Monomer is not a letter. Obvious? Not for the alignment libraries developers!

Outline
What's wrong with the existing libraries? Our definitions Examples What works Projects using allpy Future

Our definitions
Sequence and monomer

Monomer Sequence

Our definitions
Sequence and monomer

Monomer is a collection of atributes. Among them are:
code1 -- one-letter code, code3 -- three-letter code (for PDB) is modified ...

Sequence

Our definitions
Sequence and monomer

Monomer is a collection of atributes. Among them are:
code1 -- one-letter code, code3 -- three-letter code (for PDB) is modified ...

Sequence is a list (array) of monomers. It also may contain some data specific to the sequence as a whole:
name -- sequence identifier description -- comment, often embedded in alignment files source ...

Our definitions
Column and alignment

Monomer Sequence Column Alignment

Our definitions
Column and alignment

Monomer Sequence Column maps sequence to monomer. When we have a sequence we may ask column: do you have this sequence? Which monomer in this sequence belongs to you? Alignment

Our definitions
Column and alignment

Monomer Sequence Column maps sequence to monomer. When we have a sequence we may ask column: do you have this sequence? Which monomer in this sequence belongs to you? Alignment is a list of sequences and list of columns. It may also store some metadata.

Our definitions
Column and alignment

Monomer Sequence Column maps sequence to monomer. When we have a sequence we may ask column: do you have this sequence? Which monomer in this sequence belongs to you? Alignment is a list of sequences and list of columns. It may also store some metadata. It is definitely not a table with letters. It is a if we leave sequences unmodified and have these bendy columns.

Our definitions
Column and alignment

Monomer Sequence Column maps sequence to monomer. When we have a sequence we may ask column: do you have this sequence? Which monomer in this sequence belongs to you? Alignment is a list of sequences and list of columns. It may also store some metadata. It is definitely not a table with letters. It is a if we leave sequences unmodified and have these bendy columns. It is still simpler to represent it as a table with gaps and minuses when we want to look at it.

Block in alignment

It is means of selecting part of alignment. You may think of it as of a rectangle in alignment.

Each group of same-colored monomers in the picture forms a block.

Block in alignment

It is means of selecting part of alignment. You may think of it as of a rectangle in alignment.

Each group of same-colored monomers in the picture forms a block.

Definition
A block is an intersection of an arbitrary (sub)set of sequences in alignment and an adjacent (sub)set of columns in alignment.

Block in alignment

It is means of selecting part of alignment. You may think of it as of a rectangle in alignment.

Each group of same-colored monomers in the picture forms a block.

Definition
A block is an intersection of an arbitrary (sub)set of sequences in alignment and an adjacent (sub)set of columns in alignment. Block and alignment are the same thing!

Markup
We sometimes (very often) want to add marks on an alignment. Usually, the marks fall into one of three categories: they mark a region of alignment (e.g. "this is a well-defined part") -- this is not markup, this is a block!

Markup
We sometimes (very often) want to add marks on an alignment. Usually, the marks fall into one of three categories: they mark a region of alignment (e.g. "this is a well-defined part") -- this is not markup, this is a block! they mark one monomer (e.g. monomer number in sequence; secondary structure it belongs to; sequence read quality; PDB residue number . . . ).

Markup
We sometimes (very often) want to add marks on an alignment. Usually, the marks fall into one of three categories: they mark a region of alignment (e.g. "this is a well-defined part") -- this is not markup, this is a block! they mark one monomer (e.g. monomer number in sequence; secondary structure it belongs to; sequence read quality; PDB residue number . . . ). We already have it! It's the data stored in the monomers.

Markup
We sometimes (very often) want to add marks on an alignment. Usually, the marks fall into one of three categories: they mark a region of alignment (e.g. "this is a well-defined part") -- this is not markup, this is a block! they mark one monomer (e.g. monomer number in sequence; secondary structure it belongs to; sequence read quality; PDB residue number . . . ). We already have it! It's the data stored in the monomers. they mark one column (e.g. column number in alignment; consensus; persent of conserved residues . . . )

Markup
We sometimes (very often) want to add marks on an alignment. Usually, the marks fall into one of three categories: they mark a region of alignment (e.g. "this is a well-defined part") -- this is not markup, this is a block! they mark one monomer (e.g. monomer number in sequence; secondary structure it belongs to; sequence read quality; PDB residue number . . . ). We already have it! It's the data stored in the monomers. they mark one column (e.g. column number in alignment; consensus; persent of conserved residues . . . ) We define it as a mapping of column to the data we need.

Markup
We sometimes (very often) want to add marks on an alignment. Usually, the marks fall into one of three categories: they mark a region of alignment (e.g. "this is a well-defined part") -- this is not markup, this is a block! they mark one monomer (e.g. monomer number in sequence; secondary structure it belongs to; sequence read quality; PDB residue number . . . ). We already have it! It's the data stored in the monomers. they mark one column (e.g. column number in alignment; consensus; persent of conserved residues . . . ) We define it as a mapping of column to the data we need. Many people actually reinvent this thing over and over by themselves, and yet most libraries and formats have no or very little support for it!

Summary
monomer is an object with arbitrary data sequence is a list of monomers (but can have some of it's own arbitrary data) column is a mapping of sequences to monomers alignment is a list of sequenses plus a list oc columns (plus some of own data) block is the same as alignment (except that we know that it does not have all the data) alignment/block markup is a mapping of column to markup data sequence, alignment and block have some helpers for dealing with corresponding markups

Storing makrup
There exist no formats for storing alignment markup that would not fit one letter. We had to invent one ourselves! This is an example file:
sequence_markup name: pdb_resi markup: 3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30 sequence_description: CAMP_HUMAN/136-163 class: SequencePdbResiMarkup sequence_name: 2k6o_A/3:30 io-class: IntMarkup sequence_markup name: pdb_resi markup: 3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26 sequence_description: CAP18_RABIT/137-164 class: SequencePdbResiMarkup sequence_name: 1lyp_A/3:30 io-class: IntMarkup alignment format: fasta >2k6o_A/3:30 CAMP_HUMAN/136-163 GDFFRKSKEKIGKEFKRIVQRIKDFLRN >1lyp_A/3:30 CAP18_RABIT/137-164 RKRLRKFRNKIKEKLKKIGQKIQG----

Kinds of polymers

There are three classes of biologically significant polymers (you may know more): proteins DNA RNA We believe that these have rather few things in common. So we demand from the user to specify, which kind they want to use.

Outline
What's wrong with the existing libraries? Our definitions Examples What works Projects using allpy Future

Converting alignment format

alignment = protein.Alignment().append_file( input_file, format="genbank" ) alignment.to_file(output_file, format="fasta")

Converting alignment format -- the uncensored version

from allpy import protein input_file = open("input_file.fasta") output_file = open("output_file.fasta", "w") alignment = protein.Alignment().append_file( input_file, format="genbank" ) alignment.to_file(output_file, format="fasta")

Realigning a little part of a protein alignment

alignment = protein.Alignment().append_file(input_file) block = protein.Block().from_alignment(alignment, columns=alignment.columns[20:30] ) block.realign(processors.Muscle()) alignment.to_file(output_file)

Remove unimportant columns from alignment
alignment = protein.Alignment().append_file(input_file) for sequence in alignment.sequences: sequence.add_markup('case') def important(column): for monomer in column.values(): if monomer.case == "upper": return True return False alignment.columns = filter(important, alignment.columns) alignment.to_file(output_file, format="markup")

Outline
What's wrong with the existing libraries? Our definitions Examples What works Projects using allpy Future

What works

All the core operations with alignment, blocks, sequences, monomers. File IO. Realigners, there are few pre-canned: Left, Right, Muscle, Needle. Markups, with a few automatical shipped: case, number. Protein has had most attention. A few complex operations with 3D structures attached to alignment.

What lacks

Documentation is on the weak side. DNA or RNA specific code. Work with databases over the net? Make wrapper for every program out there? . . . -- that's not our goal.

Outline
What's wrong with the existing libraries? Our definitions Examples What works Projects using allpy Future

Reliable blocks detection in alignments.

This is an alignment from Pfam. What do we know about parts that are not colored?

Reliable blocks detection by 3D structure flexible superposition.
Implemented in three separate tools: blocks3d, geometrical-core and pair-cores. The approach in each of them: Associate 3D structure with monomers in alignment. For each sequence, calculate distances between every pair of C -atoms. In resulting tables of distances find the similar pieces and mark them as blocks in the alignment. blocks3d is located at http://kodomo.fbb.msu.ru/blocks3d/ Manual check shows that the program mostly produces very good results but is susceptible to producing some false positives on helixes. Developed by Boris Nagaev, one of the students participating in the LUMC summer school.

Reliable blocks detection by peering onto the alignment.

Implemented in two separate tools with different algorithms. (And none of the tools as a usable name, yet). The approach is to find little similar pieces in sequences and then join them, if they are really that much similar. And when there is nothing more to join, we call it a block. One of the tools is located at http://mouse.belozersky.msu.ru/tools/malakite.html Developed by Boris Burkov, phd student at The Moscow State University, one of the former studens of the LUMC summer schools.

Outline
What's wrong with the existing libraries? Our definitions Examples What works Projects using allpy Future

The Future

Have a good documentation. A little more simple ways to work with alignments. Have full library tidied up, fully tested, perfectly working.

The Future

Have a good documentation. A little more simple ways to work with alignments. Have full library tidied up, fully tested, perfectly working. Our own alignment editor with blackjack and hookers. Hard hacks for optimization for speed and memroy.

The Future

Have a good documentation. A little more simple ways to work with alignments. Have full library tidied up, fully tested, perfectly working. Our own alignment editor with Python command line. Hard hacks for optimization for speed and memroy.

Thank you!

The End.

Other libraries

Bio Python nwalign

What I've been doing in LUMC

Many small fixes and many small tools needed to make the poster at MCCMB'2011. Finalized the concept of markups. General cleanup and user documentation.