printlogo
http://www.ethz.ch/index_EN
CBRG - Computational Biochemistry Research Group
 
print
  

Progressive MSA Reconstruction using Consensus Sequences - Solution

This is the solution to the Progressive MSA Reconstruction exercise

First, we enter our 5 homologous sequences and initialize the Dayhoff matrices:

s1 := 'RYEDGVGNVIMQGRVMLAWRSVIALNTEKFLDVGPEDEQALEIKRNGTKAQLHLLVLMTVLAADNPAEENRKAVEMYAFLTEFDFRADGMVRVRGNLFPFMGLGLKR';
s2 := 'RYEDSIENLLLQSRLVLAYRTAVALNTEKFLGVPEDDEKELQIQRNGTKAQLKDWVLMHVLAAPAEENREAFLTPLDFHADAMVEVRGLIKQGLHR';
s3 := 'RYKRPVVNATLQSRLILAFRSVVALNTEEFLEVRPDDEAGEIERNGTKAELNLTVQMHVLAAPLEENRQAFLTPVDFHAAGMILVRGNGFPLIKLGLKR';
s4 := 'DPADGLDNKREENRVDLCCVPVVALNTPQFVNIGPDPAALTIKDNGSHVALEVLRWSLAAPTEENFQRFDLTKDFAVCMVAVRMNEFPKPGKDLLR';
s5 := 'DEADGLDNNREENRVKLCCVPVYAINTPQFIRAGPDSAALAIKANGSSGHVLLQVLRFSLAAPSEENFLRFDLTRDFAICMVAMRGNEFPKPGSDLKR';

CreateDayMatrices();

Task 1

The main idea is to reduce the alignment of two MSAs to the alignment of two sequences. Therefore, we need to compress each MSA into a consensus sequence, align those sequences using global alignment, and expand each character in the resulting pairwise alignment to the corresponding column to get the resulting MSA.

First, let us compute a consensus sequence from a MSA, which is represented as a matrix of characters. The easiest way to achieve this is by sorting each column of the MSA and determining the most frequent character in a linear pass. Gap characters are not allowed in a consensus sequence, therefore we skip them:

ConsensusSequence := proc(m:matrix)
        sequence := '':
        # transpose matrix to get columns in an easy way
        mt := transpose(m):
        # loop over all positions
        for position to length(mt) do
                # get array with sites at this position
                # sort it alphabetically (all same characters
                # are now clustered... )
                sites := sort(mt[position]):
                # what is the best site and its frequency?
                bestSite := '$':
                bestFreq := 0:
                # what is the current site and its frequency?
                currSite := '$':
                currFreq := 1:
                # loop over all sites (i.e. all sequences)
                for j to length(sites) do
                        # if its a gap, skip it - no gaps in consensus sequences
                        if sites[j] = '_' or sites[j] = '-' then next fi:
                        # if its the same character as currSite, increase frequency
                        if sites[j] = currSite then
                                currFreq := currFreq + 1:
                        else
                                # if frequency of current character is higher than 
                                # current highest frequency, update 'bestSite' and
                                # 'bestFreq'
                                if currFreq > bestFreq then
                                        bestFreq := currFreq:
                                        bestSite := sites[j]:
                                fi;
                                # in any case, update current site and reset its
                                # frequency
                                currSite := sites[j]:
                                currFreq := 1:
                        fi:
                od:
                # attach best site to consensus sequence
                sequence := sequence.best:
        od:
        return(sequence):
end:

Then, we need to invert the compression of columns into single characters. This is done in the MergeMSA procedure, which takes a pairwise sequence alignment and expands it to a MSA:

MergeMSA := proc(m1:matrix, m2:matrix, a:Alignment)
        alignedStrings := DynProgStrings(a)[2..3]:
        len := length(alignedStrings[1]):
        # create a new empty MSA (i.e. list of lists)
        msa := CreateArray(1..length(m1)+length(m2), 1..len, '_'):
        pos1 := pos2 := globalPos := 1:
        # run through every position in the aligned consensus seqs
        for globalPos to len do
                # check first cons.seq: if current position is not a gap,
                if alignedStrings[1,globalPos] <> '_' then
                        # loop over sequences in first MSA and attach characters at
                        # position 'pos1' to new MSA and increase pos1
                        for k to length(m1) do msa[k,globalPos] := m1[k,pos1] od:
                        pos1 := pos1 + 1:
                fi:
                # same for second cons.seq and second MSA
                if alignedStrings[2,globalPos] <> '_' then
                        for k to length(m2) do msa[k+length(m1),globalPos] := m2[k,pos2] od:
                        pos2 := pos2 + 1:
                fi:
        od:
        return(msa):
end:

At the leaves of the tree we have to convert our sequences into simple 'MSAs' to be able to apply the AlignProgressive() procedure consistently:

SequenceToMSA := proc(s:string)
        [[seq(s[i],i=1..length(s))]]:
end:

Finally, we rewrite our progressive alignment method with Darwin code:

AlignProgressive := proc(m1:matrix, m2:matrix, dist:numeric)
        s1 := ConsensusSequence(m1):
        s2 := ConsensusSequence(m2):
        al12 := Align(s1, s2, DayMatrix(dist), Global):
        MergeMSA(m1,m2,al12):
end:

Now we can compute the MSA progressively along the given guide tree:

m1 := SequenceToMSA(s1):
m2 := SequenceToMSA(s2):
m3 := SequenceToMSA(s3):
m4 := SequenceToMSA(s4):
m5 := SequenceToMSA(s5):

m23 := AlignProgressive(m2,m3,30+50):
m45 := AlignProgressive(m4,m5,20+10):
m13 := AlignProgressive(m23,m1,10+30):
res := AlignProgressive(m13,m45,30+80):

One last thing we have to do is define a method for pretty-printing the MSA:

PrintMSA := proc(m:matrix)
        for i in m do
                s := '';
                for j in i do
                        s := s.j:
                od:
                print(s):
        od:
end:


PrintMSA(res):

Task 2 (optional)

AlignProgressiveTree := proc(sequences:list(string), labels:list(string), tree:Tree)
        if type(tree,Leaf) then
                i := SearchArray(tree[Label],labels);
                m := SequenceToMSA(sequences[i]);
        else
                m1 := AlignProgressiveTree(sequences,labels,tree[Right]):
                m2 := AlignProgressiveTree(sequences,labels,tree[Left]):
                m := AlignProgressive(m1,m2,2*tree[Height]-tree[Right,Height]-tree[Left,Height]):
        fi;
        return(m):
end:

tree := Tree(Tree(Leaf('s1',-60),-30,Tree(Leaf('s2',-70),-40,Leaf('s3',-90))),0,Tree(Leaf('s4',-100),-80,Leaf('s5',
sequences := [s1,s2,s3,s4,s5]:
labels := ['s1','s2','s3','s4','s5']:

res := AlignProgressiveTree(sequences,labels,tree):
PrintMSA(res):

Task 3 (to complete your own MSA package)

FullAlignProgressive := proc(sequences:list(string), labels:list(string))
        tree := PhylogeneticTree(sequences,labels,DISTANCE):
        print(tree);
        m := AlignProgressiveTree(sequences,labels,tree):
        return(m):
end:

res := FullAlignProgressive(sequences,labels):
PrintMSA(res):
 

Wichtiger Hinweis:
Diese Website wird in älteren Versionen von Netscape ohne graphische Elemente dargestellt. Die Funktionalität der Website ist aber trotzdem gewährleistet. Wenn Sie diese Website regelmässig benutzen, empfehlen wir Ihnen, auf Ihrem Computer einen aktuellen Browser zu installieren. Weitere Informationen finden Sie auf
folgender Seite.

Important Note:
The content in this site is accessible to any browser or Internet device, however, some graphics will display correctly only in the newer versions of Netscape. To get the most out of our site we suggest you upgrade to a newer browser.
More information

© 2012 ETH Zurich | Imprint | Disclaimer | 24 November 2011
top