printlogo
http://www.ethz.ch/index_EN
CBRG - Computational Biochemistry Research Group
 
print
  

Dayhoff Matrix Exercise - Solution

The only part of the bio-recipe that has to be changed is the code for creating the count matrix:

CountMatrix := CreateArray(1..20,1..20,0):
for i to 333 do
   # simulate two sequences at distance 40 PAM
   s1 := Rand(Protein(500)):
   s2 := Mutate(s1,40):
   for j to length(s1) do
     # we convert the one letter code to integers
     k1 := AToInt(s1[j]); k2 := AToInt(s2[j]);
     # if k = 0, we have most likely a gap, if k = 21, we 
     # have an missing AA in the sequence (an "X")
     if k1 > 0 and k1 < 21 and k2 > 0 and k2 < 21 then
       if k1 = k2 then
         CountMatrix[k1,k2] := CountMatrix[k1,k2] + 1;
       else
         CountMatrix[k1,k2] := CountMatrix[k1,k2] + 1/2;
         CountMatrix[k2,k1] := CountMatrix[k2,k1] + 1/2;
       fi
     fi
   od
od:
print(CountMatrix);

This leads to a mutation probabilities matrix that looks similar to this one:

0.65 0.02 0.02 0.02 0.04 0.03 0.03 0.04 0.02 0.01 0.02 0.02 0.03 0.01 0.03 0.08 0.04 0.00 0.01 0.05
0.01 0.70 0.02 0.01 0.01 0.05 0.02 0.01 0.03 0.01 0.01 0.08 0.01 0.00 0.01 0.02 0.02 0.01 0.01 0.01
0.01 0.02 0.64 0.05 0.01 0.03 0.02 0.02 0.04 0.00 0.00 0.03 0.01 0.00 0.01 0.04 0.03 0.00 0.01 0.00
0.02 0.01 0.06 0.69 0.00 0.03 0.08 0.02 0.02 0.00 0.00 0.02 0.00 0.00 0.01 0.02 0.02 0.00 0.01 0.00
[16 more lines]

If we subtract the theoretical 40 PAM matrix that was used to do the simulations from our result...

delta := MutationMatrix-exp(40*logPAM1);

... we see that the obtained mutation matrix is similar, but not exactly the same as the original matrix:

max(delta); min(delta);

0.01081167
-0.01374335

The reason for the difference is that the simulations are stochastic. One would obtain exactly the same matrix only when always using the expected values. The more alignments are used, the closer we get to the original matrix. If we use for instance 10,000 alignments instead of only 333, the largest positive and negative differences become:

0.00174802
-0.00183669

The PAM value estimated from the mutation matrix is very close to 40:

After 1 iteration(s), k=31.784384, RateMutation=0.012532
After 2 iteration(s), k=39.832371, RateMutation=0.010016
After 3 iteration(s), k=39.894793, RateMutation=0.010000
After 4 iteration(s), k=39.895180, RateMutation=0.010000
After 5 iteration(s), k=39.895182, RateMutation=0.010000
After 6 iteration(s), k=39.895182, RateMutation=0.010000
After 7 iteration(s), k=39.895182, RateMutation=0.010000

If the simulations are perfomed for a much higher distance, say 500 PAM, then the estimation becomes more difficult. 500 PAM means that on average, each amino acid has mutated 5 times. The obtained mutation matrix is then very close to the "infinite distance" matrix where the probability to mutate to a particular amino acid depends only on the natural frequency of the amino acid, independent from what amino acid it was before (all columns are the same - the frequency vector).

If then only a limited number of alignments are used, the error in the mutation probabilities can be so large that the estimated matrix no longer corresponds to a Markovian process. Therefore, no rate matrix (logarithm of the mutation matrix) can be computed:

> lnPAM1 := ln(MutationMatrix):
Error, Cannot compute log of matrix

It is thus important that the alignment for estimating a Dayhoff matrix are chosen under the consideration of the following points:

 

Wichtiger Hinweis:
Diese Website wird in älteren Versionen von Netscape ohne graphische Elemente dargestellt. Die Funktionalität der Website ist aber trotzdem gewährleistet. Wenn Sie diese Website regelmässig benutzen, empfehlen wir Ihnen, auf Ihrem Computer einen aktuellen Browser zu installieren. Weitere Informationen finden Sie auf
folgender Seite.

Important Note:
The content in this site is accessible to any browser or Internet device, however, some graphics will display correctly only in the newer versions of Netscape. To get the most out of our site we suggest you upgrade to a newer browser.
More information

© 2012 ETH Zurich | Imprint | Disclaimer | 5 October 2011
top