|
|||||||||||
Open Positions
Research
The only part of the bio-recipe that has to be changed is the code for creating the count matrix:
CountMatrix := CreateArray(1..20,1..20,0):
for i to 333 do
# simulate two sequences at distance 40 PAM
s1 := Rand(Protein(500)):
s2 := Mutate(s1,40):
for j to length(s1) do
# we convert the one letter code to integers
k1 := AToInt(s1[j]); k2 := AToInt(s2[j]);
# if k = 0, we have most likely a gap, if k = 21, we
# have an missing AA in the sequence (an "X")
if k1 > 0 and k1 < 21 and k2 > 0 and k2 < 21 then
if k1 = k2 then
CountMatrix[k1,k2] := CountMatrix[k1,k2] + 1;
else
CountMatrix[k1,k2] := CountMatrix[k1,k2] + 1/2;
CountMatrix[k2,k1] := CountMatrix[k2,k1] + 1/2;
fi
fi
od
od:
print(CountMatrix);
This leads to a mutation probabilities matrix that looks similar to this one:
0.65 0.02 0.02 0.02 0.04 0.03 0.03 0.04 0.02 0.01 0.02 0.02 0.03 0.01 0.03 0.08 0.04 0.00 0.01 0.05 0.01 0.70 0.02 0.01 0.01 0.05 0.02 0.01 0.03 0.01 0.01 0.08 0.01 0.00 0.01 0.02 0.02 0.01 0.01 0.01 0.01 0.02 0.64 0.05 0.01 0.03 0.02 0.02 0.04 0.00 0.00 0.03 0.01 0.00 0.01 0.04 0.03 0.00 0.01 0.00 0.02 0.01 0.06 0.69 0.00 0.03 0.08 0.02 0.02 0.00 0.00 0.02 0.00 0.00 0.01 0.02 0.02 0.00 0.01 0.00 [16 more lines]
If we subtract the theoretical 40 PAM matrix that was used to do the simulations from our result...
delta := MutationMatrix-exp(40*logPAM1);
... we see that the obtained mutation matrix is similar, but not exactly the same as the original matrix:
max(delta); min(delta); 0.01081167 -0.01374335
The reason for the difference is that the simulations are stochastic. One would obtain exactly the same matrix only when always using the expected values. The more alignments are used, the closer we get to the original matrix. If we use for instance 10,000 alignments instead of only 333, the largest positive and negative differences become:
0.00174802 -0.00183669
The PAM value estimated from the mutation matrix is very close to 40:
After 1 iteration(s), k=31.784384, RateMutation=0.012532 After 2 iteration(s), k=39.832371, RateMutation=0.010016 After 3 iteration(s), k=39.894793, RateMutation=0.010000 After 4 iteration(s), k=39.895180, RateMutation=0.010000 After 5 iteration(s), k=39.895182, RateMutation=0.010000 After 6 iteration(s), k=39.895182, RateMutation=0.010000 After 7 iteration(s), k=39.895182, RateMutation=0.010000
If the simulations are perfomed for a much higher distance, say 500 PAM, then the estimation becomes more difficult. 500 PAM means that on average, each amino acid has mutated 5 times. The obtained mutation matrix is then very close to the "infinite distance" matrix where the probability to mutate to a particular amino acid depends only on the natural frequency of the amino acid, independent from what amino acid it was before (all columns are the same - the frequency vector).
If then only a limited number of alignments are used, the error in the mutation probabilities can be so large that the estimated matrix no longer corresponds to a Markovian process. Therefore, no rate matrix (logarithm of the mutation matrix) can be computed:
> lnPAM1 := ln(MutationMatrix): Error, Cannot compute log of matrix
It is thus important that the alignment for estimating a Dayhoff matrix are chosen under the consideration of the following points:
Wichtiger Hinweis:
Diese Website wird in älteren Versionen von Netscape ohne
graphische Elemente dargestellt. Die Funktionalität der
Website ist aber trotzdem gewährleistet. Wenn Sie diese
Website regelmässig benutzen, empfehlen wir Ihnen, auf
Ihrem Computer einen aktuellen Browser zu installieren. Weitere
Informationen finden Sie auf
folgender
Seite.
Important Note:
The content in this site is accessible to any browser or
Internet device, however, some graphics will display correctly
only in the newer versions of Netscape. To get the most out of
our site we suggest you upgrade to a newer browser.
More
information