printlogo
http://www.ethz.ch/index_EN
CBRG - Computational Biochemistry Research Group
 
print
  

ML Tree Exercise

This exercise is about the computation of the likelihood of a tree with 4 leaves. We consider the tree shown below with sequences A, B, C and D at the leaves. The sequences X and Y at the internal nodes are normally not known. The tree has branches b1, b2, b3, b4 and b5 (in PAM):

Example tree
Example tree

Introduction: Computing the likelihood

The position of the root of the tree has no influence on the result (why?). Therefore we can arbitrarily set it to X.

First we look at the likelihood of only one character of the sequences, say at position i. For simplicity, we assume for now that the internal characters Xi and Yi are known. The likelihood of the tree T at position i is then computed as:
L(T,Ai,Bi,Ci,Di,Xi,Yi) = P(Xi)*P(Xi->Ai,b1)*P(Xi->Bi,b2)*P(Xi->Yi,b5)*P(Yi->Ci,b3)*P(Yi->Di,b4)
with P(Xi) being the probability of observing character Xi at node X and P(Xi->Ai,b1) being the probability of amino acid Xi changing to amino acid Ai over distance b1.

In reality, however, the internal nodes X and Y are not known. Therefore, the likelihood is computed by summing over all possible states of Xi and Yi:
L(T,Ai,Bi,Ci,Di) = sum_Xi sum_Yi L(T,Ai,Bi,Ci,Di,Xi,Yi)

The likelihood of the whole sequences is the product of the likelihoods at the individual positions. However, since the likelihoods are products of probabilities, they can become very small and thus unsuited for computation. Thus, normally the logarithms of the likelihoods are used. Instead of the product, the sum can then used to compute the likelihood using the whole sequences. For sequences of length N this becomes:
logL(T,A,B,C,D) = sum_i=1..N log(L(T,Ai,Bi,Ci,Di))

Task 1: Compute the likelihood

Compute the log-likelihood of the tree shown above for the following sequences and branch lengths:

b1 := 10;
b2 := 15;
b3 := 15;
b4 := 20;
b5 := 10;
A := 'TGEELPNDLQTIANELFGTDQPFTNFNGIGRFATPGFSPFGAALDNMSVGGLNHIAIKHSGEIEDSVRSN':
B := 'TGEELPFDIETIANELFGTDQPFTNFNGLGQMATPGFNPFGAAADNLATGGVNHIAIEHSGEIEPSVRSN':
C := 'TGEELPNDLQTISEEKFGTDQPFTNFNGIGRFAAPGFNPFGAALDNLSSGGTNHVVIEHSGDIQQSVRSN':
D := 'TGEELPNDAQAITEQLFGTDKPFTNFNGIGRFATPGPDPMGAELDNLEVGGVNHIAVQSTGEIEPSVRSN':

The probabilities needed for the computations come from the substitution probabilities from the Dayhoff matrix. After CreateDayMatrices(), the array AF contains the equilibrium frequencies P(X) for the amino acids.
The mutation matrix for PAM distance b1 is computed as

M1 := exp(b1*logPAM1):

and the probability P(i->j,b1) is then found in M1[j,i]. (yes, j,i not i,j).

We will use this computation many times, therefore write a function that takes four sequences and 5 branch lengths as arguments and returns the log likelihood of T. In Darwin, a procedure looks like this:

ComputeLogL := proc(S1, S2, S3 ,S4, d1, d2, d3 ,d4 ,d5)
  # here do some computations, e.g.
  logL := 99;
  return(logL);
end:

Inside the procedure, S1 to S4 and d1 to d5 can be used to do the computations. It can then be called for A,B,C,D and b1,b2,b3,b4,b5 as follows:

t := ComputeLogL(A,B,C,D,b1,b2,b3,b4,b5);

You might want to use the length function of Darwin to determine the length of the sequences and the AToInt function to convert the amino acids to integer numbers. Furthermore, try to make the implementation as efficient as you can, since it will be used a lot.

For the log-likelihood you should get -402.5570. Make sure your program works correctly before moving to Task 2.

Task 2: Optimize the middle branch

The most time-consuming task in computing an ML tree is optimizing the branch lengths. For distance trees, this can be solved very efficiently by solving a system of linear equations. For ML trees, numerical optimization has to be performed. Since the the branches of the tree depend on each other, one should in principle optimize all branches simultaneously or at least iterate over all branches. It is also possible to compute first and second derivatives of the likelihoods, which allows to use more efficient optimization procedures.

But that would be beyond the scope of this course. For the purpose of this his exercise, it is enough to optimize only one branch, the middle branch b5. (You are of course welcome to optimize all branches, if you like...). We use a very simple (but inefficient) optimization procedure: compute the log-likelihood for all possible d5 between 0 and 15 in steps of 0.1 and determine the best (the highest likelihood).

The Darwin for-loop can be "enhanced" to simplify this search:

for d5 from 0 to 15 by 0.1 do
  # ....
od:

If you attended "Introduction to Computational Science", you might remember that there are more efficient ways to numerically find a maximum. In that case you could try to implement a Golden Section search.

If you are interested in some of the optimization methods in darwin, you could write a small method that wraps your ComputeLogL function and then see how some of the darwin methods perform in optimizing the parameters. Most optimize functions in darwin do minimization and expect the function to optimize to take its parameters as a list; something like this should work fine:

testFunc := proc(val:{list,numeric})
  if type(val, numeric) then
    res := ComputeLogL(A, B, C, D, b1, b2, b3, b4, val):
  else
    res := ComputeLogL(A, B, C, D, b1, b2, b3, b4, val[1]):
  fi:
  # or, if you want to optimize more than one distance, rather
  # like this:
  # res := ComputeLogL(A, B, C, D, val[1], val[2], val[3], val[4], val[5]):
  return( abs( res ) ):
end:

Possible methods to evaluate are MinimizeFunc, DisconMinimize, MinimizeBrent, ... .

 

Wichtiger Hinweis:
Diese Website wird in älteren Versionen von Netscape ohne graphische Elemente dargestellt. Die Funktionalität der Website ist aber trotzdem gewährleistet. Wenn Sie diese Website regelmässig benutzen, empfehlen wir Ihnen, auf Ihrem Computer einen aktuellen Browser zu installieren. Weitere Informationen finden Sie auf
folgender Seite.

Important Note:
The content in this site is accessible to any browser or Internet device, however, some graphics will display correctly only in the newer versions of Netscape. To get the most out of our site we suggest you upgrade to a newer browser.
More information

© 2012 ETH Zurich | Imprint | Disclaimer | 17 November 2011
top