|
|||||||||||
Open Positions
Research
This exercise guides you through the necessary steps to create a least-squares distance tree. We are using sequences from the Swiss-Prot database:
ReadDb('/home/darwin/DB/SwissProt.db');
The construction of least-squares trees is also described in the script in Section 8.4.3 (p. 146).
Step 1: Getting the sequence
Use the SearchDb function to find sequences of interest. In this exercise we want to look at the red and green opsins in mammals (photoreceptor pigments in the eyes).
Swiss-Prot IDs are made of two parts: the name of the protein (a 3 or 4 letter code) and the name of the species (typically a 5 letter code), the red opsin of human is for example OPSR_HUMAN. Thus, you need to search the DB for entries that contain ('OPSG_' OR 'OPSR_') AND 'Mammalia'. See the SearchDb help to find out how to do searches with ANDs and ORs. You should find 13 sequences matching this criteria.
Note that the result is a list without the surrounding brackets ([ and ]). It is a good idea to enclose the result in brackets for the following steps. Store the obtained sequences in a list call 'entries'.
Step 2: All vs All alignments
As the name implies, distance trees are based on (PAM) distances that are estimated between the sequences. Therefore, we need to create two matrices, the distance matrix D and the variance matrix V. The element D[a,b] is the PAM distance between sequence a and sequence b and V[a,b] is the variance of this distance estimate.
For this task, we need a 'double-loop' (two nested for-to-do-od loops) to align each sequence against each other sequence. We want Local alignments and the DMS array of Dayhoff matrices to optimize the distances:
al := Align(entries[a],entries[b], Local, DMS);
Of the resulting Alignment data structure, you need the selectors PamDistance and PamVariance:
al[PamDistance]; al[PamVariance];
Hints: don't align a sequence against itself. Also, the matrix is symmetric, D[a,b]=D[b,a]. This property can be used to save computation time. Use CreateArray to set up the matrices.
Look at the resulting matrices by printing them:
print(D); print(V);
Step 3: Creating the least-squares tree
The MinSquareTree function of Darwin creates a 'weighted least-squares' distance tree. It needs as input a distance and a variance matrix and a list of labels, identifying the sequences.
As labels, we will use the Swiss-Prot IDs. Thus, we need to create a list of the ID tags of the entries. Use the SearchTag function for this.
Then call MinSquareTree and save the resulting tree:
tree := MinSquareTree(D,V,labels);
Use the DrawTree function to draw the tree (it has many options, you can try different settings. This function creates a postscript file called temp.ps. You can look at it using the linux program 'gv' or just type View() in Darwin.
Analysis: How to compute the least-squares fit value
Finding the optimal least-squares tree corresponds to finding the tree (topology and branch lengths) such that the sum of the squares of the differences between the distances in the matrix and the distances on the tree (between two leaves) is mininmal. In Darwin, the matrix of the tree distances can be obtained with the Tree_matrix command. T[i,j] is then the distance on the tree between leaf i and leaf j.
T := Tree_matrix(tree);
Compare T and D. Why are they not the same?
Mathematically, the 'sum of squared errors', is defined as the sum of (D[i,j]-T[i,j])^2 over all (i,j), j>i. Compute this value from T and D. Remember to count only entries (i,j) with j>i,
Darwin goes one step further by doing weighted least-squares. The idea is that not all distance estimates are equally precise, their uncertainty is expressed by the variance V[i,j] of the distance. This variance is taken into account by dividing the squared-error of (i,j) by the variance V[i,j]: the terms in the sum thus become (D[i,j]-T[i,j])^2/V[i,j]. Compute this weighted sum of squared errors.
As a last step, the sum of the weighted squared errors must be divided by the 'degrees of freedom'. This is the difference between the number of constraints (given distances) and the number of free parameters (branches of the tree). For N leaves, the number of branches in a tree is 2N-3. The number of pairwise distances is 'N chosen 2', which is N*(N-1)/2.
The sum of the weighted squared errors divided by the degrees of freedom is what is optimized by the MinSquareTree function. It is saved in the global variable MST_Qual.
Compare your computation of the tree quality to MST_Qual.
Wichtiger Hinweis:
Diese Website wird in älteren Versionen von Netscape ohne
graphische Elemente dargestellt. Die Funktionalität der
Website ist aber trotzdem gewährleistet. Wenn Sie diese
Website regelmässig benutzen, empfehlen wir Ihnen, auf
Ihrem Computer einen aktuellen Browser zu installieren. Weitere
Informationen finden Sie auf
folgender
Seite.
Important Note:
The content in this site is accessible to any browser or
Internet device, however, some graphics will display correctly
only in the newer versions of Netscape. To get the most out of
our site we suggest you upgrade to a newer browser.
More
information