printlogo
http://www.ethz.ch/index_EN
CBRG - Computational Biochemistry Research Group
 
print
  

Significance of Alignments and Approximate Search String Exercise

In a previous exercise you have already seen how we can align two sequences in an optimal way. Its score gives us an idea about how similar the sequences are to each other. This week, we try to assess the significance of such a result.

One possibility how to assess the significance of alignments is presented in this bio-recipe. Have a look at it.

There is also a graphical method to separate spurious matches from homologue sequences. Collins et al. (1988) proposed to draw a histogram with a logarithmic scale of the scores obtained from aligning a query sequence against a large sequence database. As unrelated sequence scores follow a Gumbel distribution, their occurrence frequencies should form a decaying, straight line on a logarithmic scale histogram. This line can be used to estimate a cutoff value.

Collins1988 - LogHistogram of scores
Collins1988 - LogHistogram of scores

Problem 1:

Get the SwissProt database from the following url: http://www.biorecipes.com/Dayhoff/SwissProt.Z. Store it in your home-directory and load it afterwards into Darwin using the ReadDb command. Next, we need to specify our query sequence. We choose as an example the ADHA_HUMAN protein. Tip: Use ?Sequence to see an example how to load the sequence of a given id.


To speed up the upcoming computations, you may want to take only a substring of 100 amino acids length of your query sequence. Tip: string[x..y] returns the substring from x to y of string.

Now we build the histogram of the scores. For that, you need to compute the alignment score of our query sequence against every other sequence of the database and collect them in a histogram. To compute the score, we suggest that you use the Dayhoff Matrix for 250PAM with the command DynProgScore. What is the difference between DynProgScore and Align? Tip: An example how to collect data for a histogram is provided in the above mentioned Bio-recipe.

Now, we need to plot the histogram on a log-scale. The following function does this for you.

# DrawLogHistogram requires as input data either a numeric list or a list of
#  numeric tuples. They represent the frequency for each bin. In case the
#  argument is only a list, the bins are thought to be of size 1 starting at 1.
#  Otherwise, use the tuple form, where the first number corresponds to the
#  center of the bin and the second one contains the count value for the bin.
#
#  As an optional argument, a string can also be passed. It corresponds to the
#  filename of the plot. By default, loghist.ps is used.
DrawLogHistogram := proc(d:{list(numeric),list([numeric, numeric])} ;
        (fn='loghist.ps'):string)
    tmpOut := Set(plotoutput=fn);

    data := If( type(d, list(numeric)), [seq([i, d[i]],i=1..length(d))], d);
    maxx := max(seq(If(z[2]>0,z[1],NULL),z=data));
    maxy := max(seq(z[2],z=data));

    maxLn  := log10(maxy);
    pd := [];
    for z in data do if z[2]>0 then
        pd := append(pd, CIRCLE(z[1],log10(z[2]),2,color=[1,0,0]));
    fi od:
    xtics := [seq( floor(maxx/10)*i, i=0..10)];
    tlenx := maxLn/10/10; tleny := maxx/10/10;
    ytics := [seq(seq(z*10^c,z=[1,5]),c=0..floor(maxLn)) ];
    pd := append(pd, LINE(0,0,maxx,0), LINE(0,0,0,maxLn),
          seq( LINE(x,0,x,tlenx), x=xtics),
          seq( CTEXT(x,-3*tlenx, string(x)), x=xtics ),
          seq( LINE(0,log10(y),tleny,log10(y)), y=ytics ),
          seq( RTEXT(-tleny, log10(y), string(y)), y=ytics ));
    DrawPlot(pd);
    Set(plotoutput=tmpOut):
end:

Which score cutoff would you use according to this graphical method to distinguish between homologous and unrelated sequences? How many homologs do you get? Try also to verify them by looking at their description tags. Did you get many false positives?

Problem 2:

To compute the alignment score for every sequence in the database against a query sequence takes quite some time. Often, we are only interested in the high-scoring matches and would like to improve on the execution time of our homology search function. The dynamic programming step was clearly the bottleneck - it takes O(n2) to compute - in the approach above. On the other hand, given the right data structures, exact sequence matches can be found extremely fast in a database, i.e. O( log(n) ).

Let us construct a heuristic to find similar sequences using the idea of exact sequence matches. Chances are high that a similar sequence matches our query sequence at some places exactly. Thus, we can look for exact matches in the database of subsequences of our query. For instance, we can take a sliding window approach by shifting our subsequence from left to right by one position at the time. Then, we count how often each entry in the database had an exact match of such a window. The more often an entry matched, the more likely it is a homolog.

Try to implement such a function. As a window size (length of the subsequence) you can use 6. To search for exact sequence matches in a database, use SearchSeqDb. This function returns you a range in the patindex (Morrison, 1968), which is the most suitable data structure to query for exact string matches. Each value in this range corresponds to an Entry in the database. The following code snippet demonstrates, how you can get the numbers of those entries which contain the query sequence:

pat := SearchSeqDb('SGPPRI');
entries  := [ Entry(pat) ];
eNumbers := [ seq( GetEntryNumber(z), z=entries) ];

Other commands you may want to use are table, Indices and sort.

How many homologs do you obtain with this approach? What about the number of false positives?

References

Collins et al. The significance of protein sequence similarities. Comput Appl Biosci (1988) vol. 4 (1) pp. 67-71 Bioinformatics

DR Morrison, PATRICIA - Practical Algorithm To Retrieve Information Coded in Alphanumeric. Journal of the ACM (1968) vol. 15 (4) pp. 514 - 534 ACM

 

Wichtiger Hinweis:
Diese Website wird in älteren Versionen von Netscape ohne graphische Elemente dargestellt. Die Funktionalität der Website ist aber trotzdem gewährleistet. Wenn Sie diese Website regelmässig benutzen, empfehlen wir Ihnen, auf Ihrem Computer einen aktuellen Browser zu installieren. Weitere Informationen finden Sie auf
folgender Seite.

Important Note:
The content in this site is accessible to any browser or Internet device, however, some graphics will display correctly only in the newer versions of Netscape. To get the most out of our site we suggest you upgrade to a newer browser.
More information

© 2012 ETH Zurich | Imprint | Disclaimer | 26 October 2011
top