|
|||||||||||
Open Positions
Research
In a previous exercise you have already seen how we can align two sequences in an optimal way. Its score gives us an idea about how similar the sequences are to each other. This week, we try to assess the significance of such a result.
One possibility how to assess the significance of alignments is presented in this bio-recipe. Have a look at it.
There is also a graphical method to separate spurious matches from homologue sequences. Collins et al. (1988) proposed to draw a histogram with a logarithmic scale of the scores obtained from aligning a query sequence against a large sequence database. As unrelated sequence scores follow a Gumbel distribution, their occurrence frequencies should form a decaying, straight line on a logarithmic scale histogram. This line can be used to estimate a cutoff value.

Get the SwissProt database from the following url: http://www.biorecipes.com/Dayhoff/SwissProt.Z. Store it in your home-directory and load it afterwards into Darwin using the ReadDb command. Next, we need to specify our query sequence. We choose as an example the ADHA_HUMAN protein. Tip: Use ?Sequence to see an example how to load the sequence of a given id.
To speed up the upcoming computations, you may want to take only a substring of 100 amino acids length of your query sequence. Tip: string[x..y] returns the substring from x to y of string.
Now we build the histogram of the scores. For that, you need to compute the alignment score of our query sequence against every other sequence of the database and collect them in a histogram. To compute the score, we suggest that you use the Dayhoff Matrix for 250PAM with the command DynProgScore. What is the difference between DynProgScore and Align? Tip: An example how to collect data for a histogram is provided in the above mentioned Bio-recipe.
Now, we need to plot the histogram on a log-scale. The following function does this for you.
# DrawLogHistogram requires as input data either a numeric list or a list of
# numeric tuples. They represent the frequency for each bin. In case the
# argument is only a list, the bins are thought to be of size 1 starting at 1.
# Otherwise, use the tuple form, where the first number corresponds to the
# center of the bin and the second one contains the count value for the bin.
#
# As an optional argument, a string can also be passed. It corresponds to the
# filename of the plot. By default, loghist.ps is used.
DrawLogHistogram := proc(d:{list(numeric),list([numeric, numeric])} ;
(fn='loghist.ps'):string)
tmpOut := Set(plotoutput=fn);
data := If( type(d, list(numeric)), [seq([i, d[i]],i=1..length(d))], d);
maxx := max(seq(If(z[2]>0,z[1],NULL),z=data));
maxy := max(seq(z[2],z=data));
maxLn := log10(maxy);
pd := [];
for z in data do if z[2]>0 then
pd := append(pd, CIRCLE(z[1],log10(z[2]),2,color=[1,0,0]));
fi od:
xtics := [seq( floor(maxx/10)*i, i=0..10)];
tlenx := maxLn/10/10; tleny := maxx/10/10;
ytics := [seq(seq(z*10^c,z=[1,5]),c=0..floor(maxLn)) ];
pd := append(pd, LINE(0,0,maxx,0), LINE(0,0,0,maxLn),
seq( LINE(x,0,x,tlenx), x=xtics),
seq( CTEXT(x,-3*tlenx, string(x)), x=xtics ),
seq( LINE(0,log10(y),tleny,log10(y)), y=ytics ),
seq( RTEXT(-tleny, log10(y), string(y)), y=ytics ));
DrawPlot(pd);
Set(plotoutput=tmpOut):
end:
Which score cutoff would you use according to this graphical method to distinguish between homologous and unrelated sequences? How many homologs do you get? Try also to verify them by looking at their description tags. Did you get many false positives?
To compute the alignment score for every sequence in the database against a query sequence takes quite some time. Often, we are only interested in the high-scoring matches and would like to improve on the execution time of our homology search function. The dynamic programming step was clearly the bottleneck - it takes O(n2) to compute - in the approach above. On the other hand, given the right data structures, exact sequence matches can be found extremely fast in a database, i.e. O( log(n) ).
Let us construct a heuristic to find similar sequences using the idea of exact sequence matches. Chances are high that a similar sequence matches our query sequence at some places exactly. Thus, we can look for exact matches in the database of subsequences of our query. For instance, we can take a sliding window approach by shifting our subsequence from left to right by one position at the time. Then, we count how often each entry in the database had an exact match of such a window. The more often an entry matched, the more likely it is a homolog.
Try to implement such a function. As a window size (length of the subsequence) you can use 6. To search for exact sequence matches in a database, use SearchSeqDb. This function returns you a range in the patindex (Morrison, 1968), which is the most suitable data structure to query for exact string matches. Each value in this range corresponds to an Entry in the database. The following code snippet demonstrates, how you can get the numbers of those entries which contain the query sequence:
pat := SearchSeqDb('SGPPRI');
entries := [ Entry(pat) ];
eNumbers := [ seq( GetEntryNumber(z), z=entries) ];
Other commands you may want to use are table, Indices and sort.
How many homologs do you obtain with this approach? What about the number of false positives?
Collins et al. The significance of protein sequence similarities. Comput Appl Biosci (1988) vol. 4 (1) pp. 67-71 Bioinformatics
DR Morrison, PATRICIA - Practical Algorithm To Retrieve Information Coded in Alphanumeric. Journal of the ACM (1968) vol. 15 (4) pp. 514 - 534 ACM
Wichtiger Hinweis:
Diese Website wird in älteren Versionen von Netscape ohne
graphische Elemente dargestellt. Die Funktionalität der
Website ist aber trotzdem gewährleistet. Wenn Sie diese
Website regelmässig benutzen, empfehlen wir Ihnen, auf
Ihrem Computer einen aktuellen Browser zu installieren. Weitere
Informationen finden Sie auf
folgender
Seite.
Important Note:
The content in this site is accessible to any browser or
Internet device, however, some graphics will display correctly
only in the newer versions of Netscape. To get the most out of
our site we suggest you upgrade to a newer browser.
More
information