|
|||||||||||
Open Positions
Research
# load DB
ReadDb('/home/darwin/DB/SwissProt.Z');
# load sequence of ADHA_HUMAN protein
e := Entry(ID('ADHA_HUMAN'));
s := Sequence( e );
# select random subsequence of length 'len'
len := 100:
off := Rand( 0..length(s)-len );
query := s[off..off+len];
Here we present a slightly different method to collect our data for the histogram than the way presented in the bio-recipe. We keep the bin width variable. For that, we need some converter functions to get from a score to the index of the histogram (the bin) and back. Otherwise, everything remains the same.
# define some helper function to be able to vary bin-width of histogram
binSize := 2;
MAXBIN := ceil( 300/binSize );
score2bin := s -> min( floor(s/binSize)+1, MAXBIN);
bin2score := b -> (b-.5)*binSize;
# Create an Array to hold histogram counts
Hist := CreateArray(1..MAXBIN);
# load Dayhoff Matrix 250 PAM
DM := DayMatrix( 250 );
# we compute alignment score of each database sequence against query
# to compare with ApproximateSearchString (Problem 2) we also measure
# the amount of time we need to compute
t0 := time();
for i to DB[TotEntries] do
# we only need scores. DynProgScore is faster than Align,
# because if given 'JustScore'-flag, only forward phase of
# dynamic programming needs to be done.
score := DynProgScore(query, Sequence(Entry(i)), DM, JustScore)[1];
bin := score2bin(score);
Hist[bin] := Hist[bin] + 1;
od:
timeDynProg := time() - t0;
Computing all the scores and collecting them in the histogram takes quite some time - on my machine more than a minute:
We include the provided function to draw the logarithmic scale histogram
# DrawLogHistogram requires as input data either a numeric list or a list of
# numeric tuples. They represent the frequency for each bin. In case the
# argument is only a list, the bins are thought to be of size 1 starting at 1.
# Otherwise, use the tuple form, where the first number corresponds to the
# center of the bin and the second one contains the count value for the bin.
#
# As an optional argument, a string can also be passed. It corresponds to the
# filename of the plot. By default, loghist.ps is used.
DrawLogHistogram := proc(d:{list(numeric),list([numeric, numeric])} ;
(fn='loghist.ps'):string)
tmpOut := Set(plotoutput=fn);
data := If( type(d, list(numeric)), [seq([i, d[i]],i=1..length(d))], d);
maxx := max(seq(If(z[2]>0,z[1],NULL),z=data));
maxy := max(seq(z[2],z=data));
maxLn := log10(maxy);
pd := [];
for z in data do if z[2]>0 then
pd := append(pd, CIRCLE(z[1],log10(z[2]),2,color=[1,0,0]));
fi od:
xtics := [seq( floor(maxx/10)*i, i=0..10)];
tlenx := maxLn/10/10; tleny := maxx/10/10;
ytics := [seq(seq(z*10^c,z=[1,5]),c=0..floor(maxLn)) ];
pd := append(pd, LINE(0,0,maxx,0), LINE(0,0,0,maxLn),
seq( LINE(x,0,x,tlenx), x=xtics),
seq( CTEXT(x,-3*tlenx, string(x)), x=xtics ),
seq( LINE(0,log10(y),tleny,log10(y)), y=ytics ),
seq( RTEXT(-tleny, log10(y), string(y)), y=ytics ));
DrawPlot(pd);
Set(plotoutput=tmpOut):
end:
and call it with our collected data.
# as our bins can vary in width, we need to pass the data using the # tuple variant. The used seq-statement is a short and efficient way # to iterate over all values i from 1 to MAXBIN DrawLogHistogram( [seq( [bin2score(i), Hist[i]], i=1..MAXBIN)] );
With the View('loghisto.ps') command, we can inspect the histogram

The decaying straight line crosses the x-axis at a score of roughly 85. Therefore, we set our cutoff to separate homologs from unrelated sequences to 85.
# from the graphics, we estimate the cutoff to be around 85
cutoff := 85;
# let us now collect all the entries in the database, that have a
# sequence similarity > cutoff with our query sequence.
homologs := [];
for i to DB[TotEntries] do
e := Entry(i);
score := DynProgScore(query, Sequence(e), DM, JustScore)[1];
if score > cutoff then
homologs := append(homologs, [SearchTag('ID',e), score, SearchTag('DE',e)]);
fi:
od:
# in the end, we sort them according to the score in desc order
homologs := sort( homologs, x->-x[2] ):
# and print them
printf('We found %d homologs\n', length(homologs));
for homo in homologs do
printf('%s\t%.2f\t%s\n', homo[1], homo[2], homo[3]);
od:
From the description of our query protein, we see that ADHA_HUMAN is an "alcohol dehydrogenase" protein, one of the essential enzymes involved in the pathway for processing alcohol. From the description we conclude that proteins which have also either 'alcohol' or 'dehydrogenase' in their description tag, are most likely true homologs. To estimate the number of false positive predictions, we simply count the number of predicted homologs, where non of those keywords appear.
# look for possible false positives
fp := []:
for homo in homologs do
if SearchString('alcohol',homo[3])<0 and
SearchString('dehydrogenase',homo[3])<0 then
fp := append(fp, homo);
fi:
od:
printf('Number of false positives: %d\n', length(fp);
for z in fp do printf('%s\t%.2f\t%s\n', z[1], z[2], z[3]) od:
The following function searches for similar sequences by looking for subsequences that match exactly
# This function searches for similar sequences in an approximate
# way. It returns a list of entry numbers, sorted (desc) according
# to the number of exact subsequence matches of length winSize
ApproximateSearchString := proc(s:string ; (winSize=6):posint)
# a table to count nr of times a subsequence (sliding window)
# matches an entry exactly. key is entry nr of database.
matches := table(0):
# for any possible position of the window, we
# search for exact matches in the database.
for i to length(s)-winSize do
pat := SearchSeqDb(s[i..i+winSize]);
eNrs := [seq( GetEntryNumber(z), z=[Entry(pat)] )];
for z in eNrs do matches[z] := matches[z]+1 od:
od:
# we convert the table into a list and return it
# ordered (descent) according to the number of times the entry
# matched a subsequence.
top := [];
for eNr in Indices(matches) do
top := append(top, [eNr, matches[eNr]])
od:
return( sort(top,x->-x[2]) );
end:
Now, we use it and look for homologs of ADHA_HUMAN in the database. Again, we measure the amount of time it takes.
t0 := time(); homologsApprox := ApproximateSearchString( query ); timeApprox := time() - t0;
This way, we get fewer homologs. On the other hand, the method is much faster. Again, we inspect their description and look for false positives.
printf('We found %d homologs\n', length(homologs));
fpApprox := []:
for tup in homologsApprox do
eNr := tup[1];
cnts := tup[2];
id := SearchTag('ID', Entry(eNr));
desc := SearchTag('DE', Entry(eNr));
if SearchString('alcohol',desc)<0 and
SearchString('dehydrogenase',desc)<0 then
fpApprox := append(fpApprox, [id, cnts, desc]);
fi:
printf('%s\t%d\t%s\n', id, cnts, desc);
od:
printf('\n\nNumer of false positives: %d\n', length(fpApprox));
for z in fpApprox do printf('%s\t%d\t%s\n', op(z)) od:
Clearly, there are many more false positives. On the other hand it is striking that all false positives have only one single exact match (column 2). Hence, only keeping the ones with several exact matches is certainly a good idea. It would be also very feasible to compute the exact similarities for the candidate homologs obtained by ApproximateSearchString and only keep the onces above a certain score.
Wichtiger Hinweis:
Diese Website wird in älteren Versionen von Netscape ohne
graphische Elemente dargestellt. Die Funktionalität der
Website ist aber trotzdem gewährleistet. Wenn Sie diese
Website regelmässig benutzen, empfehlen wir Ihnen, auf
Ihrem Computer einen aktuellen Browser zu installieren. Weitere
Informationen finden Sie auf
folgender
Seite.
Important Note:
The content in this site is accessible to any browser or
Internet device, however, some graphics will display correctly
only in the newer versions of Netscape. To get the most out of
our site we suggest you upgrade to a newer browser.
More
information