|
|||||||||||
Open Positions
Research
Take a moment to look at the format of a Darwin Database. Use more (or zmore if the database is compressed) to look at the first page.
more /home/darwin/DB/SwissProt.db
Notice that it is formatted in SGML with <E> and </E> tags surrounding each entry. Other information inside the entry is contained in different tags such as <SEQ> for the sequence, <DE> for description, ... .
Type ?Align and ?Alignment at the command line and read the help files for these two commands. An Alignment is a Darwin data structure that holds pairwise alignment information and is created by the Align method. To create an alignment object, we have to give Align some information as can be seen in the Calling Sequence of the Align help file. It requires two sequences, a method and a scoring (Dayhoff) matrix. The easiest way to call Align is by using sequence strings, although as you can see from the examples in the help file, Align can also be called with database accession numbers or IDs. Let's define first some strings to align:
s := 'NMTTSRQLLFTFFFTTTFFFFFFQARGLPCSPTWC'; t := 'NQLLFTFFTTTFFFFQAGLRSAA':
Create an alignment with the Align procedure:
a := Align(s,t,Local, DMS);
but first assign DMS. DMS is a system variable that holds a list of Dayhoff Matrices. Likewise, DM is a system variable that holds the 250 PAM Dayhoff matrix. These global variables can be set with the command:
CreateDayMatrices();
This function automatically assigns the system variable DM to the PAM 250 Dayhoff Matrix and assigns the system variable DMS to a list of 1266 Dayhoff matrices from 0.049 to 1000 PAM. Darwin uses system variables to hold data that are often used for bioinformatic manipulations. The variable names DM and DMS are reserved for system variables.
CreateDayMatrices is a resource intensive procedure. Sometimes, when much CPU or memory is used, Darwin's garbage collection information will appear and tell how much memory was allocated and the CPU usage. The printing of this information can be turned off with:
Set(printgc=false);
Try the following:
print(DM);
This is the 250 PAM Dayhoff matrix. It contains scores between any two amino acids and costs for gaps. Similar amino acids should have higher scores than less similar ones. The identity (no change of amino acid) have the highest scores.
Now do the alignment again:
a := Align(s,t,Local, DMS);
This creates a local alignment. A global alignment can be created by calling Align with the option Global. Print both alignments with the print command:
print(a);
and compare the differences. The printing will display the alignment in a graphical form along with information about the lengths, the similarity score, the percent sequence identity and the PAM distance and variance.
Now create global and local alignments again with the variable DM instead of DMS. If DMS is used to create the alignment, Darwin will automatically find the Dayhoff matrix that produces the highest score and use this one. If DM is used then Align finds the highest scoring alignment possible using the 250 PAM Dayhoff matrix. Compare the similarity scores of the alignments made with DM and those made with DMS.
Align the two sequences with global align at PAM distances from 1 to 250 incrementing by 5. Print a list of PAM distances and the alignment score for each distance.
You can use the DayMatrix command to get Dayhoff matrices for a specific PAM distance.
To get the score of an alignment, use the selector "Score" on the alignment data structure:
sc := a[Score];
The syntax of a simple for loop in Darwin that starts at 1 and increments by 5 until 250 is as follows (the "from 1" is optional):
for i from 1 to 250 by 5 do
# body of the loop
od;
Syntax of a printf:
printf('PAM dist = %5.2f, Score = %5.2f\n', i, a[Score]):
What happens to the similarity score as a function of PAM? Compare the similarity score to that of the global align with the option DMS. What is the connection to the maximum-likelihood estimation of PAM distances?
Now we want to do alignments of real protein sequences. For this problem, we load a relatively small database that contains 2159 sequences from the Swiss-Prot database.
Load this DB using the ReadDb command. This assigns the database to the global variable "DB":
ReadDb('/home/darwin/DB/Problem2.db');
Database entries can be accessed using the Entry command. Try the following:
e := Entry(651); # get entry number 651 print(e);
From which organism has this protein been sequenced? The task is now to find all similar proteins in the database. This is done by the following steps:
Look at this list of protein descriptions and compare them to the description (DE tag) of Entry 651. Do the descriptions imply a similar function?
The syntax of an if statement is as follows:
if condition then
commands;
else
commands;
fi;
The else is optional. The comparison operators in Darwin are: >,<,>=,<=,=,<> (not equal)
Wichtiger Hinweis:
Diese Website wird in älteren Versionen von Netscape ohne
graphische Elemente dargestellt. Die Funktionalität der
Website ist aber trotzdem gewährleistet. Wenn Sie diese
Website regelmässig benutzen, empfehlen wir Ihnen, auf
Ihrem Computer einen aktuellen Browser zu installieren. Weitere
Informationen finden Sie auf
folgender
Seite.
Important Note:
The content in this site is accessible to any browser or
Internet device, however, some graphics will display correctly
only in the newer versions of Netscape. To get the most out of
our site we suggest you upgrade to a newer browser.
More
information