top3d foo_1.pdb foo_2.pdb
TOPP can be run directly using the command topp with Keyworded input, or via the script top3d which takes two file names as arguments and program parameters from the file $CLIBD/TOP.PARM (see examples section). A search with one file against a database of structures can be done using the script topsearch which takes one file name as argument and program parameters from the file $CLIBD/SEARCH.PARM (see examples section).
Use of the browser facility to search a Protein Data Bank site requires two commands to be on the user's path, namely wget and pdbhtf. The latter is part of the CCP4 suite and should have been compiled and installed. On the other hand, wget is not part of CCP4, but is a GNU program available via internet from the usual GNU sites.
TOP is designed to be user friendly. For example, once the program is properly set up on unix computers, users can use simple commands such as top3d file1 file2 so that the coordinate file2 will be automatically superimposed to file1. The Protein Data Bank (PDB) entry code can be recognized by the program. For example if the second molecule is 2cnd in PDB, user can just type top3d file1 2cnd@pdb so the program will browse the coordinates of 2cnd into the local disk and perform the comparison. If a user wants to know whether a structure in file is similar to any structures in PDB, one can type topsearch file.pdb so that the program will output a list of pdb code which are ranked according to 3d-structure similarities. The user can type top3d file.pdb code@pdb to get the interested coordinates superimposed to the probe model. The program can detect sequence permutation and be used for special purpose, such as motif searching.
The program runs two steps in each structure comparison. In the first step topology of secondary structures in the two are compared. The program uses two points to represent each secondary structure element (alpha helixes or beta strands) then systematically searches all the possible superposition of these elements between the two protein structures. Once a couple of elements in the two structures can fit each other in 3-d space (defined as, the rms, the angle between the two lines formed by the two points and the line-line distance are smaller than the given values), the program will search whether more secondary structure elements can fit by the same superposition operation. If secondary structures which can fit each other exceed a given number, the program will claim the two structures are similar, outputs names of secondary structures which correspond to each other in the two proteins and output the superimposed coordinates. It also outputs a matrix, with which one molecule can be rotated and translated to the other molecule. The program output a comparison score called "Topological Diversity", which considers both the rate of matching SSEs and structure difference of the representing points. In the data base searching, this parameter can be used for rank the topological similarities of SSEs.
While Ca atoms are available, the program can run the second step to
find the alignment based on Ca atoms of all the residues from the
initial comparison matrix, and improve the comparison matrix based on
the superposition of newly aligned Ca atoms. The procedure is iterated
until the member of matching residues converges. The program is able
to overcome sequence permutation in the superpostions. According to
both r.m.s deviations and numbers of matching residues, the program
calculated a score of "Structure Diversity",
which can be used to rank the structure difference of homologous
Use of a SSE database
The optimized way of database searching in TOP is to use
a library of Secondary Structure Elements (SSEs). This can
be created from a set of PDB files with the command MAKEVEC (see
The compact SSE library is automatically updated in Karolinska Institute every week, which include not only the current released structures in Protein Data Bank, but also compact SSE dastabases of independent family, super-family, structures classified in the SCOP database for efficient similarity search. It can be obtained from ftp://gamma.mbb.ki.se/pub/guoguang/sndlib.tar.Z . After you get this TAR file from FTP and save to your local disk as, for example /dir/sndlib.tar.Z, use following commands:
cd $TOPHOME zcat /dir/sndlib.tar.Z | tar -xvf -you can have the most recently updated SSE databases.
TOP can read 3d coordinates of protein structures in "Brookhaven" (PDB) format either from user's local computer disk, CD ROM or via internet. In the case of structure similarity searching, there can be many ways to read data. The recommended setup for the program is to use automatic updating of a secondary structure element (SSE) libary searching (see automatic updating of SSE library and MOLVEC). In this way the program can search most recent database from compact SSE library and browse the detailed coordinates of only those structures which are found similar with the molecule 1. It is considerably faster and does not require regular maintaining works for database after setup.
If you don't have the coordinates in your local disk and wish to read the coordinates directly from a Web site by giving a PDB entry code, you could give the filename something like code@pdb in this command, for example: MOL1 2cnd@pdb, the program will use the code and browse the coordinates from a PDB mirror site or another web site, the URL address of which is specified in the PDBSITE or WEBSITE commands.
MOL2 Coordinate_file_name or @List_file_name or @URL_address [zone]
This command controls whether users wish to compare two structures or do a similarity search in Protein Data Bank. If the filename is something like 2cnd.pdb or 2cnd@pdb, the program will just superimpose two structures and give sequence comparisons.
If the second text string in the command start with @ and the rest text does not start with http: or ftp:, the rest text in this string text will be assumed a name of List_file which lists names of a number of coordinate files such as:
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb200d.ent /nfs/protein/pdb/current_release/uncompressed_files/00/pdb200l.ent /nfs/protein/pdb/current_release/uncompressed_files/00/pdb300d.ent /nfs/protein/pdb/current_release/uncompressed_files/00/pdb100d.ent ....This can be used for searching structure similarities in PROTEIN DATA BANK.
If the command PDBSITE or WEBSITE is given before this command, or the LIBDIR command specify a directory name which contains "current_release", the List_file can be list of PDB entry code, such as
200d ! | pdb200d.ent 200l | or | pdb200l.ent 300d | | pdb300d.ent .... | | ....the program will browse the coordinates of these PDB entries from a web site or local disk or CDs.
This list of PDB codes can be obtained from "3DB browser" in Protein Data Bank or other bioinfomtics tools outside the program. It provide a possiblity that TOP search for a certain group of structures for a special purpose.
LIBDIR directory_name If the program is searching a number of coordinates files (see MOL2) and those files are under an identical directory, the user can indicate in which directory the coordinates files are located. for example, if users have pdb200d.ent pdb3001.ent ... in the /nfs/pdb/all_entries/ directory, the user can use UNIX command: ls -1 /nfs/pdb/all_entries/uncompressed_files/ > allpdb.lis, this file will be something like
pdb100d.ent pdb101d.ent pdb101m.ent pdb102d.ent pdb102l.ent ...then use
libdir /nfs/pdb/all_entries/uncompressed_files/ mol2 @allpdb.lisso the program will compare all the files under directory /nfs/pdb/all_entries/uncompressed_files/ and with file names in allpdb.lis and list out which one is similar with the structure specified in the MOL1 command.
Alternatively, one can use UNIX command
find /directory_name/ -name "*.ent" -print > pdball.lisinstead of the ls command. The LIBDIR command is not neccesary in this case. This is usually used when the users have whole protein data bank on their local disk or CD ROM.
In the case the directory name in the LIBDIR command contains a substring ".../current_release/uncompressed_files", the program will think this directory is organised as "current_release" directory in Protein Data Bank i.e. PDB entries are distributed under subdirectories whose name correspond to the 2 middle characters of the PDB id code, e.g.
...pub/pdb_data/current_release/uncompressed_files/00 ...pub/pdb_data/current_release/uncompressed_files/zyand program will assume each line in List_file is a PDB entry code such as
100d pdb1001.ent 100e or pdb100e.ent ..... ....Please notice the local PDB should contain the coordinates of the structures with these ID codes in the file.
If the rest text after first character"@" start with "http:", the program will assume there is a 3db browser in this URL address and try to get a list of current released entries. (This command is not neccesary if PDBSITE command is present.)
If the rest text after first character"@" start with "ftp:", the program will list all the files under the directories. This can be used for an anonymous ftp site in which a directory contains all the entries of the coordinates (such as old PDB directory .../all_release/compressed_files/*.pdb ) However, in this form, all the PDB files should be in one directory, but not distributed in sub-directories.
This command specifies an URL address of one of the official mirror sites of the Protein Data Bank. Given the "recognized mirror site", the program can browsed most recent data in PDB. A collection of the URL addresses which have been tested by the program is listed in http://gamma.mbb.ki.se/~guoguang/webtop/pdb_url_collect.html. To get efficient and fast data browsing, users should choose a site which is inside or close to their local countries.
If this command is given, the commands WEBSITE, PDBSITE and LIBDIR are not neccessary to be present.
WEBsite URL_address (or SITE or SERVER)
Sometimes, users prefer to read data from a Web site other than a standard PDB site (for example a laboratory which is very in the same campus or city), user can use WEBsite instead of PDBSITE for example:
WEBSITE http://pdb.pdb.bnl.gov/ or http://www.rcsb.org/pdb/ WEBSITE ftp://pdb.pdb.bnl.gov/pub/pdb/all_entries/compressed_files WEBSITE ftp://gamma.mbb.ki.se/pub/pdb/current_release/uncompressed_filesThis command indicates the URL address of Web server. If the address is given correctly, the program is able to browse coordinates from site which provide data of Protein Data Bank by either http or FTP service in compressed or uncompressed form. In each issue of Protein Data Bank Quarterly Newsletter, there is a list of which lab might provide this service. (most likely in form of FTP server). A current URL address collection of these sites are listed in http://gamma.mbb.ki.se/~guoguang/webtop/url_collect.html
In the case it is FTP site, if the directory name contains a sub-string "current_release", the program can automatically find out the PDB entries in sub-directories. Otherwise, it will assume all the files are in the same directory in the argument of this command.
Instead of reading all the PDB files in PROTEIN DATA BANK, the TOP program can use a compact database which is a library of secondary structures of each protein. This command indicate the filename of the database so that the program can perform the topological comparisons based on secondary structures. If the WEBSITE or LIBDIR commands are also present, the program will first perform the rapid topological search in the compact database. Once a structure in the data base with a pdb entry code is found, TOP will browse the PDB file from Internet or local disk and perform the comparisons based on Ca atoms. If users repeately use the database searching function, this command is the fast and efficient way, because it can save a lot of time for repeating browsing files and assign the secondary structures.
The MAKEVEC command can help to update the compact database in order to follow the most recent changes in Protein Data Bank. The updated database can also obtained via the Web (See example 3).
MAKEVEC output_database_filename pdb_list_file_name [format]
If this command is present, the SSE library mentioned above is made. The program can read coordinates either from local disk/CD, which is specified by LIBDIR, or via internet which is specified by PDBSITE or WEBSITE. The first argument of this command is the name of the output SSE library file. The second argument is a name of List_file (as in the MOL2 command) which can contains either a list of file name or PDB entry codes. If the third (format) argument is ZONE or SCOP, the program will assume the second column in the pdb_list_file specifies the residue range (see ZONE1 and ZONE2 keywords) while the first column specifies the PDB code or file name of the structure.
example: MAKEVEC sndnew.vec pdb.listIf you have PROTEIN DATA BANK on the disk, TOP program can make a compact database file to let those who don't have protein data bank on disk be able to perform the similarity searching. The pdb_list_file_name contains something like
101l.pdb 102l.pdb 103l.pdb 104l.pdb ....use this list together with LIBDIR command, one can make a compact SSE library, sndnew.vec
example: PDBSITE http://www2.ebi.ac.uk MAKEVEC sndnew.vec example: MAKEVEC snd.vec ftp://pdb.pdb.bnl.gov/pub/pdb/all_entries/compressed_files/If the file name starts with "ftp://" and ends with "/" the program will check the what PDB files contains under that FTP directory and browse all the coordinates in that directory. The files must be in the same directory but not sub-directory in this case. If the second argument starts with "ftp://" the program will request a 3DB server from the URL address to provide a list all the entries in PDB.
example: MAKEVEC snd.vec ftp://gamma.mbb.ki.se/pub/guoguang/scop_family.lis scopIf the second argument starts with "ftp://" or "http://" and ends with a file name, the program will assume URL address is a file which contains the PDB list. This example shows how to get an updated list for SCOP data base, which contains PDB code and range of a representing structure in each family or super family. (The format of TOP/SCOP list is the following)
3sdh a: 1.001.001.001.001.001 d3sdha_ 1phn a: 1.001.001.001.002.001 d1phna_ 1grj 2-79 1.001.002.001.001.001 d1grj_1 .... example: makevec.com # for PDB on local disk $LUEXE/top << 'end-top' LIBDIR /nfs/protein/pdb/current_release/ MAKEVEC sndlib.vec pdblist.txt 'end-top' #The pdblist.txt could be made by this way.
cd /nfs/pdb/full/ ls -1 *.pdb > /nfs/ylgs/guoguang/pdblist.txtIf LIBDIR is replaced by PDBSITE, the progam will read updated data from PDB via web.
In fact the keywords 3DBBEFore and 3DBAFTfer together with MAKEVEC provide a possibily of automaic making SSE libary of the new coming structures which can be appended to the old ones. This should be very quick.
example: MATCH RATE 0.35 0.8 MATCH auto [DEFAULT] MATCH 5If RATE appears as a subcommand, the program will read two more parameters RAT1and RAT2.
RAT1 is the minimum matching rate of secondary structures. The program chooses a minimium secondary structures (comparing mode) or number of secondary of mol1 (searching mode) and times with rat1. If matching secondary structures of the two compared protein exceeds this rate, the program will think the two structures are similiar. For example, if mol1 has 12 secondary structures, and mol2 has 10, and rat1 is 0.5, the program will think the two structures are similar when there are 5 secondary structures that can match each other in comparing mode (or 6 in searching mode).
AUTO is equvalent to RATE 0.35 0.8
Alternatively, users also can give this number by estimating at least how many secondary structures can match each other before runing the program. It has to be lower than real number. If the number is overestimated, the program will fail to superimpose the two similar structures. Under-estimating is usually OK. However if user gives a too low value, (for example 3), the program might superimpose motif instead of overall structures. This might give many ways of superpositions, many of which do not really interest the users. In database searching, an over underestimate value can also slow down the speed unecessarilly.
If user have no idea how to put this parameter, he/she can start either with 5 or 30%-50% of number of secondary structures in molecule 1 (use rate). This will be successful in 95% cases. If the comparison fails, look at the Hint section to see how to fix the problem.
LSTRES is the minimium number of residues in a consecutive fragment of protein. Default is 3. If lstres is smaller than or equal to 0 the program only compares the structures based on SSEs. In this case, no superimposed coordinates will be output. If lstres is larger than 0, the program will improve the comparison based on Ca atoms. When all Ca atoms in a fragment with more than LSTRES (usually 3) consecutive residues in one protein are closest to a fragment in the other protein and all the distances are smaller than DSTMIN, all the Ca atoms in these two corresponding fragments will be included in the superposition calculations. The rms and sequence comparison will be presented.
This value the represents the maximium distance between Ca atoms of the matched residues. (see RESIDUE). If dstmin is more than 3.0, the value is not so important because of the rule that Ca atoms of matched residues must be closest to each other. A value between 3-7 usually does not change the result of which residues can match each other in the comparisons.
If this statement is present and Ca comparison is carried out, the program will write out superimposed coordinates from Mol2 to Mol1. The file name will be something like mol2_mol1.xxx. For example if name of mol1 is sfv.pdb, name of mol2 is sin.pdb, the output name will be sin_sfv.pdb
If the input is yes and there are no secondary structure assignments in the input coordinates file, the program will append the assignment at the end of the coordinate file. [Default: NO]
If rms between an alpha helix and standard helix is higher than this value, this helix will not be used for the comparisons.
If rms between a beta strand and a straight line formed by the two representing points is higher than this value, this strand will not be used for comparisons.
If rms value of certain helix or sheet is higher than this value, this helix or sheet is not considered to be similar.
ERRANG errang_alpha, errang_beta
If the direction difference of a certain helix in the two structures is higher than errang_alpha, this helix is not considered to be similar.
If the direction difference of a certain sheet in the two structures is higher than errang_beta, this sheet is not considered to be similar.
ERRDLL errdll_alpha, errdll_beta
If the line-line distance of a certain helix in the two structures is higher than errdll_alpha, this helix is not considered to be similar.
If the line-line distance of a certain strand in the two structures is higher than errdll_beta, this strand is not considered to be similar.
When expanding the search for similar secondary structures, if the maximum direction difference exceeds this angle, the last expand is rejected.
When expanding the search for similar secondary structures, if the maximum line-line distance exceeds this number, the last expand is rejected.
SINGLE/NOSIngle (or MULTiple)
If the SINGLE statement appears, comparison is only carried out on one polypeptide chain. If NOSINGLE or MULTIPLE appears, the program can compare protein structures with multiple chains.
When this option is chosen, if a helix disturbs the match of a beta strand, the program will delete the first helix and re-search for the match.
Weight of the direction in the refinement.
REFWEIGHT refwalpha, refwbeta
In the least squares refinement, the weight of alpha helix and beta strand.
SND1 Yes/No [CA]
If input is yes, the program will not read the secondary structure assignment in the coordinate file of Mol1 but will assign it self using a algorithm defined by Smith/Laskowski (SECSTR program from PROCHECK). If the input is no, the program will first try to use the assigned secondary structure in the coordinates file. If it does not exist or it does not work, the program will assign itself. If CA is present in the second input column after the keyword, the program will assign the secondary structures based only on Ca atoms.
SND2 Yes/No [CA]
same as SND1 but for Mol2
AMPLify ampl ampltop [default: 1.5 2.0]
ampl is the amplification order for structure diversity
ampltop is the amplification order for topological diveristy
The value of Structural Diversity and Toplogical Diversity used are used in TOP for describing the structure difference between the two compared structures based on both r.m.s deviation and number of matched residues or SSEs. The "amplification order" is used to control the influence from number of matched residues or SSEs (see the conventions for more details).
Example: 3dbsite http://www.pdb.bnl.gov
If users wish to read data from their local disk/CD or a close Web site but use 3DB browser to choose searching range, one can use LIBDIR or WEBsite for specifying the location of coordinates and use this command to specify the URL address of 3DB server. The URL address of 3DB must be one of the mirror sites of Protein Data Bank. This 3DB sever site name does not have to be same as in the WEBSITE command. The program can obtain the PDB entry list from the 3DB server and browse the coordinates from other URL address. If WEBSITE, LIBDIR and PDBSITE commands are not given, the program will use this 3DB address for browsing coordinates. If this address is not given, the default server is from BNL. However, I strongly recommend choosing a PDB mirror site close to user's local lab.
example: 3DBKEYWORD FAD + FMN + FLAVIN 3DBKEYWORD NITRATE REDUCTASE 3DBKEYWORD FAD .or. FMN .or. FLAVINEquivalent to the "Keyword" column in 3DB. If this command appears, the TOP program only searches those strucures with the words appearing in HEADER, TITLE, KEYWDS and COMPND fields. If two keywords are separated by space, relation between them are "AND". If separated by ".or." or "+" the relations between words are "OR".
example: 3DBTEXT FAD + FMN + FLAVIN 3DBTEXT REDUCTASEEquivalent to the "Text query" column in 3DB. If this command appears, the TOP program only searches those strucures with the Word in the complete PDB text. If two keywords are separated by space, relations between words are "AND". If two keywords are separated by + or ".or." relations between words are "OR".
3DBSEQ (or 3DBFASTA) cutoff sequence (or cutoff @seq_file_name)
Example: 3DBSEQ 0.02 GXGXTGGTX or 3DBSEQ 0.02 @zm.seqEquivalent to the "FASTA" column in 3DB. If this command appears, the TOP program will request the 3DB server running the FASTA program to provide a list of structures with homologies to the given sequence. Then it only searches structure similarity to those structures and output superimposed coordinates if WRITE command is presented. The sequence must be 1 letter code. It must be either in 1 line or in a file such as following example:
SYTVGTYLAERLVQIGLKHHFAVAGDYNLVLLDNLLLNKNMEQVYCCNEL NCGFSAEGYARAKGAAAAVVTYSVGALSAFDAIGGAYAENLPVILISGAP NNNDHAAGHVLHHALGKTDYHYQLEMAKNITAAAEAIYThe format is free but the sequence can not exceed 5000 residues. The detailed description of cutoff value, see 3DB Browser Help File (For TOP, this value should be between 0.02 and 0.01). This command is good for searching structures with a short sequence figure print or structures in a sequence family and superimpose them together. This makes TOP can be used as simple a modeling program.
3DBRESOlution res1-res2 or RESO res1 res2
example: 3DBRESOLUTION 0.1-3.0 or 3DBRESOLUTION 0.1 3.0Equivalent to the "Resolution" column in 3DB. If this command appears, the TOP program only search those structures with resolution higher than 3.0 A (and lower than 0.1 A) cutoff.
3DBBEFore (or 3DBUPPer) date
Example: 3dbbefore 12/3/1998
Equivalent to the "Date (upper)" column in 3DB. If this command appears, the TOP program only search those structures which is deposited this date.
3DBAFTer (or 3DBLOWer) date
Example: 3dbafter 12/1/1998
Equivalent to the "Date (lower)" column in 3DB. If this command appears, the TOP program only search those structures which is deposited after the date. This makes users to trace the new structures which are similar to a certain family. It is possible to let this procedure fully automatic by making a simple unix script file.
Example: 3dbhet FMN
Equivalent to the "Associated group" column in 3DB. If this command appears, the TOP program only search those structures with this Hetero compound.
HELIX 1 F1 LEU 96 SER 103 HELIX 2 N1 ILE 148 ARG 160 HELIX 3 N2 ARG 184 GLU 193 HELIX 4 N3 GLU 223 HIS 229 HELIX 5 N4A PRO 245 GLN 249 HELIX 6 N4B SER 253 GLU 257 HELIX 7 N5 MET 263 SER 266 SHEET 1 FB 6 LYS 58 TYR 64 0 SHEET 2 FB 6 HIS 48 ILE 55 -1 SHEET 3 FB 6 TYR 109 LEU 116 -1 SHEET 4 FB 6 ILE 13 SER 24 -1 SHEET 5 FB 6 VAL 27 SER 33 -1 SHEET 6 FB 6 HIS 75 LYS 81 -1If there are no SSE assignments in the coordinates file, the program will take some CPU time to calculate it. If the file contains coordinates of all mainchain atoms, the program will use the "Smith-Laskowski method" as in the PROCHECK package. If the file only contains Ca coordinates or many mainchain atoms are missing, the program can also automatically assign the secondary structures using another method, but some elements, especially beta strands, might be not as accurate as in the case that all the mainchain atoms are provided. However, this does not influence the structure comparisons in most cases.
- 1) There are at least a certain number of residues in a consecutive fragment which Ca number of the two superimposed structures are less than certain distance. The distance is defined in the DISTANCE command (default 3.8 angstrom) while the number of consecutive residues is defined in the RESIDUE command (default 3)
- 2) The Ca atoms of the matched residues in the two superimposed structures must be the closest each other.
N r.m.s. = (Sigma(di)2 /N))1/2 i
N dmean = Sigma(di)/N i
N is the number of the matchable Ca atoms
di is the distance between the 1st molecule and 2nd molecule of the i'th atoms
Usually, if the difference is distributed homogenously all overall the two structures, values of dmean and r.m.s are close. If some parts of two structures are much more different than the other parts, r.m.s is usually significantly higher than dmean. In my opinion, dmean is more able to reflect the distance between the two structures in the comparisons than r.m.s.
Structure Diversity = (r.m.s)*(Nmol1/Nfit)A
Nfit is the number of matched residues (Ca atoms)
Nmol1 is the total number of residues in the 1st molecule.
A is the amplication order for number of matched residues. (defined in the AMPLIFY command, default 2.0). Higher this value is, more the structure diversity is influenced by number of matched residues, rather than by the r.m.s deviation.
Topological Diversity = (Angle + RMS)*(Mmol1/Mfit)A
For these purposes, one can just type top3d file1 file2 or top3d and answer the questions. For example if you type: top3d mol1.pdb mol2.pdb (in the case the two structures are similar) the program will output a sequence alignment of the two proteins and output a coordinates file mol2_mol1.pdb in which mol2.pdb is superimposed to mol1.pdb
In the case the two molecules or one of them have been deposited to Protein Data Bank and the entry code is known, you tell the program by a special format: code@pdb. For example, if you want to compare PDB entry 1KXD and 1VCP, you can just type top3d 1kxd@pdb 1vcp@pdb the program will output a file 1vcp_1kxd.pdb in which 1VCP is superimposed to 1KXD.
In the case user wish to change the parameter for the TOP program, one can edit a file TOP.PARM in the directory.
The file strdiv_name.lis is a list of similar structures ranked by "Structure Diversity" (based on Ca atoms). The file todiv_name.lis is a list of similar structures ranked by "Topological Diversity" (based on Secondary Structure Elements). If users wish to have detailed comparisons, one can pick up the code from one of these two lists and use the command top3d for further information.
|Name||PDB data from||Function|
|top.com||local disk or internet||Superimposing two protein structures and compare them|
|pdbscan.com||local disk||Searching similar structures in Protein Data Bank|
|pdbsearch.com||local disk||Searching similar structures in a compact database.|
|top3db.com||internet||Searching similar structures with 3DB restraints|
|makevec.com||local disk||Making SSE library|
# rm fort.10 fort.11 fort.12 ln -s omatrix.ofm fort.10 ln -s mol1.ofm fort.11 ln -s mol2.ofm fort.12 $LUEXE/top << 'end-top' MOL1 1kxd.pdb MOL2 1vcp.pdb RESIDUE 3 WRITE 'end-top' #type "top.com > top.log", the program will output which secondary structure elements are corresponding to each other in the two structures. Optionally, the program also superimposes the two structures based on the Ca atoms and output the sequence comparison. (See instruction of keyword RESIDUE). The rms deviation is output. When the WRITE statement appears, the program will write a file which superimposes molecule 2 onto molecule 1. In this case the output file name is 1vcp_1kxd.pdb. Sometimes, there are more than one way to superimpose the two structures (e.g. when the two structures are dimers AB, the program can superimpose AB to A'B' and AB to B'A'). In this case the program will output several superimosed coordinates files, called 1vcp_1kxd.pdb, 1vcp_1kxd.pdb_2, 1vcp_1kxd.pdb_3,....). One can use any graphics program (such as O, Insight or Frodo) to display the superimposed coordinates together with 1kxd.pdb. Look at top.log for more information.
There are other commands concerning the paramenters for different purpose of the comparisons. For detail, please see "Keyworded Input"
The TOP software can directory browse coordinates from Protein Data Bank (PDB),
if an URL address of a mirror site of PDB is provided. In this example, if you
know one of structures PDB entry code is 1vcp , you can do the
1) add a command to indicate from which site you want to browse
2) use xxxx@pdb in the MOL2MOL2 1vcp@pdb
So the program will directly read 1vcp from Brookhaven National Laboratory
The recommended way run TOP is first searching a compact library of Secondary
Structure Elements (SSEs) . If SSEs constructions of some proteins are found to
be similar to the studied structure, the program can do the further comparisons
based on Ca atoms (as shown in pdbsearch.com and topsearch.com). This ways
requires a regularly updated SSEs library which can be obtained from
It can also be made and updated automatically (see instructions for "
Automatic updating of SSE library"
If users choose not to use compact SSE library, one can use pdbscan.com
or topscan.com instead of pdbsearch.com or topsearch.com for searching PDB
in local disk or via internet.
In pdbscan.com, it is assumed that user have all the Protein Data Bank files
under directory /nfs/protein/pdb/current_release/uncompressed_files and all
the files are called *.ent. In this example file, the command
find $pdbdir -name "*.ent" -print > current.lis
find all the PDB entries and write into the file current.lis which has
Example 2: Searching similar structures in Protein Data Bank TOP can be used to see whether a protein is similar with certain structures in Protein Data Bank. Regarding how to obtaining the data from database, TOP may have two ways to run database searching.
In this way all the file names are stored in current.lis which will
be read by the MOL2 command in the TOP program.
In fact, one can search not only the whole protein data bank, but also a
group of selected structures, for example, structures represent
independent folding in the SCOP
The recommended way run TOP is first searching a compact library of Secondary Structure Elements (SSEs) . If SSEs constructions of some proteins are found to be similar to the studied structure, the program can do the further comparisons based on Ca atoms (as shown in pdbsearch.com and topsearch.com). This ways requires a regularly updated SSEs library which can be obtained from ftp://gamma.mbb.ki.se/pub/guoguang/sndlib.tar.Z It can also be made and updated automatically (see instructions for " Automatic updating of SSE library"
If users choose not to use compact SSE library, one can use pdbscan.com or topscan.com instead of pdbsearch.com or topsearch.com for searching PDB in local disk or via internet.
In pdbscan.com, it is assumed that user have all the Protein Data Bank files under directory /nfs/protein/pdb/current_release/uncompressed_files and all the files are called *.ent. In this example file, the command find $pdbdir -name "*.ent" -print > current.lis find all the PDB entries and write into the file current.lis which has contents like:
Still take pdbscan.com as an example. To run database searching, type "pdbscan.com &", after some hours, there will be all the information in pdbscan.log which users usually don't have to look at. User can look at the summary files: "strdiv.lis" or "topdiv.log" (If the program crash, you could also look at the middle results by typing "grep Str pdbscan.log | sort +3 -4" or "grep Top pdbscan.log | sort +3 -4")
The content of strdiv.lis is the following:
1692 structures are found to be similar under the given criteria Best Structure Diversity 7.67 with 52 matched residues to 2cnd Best Structure Diversity 7.68 with 56 matched residues to 1azz Best Structure Diversity 8.13 with 57 matched residues to 1epa Best Structure Diversity 8.33 with 48 matched residues to 1cnf Best Structure Diversity 8.48 with 54 matched residues to 1ave Best Structure Diversity 8.70 with 54 matched residues to 1hav Best Structure Diversity 8.70 with 54 matched residues to 2pia Best Structure Diversity 9.28 with 51 matched residues to 1avd ............The structure here 2cnd, 1azz, 1epa ... and so on are found similar to the searched model. (2cnd is ranked as most similar structure by the program). Users can use command file of example 1 and pick up the coordinates to run the individual comparison which gives superimposed structure and details of the comparison such as r.m.s and sequence alignment and so on (these information are also inside pdbscan.log, run nicelist.com or toplist.com to get a better output.)
The following is an example how to use SSE library for similarity searching. It is similar with example 2, but with one more command MOLVEC.
rm -f fort.10 fort.11 fort.12 ln -s omatrix.ofm fort.10 ln -s mol1.ofm fort.11 ln -s mol2.ofm fort.12 cat > topsearch.inp << EOF MATCH auto PDBSITE http://www2.ebi.ac.uk !LIBDIR /nfs/pdb/current_release/uncompressed_files/ MOL1 kinA.pdb MOLVEC $TOPHOME/lib/sndlib.vec EOF $TOPBIN/top < topsearch.inp > topsearch.log grep Top topsearch.log | sort +3 -4 >> topdiv.lis grep similar topsearch.log > strdiv.lis grep Str topsearch.log | sort +3 -4 >> strdiv.lisThe runing and analysis procedure is similar with example 2
In this example, if you use LIBDIR /nfs/pdb/current_release/uncompressed_files/ instead of PDBSITE http://www2.ebi.ac.uk, the program will browse the coordinates from local disk instead of internet.
If you use an other SSE dastabase, for example MOLVEC $TOPHOME/lib/scop_structure.vec You search only about 2000 independent domain structures selected in the SCOP dastabase instead of 8000 in Protein Data Bank. The speed would be much faster (only 1/10 to 1/5 as before). For same reason, you could use $TOPHOME/lib/scop_family.vec (about 900 domain structures) or $TOPHOME/lib/scop_superfamily.vec (about 600 domain structures) to even search for a short time. The SCOP database is not updated as frequent as PDB, so far once every year. The the SSE database for most recent SCOP is always kept in our FTP distibution site
In the Web server of TOP, there is another way to search all the structures: The program search classification unit of independent domain structures, families or super-families in SCOP. Once it found the similarity, it can optionally futher search other structures in the same classification unit. The search in this way is very efficient in terms of speed although it does not search the most recent data in Protein Data Bank. Please have a look at: http://alfa.mbb.ki.se:8000/TOP/search_SCOP_new.html
#!/bin/csh rm fort.10 fort.11 fort.12 ln -s omatrix.ofm fort.10 ln -s mol1.ofm fort.11 ln -s mol2.ofm fort.12 $TOPBIN/top << 'end-top' MOL1 zmA.pdb MOLVEC snd1.vec pdbsite http://www2.ebi.ac.uk 3dbseq 0.02 @zm.seq MATCH auto WRITE yes 'end-top'In this example zm.pdb is the PDB coordinates of the probe structure. zm.seq is the file which contains the sequence in format of 1-letter code:
SYTVGTYLAERLVQIGLKHHFAVAGDYNLVLLDNLLLNKNMEQVYCCNEL TLKFIANRDKVAVLVGSKLRAAGAEEAAVKFTDALGGAVATMAAAKSFFP EENALYIGTSWGEVSYPGVEKTMKEADAVIALAPVFN ....The filename for all the superimposed coordinates will be 1pyd_zmA.pdb, 1pvd_zmA.pdb, 1pox_zmA.pdb....
|SCOP||http://scop.mrc-lmb.cam.ac.uk/scop||Structure Classification of Proteins||Chothia, Murzin...|
|CATH||http://www.biochem.ucl.ac.uk/bsm/cath||Class Architecture Topology Homology||Thornton...|
While searching similar structures in the whole protein data bank usually, a lot of time is wasted on tens of Lysozyme mutants or other closely related homologous proteins. It is possible to make a file list where only structures with independent folds or super-families are present (see example 2), if such information can be obtained from other sources. So far, no such a effort has been made by the author.
In the case database searching, too high value in this command will cause that no or too few similar structures are found. Users can find out what is the proper parameter for by typing: grep "Maxminun match" pdbscan.log | sort +10 -11 (it is assumed that the log file is called pdbscan.log). For example, you give MATCH number 5 and you have no hitted structure, you will get something like
...... ... No way to align in 1abj.pdb Maxminun match : 3 Minumun Align: 5 ... No way to align in 1abn.pdb Maxminun match : 3 Minumun Align: 5 ... No way to align in 1abo.pdb Maxminun match : 3 Minumun Align: 5 ... No way to align in 12ca.pdb Maxminun match : 4 Minumun Align: 5 ... No way to align in 1aag.pdb Maxminun match : 4 Minumun Align: 5 ... No way to align in 1aao.pdb Maxminun match : 4 Minumun Align: 5In this example, you can get 3 more matched similar structures if you use 4 in the MATCH command.
Under-estimation: Usually under-estimation of this number is OK. The program will find too many structures which you are not interested, but you can always rank the similarity by "Structure Diversity" or "Topological Diversity" and look only the structures at top in the rankings. If you find you think the speed of searching is too slow because of the too low value of this parameter, you also have some way to know the your wanted number far before the searching is finished. For example, you give 5 in the MATCH command. After a while of running the program, you can type grep "Max Align" pdbscan.log | sort +3 -4 you get
....... ...(too many hints)... ...... 1cax.pdb<->mol1.pdb Max Align: 5 Max Match: 5 1cwa.pdb<->mol1.pdb Max Align: 5 Max Match: 5 1cwb.pdb<->mol1.pdb Max Align: 5 Max Match: 5 1cwc.pdb<->mol1.pdb Max Align: 5 Max Match: 5 1cxf.pdb<->mol1.pdb Max Align: 5 Max Match: 5 1cyn.pdb<->mol1.pdb Max Align: 5 Max Match: 5 1dlc.pdb<->mol1.pdb Max Align: 5 Max Match: 5 1cnd.pdb<->mol1.pdb Max Align: 7 Max Match: 7 1cne.pdb<->mol1.pdb Max Align: 7 Max Match: 7 1cnf.pdb<->mol1.pdb Max Align: 7 Max Match: 7If you find only the last 3 structures fall into your "similarity" criterion, you can give "MATCH 6" (or 7) when you re-scan the database.