AutoMotif Server: A Computational Protocol for Identification of Post-Translational Modifications in Protein Sequences.
Lab/Group: Plewczynski Lab (University of Warsaw)
Introduction
The rapid increase in genomic information requires new automatic techniques to investigate protein functions. The function of proteins is partially determined by short sequence segments. For example the phosphorylation by protein kinases is an important mechanism for controlling intracellular processes. Many kinases are known, but the identification of their potential biological targets is still ongoing research. High substrate specificity of protein kinases ensures correct transmission of signals in cells. The specificity is largely determined by the primary sequence of the target site, but we lack general, efficient and error prune tools for identifying these sites. Most methods designed to predict functional motifs process local sequence information around post-translational modification sites.
We present here an advanced computational protocol for rapid identification of post-translational modifications (PTM) in proteins on the whole genome scale. The AutoMotif Server (AMS) identifies various types of post-translational modifications in protein sequences. A query protein sequence is dissected into overlapping short segments. Each segment is projected into an abstract space of sequence fragments by 10 different representations. Those projections are compared with the database of representations of known and confirmed by experiments post-translational modification sites using the support vector machine (SVM) approach 1, 2. The supervised machine learning approach is able to predict the most of post-translational modification sites in proteins. It is based on the classification of the biological functional information acquired from the Swiss-Prot database version 4.2. The classification models are then used to predict new modification sites in proteins. Users can access a list of sites in proteins annotated as being able to undergo certain post-translational modification in Swiss-Prot database and add new annotated sequence segments from proteins (positive instances).
The AMS server was demonstrated 3, 4 to gain high accuracy in distinguishing short sequence fragments that are post-translational modified from those that are not. The efficiency of the classification for each type of modifications and the prediction power of several versions of the method is estimated using the standardized leave-one-out tests. The sensitivities of the protocol for all types of modifications are in the range of 70%.
The AutoMotif Server is freely available at http://automotif.bioinfo.pl/. The local version of the software is available on request from the authors. The parameters (the search type, the number of top models, and the PTM type) are optional and can be easily modified. The following protocol describes how to use AMS server to detect various types of post-translational modifications, and how to understand the resulting score for a given prediction.
Materials
Reagents
Equipment
1. A typical personal computer with Linux, Apple Mac OSX or Windows operating system
2. Input single sequence or a set of sequences in FASTA file format, from experimental data or sequence databases
3. The internet web browser. We suggest using Firefox, but Apple Safari, Microsoft Internet Explorer or Mozilla suite are allowed.
Procedure
1. The AutoMotif Server (AMS) dissects a query protein sequence into overlapping short sequence segments and identifies selected types of post-translational modification sites. We use supervised SVM classification trained on experimental knowledge for identification of PTM sites. Each sequence segment has assigned a real number calculated by the cost function of SVM classification model. Residues with have the value of cost function, i.e. the score larger than a given cut-off value are identified as possible modification sites. This means that the point representing this sequence segment is located in the region of multidimensional space classified as “positive” by the SVM model’s hyperplane within given cut-off value. In AMS web server we use only single, the most effective type of the kernel, i.e. the polynomial kernel. The one-vote-wins method is used to annotate segments that are predicted as positives by at least one classification model.
2. The AMS server accepts input sequences in the one-letter mode in capital letters: 'ACDEFGHIKLMNPQRSTVWY', with additional letter X for marking empty or unknown positions in a protein sequence, or extension of a sequence segment. Users can input sequences by submiting text file in FASTA file format (for details see http://en.wikipedia.org/wiki/Fasta_format ), or by providing the SWISS-PROT/TrEMBL identifier or accession number in the text box, or simly pasting the amino acids seqeuences.
3. The server predicts by default all types of post-translational modification sites that were precalculated by the authors and which are available with enough statistics in the Swiss-Prot database. The list presently include acetylation, amidation, hydroxylation, methylation, sulfation and phopshorylation (by PKC, PKA, CK, CK2 and CDC2 protein kinases). The search can be limited by selecting particular type of functional motif from the drop-down menu on the server’s www page (for example phosphorylation sites in general or by specific kinases).
4. Two types of search procedures are available on the server: the identity search and scan based on SVM classification. The first method identifies identical in terms of sequence 9 residues segments in a query protein and the database of positives for that selected type of modification. The second method runs several versions of SVM predictions that use different projection methods. The registration of a user (by following the link ”User Site” from the main www page) allows for submiting his or her own list of training instances as a text file with the set of segments dissected from a multiple proteins known to pefrorm certain function. Then the AMS server train the SVM for the new type of functional motif and use it to scan any query protein sequence for potential substrates. This method allows for indroducing new types of biochemical process that are not yet known in public, or that are not contained in Swiss-Prot database .
5. The output www page for a query protein contains two sections. The first section displays results of predictions for each selected model, i.e. the parent protein information (i.e. the sequence number in a query set), local segment sequences predicted as a modificated sites, their positions (start, modified central residue, the end position, and the size of a segment) and the output scores. The second section of the output www page describes each used type of post-translational modification, its protein agent, the best SVM method used to classify known instances. Each SVM model is described by the number of positive and negative instances used in training, the precision and recall errors of the classification models.
6. The accuracy of SVM classification models is described by two numbers: the recall R and the precision P. The recall R value measures the percentage of correct predictions (the probability of correct prediction), whereas precision P gives the percentage of observed positives that are correctly predicted (the measure of the reliability of positive instances prediction). The measures of accuracy are calculated separately for each type of PTM using the leave-one-out procedure. The typical recall value is around 30%, and the precision P is over 70% for majority of PTM.
7. In the case of single query protein applying the computational protocol give for each type of PTM the list of predicted modifications for this sequence. When a set of sequences is used as an input, the protocol returns the for each type of modification the list of predicted short sequence fragments that are modified with the parent protein number. The list of predicted modified sites is not ordered.
8. The consensus prediction is also available on the output web page, when several different versions of the method predict the same local sequence fragment to perform given post-translational modification.
Troubleshooting
Critical Steps
1. The optimal way to investigate protein function is to use the complete parent protein sequence, not short parts of it. In that case the interesting non-local multiple modifications sites can be identified.
2. The output score is in the range [0.000-5.000]. The higher the output score indicate the higher confidence of the predictions.
3. The predicted sequence fragments that are modified for certain type of post-translational modification can repeat in the output page with different reliability scores. Those variants are predicted by different methods by the use of various projections. If more than one method predicts a site as modified, the prediction is more reliable even if low scores are presented.
4. In all types of post-translational modification sites the best type of a kernel is polynomial one. Representations mixed with LOOKUP projection (like PROF+LOOKUP and BLOSUM+LOOKUP) are the most efficients. Other projections (like generic BIN or PROF) have some advantages for particular types of modification sites, but they have lower overall efficiency (small recall and precision values). When the number of positive instances is large the simple binary method BIN is becoming the most accurate one, whereas in the case of lower statistics profile methods gain better results. The SVM finds more easily proper classification scheme of the test set with simple representations than more complex ones. The linear kernel function in the case of more complicated sequence signatures of post-translational modification sites is not efficient. However in some cases (PKA phosphorylation with SPARSE+LOOKUP representation) SVM models of this type reach efficiency of the polynomial kernel. In the case of radial basis kernel SVM frequently fails to build the model. In the case of large number of instances the simple LOOKUP method for this type of a kernel is the most accurate. The remarkable cases are acetylation, amidation and pyrrolidone cases, where the system with LOOKUP embedding reaches efficiency of the polynomial kernel.
Anticipated Results
1. The analysis of post-translational modification sites by support vector machine allows for quick and accurate (very conservative) prediction of a protein function. The high overall precision of best methods allows user to gain deep insight in plausible functional characteristics of unknown new proteins. The recall efficiency ensures that information from previously verified sites will be not lost during automatic scans of known instances. The algorithm can by applied independently from the Web interface in a pipe-line. Large scale genomes analysis is also possible.
2. The main problem for some types of functional modifications is the insufficient number of experimentally verified instances. The number of support vectors for some of our classification models is very large – which is explained by the large dimensionality of the embedding space in such cases and the complicated shape of the separation hyperplane between positive and negative instances. The number of support vectors can be lowered when one chooses low dimensional initial encoding of the amino acids into the general physicochemical properties (like hydrophobicity, hydrophilicity, polarity, volume, surface area, bulkiness or refractivity). We are working now on incorporating those features for recent update of our service, which will be available within one month.
References
1. Vapnik, V. N. The nature of statistical learning theory. Springer: New York, 1995; p xv, 188.
2. Vapnik, V. N. Statistical learning theory. Wiley: New York, 1998; p xxiv, 736 p.
3. Plewczynski, D.; Tkacz, A.; Wyrwicz, L. S.; Godzik, A.; Kloczkowski, A.; Rychlewski, L. Support-vector-machine classification of linear functional motifs in proteins. J Mol Model 2006, 12, 453-61.
4. Plewczynski, D.; Tkacz, A.; Wyrwicz, L. S.; Rychlewski, L. AutoMotif server: prediction of single residue post-translational modifications in proteins. Bioinformatics 2005, 21, 2525-7.
5. Bairoch, A.; Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res 1999, 27, 49-54.
6. Junker, V. L.; Apweiler, R.; Bairoch, A. Representation of functional information in the SWISS-PROT data bank. Bioinformatics 1999, 15, 1066-7.
Acknowledgements
This work was supported by EC BioSapiens (LHSG-CT-2003-503265) 6FP project as well as the Polish Ministry of Education and Science (PBZ-MNiI-2/1/2005 and 2P05A00130).
Keywords
Post-translational modifications, phosphorylation, kinase substrate prediction, protein kinases, acetylation, sulfation, amidation, hydroxylation, methylation, pyrrolidone, gamma-carboxyglutamic modification, sequence similarity, database of functional sequence segments, Swiss-Prot database, support vector machine, machine learning
Figure IA
Examples of the server input Web page for a set of sequences (top and middle pictures) and for A1A1_BUFMA protein (bottom). On the middle picture we present the identity search options (scan for identical sequence segments) for PHOSPHORYLATION BY PKC active sites, and the bottom picture presents the Web server page for SVM scan for PHOSPHORYLATION.
Figure 1B
Examples of the server output Web page for “PHOSPHORYLATION BY PKC” scan in set of proteins and A1A1_BUFMA protein. On the top the output for Identity search, on the bottom for the SVM search.
Figure IIA
The AutoMotif Server main Web pages. On the top we present the main input page, and on the bottom there is Documentation section of the AMS
Figure IIB
The AutoMotif Server links and “User Site” Web pages. On the top we present all available links to functional prediction sites, databases and servers, and on the bottom there is Swiss-Prot based set of functional motifs used in service.
Figure IIIA
The AutoMotif Server “User Site” section of the service. On the top picture we present user’s login screen, and on the middle the uploading of a set of positives from the text file. It is important that input file should include only segments with the same length centered on the functional site, no empty lines. The input file format uses only single-upper case letters to mark amino acids, and cannot include more than 1000 positives. Below we present the output page of building the user’s own model from the set of supplied positives
Figure IIIB
The AutoMotif Server “User Site” search section of the service. On the top we present the input page with set of proteins and on the bottom picture the scan results for user supplied set of positives representing “MySetOfPos” functional motif.
Example I
The AutoMotif Server (AMS) accepts input sequences in the one-letter code in capital letters: 'ACDEFGHIKLMNPQRSTVWY', with additional letter X for marking empty unknown positions in a protein chain or positions extending a sequence chain outside its ends. There are three methods for a user to input sequences. He or she can:
• submit a file in FASTA format with single or multiple sequences directly from local disk.
• enter SWISS-PROT/TrEMBL protein identifier (or accession number) and press a “Load sequence” button. The protein sequence will appear below in the text field.
• or simple paste a single or several sequences in FASTA format directly into the input text field.
If user pastes more than one sequence in FASTA format the ’>’ delimiter must be used (see Example IA). In the case of single sequence user can omit description line and use only bare sequence. It is also possible to submit short peptide sequences (see Example IB). If sequences are shorter than 9 amino acids user should use X to mark empty positions, or positions outside ends of peptide. User must submit all 9 positions always puting the main residue to evaluate in the center and including 4 flanking residues on both sides. The optimal way to investigate sequences is to use the complete parent protein sequence.
Example IA
Examples of input sequences in FASTA format. The detailed description of the FASTA format is available at NCBI web page http://www.ncbi.nlm.nih.gov/BLAST/fasta.html.
Example for four real protein sequences known to be phosphorylated by PKC kinase:
> ATHA_PIG 1033 PKC PHOSPHORYLATION P19156 PHOSPHORYLATION PHOSPHORYLATION (BY PKA AND PKC) ATHA_PIG 1033 26 PSGDMAAKMSKKKAGRGGG Potassium-transporting ATPase alpha chain 1 (EC 3 6 3 10) (Proton pump) (Gastric H+/K+ ATPase alpha subunit) 1
GKAENYELYQVELGPGPSGDMAAKMSKKKAGRGGGKRKEKLENMKKEMEINDHQLS
VAELEQKYQTSATKGLSASLAAELLLRDGPNALRPPRGTPEYVKFARQLAGGLQCL
MWVAAAICLIAFAIQASEGDLTTDDNLYLALALIAVVVVTGCFGYYQEFKSTNIIA
SFKNLVPQQATVIRDGDKFQINADQLVVGDLVEMKGGDRVPADIRILQAQGRKVDN
SSLTGESEPQTRSPECTHESPLETRNIAFFSTMCLEGTAQGLVVNTGDRTIIGRIA
SLASGVENEKTPIAIEIEHFVDIIAGLAILFGATFFIVAMCIGYTFLRAMVFFMAI
VVAYVPEGLLATVTVCLSLTAKRLASKNCVVKNLEAVETLGSTSVICSDKTGTLTQ
NRMTVSHLWFDNHIHSADTTEDQSGQTFDQSSETWRALCRVLTLCNRAAFKSGQDA
VPVPKRIVIGDASETALLKFSELTLGNAMGYRERFPKVCEIPFNSTNKFQLSIHTL
EDPRDPRHVLVMKGAPERVLERCSSILIKGQELPLDEQWREAFQTAYLSLGGLGER
VLGFCQLYLSEKDYPPGYAFDVEAMNFPTSGLSFAGLVSMIDPPRATVPDAVLKCR
TAGIRVIMVTGDHPITAKAIAASVGIISEGSETVEDIAARLRVPVDQVNRKDARAC
VINGMQLKDMDPSELVEALRTHPEMVFARTSPQQKLVIVESCQRLGAIVAVTGDGV
NDSPALKKADIGVAMGIAGSDAAKNAADMILLDDNFASIVTGVEQGRLIFDNLKKS
IAYTLTKNIPELTPYLIYITVSVPLPLGCITILFIELCTDIFPSVSLAYEKAESDI
MHLRPRNPKRDRLVNEPLAAYSYFQIGAIQSFAGFTDYFTAMAQEGWFPLLCVGLR
PQWENHHLQDLQDSYGQEWTFGQRLYQQYTCYTVFFISIEMCQIADVLIRKTRRLS
AFQQGFFRNRILVIAIVFQVCIGCFLCYCPGMPNIFNFMPIRFQWWLVPMPFGLLI
FVYDEIRKLGVRCCPGSWWDQELYY
> MYPC_CHICK 1271 PKC PHOSPHORYLATION Q90688 PHOSPHORYLATION PHOSPHORYLATION (BY PKA AND PKC) MYPC_CHICK 1271 264 DIRAAFRRTSLAGGGRRMT Myosin-binding protein C, cardiac-type (Cardiac MyBP-C) (C-protein, cardiac muscle isoform) 2
PEPAKKAVSAFTKKPKTTEVAAGSTAVFEAETEKTGIKVKWQRAGTEITDSEKYAI
KAEGNKHSLTISNVGKDDEVTYAVIAGTSKVKFELKVKEPEKSEPVAPAEASPAPA
ASELPAPPVESNQNPEVPPAETQPEEPVDPIGLFVTRPQDGEVTVGGNITFTAKVA
GESLLKKPSVKWFKGKWMDLASKVGKHLQLHDNYDRNNKVYTFEMEIIEANMTFAG
GYRCEVSTKDKFDSSNFNLIVNEAPVSGEMDIRAAFRRTSLAGGGRRMTSAFLSTE
GLEESGELNFSALLKKRDSFLRTANRGDGKSDSQPDVDVWEILRKAPPSEYEKIAF
QYGITDLRGMLKRLKRIKKEEKKSTAFLKKLDPAYQVDKGQKIKLMVEVANPDADV
KWLKNGQEIQVSGSKYIFEAIGNKRILTINHCSLADDAAYECVVAEEKSFTELFVK
EPPILITHPLEDQMVMVGERVEFECEVSEEGATVKWEKDGVELTREETFKYRFKKD
GKKQYLIINESTKEDSGHYTVKTNGGVSVAELIVQEKKLEVYQSIADLTVKARDQA
VFKCEVSDENVKGIWLKNGKEVVPDERIKISHIGRIHKLTIEDVTPGDEADYSFIP
QGFAYNLSAKLQFLEVKIDFVPREEPPKIHLDCLGQSPDTIVVVAGNKLRLDVPIS
GDPTPTVIWQKVNKKGELVHQSNEDSLTPSENSSDLSTDSKLLFESEGRVRVEKHE
DHCVFIIEGAEKEDEGVYRVIVKNPVGEDKADITVKVIDVPDPPEAPKISNIGEDY
CTVQWQPPTYDGGQPVLGYILERKKKKSYRWMRLNFDLLKELTYEAKRMIEGVVYE
MRIYAVNSIGMSRPSPASQPFMPIAPPSEPTHFTVEDVSDTTVALKWRPPERIGAG
GLDGYIVEYCKDGSAEWTPALPGLTERTSALIKDLVTGDKLYFRVKAINLAGESGA
AIIKEPVTVQEIMQRPKICVPRHLRQTLVKKVGETINIMIPFQGKPRPKISWMKDG
QTLDSKDVGIRNSSTDTILFIRKAELHHSGAYEVTLQIENMTDTVAITIQIIDKPG
PPQNIKLADVWGFNVALEWTPPQDDGNAQILGYTVQKADKKTMEWYTVYDHYRRTN
CVVSDLIMGNEYFFRVFSENLCGLSETAATTKNPAYIQKTGTTYKPPSYKEHDFSE
PPKFTHPLVNRSVIAGYNTTLSCAVRGIPKPKIFWYKNKVDLSGDAKYRMFSKQGV
LTLEIRKPTPLDGGFYTCKAVNERGEAEIECRLDVRVPQ
> ADDA_HUMAN 737 PKC PHOSPHORYLATION P35611 PHOSPHORYLATION PHOSPHORYLATION (BY PKC AND PKA) ADDA_HUMAN 737 726 KKKKKFRTPSFLKKSKKKS Alpha adducin (Erythrocyte adducin alpha subunit) 3
MNGDSRAAVVTSPPPTTAPHKERYFDRVDENNPEYLRERNMAPDLRQDFNMMEQKK
RVSMILQSPAFCEELESMIQEQFKKGKNPTGLLALQQIADFMTTNVPNVYPAAPQG
GMAALNMSLGMVTPVNDLRGSDSIAYDKGEKLLRCKLAAFYRLADLFGWSQLIYNH
ITTRVNSEQEHFLIVPFGLLYSEVTASSLVKINLQGDIVDRGSTNLGVNQAGFTLH
SAIYAARPDVKCVVHIHTPAGAAVSAMKCGLLPISPEALSLGEVAYHDYHGILVDE
EEKVLIQKNLGPKSKVLILRNHGLVSVGESVEEAFYYIHNLVVACEIQVRTLASAG
GPDNLVLLNPEKYKAKSRSPGSPVGEGTGSPPKWQIGEQEFEALMRMLDNLGYRTG
YPYRYPALREKSKKYSDVEVPASVTGYSFASDGDSGTCSPLRHSFQKQQREKTRWL
NSGRGDEASEEGQNGSSPKSKTKWTKEDGHRTSTSAVPNLFVPLNTNPKEVQEMRN
KIREQNLQDIKTAGPQSQVLCGVVMDRSLVQGELVTASKAIIEKEYQPHVIVSTTG
PNPFTTLTDRELEEYRREVERKQKGSEENLDEAREQKEKSPPDQPAVPHPPPSTPI
KLEEDLVPEPTTGDDSDAATFKPTLPDLSPDEPSEALGFPMLEKEEEAHRPPSPTE
APTEASPEPAPDPAPVAEEAAPSAVEEGAAADPGSDGSPGKSPSKKKKKFRTPSFL
KKSKKKSDS
> ADDB_HUMAN 726 PKC PHOSPHORYLATION P35612 PHOSPHORYLATION PHOSPHORYLATION (BY PKC AND PKA) ADDB_HUMAN 726 713 KKKKKFRTPSFLKKSKKKE Beta adducin (Erythrocyte adducin beta subunit) 4
MSEETVPEAASPPPPQGQPYFDRFSEDDPEYMRLRNRAADLRQDFNLMEQKKRVTM
ILQSPSFREELEGLIQEQMKKGNNSSNIWALRQIADFMASTSHAVFPTSSMNVSMM
TPINDLHTADSLNLAKGERLMRCKISSVYRLLDLYGWAQLSDTYVTLRVSKEQDHF
LISPKGVSCSEVTASSLIKVNILGEVVEKGSSCFPVDTTGFCLHSAIYAARPDVRC
IIHLHTPATAAVSAMKWGLLPVSHNALLVGDMAYYDFNGEMEQEADRINLQKCLGP
TCKILVLRNHGVVALGDTVEEAFYKIFHLQAACEIQVSALSSAGGVENLILLEQEK
HRPHEVGSVQWAGSTFGPMQKSRLGEHEFEALMRMLDNLGYRTGYTYRHPFVQEKT
KHKSEVEIPATVTAFVFEEDGAPVPALRQHAQKQQKEKTRWLNTPNTYLRVNVADE
VQRSMGSPRPKTTWMKADEVEKSSSGMPIRIENPNQFVPLYTDPQEVLEMRNKIRE
QNRQDVKSAGPQSQLLASVIAEKSRSPSTESQLMSKGDEDTKDDSEETVPNPFSQL
TDQELEEYKKEVERKKLELDGEKETAPEEPGSPAKSAPASPVQSPAKEAETKSPLV
SPSKSLEEGTKKTETSKAATTEPETTQPEGVVVNGREEEQTAEEILSKGLSQMTTS
ADTDVDTSKDKTESVTSGPMSPEGSPSKSPSKKKKKFRTPSFLKKSKKKEKVES
Example IB
The set of peptides phosphorylated by PKC kinase can be also pasted into the browser text window in the same format:
> pept 1
EPAATSEHG
> pept 2
PAATSEHGG
> pept 3
PAAVSEHGD
> pept 4
GDKKSKKAK
> pept 5
GKSPSKKKK
> pept 6
FRTPSFLKK
> pept 7
SKSPSKKKK
> pept 8
FRTPSFLKK
> pept 9
EYIKSVKGG
> pept 10
SAYGSVKAY
> pept 11
SAYATVKAY
> pept 12
SAYGSVKAY
> pept 13
SAYGSVKPY
> pept 14
SKLGSVKAA
> pept 15
AKGGTVKAA
> pept 16
NRIQTQMDV
> pept 17
AAKMSKKKA
> pept 18
AARTSPLRP
> pept 19
TKKQSFKQT
> pept 20
KTTASTRKV
Example II.
By default the server predicts all available in the database types of biochemical processes, posttranslational modifications. User can limit his or her scan by choosing the particular process from the drop-down list (for example phosphorylation sites in general or by specific kinases). The available functional motifs up to now include:
• ACETYLATION
• AMIDATION
• GCG_ACID
• HYDROXYLATION
• METHYLATION
• SULFATION
• PHOSPHORYLATION
• PHOSPHORYLATION BY PKC
• PHOSPHORYLATION BY PKA
• PHOSPHORYLATION BY CK
• PHOSPHORYLATION BY CK2
• PHOSPHORYLATION BY CDC2
• PHOSPHORYLATION BY ABL
The user can choose two types of scan – either identity search or SVM method scan (see Figure IA). The first one performs a simple search over the database of collected from Swiss-Prot instances for functional motifs.
When it finds the exact matches in terms of short (9aa) sequence strings it displays them. The second one runs SVM search with various embedding methods in order to scan a query sequence for certain type of process. The maximum number of the best models for each pattern is fixed at 5. One can scan a query sequence with smaller number of best models by choosing the preferable option from the drop-down list on the main page of the server.
After preparing a sequence(s) user starts a scan by pressing the button labelled 'Start'. The server is working on-line, so one need to wait for a moment to see results of his or her query directly in the browser window.
The output from AutoMotif contains two main parts (see examples below).
The first part is a large table with information about all used in a scan types of patterns and SVM models. Each method is constructed for certain type of embedding, type of pattern (for example PHOSPHORYLATION) and modification “BY” of it (for example BY PKC). The number of positives and negatives used in training of SVM for this partivcular type is provided in the last two columns. The precision and recall errors for used methods calculated automatically during the training phase by Leave-One-Out test is presented in the middle two columns.
The second part of the output provides in tables results of the predictions for each type of a pattern (see Figure IB). The name of a type of process (functional pattern) is displayed in the front of each table. Following it the predictions are printed in tables with 7 columns containing:
• Column 1: The sequence position in a list (with accordance to submitted list of sequences: the first protein or a peptide is marked by 1, the second one by 2 etc.).
• Column 2: The predicted segment with accordance to the pattern. The central residue is provided with sequence context (shown as a 9-residue sequence string centered on the residue being analyzed).
• Column 3: The start position of the segment predicted as a functional motif .
• Column 4: The center position of the segment that is the position of a modificated residue in predicted functional motif.
• Column 5: The end position of the segment predicted as a
functional motif in a query protein(s).
• Column 6: The size of the segment (or functional motif).
• Column 7: The output score for a fragment with value in the range [0.000-5.000].
The potential functional motifs sometimes are repeated when predicted by various methods (with different scores). Each method predicts a little different set of sequence segments as the functional motifs. Our automatic predictor uses identity search or SVM scan.
Example IIA.
The results for a both types of scans for a set of four proteins. All of them are known to be phosphorylated by PKC kinase (which is verified by experimental results and annotated in Swiss-Prot database). The higher the score indicate the higher confidence of the predictions. This means that potential segments are more similar to one or more of the phosphorylation functional motifs used in training of SVM method.
The identity search gives the following results:Example IIB
The output of ”phosphorylation by PKC kinase” scan for a set of peptides. The predicted to be phosphorylated peptides are repeated in the case of SVM scan because they are predicted by various methods (with different scores). Each method provides a different, although ovelapping set of peptides to be phosphorylated by PKC kinase.
The identity search provides the following results:

