ConSeq Overview

Introduction
Methodology

Protein query sequence
Searching for homologous sequences
Generating the multiple sequence alignment
Calculating the amino acid conservation scores
Model of substitution for proteins
Surface accessibility prediction
Conservation coloring scheme

Output

Graphic visualization

References

The server is compatible with Internet Explorer 5.5/6 and Netscape 4.7x.

Introduction

ConSeq is a web server for the identification of structurally and functionally important residues in protein sequences. Given a set of homologous proteins in the form of multiple sequence alignment (MSA), the evolutionary rate at each amino acid site in the MSA is calculated. The slowly evolving sites are often biologically important. To determine which of these sites are important for maintaining the protein structure and which are functionally important, the MSA is used to predict the relative solvent accessibility state of each site (i.e. buried vs. exposed). The basic assumption is that functionally important residues,which for example take part in ligand binding, DNA binding and protein-protein interactions, are often evolutionarily conserved and are most likely to be solvent accessible, whereas highly conserved residues within the protein core are likely to have an important structural role in maintaining the protein's fold. However it should be emphasized that the discrimination between the functional and structural residues ("functional" = highly conserved and exposed, whereas "structural" = highly conserved and buried) might be problematic in cases where a certain residue has both functional role and structural roles. Moreover, it should be noted that in certain cases, functionally important sites might evolve faster; for example, hypervariable peptide binding sites in MHC molecules (see example at: http://consurf.tau.ac.il/gallery.html).

ConSeq was designed to analyze protein sequences of unknown three-dimensional (3D) structure and can provide fast and useful leads for the design and analysis of mutagenesis studies. For cases in which the 3D structure of the protein is known, we recommend the use of the ConSurf server (http://consurf.tau.ac.il/).

A ConSeq analysis of a set of 5 well-documented proteins and a comparison to other available web-servers, are presented in the VALIDATION and COMPARISON sections, respectively. The outcome of a ConSeq analysis of a set of 111 proteins of unknown function is presented in "PREDICTIONS" section.

Methodology

Given the sequence of a protein or a domain as an input, the server automatically carries out a PSI-BLAST search for close homologous sequences in the SWISS-PROT database or the full UNI-PROT knowledgebase (SWISS-PROT + TrEMBL) databases. It then multiply aligns them using the CLUSTALW or MUSCLE program, builds a phylogenetic tree consistent with the MSA, and calculates the conservation scores using either an empirical Bayesian or the Maximum Likelihood method. Alternatively, a user-provided MSA can be processed. In this case, the PSI-BLAST search and CLUSTALW or MUSCLE alignment steps are passed over, and the phylogenetic tree is calculated directly from the user-provided MSA.

Based on the MSA, the server predicts the burial status of each residue, i.e., whether it is buried in the protein core or exposed to the solvent, based on a neural network algorithm. Slowly-evolving and exposed residues are marked with an "f", to indicate that they are predicted to be functional, whereas slowly-evolving and buried residues are marked with an "s", to indicate that they are predicted to have an important structural role.

The sequence, with the conservation scores color-coded onto it, the relative accessibility prediction and the indication of the functional and structurally predicted residues, can finally be visualized on-line (Fig. 1).

Protein query sequence

The user is requested to paste the query sequence of the protein or domain in a FASTA format.

Searching for homologous sequences

The server uses the PSI-BLAST heuristic algorithm (Altschul et al., 1997) with default parameters to collect homologous sequences to the query sequence. The search is carried out using the SWISS-PROT database (O'Donovan et al., 2002) or the full UNI-PROT knowledgebase (SWISS-PROT + TrEMBL) databases and a default single iteration of PSI-BLAST with an E-value cutoff of 0.001. A profile search can be generated with up to 5 PSI-BLAST iterations, by changing the number of iterations in the Home Page. The E-value is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. The higher the E-value, the more hits will be expected, but the pairwise distance between them and the query sequence will increase.The E-value cutoff can also be changed, from the server Home Page.

The homologue names extracted from PSI-BLAST output file are the original SWISS-PROT key names found by PSI-BLAST (e.g. XXXX_YYY). If more than one match is found for the same sequence, the first one conserves its original SWISS-PROT name while the others receive a sequential number (e.g. XXXX_YYY_1).

Generating the Multiple Sequence Alignment (MSA)

The server uses CLUSTALW (Thompson et al., 1994) with default parameters to align the homologues extracted from the PSI-BLAST output file.

In cases where the user provides an MSA, it can be in any of the following formats: NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF and RSF formats. For more information on these formats check Additional MSA Format Information and The Format Converter.In case of using a MSA from the HSSP database, please convert it first to the Fasta format using the HsspToFasta script.

Calculating the amino acid conservation scores

The server uses the neighbor-joining method (Saitou and Nei, 1987) to construct an evolutionary tree consistent with the available MSA, and calculates a conservation score for each site on the MSA. This is carried out using either an empirical Bayesian or the Maximum Likelihood method.

The rate of evolution is not constant among amino acid sites: some positions evolve slowly and are commonly referred to as "conserved", while others evolve rapidly and are referred to as "variable". The rate variations correspond to different levels of purifying selection which act on these sites. The purifying selection can be the result of geometrical constraints on the folding of the protein into its 3D structure, or constraints at amino acid sites involved in enzymatic activity, in ligand binding or at amino acid sites that take part in protein-protein interactions. The rate of evolution at each site is calculated using either an empirical Bayesian or the Maximum Likelihood paradigm. This permits taking into account the stochastic process, which underlies sequence evolution within protein families and considers the topology and branch lengths of the phylogenetic tree of the protein family. The conservation score at a site corresponds to the site's evolutionary rate. A detailed description of the algorithm is provided in Bioinformatics 18 Suppl. 1 S1-S7, 2002 (PDF). Rate4Site, a stand-alone implementation of this algorithm, is available at the URL: Rate4Site.

The conservation scores calculated by ConSeq appear in the SCORE column in the "Amino Acid Conservation Score" output file. The scores are normalized, so that the average score for all residues is zero, and the standard deviation is one. The conservation scores calculated by ConSeq are a relative measure of evolutionary conservation at each sequence site of the query sequence. The lowest score represents the most conserved position in a protein. It does not necessarily indicate 100% conservation (e.g. no mutations at all), but rather indicates that this position is the most conserved in this specific protein calculated by the use of a specific MSA.

Model of substitution for proteins

The inference of evolutionary conservation relies on a specified probabilistic model of amino-acid replacements. The server supports a few models of substitution for nuclear DNA-encoded proteins as well as models of non-nuclear DNA-encoded proteins. The model of substitution can be chosen from the "Model of substitution for proteins" drop-down list, which is available in the Advance Options in the server Home Page. The JTT, Dayhoff and WAG matrices are suitable for nuclear DNA-encoded proteins. The WAG matrix has been inferred from a large database of sequences comprising a broad range of protein families and is thus suitable for distantly related amino acid sequences. The mtREV and cpREV matrices are suitable for mitochondrial, and chloroplast DNA-encoded proteins, respectively.

Conservation coloring scheme

The discrete conservation color scheme used for visualization is based on the continuous conservation scores. The color grades (1-9) are assigned as follows:

The conservation scores below the average (negative values, which are indicative of slowly evolving, conserved sites) are divided into 4.5 equal intervals. The same 4.5 intervals are used for the scores above the average (positive values, which are indicative of rapidly evolving, variable sites). Thus, 9 qually sized categories of conservation, or grades, are obtained. Colors are assigned to the 9 grades for graphic visualization. The coloring results of a ConSeq run do not indicate the absolute magnitudes of evolutionary distances, but rather the relative degrees of conservation for each residue. ConSeq scaling procedure does not guarantee that grades 1-8 will always be occupied, although grade 9 is always occupied by at least one residue.

Surface accessibility prediction

A feed-forward neural network system (Fariselli and Casadio, 2001) is combined with evolutionary information, using pre-made PSI-BLAST profiles in order to predict the relative solvent accessibility of each residue in the query sequence based on the MSA. The classification depends on a relative accessibility value lower or higher than 16%. The output consists of a single logistic unit for each residue that estimates whether it is solvent-accessible or not (i.e. buried vs. exposed). The prediction of the solvent accessibility reaches an accuracy of 76%. A detailed description of the algorithm is provided in Bioinformatics, 17, 202-204, 2001 (PDF) and in Proteins 47: 142-153, 2002 (PDF).

Important note: The neural network algorithm was designed to predict the relative surface accessibility of amino acids in globular proteins. Thus, the buried/exposed prediction for the transmembrane (TM) segments of proteins is inaccurate; consequently the structural/functional classification is meaningless for these parts. Therefore it is recommended to ignore the surface accessibility predictions for these segments and to consider the evolutionary rates only.

Output

In each run, ConSeq produces an output file called "ConSeq Job Status Page". This file is automatically updated every 30 seconds, showing the input parameters uploaded to the server and messages regarding the different stages of the server activity. When the calculation finishes, ConSeq produces the message "ConSeq finished the calculation" and several links appear:

"View The Colored Results "

This is the main link of the page. It leads to the graphic visualization of the color-coded sequence, the solvent accessibility prediction for the amino acids in the sequence and an indication regarding the important functional and structural residues.

In addition, ConSeq presents several files that are generated during the calculations:

"PSI-BLAST output"

This links to the output file generated by PSI-BLAST, which includes the sequences found, their pairwise alignment with the query sequence, etc.

This file is not generated if the user provides the MSA.

"The Homologues found by PSI-BLAST (in FASTA format)"

This links to the file including the homologous sequences and their SWISS-PROT code names extracted from the PSI-BLAST output, converted to FASTA format.

This file is not generated if the user provides the MSA.

"Multiple Sequence Alignment (in Clustal format)"

This links to the output file produced by CLUSTALW.

If the user provides a non-CLUSTALW format MSA, the MSA is converted to CLUSTALW format and presented in this output file.

"Phylogenetic Tree"

This links to the textual data of the generated phylogenetic tree based on the neighbor-joining method (Saitou and Nei, 1987). To view a graphic display of the tree, several programs can be used for displaying phylogenies, for example TreeView.

"Amino Acid Conservation Scores"

This link includes the conservation scores obtained for each amino acid position of the target sequence. This output file also includes the color grades for each amino acid site. The buried (b) or exposed (e) predicted status for each residue is shown in an additional column. The "FUNCTION" column indicates the structural and functional residues according to these codes: highly conserved (color grade: 8 or 9) and exposed residues are marked with an "f", since they are predicted to be functional, whereas highly conserved (color grade: 9) and buried residues are marked with an "s", since they are predicted to have an important structural role. Since gaps are treated in the algorithm as missing data, not all sequences are taken into account in each position. Hence the "MSA data" column indicates the total number of non-gaped homologues that were calculated at each amino-acid site, out of all the homologue sequences. Sites that contain less than 6 non-gaped homologue sequences (or 10% of the sequences, for more than 60 homologues) are considered as "insufficient data" sites, and the one letter amino-acid code is colored in yellow in the graphic visualization output.

Graphic Visualization (Figure 1)

The ConSeq results are automatically visualized on-line.

The output consists of three row batches. The uppermost row includes the sequence of the query protein with the evolutionary rates color-coded onto each site (see Legend). The residues of the query sequence are always numbered starting from 1. The middle row lists the predicted burial status of the site (i.e. "b"-buried vs. "e"-exposed). The lowermost row indicates residues predicted to be structurally and functionally important, "s" and "f", respectively.

Amino-acid sites categorized as "Insufficient data" (see above) are colored in yellow, indicating that the calculation for these sites was generated using only a few of the homologous sequences.

A legend below the result shows the conservation coloring scale and graphical description.

Figure 1: ConSeq results:

REFERENCES

1. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402, 1997.

2. Fariselli, P. and Casadio, R. RCNPRED: prediction of the residue co-ordination numbers in proteins. Bioinformatics, 17, 202-204, 2001.

3. Glaser, F., Pupko, T., Paz, I., Bell, R.E., Bechor, D., Martz, E. and Ben-Tal N. ConSurf: Identification of functional regions in proteins by surface-mapping of phylogenetic information. Bioinformatics 19:1-3, 2002.

4. Pollastri, G., Baldi, P., Fariselli, P. And Casadio, R. Prediction of coordination number and relative solvent accessibility in proteins. Proteins 47: 142-153, 2002.

O'Donovan, C., Martin, M.J., Gattiker, A., Gasteiger, E., Bairoch, A. and Apweiler, R. High-quality protein knowledge resource: SWISS-PROT and TrEMBL Brief. Bioinform. 3, 275-284, 2002.
Pupko, T., Bell, R.E., Mayrose, I., Glaser, F. and Ben-Tal N. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 18:S1-S7, 2002.

7. Saitou, N. and Nei, M. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406-425. 1987.

8. Thompson, J.D, Higgins, D.G., Gibson, T.J. CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.Nucleic Acids Res. 22:4673-4680. 1994.

9. Robert C. Edgar, MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 32:1792-1797. 2004.

Top Page