Local PFAM Domain Search
Many proteins have a unique combination of domains which can be responsible for the protein function – e.g. the catalytic activities of enzymes. With CLC Protein Workbench and CLC Main Workbench you can predict such domains on unknown or uncharacterized proteins by performing local searches in the PFAM database.
About the PFAM database
The PFAM database is a large collection of multiple sequence alignments that covers approximately 8000 protein domains and protein families [Bateman et al., 2004]. Based on the individual domain alignments, profile HMMs have been developed. These profile HMMs can be the used to search for domains in unknown sequences.Annotating unknown sequences based on pairwise alignment methods by simply transferring annotation from a known protein to the unknown partner does ´however not take domain organization into account [Galperin and Koonin, 1998]. Using such a method can therefore result in a wrong annotation of an unknown protein (e.g. an enzyme) if the pairwise alignment only finds one regulatory domain. The PFAM database therefore provide additional information, but the as with any other prediction, the risk of a wrongful answer is to be taken into account.
PFAM domain searches in CLC's workbenches
When using CLC's workbenches, the PFAM database is stored on your desktop. The searches are therefore not internet based searches, but local searches.
Using the PFAM search option, you can search for domains in sequence data which otherwise do not carry any information about annotations. The PFAM search option adds all found domains onto the protein sequence which was used for the search. If domains of no relevance are found they can easily be removed by the user.
We have implemented our own HMM algorithm for prediction of the PFAM domains. Thus, we do not use the original HMM implementation, HMMER http://hmmer.wustl.edu for domain prediction. Instead we find the most probable state path/alignment through each profile HMM by the Viterbi algorithm and based on that we derive a new null model by averaging over the emission distributions of all M and I states that appear in the state path (M is a match state and I is an insert state).
From that model we now arrive at an additive correction to the original bit-score, like it is done in the original HMMER algorithm.
You can perform the analysis on several protein sequences at a time. This will add annotations to all the sequences and open a view for each sequence.
PFAM search parameters
The following search parameters are available when searching the PFAM database:
- Choose database and search type. When searching for PFAM domains it is possible to choose different databases and specify the search for full domains or fragments of domains.
- Search full domains and fragments. This option allows you to search both for full domain but also for partial domains. This could be the case if a domain extends beyond the ends of a sequence.
- Search full domains only. Selecting this option only allows searches for full domains.
- Search fragments only. Only partial domains will be found.
- Database. Choose between database with 100 most frequent domains, 500 most frequent domains and all domains included in the PFAM database.
- Set significance cutoff. The E-value (expectation value) is the number of hits that would be expected to have a score equal or better than this by chance alone. This means that a good E-value which gives a confident prediction is much less than 1. E-values around 1 is what is expected by chance. Thus, the lower an E-value, the more specific the search for domains will be.
After searching a view is opened, showing the found domains as annotations on the original sequence. If you have selected several sequences, a corresponding number of views will be opened.
Each domain found will be represented as an annotation on the sequence. More information on each found domain is available through the tool tip, including detailed information on the identity score which is the basis for the prediction.
For a more detailed description of the provided scores through the tool tip look at:
http://www.sanger.ac.uk/Software/Pfam/help/scores.shtml
Download and installation of additional PFAM databases
Additional databases can be downloaded from our download page by clicking here
Here are databases containing the 100 most frequent domains, the 500 most frequent domains, and the complete database of approximately 8000 domains. This site also includes descriptions (PDF-files) of the databases.
Each quarter, CLC Protein Workbench and CLC Main Workbench customers will receive a CD-ROM from CLC bio with updated versions of all three databases.
























