Other names: simap

Protein sequences are of utmost importance for studying the function and evolution of genes and genomes. Therefore a rich collection of methods in computational biology relies on the analysis and comparison of protein sequences. Many of these intensively used methods perform sequence similarity searches (e.g. BLAST (1)) or compare protein sequences against secondary databases of protein families (e.g. InterPro (2)).The rapidly increasing volume of publicly available protein sequences forges a computational dilemma for bioinformatics tasks that require repeated all-against-all calculations of sequence similarities or sequence features. Such rather straightforward but technically challenging tasks among others are the annotation of genomes or the clustering of the protein sequence space into protein families. The Similarity Matrix of Proteins (SIMAP) solves the computational dilemma described above by incrementally pre-calculating the sequence similarities forming the known protein sequence space (3). The comparison of new sequences vs. known ones returns symmetric scores that can be updated accordingly in the existing records. To complement the pair-wise sequence similarity matrix by position specific searches against known protein families, SIMAP in addition pre-calculates sequence based features as e.g. InterPro matches (2).The SIMAP database provides a comprehensive and up-to-date pre-calculation of the protein sequence similarity matrix, sequence-based features and sequence clusters. As of September 2009, SIMAP covers 48 million proteins and more than 23 million non-redundant sequences.Access to SIMAP is freely provided through the web portal for individuals ( and for programmatic access through DAS ( and Web-Service (




protein sequence protein domains and classification sequence analysis

More to explore:


Need help integrating and/or managing biomedical data?