iProLINK (integrated Protein Literature, INformation and Knowledge) is a resource to facilitate text mining research in the area of literature-based database curation, named entity recognition, and protein ontology development. This collection of annotated data sources can be utilized by computational and biological researchers to explore literature information on proteins and their features or properties (Hu et al., 2004). The data sets for bibliography mapping and feature evidence attribution include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes ~800 abstracts and/or full-text articles in which text evidence was tagged for ~1200 experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database (PIR-PSD). The data sets for entity recognition and ontology development include protein name dictionaries, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, and a protein ontology based on PIRSF protein family names. All datasets are freely accessible and can be downloaded at http://pir.georgetown.edu/iprolink/.
protein sequence protein properties