Protein name tagging guidelines: lessons learned

Authors: Mani Inderjeet1; Hu Zhangzhi1; Jang Seok Bae1; Samuel Ken2; Krause Matthew1; Phillips Jon1; Wu Cathy H.1

Source: Comparative and Functional Genomics, Volume 6, Numbers 1-2, February 2005 , pp. 72-76(5)

Publisher: John Wiley & Sons, Ltd.

Key:
Free Content - Free Content
New Content - New Content
Subscribed Content - Subscribed Content
Free Trial Content - Free Trial Content

Abstract:

Interest in information extraction from the biomedical literature is motivated by the need to speed up the creation of structured databases representing the latest scientific knowledge about specific objects, such as proteins and genes. This paper addresses the issue of a lack of standard definition of the problem of protein name tagging. We describe the lessons learned in developing a set of guidelines and present the first set of inter-coder results, viewed as an upper bound on system performance. Problems coders face include: (a) the ambiguity of names that can refer to either genes or proteins; (b) the difficulty of getting the exact extents of long protein names; and (c) the complexity of the guidelines. These problems have been addressed in two ways: (a) defining the tagging targets as protein named entities used in the literature to describe proteins or protein-associated or -related objects, such as domains, pathways, expression or genes, and (b) using two types of tags, protein tags and long-form tags, with the latter being used to optionally extend the boundaries of the protein tag when the name boundary is difficult to determine. Inter-coder consistency across three annotators on protein tags on 300 MEDLINE abstracts is 0.868 F-measure. The guidelines and annotated datasets, along with automatic tools, are available for research use. Copyright © 2005 John Wiley & Sons, Ltd.

Keywords: nomenclature; protein names; guidelines; database curation; named entity tagging; inter-coder reliability

Document Type: Miscellaneous

DOI: 10.1002/cfg.452

Affiliations: 1: Georgetown University, 37th and O Streets NW, Washington, DC 20057, USA 2: The MITRE Corporation, 7515 Colshire Drive, McLean, VA 22102, USA

The full text electronic article is available for purchase. You will be able to download the full text electronic article after payment.

$40.03 plus tax      Refund Policy

 

OR

Back to top

Key:
Free Content - Free Content
New Content - New Content
Subscribed Content - Subscribed Content
Free Trial Content - Free Trial Content
Share this item with others: These icons link to social bookmarking sites where readers can share and discover new web pages.
Page Help Click here for Page Help
Shopping cart
Tools
Sign in






Need to register?
Sign up here
Text size: A | A | A | A