ASU logo
ASU Sunburst
  • GNAT home
  • Submit query
  • Benchmark
  • CBioC
  • BANNER
  • GNN
  • Publications
  • BioAI Lab

GNAT - Inter-species gene mention normalization (ISGN)

Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier, and more complete way. One step towards this aim is the recognition and subsequent identification of named entities; for instance, mapping mentions of genes to databases such as Entrez Gene facilitates sophisticated indexing and querying, and hyperlinking publications to more detailed information on genes etc.


Mapping the gene mention 'p21' to the human genome An example best describes what the task of gene mention normalization is about. Consider that you find an article that mentions the word "p21". After having successfully determined that this mention indeed refers to a gene or protein (and is not an abbreviation for "page 21"), you want to find out which entry, for instance, in EntrezGene, the article talks about. EntrezGene contains 370 entries that have "p21" as a valid name or in the name.
A first problem you have to solve is to which species the gene mention refers to. If you can decide that the authors talk about a human gene, you are left with 75 candidates. In the figure, we consider three candidates, with the official symbols "CDKN1A", "HRAS", and "RHOA", respectively. As you can rule out RHOA (bovine gene), you're left (in this example) with two human candidate genes.
To further disambiguate between the two candidates, we consider the context the gene name was found in (that is, the abstract, paragraph, full-text.) If the gene is discussed together with "an inhibition of the activity of cyclin-CDK2 or -CDK4 complexes", or "negative regulation of cell cycle", this points towards CDKN1A; if the text mentions the chromosomal location "11p15.5", the mutation G12S in the protein, or "cell surface receptor linked signal transduction", this points towards HRAS.


GNAT searches text for mentions of genes and maps each gene to Entrez Gene identifiers. GNAT currently includes more than 3.5 million genes from ca. 4700 species. The online version (see "Submit query") handles a subset that consists of ca. 36,500 human genes (135,000 different gene names). Please refer to [Hakenberg et al., 2008] for further information.


Submitting queries

  • Submit a new query through the web-interface
  • Download and install a local copy of GNAT from Sourceforge

Benchmarks and supplementary information

  • Test collection for the ISGN task
  • Further remarks on the data set
  • Supplementary information on related publications

Developers

GNAT is being developed and hosted by the BioAI lab at Arizona State University's Computer Science Department (Jörg Hakenberg, project leader), ASU's DIEGO lab at the Biomedical Informatics Department (Prof. Graciela Gonzalez, group leader; Bob Leaman, developer, PhD student), and the Bioinformatics group of the Technical University Dresden (Prof. Michael Schroeder, group leader; Conrad Plake, developer, PhD student).


(c)2007-2010 BioAI