How to programmatically map GI numbers directly to HGNC gene names?

666 Views Asked by At

I have a collection of ~2000 GI numbers that I need to map to HGNC (aka HUGO) gene names.

I will have to do a similar mapping repeatedly in the future, as part of a data analysis pipeline, so I will like to do this mapping programmatically (as opposed to by cutting-and-pasting the 2K GI numbers on some interactive tool's interface).

Furthermore, I'm constrained to work only with free software. I am most comfortable with Python and Perl, although I can work ok with R and Java, and as last resort, with anything else (Ruby, MATLAB, Tcl, etc.).


(The remainder of this post is not essential to the question. In it I provide as additional background info, FWIW. It gets increasingly technical towards the end; this content will be meaningful only to those familiar with NCBI's eutils interface.)

One possibility would be to scrape the HGNC id(s) from the web page for each GI number (example), but these pages use JavaScript to load their content, which puts them beyond my web scraping abilities.

Even if I could carry out such web scraping, the results are bound to be lower quality than those obtained from a proper web service API.

Unfortunately, I have not found any "official" service to programmatically map GI numbers directly to HGNC/HUGO gene names. The best hope I had for this was NCBI's eutils interface, but I was not able to find a way to perform the direct mapping I'm was after. (Please, correct me if I'm wrong!)

The best I could come up with was a 2-hop mapping: use eutils (or rather, the interface to eutils provided by the bioservices.eutils Python module) to map GI numbers to Entrez Gene IDs, and then use a comprehensive table downloaded from HGNC to map these Entrez Gene IDs to HGNC/HUGO gene names.

As usual, the "attrition rate" for such a multi-hop mapping is pretty bad: ~25% of all the GI numbers got mapped to some HGNC/HUGO gene name. (I have yet to estimate how many of these mappings are actually correct.)

I attempted doing the first hop of this mapping using Python's bioservices.eutils library but was able to get Entrez gene ids for only about one quarter of the 2K GI numbers this way. More specifically, this is what I used, in essence:

from bioservices import EUtils

s = EUtils()
xml = s.ELink(db='gene', dbfrom='protein', Ids='395398606')

# ...now parse the returned xml to get the returned Entrez gene id(s)

The call to s.ELink results in an HTTP request of the form:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=gene&dbfrom=protein&id=395398606&cmd=neighbor

If there's a better eutils command to map GI numbers to Entrez gene ids than this, please let me know. Better yet, if there's a better eutils command to map GI numbers directly to HGNC/HUGO gene names, please let me know.

2

There are 2 best solutions below

0
On

You may be able to get what you need from the UCSC Genome Browser as a MySql table - this can be queried using most of the languages that you mention but my preference would be python.

0
On

If at all I am able to get the information needed from the NCBI through web scraping, what information do you need exactly from those pages. Lets say which details do you need from the example you have given.

If at all what you need is available in that page, I can write a code in PHP to get the information you need for any number of GI IDs.