Download Symbol via EFetch #59

mapauley · 2022-06-14T16:56:18Z

Given a list of accession.version numbers, is there a way to download the official gene symbol (only) of the corresponding gene using one of the EDirect utilities? If not, any thoughts on how this might be best accomplished?

vkkodali · 2022-06-14T19:36:37Z

You can use NCBI Datasets for this.
An EntrezDirect method would be to use elink first to get links to genes from RefSeq accession.version, download gene DocSum and then extract the gene names. An example is shown below:

$ cat accs.txt
NR_133910.2
NM_001318896.2
NM_001160354.2
NM_001369393.2
NM_006552.2
$ epost -db nuccore -input accs.txt \
    | elink -target gene \
    | esummary \
    | xtract -pattern DocumentSummary -element Name 
MECP2
FHL2
CXCL17
LY6K
SCGB1D1

Since the gene DocSum does not have transcript accessions, a bash for loop can be used to map acc.ver to gene symbols:

$ cat accs.txt \
    | while read -r acc ; do 
        g=$(epost -db nuccore -id $acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name); 
        echo -e "$acc\t$g" ; 
done
NR_133910.2     CXCL17
NM_001318896.2  FHL2
NM_001160354.2  LY6K
NM_001369393.2  MECP2
NM_006552.2     SCGB1D1

mapauley · 2022-06-15T13:31:06Z

Thank you. This is extremely helpful.

I'm working with a list of 177,816 accession.versions (attached), the number of reference transcripts in the GRC38 release on the NCBI Human Genome Resources page when I downloaded it. While NCBI Datasets worked fine for a small test file of ten accession.versions, I let the complete file run overnight, but it never finished. Was I just not patient enough? I would really like this to work.

Thanks for the scripts. I'm currently running the second. When I tested it, it worked fine but slowly: it seemed like I was only getting about one result per second, meaning my full list will take over two days to complete. Is there any way to speed this up?

accNosRandom.txt

kharo · 2022-06-15T13:55:21Z

using esummary may be faster:

epost -db nuccore -id NM_001318896.2 | elink -target gene | esummary -format text

The DocSum is HUGE and is going to be slow. But if you need to preserve 1:1 linkages between input and output, I believe you have to send each request one at a time, unfortunately

vkkodali · 2022-06-15T14:33:09Z

Is there any way to speed this up?

Not using EntrezDirect. As I had mentioned earlier, NCBI Datasets is a good choice for this. For example, I was able to download the data for the entire list in 6 min using the following command:

$ datasets summary gene accession --inputfile accNosRandom.txt --as-json-lines > gene_summary.jsonl

mapauley · 2022-06-15T15:51:47Z

Thanks for information.

I downloaded the datasets and dataformat command-line tools (Linux AMD64) . As a test, I ran
datasets summary gene accession --inputfile accNosRandom.short.txt --as-json-lines > accNosRandom.short.jsonl
where accNosRandom.short.txt (a list of ten accession.version numbers) is attached and got accNosRandom.short.jsonl as a result (also attached, although I added the extension to .txt so I could upload it). I then ran
dataformat tsv gene --inputfile ./accNosRandom.short.jsonl --fields transcript-accession,symbol
but got an empty result (see image). What am I doing wrong?

accNosRandom.short.txt
accNosRandom.short.jsonl.txt

vkkodali · 2022-06-15T16:02:41Z

You are not doing anything wrong. Currently dataformat does not support the jsonl files generated by datasets summary gene ... commands. This feature is in the pipeline and will be added in the near future.

As a workaround, you can "download" the package without any sequence data and use dataformat as shown below:

$ datasets download gene accession --inputfile accNosRandom.short.txt --exclude-gene --exclude-protein --exclude-rna 
Downloading: ncbi_dataset.zip    57.6kB done
$ dataformat tsv gene --package ncbi_dataset.zip --fields gene-id,symbol,transcript-accession | head -n3
NCBI GeneID     Symbol  Transcript Accession
10717   AP4B1   XR_007066904.1
10717   AP4B1   XM_017000090.2

mapauley · 2022-06-15T17:26:41Z

Again, thanks.

Unfortunately, this doesn't appear to work as there is extraneous information in the result. I need the official symbol of the accession numbers in the file in order. It looks like the results are all the records for the genes with the accession.version numbers in my list. For example, XM_017000093.3 is the first accession number, which is a transcript for gene AP4B1. However, in the result provided by dataformat, I get a bunch of accession numbers for that gene including the one I supplied.

vkkodali · 2022-06-15T18:25:39Z

Ah, the details! Yes, datasets by default returns all transcripts for a given gene, not just the ones you have asked for. an additional unix join may be needed to filter the output of dataformat.

mapauley · 2022-06-17T20:41:24Z

Thanks again for all your help.

join requires that the two files to be joined are sorted on the join field. I wanted to preserve the order, so I used grep:

cat accNosRandom.txt \
    | while read -r acc ; do
        grep -m 1 "$acc" accNosRandom.info.tab ;
done

where accNosRandom.txt is my list of accession numbers and accNosRandom.info.tab is the dataformated package.

BTW, when I ran datasets, I got the messages below. What do they mean? Note that the last message is different. There are 177,816 accession.version numbers in the list I gave to datasets.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The current accession.version will be returned.
The accession (NR_120632.1) you provided is not currently in NCBI Gene or does not have an associated NCBI GeneID.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download Symbol via EFetch #59

Download Symbol via EFetch #59

mapauley commented Jun 14, 2022

vkkodali commented Jun 14, 2022

mapauley commented Jun 15, 2022

kharo commented Jun 15, 2022

vkkodali commented Jun 15, 2022

mapauley commented Jun 15, 2022

vkkodali commented Jun 15, 2022

mapauley commented Jun 15, 2022

vkkodali commented Jun 15, 2022

mapauley commented Jun 17, 2022

Download Symbol via EFetch #59

Download Symbol via EFetch #59

Comments

mapauley commented Jun 14, 2022

vkkodali commented Jun 14, 2022

mapauley commented Jun 15, 2022

kharo commented Jun 15, 2022

vkkodali commented Jun 15, 2022

mapauley commented Jun 15, 2022

vkkodali commented Jun 15, 2022

mapauley commented Jun 15, 2022

vkkodali commented Jun 15, 2022

mapauley commented Jun 17, 2022