Why would a special character cause an R function to drop components of the search string?

40 Views Asked by At

I am using the R package 'easyPubMed' to investigate species and the research effort (i.e. total number of publications) on those species. Typically, I can use the function get_pubmed_ids("example string[TI]") to return information from the NCBI database on how many publications have "example string" in the title.

However, I've run into a curious effect. It seems that when I try to enter specific species names in get_pubmed_ids() there are words dropped if the species name has a dash or an apostrophe. For instance, if I want to search for publications on the Tawny-breasted Tinamou, I enter the following:

get_pubmed_ids("Tawny-breasted Tinamou[TI]")

I noticed I get a set of odd results. I noticed this because multiple types of Tinamou species all returned exactly 31 publications. I investigated the returned information and isolated the problem, but can't figure out a solution. Specifically, the function does accept the species name with the special character:

$OriginalQuery[1] "Tawny-breasted+Tinamou[TI]"

However, it seems the function modifies the text because the 'Query Translation' shows the following: $QueryTranslation[1] ""Tinamou"[Title]"

Species names without a special character (e.g. Common Raven) do not have this error. And when I search "Tawny-breasted Tinamou[TI]" in the web browser of the NCBI database it seems to work.

If anyone has suggestions or potential explanations for why specific characters within the string cause the function to drop parts of the species name, I would be very interested.

Thank you.

I have attempted the search in the original database to make sure the search string would work overall, without success. I have also tried to modify the characters using escape slashes so they might be recognized as special characters, but that did not seem to work. However I am not sure I used the escape slashes correctly. In sum, I've tried to have the R function employ the correct search string without avail.

1

There are 1 best solutions below

1
divibisan On

I think this is on the NCBI side. The R code isn't doing anything strange with the strings and you see the same results if you search the NCBI website directly.

If you look at the XML returned by the queries, it shows more detail about what NCBI is doing behind the scenes:

When you search for common raven[TI]: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Common+Raven[TI]

you can see the query is translated as:

<QueryTranslation>"common raven"[Title]</QueryTranslation>

It puts the whole phrase in quotes and knows that the entire thing should be a title.


When you search for Tawny-breasted Tinamou[TI]: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Tawny-breasted+Tinamou[TI]

<QueryTranslation>"Tinamou"[Title]</QueryTranslation>
<ErrorList>
 <PhraseNotFound>Tawny-breasted</PhraseNotFound>
</ErrorList>

So it's not putting the query in quotes, and treats it as 2 separate search terms: "Tawny-breasted" and "Tinamou[TI]". Since "Tawny-breasted" returns no results, it is dropped and it only searches for "Tinamou"[Title].


If you want to properly search for the whole term, you need to add the quotes yourself, as shown below:

get_pubmed_ids('"Tawny-breasted Tinamou"[TI]')
$Count
[1] "0"

$RetMax
[1] "0"

$RetStart
[1] "0"

$QueryKey
[1] "1"

$WebEnv
[1] "MCID_648b50fadd6389294158ec04"

$QueryTranslation
[1] "\"Tawny-breasted Tinamou\"[TI]"

$IdList
named list()

$TranslationSet
list()

$OriginalQuery
[1] "\"Tawny-breasted+Tinamou\"[TI]"

Note that you need to use double quotes ("), so the string needs to be enclosed in single quotes as above, or you need to escape the inner quotes with \:

get_pubmed_ids("\"Tawny-breasted Tinamou\"[TI]")