Why is google natural language returning an incorrect beginOffset for analyzed string?

Question

Why is google natural language returning an incorrect beginOffset for analyzed string?

371 Views Asked by LastMan0nEarth At 15 February 2017 at 12:55

I am using google-cloud/language api to make an #annotate call and analyze entities and sentiments from a csv of comments which I have taken from various online resources.

To begin with, the string I am trying to analyze includes commentId's so I reformat this:

youtubez22htrtb1ymtdlka404t1aokg2kirffb53u3pya0,i just bot a Nostromo... ( ._.)
youtubez22oet0bruejcdf0gacdp431wxg3vb2zxoiov1da,Good Job Baby! MSI Propeller Blade Technology!
youtubez22ri11akra4tfku3acdp432h1qyzap3yy4ziifc,"exactly, i have to deal with that damned brick, and the power supply can&#39;t be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51&#39;s"
youtubez23ttpsyolztc1ep004t1aokg5zuyqxfqykgyjqs,"I like how people are liking your comment about liking the fact that Sky DID put Deadlox&#39;s channel in the description instead of Ryan&#39;s. Nice Alienware thing logo thing, btw"
youtubez12zjp5rupbcttvmy220ghf4ctqnerqwa04,"You know, If you actually made this. People would actually buy it."

So that it doesn't include any comment ID's:

I just bot a Nostromo... ( ._.)
Good Job Baby! MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can&#39;t be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51&#39;s"
"I like how people are liking your comment about liking the fact that Sky DID put Deadlox&#39;s channel in the description instead of Ryan&#39;s.   Nice Alienware thing logo thing, btw"
"You know, If you actually made this. People would actually buy it."

After sending a request for google cloud/language to #annotate the text. I receive a response which includes various substrings sentiments and magnitudes. Each string is also given a beginOffset value, which relates to the strings index in the original string (the string in the request).

{ content: 'i just bot a Nostromo... ( ._.)\nGood Job Baby!',
  beginOffset: 0 }
{ content: 'MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can&#39;t be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51&#39;s"\n"I like how people are liking your comment about liking the fact that Sky DID put Deadlox&#39;s channel in the description instead of Ryan&#39;s.',
  beginOffset: 50 }
{ content: 'Nice Alienware thing logo thing, btw"\n"You know, If you actually made this.',
  beginOffset: 462 }

My aim is then to locate the original comment in the original string, which should be simple enough. Something like (originalString[beginOffset]).....

This value is incorrect!

I am assuming that they do not include certain characters, but I have tried a multitude of regexes and nothing seems to work perfectly. Does anyone have any idea about what might be causing the issue???

Original Q&A

There are 3 best solutions below

**Iliiazbek Akhmedov** · Answer 1 · 2019-10-10T03:02:08.903000

This has got something to do with encoding. Play around with one of the encodings or simply use one of the example approaches provided in their github repo:

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/language/api/analyze.py

Key code block:


def get_native_encoding_type():
    """Returns the encoding type that matches Python's native strings."""
    if sys.maxunicode == 65535:
        return 'UTF16'
    else:
        return 'UTF32'

This worked for me. It was messing up characters like ' (that is \u2019 in unicode).

**wpfwannabe** · Answer 2 · 2019-12-18T09:54:34.037000

I know this is an old question but the problem seems to persist even today. I have recently encountered the same issue and resolved it by interpreting Google's offsets as "byte offsets" rather than string offsets in the chosen encoding. Works great. I hope it helps someone.

The following is some C# code but anybody should be able to interpret it and recode in their own favorite language. If we assume that text is actually the sentiment text being analyzed then the following code transforms, Google's offsets into correct offsets.

int TransformOffset(string text, int offset)
{
   return Encoding.UTF8.GetString(
             Encoding.UTF8.GetBytes(text),
             0,
             offset)
          .Length;
}

**Juan Carlos Sastre** · Answer 3 · 2021-01-08T19:42:14.523000

You should set the EncodingType on the request.

Example using Java client library and working with UTF-8 encoded texts:

Document doc = Document.newBuilder().setContent(dreamText).setType(Type.PLAIN_TEXT).build();
        
AnalyzeEntitiesRequest request = AnalyzeEntitiesRequest.newBuilder().setEncodingType(EncodingType.UTF8).setDocument(doc).build();

Why is google natural language returning an incorrect beginOffset for analyzed string?

There are 3 best solutions below

Related Questions in JAVASCRIPT

Related Questions in STRING

Related Questions in OFFSET

Related Questions in SENTIMENT-ANALYSIS

Related Questions in GOOGLE-LANGUAGE-API

Trending Questions

Popular # Hahtags

Popular Questions