I am using google-cloud/language api to make an #annotate call and analyze entities and sentiments from a csv of comments which I have taken from various online resources.
To begin with, the string I am trying to analyze includes commentId's so I reformat this:
youtubez22htrtb1ymtdlka404t1aokg2kirffb53u3pya0,i just bot a Nostromo... ( ._.)
youtubez22oet0bruejcdf0gacdp431wxg3vb2zxoiov1da,Good Job Baby! MSI Propeller Blade Technology!
youtubez22ri11akra4tfku3acdp432h1qyzap3yy4ziifc,"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"
youtubez23ttpsyolztc1ep004t1aokg5zuyqxfqykgyjqs,"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's. Nice Alienware thing logo thing, btw"
youtubez12zjp5rupbcttvmy220ghf4ctqnerqwa04,"You know, If you actually made this. People would actually buy it."
So that it doesn't include any comment ID's:
I just bot a Nostromo... ( ._.)
Good Job Baby! MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"
"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's. Nice Alienware thing logo thing, btw"
"You know, If you actually made this. People would actually buy it."
After sending a request for google cloud/language to #annotate the text. I receive a response which includes various substrings sentiments and magnitudes. Each string is also given a beginOffset value, which relates to the strings index in the original string (the string in the request).
{ content: 'i just bot a Nostromo... ( ._.)\nGood Job Baby!',
beginOffset: 0 }
{ content: 'MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"\n"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's.',
beginOffset: 50 }
{ content: 'Nice Alienware thing logo thing, btw"\n"You know, If you actually made this.',
beginOffset: 462 }
My aim is then to locate the original comment in the original string, which should be simple enough. Something like (originalString[beginOffset]).....
This value is incorrect!
I am assuming that they do not include certain characters, but I have tried a multitude of regexes and nothing seems to work perfectly. Does anyone have any idea about what might be causing the issue???
This has got something to do with encoding. Play around with one of the encodings or simply use one of the example approaches provided in their github repo:
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/language/api/analyze.py
Key code block:
This worked for me. It was messing up characters like
'(that is \u2019 in unicode).