Change substrings with multiple values using offset and length

110 Views Asked by At

I am working on pulling data from OpenCalais API and here are the details:

Input: Some paragraph (a string e.g. "Barack Obama is the President of United States." Also, what gets returned is some instance variables with offsets and lengths but not necessarily in order of occurrence.

Output (I want): Same string but with the identified entity instances with hyperlinks (which is also a string) i.e.

output="<a href="https://en.wikipedia.org/Barack_Obama"> Barack Obama </a> is the President of ""<a href="https://en.wikipedia.org/United_States"> United States. </a>"

BUT IT IS A PYTHON QUESTION REALLY.

This is what I have

#API CALLS ABOVE WHICH IS NOT RELEVANT. 

output=input
for x in range(0,result.print_entities()):
    print len(result.entities[x]["instances"])
    previdx=0
    idx=0
    for y in range(0,len(result.entities[x]["instances"])):

        try: 
            url= "https://permid.org/1-" + result.entities[x]['resolutions'][0]['permid']

        except:
            url="https://en.wikipedia.org/wiki/"+result.entities[x]    ["name"].replace(" ", "_")

        print "Generating wiki page link"
        print url+"\n"

 #THE PROBLEM STARTS HERE

         offsetstr=result.entities[x]["instances"][y]["offset"]
         lenstr=result.entities[x]["instances"][y]["length"]

         output=output[:offsetstr]+"<a href=" + url + ">" +   output[offsetstr:offsetstr+lenstr] + "</a>" + output[offsetstr+lenstr:]

print output

Now the issue is, if you read the code properly you'll know that after the first iteration, the output string changes - therefore for subsequent iterations, the offset values no longer applies in the same manner. So, I cannot make the expected change.

Basically trying to get:

input = "Barack Obama is the President of United States"

output= "<a href="https://en.wikipedia.org/Barack_Obama"> Barack Obama </a> is the President of ""<a href="https://en.wikipedia.org/United_States"> United States. </a>." 

How can it be done, I wonder. Tried splicing n dicing but string just gets garbled.

2

There are 2 best solutions below

0
On

I finally solved it. Took some major math logic to do but as my last comment with the intuition that - "Maybe a solution can be storing the {offset, length} tuples in an array and then sort it on the offset values and THEN run the loop. Any help making that structure?" - THAT DID THE TRICK.

output=input
l=[]
for x in range(0,result.print_entities()):
    print len(result.entities[x]["instances"])

    for y in range(0,len(result.entities[x]["instances"])):

        try: 
            url=r'"'+ "https://permid.org/1-" + result.entities[x]['resolutions'][0]['permid'] + r'"'

        except:
            url=r'"'+"https://en.wikipedia.org/wiki/"+result.entities[x]["name"].replace(" ", "_") + r'"'

        print "Generating wiki page link"

 #THE PROBLEM WAS HERE 

        offsetstr=result.entities[x]["instances"][y]["offset"]
        lenstr=result.entities[x]["instances"][y]["length"]

#The KEY TO THE SOLUTION IS HERE
        l.append((offsetstr,lenstr,url))
       # res.append(output[preOffsetstr:offsetstr]+"<a href=" + url + ">" +      output[offsetstr:offsetstr+lenstr] + "</a>" + output[offsetstr+lenstr:])

print l

def getKey(item):
    return item[0]

l_sorted=sorted(l, key=getKey)


a=[]
o=[]
x=0
p=0
#And then simply run a for loop

for x in range(0,len(l_sorted)):
    p=x+1
    try:
        o=output[l_sorted[x][0]+l_sorted[x][1]:l_sorted[x][0]] + "<a href=" + str(l_sorted[x][2]) + ">" +  output[l_sorted[x][0]:(l_sorted[x][0]+l_sorted[x][1])] + "</a>" + output[l_sorted[x][0]+l_sorted[x][1]:(l_sorted[p][0]-1)]
        a.append(o)
    except:
        print ""

#+ output[l_sorted[x][0]+l_sorted[x][1]:]
#a.append(output[l_sorted[len(l_sorted)][0]] + l_sorted[len(l_sorted)][1]:l_sorted[len(l_sorted)][0]] + "<a href=" + str(l_sorted[len(l_sorted)][2]) + ">" + output[l_sorted[len(l_sorted)][0]:(l_sorted[len(l_sorted)][0]+l_sorted[len(l_sorted)][1])] + "</a>" + output[l_sorted[len(l_sorted)][0]+l_sorted[len(l_sorted)][1]:]
m=output[l_sorted[len(l_sorted)-1][0]+l_sorted[len(l_sorted)-1][1]:l_sorted[len(l_sorted)-1][0]] + "<a href=" + str(l_sorted[len(l_sorted)-1][2]) + ">" +  output[l_sorted[len(l_sorted)-1][0]:(l_sorted[len(l_sorted)-1][0]+l_sorted[len(l_sorted)-1][1])] + "</a>" + output[l_sorted[len(l_sorted)-1][0]+l_sorted[len(l_sorted)-1][1]:]
a.append(m)

print " ".join(a)

And WALLAH!:) - Thanks for the help folks. Hope it helps someone someday.

5
On

try to use another var to store result

output=input
res,preOffsetstr  = [],0
for x in range(0,result.print_entities()):
    print len(result.entities[x]["instances"])
    previdx=0
    idx=0
    for y in range(0,len(result.entities[x]["instances"])):

        try: 
            url= "https://permid.org/1-" + result.entities[x]['resolutions'][0]['permid']

        except:
            url="https://en.wikipedia.org/wiki/"+result.entities[x]    ["name"].replace(" ", "_")

        print "Generating wiki page link"
        print url+"\n"

 #THE PROBLEM STARTS HERE

         offsetstr=result.entities[x]["instances"][y]["offset"]
         lenstr=result.entities[x]["instances"][y]["length"]

         res.append(output[preOffsetstr :offsetstr]+"<a href=" + url + ">" +      output[offsetstr:offsetstr+lenstr] + "</a>" + output[offsetstr+lenstr:])


         preOffsetstr = offsetstr
print '\n'.join(res)