My end goal here is to create a primitive plagiarism checker given a text file. I plan to do this by first splitting the data by sentence, searching each sentence on Google, and finally searching each of the first few URL's returned by Google for occurrences of the sentence/substrings. This last step is the one I'm having trouble with.
When running through each URL in a for-loop, I first read the contents of the URL using urllib.open(), but I'm not sure what to do after. Code is attached below, with some solutions I've tried commented out. I've imported the googlesearch
, urllib.request
, and re
libraries.
def plagCheck():
global inpFile
with open(inpFile) as data:
sentences = data.read().split(".")
for sentence in sentences:
for url in search(sentence, tld='com', lang='en', num=5, start=0, stop=5, pause=2.0):
content = urlopen(url).read()
# if sentence in content:
# print("yes")
# else:
# print("no")
# matches = findall(sentence, content)
# if len(matches) == 0:
# print("no")
# else:
# print("yes")
If I understand your code correctly, you now have two Python lists of sentences. It looks like you have split them using a period. This would create fairly large run-on sentences for other types of punctuation (?, !).
I would consider using a similarity checker library. Diflibb has a simliar class Then decide on some percentage to flag i.e. if it's 40% the same. This reduces the amount of content you have to check manually.
Expanding the number of punctuations. That might look something like this:
Then I would write your results for this file back to a new output file, something like this
You could even go as far as to highlight table rows based on the percentage i.e.
80% and above is red 61-79% orange 40-60% yellow 39% and below is green