Few of the text files that I'm importing has mojibake, so I'm trying to fix them using the ftfy library prior to feeding them to Spacy (NLP). The code snippet relating to this issue:
import spacy
import classy_classification
import pandas as pd
import ftfy
with open ('SID - Unknown.txt', "r", encoding="utf8") as k:
Unknown = k.read().splitlines()
data = {}
data["Unknown"] = Unknown
# NLP model
spacy.util.fix_random_seed(0)
nlp = spacy.load("en_core_web_md")
nlp.add_pipe("text_categorizer",
config={
"data": data,
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"cat_type": "multi-label",
"device": "gpu"
}
)
print(ftfy.fix_text(Unknown))
I get the error:
AttributeError: 'list' object has no attribute 'find'
When I look up based on this error, lots of threads have suggested to use index() instead of find() in the case of lists. But in this case, find is done within ftfy.fix_text. How can I get through this error? I want it to stay in the form of list since that's how I feed it into the Spacy model.
As you noticed, your error happens within
ftfy.fix_text
. So when we know something is going wrong in a function that we haven't written ourselves, the next thing we can have a look at is "What are we inputting in that function?" .In your case, you are giving
Unknown
as an input.Unknown
is made like this:And this is where things are going wrong: Unknown is a list of strings but the
ftfy.fix_text
function expects a string, as you can find some examples here.So the solution to your problem can be either:
ftfy.fix_text
for each different line: