Essentially, I'm looking to build a web app where the user can input n number of labels for a dataset put it into a dictionary with keywords for each label. I'd like the same function to be created for n number of labels, something like:
# labeling function for label 1
@labeling_function()
def lf_label_1(x):
if x.label_1 in ["bag", "surfboard", "skis"]:
return CARRY
return ABSTAIN
So, I'd get a new function for each new label added by the user. Each function then feeds into a list and ends up being input for a function. For example:
# list of labeling functions
lfs = [
lf_ride_object,
lf_carry_object,
lf_carry_subject,
lf_not_person,
lf_ydist,
lf_dist,
lf_area,
]
# applying label functions to create dataset
applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)
L_valid = applier.apply(df_valid)
For more details (and includes a second approach):
I'm looking for advice on how to use snorkel in a particular way. I'd like to create a labeling function with spaCy's PhraseMatcher. Basically, I want a user to input all of the words (with the corresponding label) in a web app and send it to the PhraseMatcher. Then, match a paragraph of text inside the labeling function.
How would I go about creating a labeling function(s?) for n number of labels on the backend? Typically, we would write code >n number of labeling functions for all the labels, but I'm trying to use snorkel in a use-case where we don't know how many labels there are until the user creates them.
Is there a way around this? Basically, the Matcher would go over the input text and check how many labels are in the text and then return all the labels found.Kind of like I'm trying to get the users to use snorkel without creating the functions themselves and only input the (label, word(s)) combinations.
Is there a way to use the Matcher in such a way that there is only one labeling function and it uses the Matcher for all labels?
For example, there could be a single labeling function that looks like this (peudo-code):
# labeling function for all labels
@nlp_labeling_function() # labeling function for using spaCy
def lf_labeler(x, label_keyword_dict):
labels = []
doc = nlp(x) # tokenizing the input text
matches = PhraseMatcher(doc) # finding all the token words that match a specific label
for match_id in matches:
labels.append(nlp.vocab.strings[match_id] # adds every label where there was a match
if labels: # runs if labels is not empty
return labels # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
else:
return ABSTAIN # abstains from using this text example in the dataset creation because there was no match
So, it would basically take in a piece of text (from a dataframe), check to see if any of the users' keywords are in the text and add all of the appropriate labels as a result. Return all the matched labels or simply return "ABSTAIN" to say that there were no matches.
While typing up this question I came up with some ideas that I wrote out here, so I'll be testing them in the meantime and have a look a the snorkel source code to see if I can come up with anything else.
Thanks!