For a personal project, I am looking to filter out all documents from the coastalcph/multi_eurlex dataset on Hugging Face, that belong to the legal space of consumer protection laws. According to the classification done by Chalkidis et al. (2020), there even exists a specific label for consumer protection (2836).
Turns out that the dataset does not contain any document that is categorized with this id.
Did I miss something here? Or is there any other way to efficiently filter this dataset?
This is the code I use for transformation of the labels to the eurovoc_id:
# Define a function that adds a new feature containing an empty list to each document
def add_empty_list_feature(regulation_document):
regulation_document['eurovoc_labels_id'] = [] # Add an empty list to store the new values when they are translated
regulation_document['eurovoc_labels_description'] = []
# Iterate over each label_id in 'labels' for the example
for label_id in regulation_document['labels']:
# Get the EuroVoc ID as a string
eurovoc_id = classlabel.int2str(label_id)
# Find the corresponding EuroVoc description
eurovoc_desc = eurovoc_concepts[eurovoc_id]
# Append the EuroVoc ID and description to the translations list
regulation_document['eurovoc_labels_id'].append(eurovoc_id)
regulation_document['eurovoc_labels_description'].append(eurovoc_desc)
return regulation_document
# Apply the function to each document in the dataset
dataset_with_eurovoc_labels = dataset_train.map(add_empty_list_feature)
def filter_by_eurovoc_label(dataset, eurovoc_labels = ['2836']):
"""Return all documents from the dataset that have matching eurovoc_labels (in the
case of consumer protection - 2836)
Parameters:
dataset (List): Dataset that contains all eurolex documents from hugging face dataset
eurovoc_labels (List): Contains all relevant labels that we want to filter for, DEFAULT = 2836 (consumer protection)
Returns:
a new datastructure with all filtered regulation
"""
filtered_regulation = []
for document in dataset:
# Check if any of the EuroVoc IDs in the document match the specified labels
if any(label in eurovoc_labels for label in document['eurovoc_labels_id']):
filtered_regulation.append(document)
return filtered_regulation