Filtering all documents that belong to consumer protection from the EUR-Multilex Huggingface dataset

13 Views Asked by At

For a personal project, I am looking to filter out all documents from the coastalcph/multi_eurlex dataset on Hugging Face, that belong to the legal space of consumer protection laws. According to the classification done by Chalkidis et al. (2020), there even exists a specific label for consumer protection (2836).

Turns out that the dataset does not contain any document that is categorized with this id.

Did I miss something here? Or is there any other way to efficiently filter this dataset?

This is the code I use for transformation of the labels to the eurovoc_id:

# Define a function that adds a new feature containing an empty list to each document
def add_empty_list_feature(regulation_document):
    regulation_document['eurovoc_labels_id'] = []  # Add an empty list to store the new values when they are translated
    regulation_document['eurovoc_labels_description'] = [] 

    # Iterate over each label_id in 'labels' for the example
    for label_id in regulation_document['labels']:
        # Get the EuroVoc ID as a string
        eurovoc_id = classlabel.int2str(label_id)
        # Find the corresponding EuroVoc description
        eurovoc_desc = eurovoc_concepts[eurovoc_id]
        # Append the EuroVoc ID and description to the translations list
        regulation_document['eurovoc_labels_id'].append(eurovoc_id)
        regulation_document['eurovoc_labels_description'].append(eurovoc_desc)

    return regulation_document

# Apply the function to each document in the dataset
dataset_with_eurovoc_labels = dataset_train.map(add_empty_list_feature)
def filter_by_eurovoc_label(dataset, eurovoc_labels = ['2836']):
  """Return all documents from the dataset that have matching eurovoc_labels (in the
  case of consumer protection - 2836)

    Parameters:
    dataset (List): Dataset that contains all eurolex documents from hugging face dataset
    eurovoc_labels (List): Contains all relevant labels that we want to filter for, DEFAULT = 2836 (consumer protection)

    Returns:
    a new datastructure with all filtered regulation

   """
  filtered_regulation = []
  for document in dataset:
    # Check if any of the EuroVoc IDs in the document match the specified labels
      if any(label in eurovoc_labels for label in document['eurovoc_labels_id']):
        filtered_regulation.append(document)
  return filtered_regulation
0

There are 0 best solutions below