How to gather several datasets (dataset configurations) in a list with hydra?

106 Views Asked by At

I am using hydra for the configuration of deep learning projects. I want to put together several datasets for the training. As the number of datasets is a priori not defined, I want to get the datasets as a list. I want to define them as defaults in the parent config.yaml file.

I have found a working solution. I put it here because I found it hard to find and it could be useful to others. Also, I am wondering if you have some better solutions for the problem.

After some searching, I arrived at the interpolation oc.dict.values (see this resolver, and particularly this solution).

My project structure is:

├── configs
│   ├── config.yaml
│   └── data_repository
│       ├── data1.yaml
│       ├── data2.yaml
│       └── data3.yaml
├── test_hydra.py

All my dataset configurations are in the data_repository subfolders.

I want, as an example, to use only data1 and data2, as shown in the config.yaml file:

# config.yaml
defaults:
  - _self_
  - data_repository/data1
  - data_repository/data2

hydra:
  job:
    chdir: True

data_used: ${oc.dict.values:data_repository}

data1.yaml:

# @package data_repository.data1
dataset_name: data1
number_layers: 1

data2.yaml:

# @package data_repository.data2
dataset_name: data2
number_layers: 1

data3.yaml:

# @package data_repository
dataset_name: data3
number_layers: 1

test_hydra.py:

# test_hydra.py

import hydra
from omegaconf import OmegaConf

@hydra.main(config_name='config', version_base="1.1", config_path="configs")
def train(config): 
    config = OmegaConf.structured(config)
    
    print("\n")
    print(config)
    
    print("\n" + OmegaConf.to_yaml(config) + "\n")
    print("config.data_used = ", config["data_used"])
    for i, data in enumerate(config.data_used):
        print(f"config.data_used[{i}] = {data}")
    

if __name__ == "__main__":
    train()

Running test_hydra (with hydra-core 1.3.2) gives the output:


{'data_used': '${oc.dict.values:data_repository}', 'data_repository': {'data1': {'dataset_name': 'data1', 'number_layers': 1}, 'data2': {'dataset_name': 'data2', 'number_layers': 1}}}

data_used: ${oc.dict.values:data_repository}
data_repository:
  data1:
    dataset_name: data1
    number_layers: 1
  data2:
    dataset_name: data2
    number_layers: 1


config.data_used =  ['${data_repository.data1}', '${data_repository.data2}']
config.data_used[0] = {'dataset_name': 'data1', 'number_layers': 1}
config.data_used[1] = {'dataset_name': 'data2', 'number_layers': 1}

It gives the desired output. We have the dictionary data_repository that I will not use, and the list data_used that contains the desired list of datasets.

It is working, however it could be more fancy: we have indeed duplication of the data (data_repository and data_used), and the line data_used: ${oc.dict.values:data_repository} in config.yaml is a little cryptic. Do you have some suggestions for improvement?

0

There are 0 best solutions below