Read or update quickly hundred millions of relations on Solr

426 Views Asked by At

1st post : SOLR

My team and I have a problem with Solr, we failed to find a solution, however we think a lot about it. So we decided to ask here.

Problematic : Read and updates relations very quickly between an Element with few millions (and possibly more) Sub-element relations.

Currently we have a structure like this :

ELEMENT 
{ 
    id: 1, 
    name : element1 
}, 
{ 
    id: 2, 
    name : element2 
}, 
[...]
{ 
    id: 10000, 
    name : element10000 
}

SUBELEMENT
{ 
    id: 1, 
    ids_elements : 1117|165|27|32|4577
    name : subelement1 
}, 
{ 
    id: 2, 
    ids_elements : 1117|416|278
    name : subelement2 
}, 
[...]
{ 
    id: 15000000, 
    ids_elements : 1117|2428|3457|5475|32|958
    name : subelement15000000 
}

(I use big ID because it corresponds to our needs)

With this method, we can easily get every Sub-element belonging to an Element (very quickly, like 10-20 milliseconds to get 100000 Sub-element). But, taking the example above, removing Element "1117" on every Sub-element, the time for the update is too long (like 10 sec for 100000 updates).

I would like a structure like that :

ELEMENT 
{ 
    id: 1, 
    name : element1,
    ids_subelements : 1|2|3|...12458765|12458766... (few millions)
}, 
{ 
    id: 2, 
    name : element2 
    ids_subelements : 1|2|3|...12458765|12458766... (few other millions)
}, 
[...]
{ 
    id: 10000, 
    name : element10000 
    ids_subelements : 1|2|3|...12458765|12458766... (few other millions)
}

But we are out of the Solr limits.

Has someone solved this problem before ? Or has an idea to get the right data structure for Solr ?

Is Solr the best solution for this kind of interaction ?

Thank you for your time !

Update 1 : SQL

It's already denormalized in Solr. In Sql normalized database, we have something like this (I explain it for a better understanding of the problem):

ELEMENT(id_element, name)
ELEMENT_OWN_SUBELEMENT(id_element, id_subelement)
SUBELEMENT(id_subelement, name_subelement)

So in Sql we need to update data this way :

DELETE FROM element_own_subelement 
    WHERE id_element = :id_element 
    AND id_subelement = :id_subelement; 
-- (one row concerned);

DELETE FROM element_own_subelement 
    WHERE id_element = :id_element; 
-- (potentially there are millions data concerned)

DELETE FROM element_own_subelement 
    WHERE id_subelement = :id_subelement; 
-- potentially there are thousands data concerned)

And it's the same for the insert Sql requests. We need to insert millions data in element_own_subelement in less than few seconds.

We turned ourselves to Solr to solves reading problem. And it did it ! But we didn't solve the insert/update performance problems.

Update 2 : concrete case

I'm trying to explain our problem with a concrete case :

SQL :

-- Main data
POINT_OF_INTEREST (id_poi, name, [...])

-- Datasets & filtering
DATASET(id_dataset, name, [...])
DATASET_OWN_POINT_OF_INTEREST(id_dataset, id_poi)
FILTERING(id_filtering, id_dataset)

-- Data displaying
MODULE (id_module, name, [...])
MODULE_OWN_POINT_OF_INTEREST(id_module, id_poi)

So :

  • We group our data into datasets
  • We associate datasets to a module
  • A dataset can be filtered for a module
  • We define displayed data of a module based on filtered datasets => It's only at this time we insert the documents in the SOLR core.

Example of a document from our SOLR core :

SOLR core "POINT_OF_INTEREST": 
{
    id_poi : 13,
    ids_modules : 1|5|8|7|24, /* Results of the filtering */
    name : "POI thirteen",
    ids_datasets : 25|5|7
}

To get data from the module "24", I have to request the "point_of_interest" core like this : ids_modules:24. It's really fast.

However, if I save a new filtering, I must do these actions in my SOLR core :

  • remove the number "24" in the field "ids_modules" relating to all documents which aren't in the filtering data results
  • add the number "24" in the field "ids_modules" relating to all documents which are in the filtering data results. This action is very slow, considering we have potentially 1 / 2 millions data to update many times a day.

I want to solve these 2 problematics:

  • If I can make the complex filtering operation more quickly with SOLR than SQL
  • If there is a way to make work my data system without systematically to do hundred thousands updates on the SOLR core

Update 3 : example from SOLR database

{
    "votes_moins": "",
    "id_module": "957654|1445047|1485752|1445843|1854865|1491371|1445993|1478994|1965107|1755131|1725201|1785227|1564235|1585245|1545261|1255272|1542273|1585349|1545434|1585444|1115583|1225671|1585672|1588712|1545730|1775826|1596666|1555988|1675344|1256417|16456683|1856983|1757004|12571566|1715593|1457200|1757218|1777428|172455|1845053|1058425|108594|1885677|1748751|14874647|184817|1955120|1569536|1945635|1259825|2120353|2075726|2850779|2221051|2121129|2421264|2331600|28561607|27771611|2562107|2782553|2225916|2663224|2653225|2235717|2442252|249491|2251440|265069|2365104|2687789|275048|4270620|275092|270278|65273947|257425|274451|7275509|2275811|272605|4527690|279721|2277630|2754808|278038|5280652|2080935|280599|2481710|8281161|328211145|2815958|285219|22823435|2686666|2885978|289807|294024|729044|2292156|2892216|2902128|1029256|2932089|2954401|290488|289934|306105|304509|307616|380725|3907598|3208855|3059794|3310714|311079|3151060|315536|351598",
    "adresse.altitude": 0,
    "id_objet": "1886416_0",
    "id_flux": "101|11158|32548|10365460|104686456|1024537|1024568|1054686|1844859|1986559",
    "adresse.ville": "Varangéville",
    "id_categories": "",
    "type": 5,
    "difficulte": "16|17",
    "id_circuit": "124785_0",
    "utilisateur": "u47852;PPDA Tourisme",
    "id_sous_categories": "",
    "publication_cirkwi": 1,
    "description_fr_FR": "Le Lorem Ipsum est simplement du faux texte employé dans la composition et la mise en page avant impression. Le Lorem Ipsum est le faux texte standard de l'imprimerie depuis les années 1500, quand un peintre anonyme assembla ensemble des morceaux de texte pour réaliser un livre spécimen de polices de texte. Il n'a pas fait que survivre cinq siècles, mais s'est aussi adapté à la bureautique informatique, sans que son contenu n'en soit modifié. Il a été popularisé dans les années 1960 grâce à la vente de feuilles Letraset contenant des passages du Lorem Ipsum, et, plus récemment, par son inclusion dans des applications de mise en page de texte, comme Aldus PageMaker.",
    "date_creation": 1478509817,
    "date_modification": 1473033625,
    "adresse.cp": "87456",
    "adresse.rue": "",
    "langue": "fr_FR",
    "id_locomotions": "13|48|74|18",
    "adresse.numero_rue": "",
    "votes_plus": "7852",
    "distance": 189654,
    "publication_reseau": 1,
    "pseudo_utilisateur": "PPDA1",
    "id_utilisateur_auteur": "u47852",
    "publication_prive": 0,
    "latlng": "49.1234,7.1234",
    "geo": [
      "49.1234,7.1234"
    ],
    "url_image": "[...]/mypicture.jpg",
    "stats_consultation": 20568,
    "titre_fr_FR": "Example",
    "titre_fr_FR_tri": "Example",
    "_version_": "155642045896312192"
  }

(values are fakes)

The worst keys for performance are "id_module" and "id_flux". Fields declaration in data-config.xml

<field column="id_module" name="id_module" />
<field column="id_flux" name="id_flux" />

Fields declaration in myschema.xml

<field name="id_module" type="splitPipeTexte" indexed="true" />
<field name="id_flux" type="splitPipeTexte" indexed="true" />

Where "splitPipeTexte" is declared like this :

<fieldType name="splitPipeTexte" class="solr.TextField">
    <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="\|"/>
    </analyzer>
</fieldType>

I'm aware of that's a complex problem. With my post I search some reflections about this data system, or reacts about an error of conception. Not necessarily a complete solution.

Thanks !

1

There are 1 best solutions below

7
On BEST ANSWER

You are looking at this problem from the wrong end of the stick. You need to organize your data in Solr for search, not just try to map your structure.

Most probably you will flatten (denormalize) your structures into flat records. And, if you have very complex structures, you may end up storing your original information in a separate (graph?) database and returning just IDs from Solr. The more complex original structure is, the likelier that would be the case.