xml/json source surrogate sequence per nesting level

80 Views Asked by XKB At 08 November 2023 at 11:07

Semi-structure xml/json files are often used to serialize relational data from different levels of entities in a master-detail cascade fashion. As xml/json files are ordered, unique keys are not necessarily provided to preserve sequencing and hierarchy.

Most ETL's offer a way to de-serialize those entities instances by dispatching them back from an xml/json file into different master-detail cascade of tables, generating a integer on-the-fly surrogate sequence PER LEVEL that can be propagated to both master records (as a unique surrogate key) and detail records (as a referential foreign key to master).

I can't seem to find such a feature in ADF transformations (be it parse, flatten, surrogate, etc.)

Grouping at master level reaches the goal to uniquely number master records but the link is lost to details
Flattening the whole xml/json source can be extremely heavy, redundant and entails a lot of re-scan & lookup in order to achieve that objective.

QUESTION: any smart idea (within ADF) to perform that multiple surrogate sequence numbering per cascading master-detail level in one pass (one scan) over the whole source data set ?

Thanks

@Rakesh : 'should have done that from the very beginning | suppose the following structure (country name could also be nested as regular xml tags) :

<geo>
  <country name="USA">
    <city>Miami</city>
    <city>Frisco</city>
  </country>
  <country name="France">
    <city>Paris</city>
    <city>Brest</city>
  </country>
</geo>

I would ideally get the following two dataflows to be generated with each an independent sequence keeping track of order and structure :

country_id	country_name
1	USA
2	France

country_id	city_id	city_name
1	1	Miami
1	2	Frisco
2	3	Paris
2	4	Brest

The following denormalized flow would be OK provided I get both sequences (even within an xml-kept structure)

country_id	country_name	city_id	city_name
1	USA	1	Miami
1	USA	2	Frisco
2	France	3	Paris
2	France	4	Brest

That following one is not ok :

city_id	country_name	city_name
1	USA	Miami
2	USA	Frisco
3	France	Paris
4	France	Brest

...because if you think of a very large and deeply nested dataset, it makes a hellish amount of reprocessing and lookups downstream to reconnect master-detail surrogates whereas the previous example only need one pass parsing/pass.

Any idea on how to ?

Thanks

Original Q&A

xml/json source surrogate sequence per nesting level

There are 0 best solutions below

Related Questions in AZURE-DATA-FACTORY

Related Questions in ETL

Related Questions in JSON-DESERIALIZATION

Related Questions in XML-DESERIALIZATION

Trending Questions

Popular # Hahtags

Popular Questions