xml/json source surrogate sequence per nesting level

80 Views Asked by At

Semi-structure xml/json files are often used to serialize relational data from different levels of entities in a master-detail cascade fashion. As xml/json files are ordered, unique keys are not necessarily provided to preserve sequencing and hierarchy.

Most ETL's offer a way to de-serialize those entities instances by dispatching them back from an xml/json file into different master-detail cascade of tables, generating a integer on-the-fly surrogate sequence PER LEVEL that can be propagated to both master records (as a unique surrogate key) and detail records (as a referential foreign key to master).

I can't seem to find such a feature in ADF transformations (be it parse, flatten, surrogate, etc.)

  • Grouping at master level reaches the goal to uniquely number master records but the link is lost to details

  • Flattening the whole xml/json source can be extremely heavy, redundant and entails a lot of re-scan & lookup in order to achieve that objective.

QUESTION: any smart idea (within ADF) to perform that multiple surrogate sequence numbering per cascading master-detail level in one pass (one scan) over the whole source data set ?

Thanks


@Rakesh : 'should have done that from the very beginning | suppose the following structure (country name could also be nested as regular xml tags) :

<geo>
  <country name="USA">
    <city>Miami</city>
    <city>Frisco</city>
  </country>
  <country name="France">
    <city>Paris</city>
    <city>Brest</city>
  </country>
</geo>

I would ideally get the following two dataflows to be generated with each an independent sequence keeping track of order and structure :

country_id country_name
1 USA
2 France
country_id city_id city_name
1 1 Miami
1 2 Frisco
2 3 Paris
2 4 Brest

The following denormalized flow would be OK provided I get both sequences (even within an xml-kept structure)

country_id country_name city_id city_name
1 USA 1 Miami
1 USA 2 Frisco
2 France 3 Paris
2 France 4 Brest

That following one is not ok :

city_id country_name city_name
1 USA Miami
2 USA Frisco
3 France Paris
4 France Brest

...because if you think of a very large and deeply nested dataset, it makes a hellish amount of reprocessing and lookups downstream to reconnect master-detail surrogates whereas the previous example only need one pass parsing/pass.

Any idea on how to ?

Thanks

0

There are 0 best solutions below