Semi-structure xml/json files are often used to serialize relational data from different levels of entities in a master-detail cascade fashion. As xml/json files are ordered, unique keys are not necessarily provided to preserve sequencing and hierarchy.
Most ETL's offer a way to de-serialize those entities instances by dispatching them back from an xml/json file into different master-detail cascade of tables, generating a integer on-the-fly surrogate sequence PER LEVEL that can be propagated to both master records (as a unique surrogate key) and detail records (as a referential foreign key to master).
I can't seem to find such a feature in ADF transformations (be it parse, flatten, surrogate, etc.)
Grouping at master level reaches the goal to uniquely number master records but the link is lost to details
Flattening the whole xml/json source can be extremely heavy, redundant and entails a lot of re-scan & lookup in order to achieve that objective.
QUESTION: any smart idea (within ADF) to perform that multiple surrogate sequence numbering per cascading master-detail level in one pass (one scan) over the whole source data set ?
Thanks
@Rakesh : 'should have done that from the very beginning | suppose the following structure (country name could also be nested as regular xml tags) :
<geo>
<country name="USA">
<city>Miami</city>
<city>Frisco</city>
</country>
<country name="France">
<city>Paris</city>
<city>Brest</city>
</country>
</geo>
I would ideally get the following two dataflows to be generated with each an independent sequence keeping track of order and structure :
| country_id | country_name |
|---|---|
| 1 | USA |
| 2 | France |
| country_id | city_id | city_name |
|---|---|---|
| 1 | 1 | Miami |
| 1 | 2 | Frisco |
| 2 | 3 | Paris |
| 2 | 4 | Brest |
The following denormalized flow would be OK provided I get both sequences (even within an xml-kept structure)
| country_id | country_name | city_id | city_name |
|---|---|---|---|
| 1 | USA | 1 | Miami |
| 1 | USA | 2 | Frisco |
| 2 | France | 3 | Paris |
| 2 | France | 4 | Brest |
That following one is not ok :
| city_id | country_name | city_name |
|---|---|---|
| 1 | USA | Miami |
| 2 | USA | Frisco |
| 3 | France | Paris |
| 4 | France | Brest |
...because if you think of a very large and deeply nested dataset, it makes a hellish amount of reprocessing and lookups downstream to reconnect master-detail surrogates whereas the previous example only need one pass parsing/pass.
Any idea on how to ?
Thanks