Generate data flow graph for ETL process

44 Views Asked by At

I would like to generate something like a data flow diagram for an ETL process.

What I want to achieve is to have (the minimal example) two tables - source and destination - both tables with their attributes and between them a process, that transforms data from the first to the second table. Within this process I want to display the relationship to input fields, output fields and description of the formula/expression.

Of course, then for real world scenario I need multiple source tables for one deatination and the whole chain, where the data flows from input table to the final destination one through multiple tables.

I tried to generate something in graphviz, but if it is only a bit more complicated than completely simple, I can't force it to keep the expected layout and it's just a mess.

Here is some mspaint mockup for what layout I am after. But there is more tables (~10), around the same amount of processes, but hundreds of attributes and derivation functions. Mockup

1

There are 1 best solutions below

1
On

Very fiddly, but also very scriptable. See the Graphviz documentation as needed, esp. https://graphviz.org/doc/info/shapes.html

digraph ETL {
  rankdir=LR
    
  subgraph clusterA {
    graph [label="Table A"]
    // unfortunately, if we let each attribute have its own node, they end up ordered last-to-first
    // not interested in fighting this, so put them all in a table that looks like individual nodes
    //
    //  see below for fix to the above mentioned last-to-first bug
    //  I am to lazy to apply it to the three Tables
    //
    node [width=2.5]
    TA [shape=none label=<
        <table border="0" cellborder="1" cellspacing="4">
        <tr><td width="130" port="A1">Attribute 1</td></tr>
        <tr><td style="invis"></td></tr>
        <tr><td port="A2">Attribute 2</td></tr>
        <tr><td style="invis"></td></tr>
        <tr><td port="A3">Attribute 3</td></tr>
        </table>>]
  }

  subgraph clusterP {
    graph [label="Process P"]
    node [shape=none]

    subgraph clusterP1 {
      graph [label="Attribute 1 Derivation"]
      {rank=same
      F1 [label=<<table border="0" cellborder="1" cellspacing="0">
          <tr>
      <td rowspan="3" port="f1" width="160" align="text" balign="left">Function 1<BR align="right"/>IF<BR align="left"/>blah blah<BR align="left"/>more<BR/>blah blah<BR/>and more<BR/>blah blah<BR/>no more</td>
          <td port="sa1">Source 1</td>
      </tr>
          <tr>  <td port="sa2">Source 2</td> </tr>
          <tr>  <td port="sa3">Source 3</td> </tr>
          </table>>]
      F2 [label=<<table border="0" cellborder="1" cellspacing="0">
          <tr>
      <td rowspan="3" port="f1" width="160" align="text" balign="left">Function 2<BR align="right"/>IF<BR align="left"/>blah blah<BR align="left"/>more<BR/>blah blah<BR/>and more<BR/>blah blah<BR/>no more</td>
          <td port="sa1">Source 1</td>
      </tr>
          <tr>  <td port="sa2">Source 2</td> </tr>
          <tr>  <td port="sa3">Source 3</td> </tr>
          </table>>]
      }
    F2->F1  [style=invis] // yes, this is backwards & stupid, but it gets the nodes inthe correct order
    }

  }

 subgraph clusterB {
    graph [label="Table B"]
    TB [shape=none label=<
        <table border="0" cellborder="1" cellspacing="4">
        <tr><td port="A1">Attribute 1</td></tr>
        <tr><td style="invis"></td></tr>
        <tr><td port="A2">Attribute 2</td></tr>
        <tr><td style="invis"></td></tr>
        <tr><td port="A3">Attribute 3</td></tr>
        </table>>]
  }
 subgraph clusterC {
    graph [label="Table C"]
    TC [shape=none label=<
        <table border="0" cellborder="1" cellspacing="4">
        <tr><td port="A1">Attribute 1</td></tr>
        <tr><td style="invis"></td></tr>
        <tr><td port="A2">Attribute 2</td></tr>
        <tr><td style="invis"></td></tr>
        <tr><td port="A3">Attribute 3</td></tr>
        </table>>]
  }

  TA:A1 ->  F1:f1
  F1:sa1 -> TB:A1
  F1:sa2 -> TB:A2
  F1:sa3 -> TC:A1

}

Giving:
enter image description here