Defining a schema that will be compatible between multiple databases and different conventions

487 Views Asked by At

I want to define one schema that will be cross teams & platform valid. This is pretty simple and can be thought of as a kind of ontology. What I need is to have the ability to define what the field represents and under it the name of the field on each platform. I'd like the schema to have the ability to generate data objects for each of the used languages, and therefore I'd like to know if my need can be filled within Protobuf or GraphQL. Notice - my conventions can be different than the trivial in my generated target language since it needs to be compatible with the databases. A simple example for my need:

{
  "lastName": {
    "mssqlName":"LastName",
    "oracleName":"FamilyName",
    "elasticName":"lastName",
    "cassandraName":"last_name",
    "rocksDbName":"surname",
  },
  "age" : {
  ...
}

As you can see, on some platforms I have totally different names than the others. I'd like to know what are the usual ways\ technologies to solve this problem, and if whether it will be possible with codegen-able technologies like Proto & GraphQL.

1

There are 1 best solutions below

0
On

A single schema as the single point of truth for all object / message definition across databases, comms links, multiple languages and plaforms? It would be nice, wouldn't it?

The closest I can think of is XSD (XML schema), but I don't think it works when it comes to tools. For example, I know of tools that will take an XSD schema and generate you code that will serialise / deserialise objects to / from XML (e.g. Microsoft's xsd.exe). There's even some good ones.

And then there's tools that will create SQL tables from that XSD schema. But a code generator that builds classes that can access those tables isn't also building them to serialise / deserialise objects to and XML wireformat.

Basically, I've not come across a schema language that has tooling that does everything. The ASN.1 tools are very good at creating serialisation classes, but I've never found one that also targets SQL interactions. Same with XSD.

My knowledge is of course not exhaustive, and there might be something in JSON-land that works.

Minimum Pain Compromise Approach

What I have settled on in the past is to accept that I'm having to do some manual coding around changes in schema, but probably not too much. I'd define messages fully in, say, Google Protocol Buffers, and use that for object exchange between applications / languages. Where I wanted to stash objects in a database, I'd accept that for that I'd be having to have a parallel definition of the object in the table columns, but only for critical fields that I'd want to search on. The last column would be an arbitrary container, able to store the serialised object whole.

For example, if a GPB message had an integer ID field, and a string Name field, plus a bunch of other fields. My data base table would then have an ID column, a Name column, and column for storing Bytes.

That way I could serialise an object, and push it into a row's Bytes column whilst also filling in the ID and Name columns. I could quickly search for objects, because of the Name / ID column. If I then wanted access to the other fields in the object stored in the database, I'd have to retreive the record from the database and deserialise the Bytes column.

This way one is essentially taking a bet that those key columns / field names (ID, Name) won't ever be changed during development in the schema. But it's quite likely a safe bet. Generally, one can settle things like that quite easily, early on in a project, it's the rest of the schema that might be changed during development.

One small payoff is that if the reason to hunt out an object in the database is to be able to send it through a communications channel, it is already serialised in the database. No need to serialise it again before dispatch down the comms link.

So this approach can leave one with some duplication of code / points of truth, but can be quite performant in avoiding a serialisation step during parts of runtime.

You can also cheat a little. If the serialisation wireformat is text based (JSON, XML, some ASN.1 formats, etc), then there's a good chance that string searches on the bytes column will yield good results anyway. For instance, suppose a message field was MiddleName, but I'd not created that as a distinct table column in the database. I could find likely records for any given MiddleName by searching for the value in the Bytes column, as it's stored as text somewhere in there.

Reflection Based Approach?

A potential other approach is to accept that the tooling does not exist to satisfy all needs, and adapt using language features (reflection) to exploit a common feature of code generators.

For example, consider GPB's proto compiler. In the generated code you end up with classes whose members are named after the fields in messages. And it'll be more or less the same with any code generated to access a database table that has columns by the same name.

So it is possible to use reflection to make an auto-transcriber between generated classes. You iterate down the tree of members in one class, and you can match that up to a member in a different generated class.

This avoids the need for code like:

Protobuf::MyClass myObj_g; // An object built using GPB
JSON::MyClass myObj_j;     // equivalent object to be copied from myObj_g;

myObj_j.Field1 = myObj_g.Field1;
myObj_j.Field2 = myObj_g.Field2;
.
.
.

Instead:

Protobuf::MyClass myObj_g; // An object built using GPB
JSON::MyClass myObj_j;     // equivalent object to be copied from myObj_g;

foreach (Protobuf::MyClass::Reflection::Field field in Protobuf::MyClass.Fields)
{
    myObj_j.Reflection.FindByName(field.Name) = myObj_g.Reflection.FindByName(field.Name);
}

There'd be a fit of fiddling around to do to get this to work between each database and serialisation technology, per language, but the point is you'd only ever have to write it once. Any subsequent schema changes do not require code changes, at least not so far as exchanging objects between a serialisation technology and a database access technology.

Obviously, reflection is easier / possible in some languages and not otheres.

The Fix It At Runtime Approach?

Apache Avro has the characteristic where serialised data describes it's own shape. Basically, wireformat data comes with its own schema, so a consumer can build a representation of the data automatically. In some languages that's horrid (C, C++), but libraries exist.

Basically, it forces you to write applications so that they work out what to do with data for themselves;