Replicating linkedin's databus with SqlServer 2008

848 Views Asked by At

I am attempting to recreate linkedin's databus architecture in my organization for a highly contentious database. My initial thought is to use SQL server's Change Data Capture (CDC) functionality to capture all events and publish them to an external datastore in the same way that LI does with their "bootstrap". If I write a service to constantly pool the cdc tables I can then write them to my bootstrap db then publish the events with some pub/sub system.

My question here is whether anybody has attempted something like this and if my aforementioned approach seemed like a good approach or if there is a better way to capture these events and publish them?

Thanks.

Edit: [Adding details]

I have not completely decided on the secondary datastore. It will likely, at least for the short term, be another SQL Server 2008 instance on another server. One of the primary purposes of this secondary store is to take the load off of the main server. Our main database has grown quite large (>2.5TB) and adding ay additional load would be frowned upon. If I could implement this architecture, an additional gain would be to essentially manage replication, taking much of the responsibility of replication off of the main server(s).

Ideally, I would like to not even use CDC. I feel it is a great technology but it does store these changes locally and, from what I read, causes the server to incur a performance hit. However, for the time being, this appears to be my best option.

Edit 2: [Further process details]

What I am attempting to achieve is indeed very similar to replication. On this secondary server what I would do is start with a copy of the base database and a set of tables for tracking changes. I would then have a service that monitors the CDC tables and move these events to the new db, clear them from the source, and apply the changes to the second servers base copy.

The next component will be a pub/sub service that will get all of the change events published and any consumer can choose to subscribe to get these change events. The full db copy will be used for "bootstrapping" new consumers as they come on-line so they can get the full, up-to-date, db before beginning to get the change events. The tracking table will also be used to obtain deltas in the event a consumer goes offline and misses events.

Clients will be able to apply rules to translate the data as they see fit. the first application will be purely to create read-only copies of the original db. Future plans include de-normalizing the data and transferring into other formats such as MongoDb collections.

The first part of this could probably be achieved most simply if it were possible to maintain the CDC tables on a remote server. However, I do not see any way to do this.

I know this sounds convoluted but it will solve a real problem that, apparent by linkedin's solution, is not terribly uncommon these days.

Hopefully this helps.

0

There are 0 best solutions below