Time series data representation in MarkLogic

330 Views Asked by At

Time-series data such as historical stock prices are usually stored in an RDBMS.

I am evaluating various options to use this data, possibly store it in doc store or triple store in MarkLogic, and build some use cases on this data and/or along with the other kind of data stored in the doc/triple store.

Essentially, I am looking for ways to

  1. Store time series data such as historical stock prices in a MarkLogic database.
  2. Ways to query this data (stored in ML or queried across the RDBMS), through XQuery for example.
  3. Ways to query this data, along with the other data stored in the doc/triple store.

I would appreciate any recommendations in this regard.


Added some more info...

I am trying to figure a neat way of capturing this data as triples. The idea being that it would be nice to link this data with other related data. For example, if the historical stock price we are trying to store is for HSBC listed on NYSE, then we can in some way define resources for HSBC and NYSE and also capture the stock price as literals (perhaps) and then link the resource HSBC with for example, the company information stored in dbpedia.

Essentially, I am talking about creating linked data, such that it is easy to query across data fetched from different sources and also if possible, try to use inferencing. For example, if I use this approach, it would be possible for me to run a query such as 'Get me the stock price of the companies headquartered in London, whose turnover is greater than $1billion'.

2

There are 2 best solutions below

0
On BEST ANSWER

I would recommend storing time-series data in a time-series database: https://en.wikipedia.org/wiki/Time_series_database

Update 1:

You can define HSBC as an entity, specify meta-data for the entity such as location or headcount, and then store quarterly revenue and traded tick prices as separate time-series. Then you can run queries that a) filter by meta-data tag such as Location and filter by aggregation, e.g. MAX(price). I would store headcount as series as well actually. This way I can investigate correlations between different series for research and analytics.

0
On

You have 2 alternatives. Either you have 1 big document for each series, or you have 1 document per price. The former is not recommended, as the latter let you better use the index system, especially a range index on the timestamp.

I worked on a system using MarkLogic, which was essentially a system to store time series. We used 1 document per point in the series (as well as 1 document for the series itself, for its "metadata", all information common across all the points in the series). We also put all documents relative to 1 series in 1 collection. We used a naming scheme for the document URIs based on the timestamp and a unique ID per series, so we can easily guarantee the uniqueness of the document URIs.

An important point is to have the series point documents to reference their series document (either explicitly or just by being in the same collection), instead of the other way around.

As per querying, it depends on your specific use cases, but typically you will use a search constraint on the collection to identify one (or several) series, and a range index on the timestamp to select a "slice" of points in the series. If you have use cases like selecting points based on their value (instead of their time) you can do it as efficiently as you do it based on the timestamp, by using a range index on the values themselves.