Cassandra saving time series for industry data sensors

1.1k Views Asked by At

I am currently developing a project and researching the best way to retrieve data from industrial factory sensors connected to PLCs (the controller of the machinery in a factory for example the control motors, speeds, switches... ).

I will explain the objective to achieve and I think my case could be extrapolated to so much different types of industries:

  1. I have several PLCs that give me a lot of different data values. (Many of these values are only booleans and other are analog values, real type for example.)

  2. I will have more than 10.000 sensors in a whole factory.

  3. I want to retrieve the data at least every second for the analog values (for example motor rmp, temperature, humidity....).

  4. For the digital values the data will be saved with timestamp when a event appears.

I want to use Cassandra with timeseries because it looks the most promising and faster technlogy to do that.

My question is about storing analog values every second. Is it better to have a schema like:

timestamp, sensor1, sensor2, sensor3, sensor4

and row and group it by parts in the factory or is it better that

every sensor has his own table

?

The whole system will be developed in Java and it will provide the data to an external company in order to analyse it.

2

There are 2 best solutions below

4
On BEST ANSWER

It's not quite clear what your query is. You mention "I want to retrieve the data at least every second for the analogic values (for example motor rmp, temperature, humidity....)".

Does that mean you're querying every second for all 10K sensors? Or for a specific sensor, or for a group of sensors? In cassandra, it's vital to know what your query is before looking at data models. If you're looking for 1 second granuality, one option may be to feed incoming data streams to Spark Streaming, and have the Spark Streaming code save to a Cassandra table that suits what you want to query.

As for the options you mention, it's hard to say without knowing the exact nature of your queries. Having one key ronded to the second may be an option - that would mean 10K or so entries per partition, assuming a data rate or 1/s per sensor. Having a table per sensor would be weird, but you may have a partition per sensor with timestamps for each entry. It really depends on your query.

Perhaps if you gave us an example of how you intend to retrieve the data, we can help better?

2
On

I suspect that in the end you'll want to query the data both by sensors and by time. There is no reason you wouldn't have two tables and write each data point into both. (Twitter writes each tweet into a different table for each person who follows the tweet!)

Some likely tables you would write are:

CREATE TABLE factory_status (
  date timestamp,
  hour int,
  minute int,
  second int,
  sensor_status_map map<uuid, float>
  PRIMARY KEY ((date, hour, minute, second))
)

This table would essentially log the status of all sensors in the factory each second. Each partition would contain a snapshot of the factory. This wouldn't be very useful for queries as you couldn't effectively get a range of time (each second would be its own query), but it could be very powerful in doing analytics on the factory and developing a model for failures.

CREATE TABLE sensor_status (
  sensor_id uuid,
  date timestamp,
  time timestamp,
  sensor_val float,
  PRIMARY KEY ((sensor_id, date), time)
)

This table would essentially log each sensor's output. Each date would be the truncated version of the time. Otherwise the sensor's once per second entry would quickly overwhelm cassandra's column limit. This would make it easy to query the status of a sensor at a particular time or over a period of time.

If you're having trouble designing "from the roof down", feel free to take an iterative approach and add new tables as you find a new query you need to run that doesn't fit the mold of the old queries.