Cassandra Data modeling IoT best practices

I am fairly new to Cassandra and I am trying to understand how to design my tables for IoT sensors.

The idea is to have several devices, each with several sensors attached to it sending data periodically (up to around 200000 values per device per day per sensor)

I'd like to be able to query for the latest value of a sensor for a specific list of sensors and devices in more or less real-time. Also devices do not always send data and may be down for long periods of time.

After a lot of reading I came up with something like this

CREATE TABLE "sensor_data" (

    deviceid TEXT,

    sensorid TEXT,

    ts timestamp,

    value TEXT,

    PRIMARY KEY ((deviceid, sensorid), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

The idea behind this would be to perform one query per device and sensor such as

Select deviceid, sensorid, ts, value where deviceid = "device1" and sensorid = "temperature" limit 1

And run this for each device and sensor. It's not one query to return it all (Which would be ideal) but seems to be fast enough to run for potentially up to 100 sensors or so (With possibilities for parallelizing the queries) for a few devices.

However from what I have read so far, I understand this would give me a lot of columns for my row and it might be complicated in terms of long term storage and Cassandra limitations.

I am thinking that maybe adding something like the date to the table like so (as seen on some blogs and guides) might be a good idea

CREATE TABLE "sensor_data" (

    deviceid TEXT,

    sensorid TEXT,

    date TEXT

    ts timestamp,

    value TEXT,

    PRIMARY KEY ((deviceid, sensorid, date), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

And then query like

Select deviceid, sensorid, date, ts, value where deviceid = "device1" and sensorid = "temperature" and date = "2018-11-14" limit 1

Does that even make sense? It feels like it might mitigate storage issues and allow for easier archiving of old data in the future however how do I go about querying for the latest value of a specific sensor and device if that device was down for a day or more? Do I really have to query for 1 day, if nothing is found, query the previous day and so forth (Maybe limit it to only the last few days or so)?

Are there better ways to handle this in Cassandra or am I in the right direction?

asked Nov 14 '18 at 21:54

ByteFlinger

7661729

add a comment |

I am fairly new to Cassandra and I am trying to understand how to design my tables for IoT sensors.

The idea is to have several devices, each with several sensors attached to it sending data periodically (up to around 200000 values per device per day per sensor)

After a lot of reading I came up with something like this

CREATE TABLE "sensor_data" (

    deviceid TEXT,

    sensorid TEXT,

    ts timestamp,

    value TEXT,

    PRIMARY KEY ((deviceid, sensorid), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

The idea behind this would be to perform one query per device and sensor such as

Select deviceid, sensorid, ts, value where deviceid = "device1" and sensorid = "temperature" limit 1

However from what I have read so far, I understand this would give me a lot of columns for my row and it might be complicated in terms of long term storage and Cassandra limitations.

I am thinking that maybe adding something like the date to the table like so (as seen on some blogs and guides) might be a good idea

CREATE TABLE "sensor_data" (

    deviceid TEXT,

    sensorid TEXT,

    date TEXT

    ts timestamp,

    value TEXT,

    PRIMARY KEY ((deviceid, sensorid, date), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

And then query like

Select deviceid, sensorid, date, ts, value where deviceid = "device1" and sensorid = "temperature" and date = "2018-11-14" limit 1

Are there better ways to handle this in Cassandra or am I in the right direction?

asked Nov 14 '18 at 21:54

ByteFlinger

7661729

add a comment |

I am fairly new to Cassandra and I am trying to understand how to design my tables for IoT sensors.

The idea is to have several devices, each with several sensors attached to it sending data periodically (up to around 200000 values per device per day per sensor)

After a lot of reading I came up with something like this

CREATE TABLE "sensor_data" (

    deviceid TEXT,

    sensorid TEXT,

    ts timestamp,

    value TEXT,

    PRIMARY KEY ((deviceid, sensorid), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

The idea behind this would be to perform one query per device and sensor such as

Select deviceid, sensorid, ts, value where deviceid = "device1" and sensorid = "temperature" limit 1

However from what I have read so far, I understand this would give me a lot of columns for my row and it might be complicated in terms of long term storage and Cassandra limitations.

I am thinking that maybe adding something like the date to the table like so (as seen on some blogs and guides) might be a good idea

CREATE TABLE "sensor_data" (

    deviceid TEXT,

    sensorid TEXT,

    date TEXT

    ts timestamp,

    value TEXT,

    PRIMARY KEY ((deviceid, sensorid, date), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

And then query like

Select deviceid, sensorid, date, ts, value where deviceid = "device1" and sensorid = "temperature" and date = "2018-11-14" limit 1

Are there better ways to handle this in Cassandra or am I in the right direction?

asked Nov 14 '18 at 21:54

ByteFlinger

7661729

I am fairly new to Cassandra and I am trying to understand how to design my tables for IoT sensors.

The idea is to have several devices, each with several sensors attached to it sending data periodically (up to around 200000 values per device per day per sensor)

After a lot of reading I came up with something like this

CREATE TABLE "sensor_data" (

    deviceid TEXT,

    sensorid TEXT,

    ts timestamp,

    value TEXT,

    PRIMARY KEY ((deviceid, sensorid), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

The idea behind this would be to perform one query per device and sensor such as

Select deviceid, sensorid, ts, value where deviceid = "device1" and sensorid = "temperature" limit 1

However from what I have read so far, I understand this would give me a lot of columns for my row and it might be complicated in terms of long term storage and Cassandra limitations.

I am thinking that maybe adding something like the date to the table like so (as seen on some blogs and guides) might be a good idea

CREATE TABLE "sensor_data" (

    deviceid TEXT,

    sensorid TEXT,

    date TEXT

    ts timestamp,

    value TEXT,

    PRIMARY KEY ((deviceid, sensorid, date), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

And then query like

Select deviceid, sensorid, date, ts, value where deviceid = "device1" and sensorid = "temperature" and date = "2018-11-14" limit 1

Are there better ways to handle this in Cassandra or am I in the right direction?

cassandra data-modeling iot cassandra-3.0

asked Nov 14 '18 at 21:54

ByteFlinger

7661729

asked Nov 14 '18 at 21:54

ByteFlinger

7661729

asked Nov 14 '18 at 21:54

ByteFlinger

7661729

asked Nov 14 '18 at 21:54

ByteFlinger

7661729

asked Nov 14 '18 at 21:54

ByteFlinger

7661729

add a comment |

1 Answer
1

active

oldest

votes

Part of the problem that you'll run into is that each sensor will be having 200k readings per day. In general, you want to keep each partition under 100k rows. So, your second idea (having date as part of the PK) may have perf issues.

Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.

To really help with this, we will need to understand a little more information:

How many devices do you have? Will that number grow or is it finite?

In plain English, what is an example of queries that you are trying to answer?

Incorporating this into the answer based on your answers (below):

Alright, here is a potential idea...

We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.

You're going to want two tables:

Lookup table

Sensor table

Lookup table will look something like:

CREATE TABLE lookup-table (

deviceid TEXT,

sensor-map MAP,

PRIMARY KEY (deviceid)

);

deviceid is the unique ID for each device

sensor-map is a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})

That way each device has a mapping of sensors that is available to it

Example query would be: SELECT * FROM lookup-table WHERE deviceid = 1234;

Another approach would be to have individual columns for each type of sensor and the unique ID for each sensor as a value

Sensor table will look like:

CREATE TABLE sensor_data (

sensorid TEXT,

sensor_value (whatever data type fits what you need),

ts TIMESTAMP,

reading_date date,

time_bucket int,

PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

As each sensor will get 200k readings/day AND we want to keep each partition under 100k rows, that means we want to do two partitions for each sensor each day

How could you bucket? You should do it in two parts:you need to bucket daily; each sensor gets a new partition each day (reading_date) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2

Within your application provide the specific sensorid and
time_bucket will come from the time that you're actually requesting
the query (e.g. if time is 1135 hours, then time_bucket = 1) and reading_date will come from the actual day that you are querying

Since you are clustering with ts DESC then it will retrieve the
latest reading for that given sensorid. So it would look like
SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;

By maintaining ts as a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwritten

Important to know: this works great if there is an even distribution of sensor readings throughout the 24-hour day. However, if you're reading heavily in the morning and not at all in the afternoon, then it isn't an even and we'll have to figure out another way to bucket. But, I think that you get what is going on.

To query:

There will be one query to retrieve all of the sensorid that a device has; once you have those sensorid, you can then use it for the next step

There will be n queries for each sensor_value for each sensorid

Since we are bucketing (via time_bucket), you should have an even distribution throughout all of the partitions

Lastly: give me the latest sensorid by a given value
To do that there are a couple of different ways...

Run a Spark job: to do that, you'll have to lift and shift the data to run the Spark query

Use DataStax Enterprise: with DSE you have an integrated Analytics component based on Spark so you can run Spark jobs without having to manage a separate Spark cluster. Disclosure: I work there, btw

Create an additional Cassandra (C*) table and do some parallel writes

For the additional C* table:

CREATE TABLE sensor_by_value (

sensor-value INT,

ts TIMESTAMP,

sensorid TEXT,

reading_date DATE,

time_bucket INT,

PRIMARY KEY ((sensor-value, reading_date), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

You will definitely have to do some time bucketing here:

Remember, we don't want any more than 100k rows per partition

You'll have to understand the possible values (range)

The frequency of each reading

If you have 100 devices, 100 sensors, and each sensor being read up to 200k per day, then you have a potential for up to 2B sensor readings per day...

Typically, what I have my customers do is run some analysis on their data to understand these bits of info, that way you can be sure to account for it

How much you have to bucket will depend on the frequency

Good luck! :-)

Final tip

Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live

Your data seems immutable after the initial insert

TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need

A default_ttl will also help with the operational overhead of deleting data after you don't need it anymore.

Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.

To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!

edited Nov 20 '18 at 17:33

answered Nov 16 '18 at 17:25

Justin Breese

463

Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?

– ByteFlinger
Nov 17 '18 at 9:46

Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.

– ByteFlinger
Nov 17 '18 at 10:23

Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...

– Justin Breese
Nov 18 '18 at 16:15

One more question: for your second question Find the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)

– Justin Breese
Nov 18 '18 at 16:30

Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.

– Justin Breese
Nov 19 '18 at 3:24

|
show 9 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309291%2fcassandra-data-modeling-iot-best-practices%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.

To really help with this, we will need to understand a little more information:

How many devices do you have? Will that number grow or is it finite?

In plain English, what is an example of queries that you are trying to answer?

Incorporating this into the answer based on your answers (below):

Alright, here is a potential idea...

We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.

You're going to want two tables:

Lookup table

Sensor table

Lookup table will look something like:

CREATE TABLE lookup-table (

deviceid TEXT,

sensor-map MAP,

PRIMARY KEY (deviceid)

);

deviceid is the unique ID for each device

sensor-map is a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})

That way each device has a mapping of sensors that is available to it

Example query would be: SELECT * FROM lookup-table WHERE deviceid = 1234;

Another approach would be to have individual columns for each type of sensor and the unique ID for each sensor as a value

Sensor table will look like:

CREATE TABLE sensor_data (

sensorid TEXT,

sensor_value (whatever data type fits what you need),

ts TIMESTAMP,

reading_date date,

time_bucket int,

PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

As each sensor will get 200k readings/day AND we want to keep each partition under 100k rows, that means we want to do two partitions for each sensor each day

How could you bucket? You should do it in two parts:you need to bucket daily; each sensor gets a new partition each day (reading_date) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2

Within your application provide the specific sensorid and
time_bucket will come from the time that you're actually requesting
the query (e.g. if time is 1135 hours, then time_bucket = 1) and reading_date will come from the actual day that you are querying

Since you are clustering with ts DESC then it will retrieve the
latest reading for that given sensorid. So it would look like
SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;

By maintaining ts as a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwritten

To query:

There will be one query to retrieve all of the sensorid that a device has; once you have those sensorid, you can then use it for the next step

There will be n queries for each sensor_value for each sensorid

Since we are bucketing (via time_bucket), you should have an even distribution throughout all of the partitions

Lastly: give me the latest sensorid by a given value
To do that there are a couple of different ways...

Run a Spark job: to do that, you'll have to lift and shift the data to run the Spark query

Use DataStax Enterprise: with DSE you have an integrated Analytics component based on Spark so you can run Spark jobs without having to manage a separate Spark cluster. Disclosure: I work there, btw

Create an additional Cassandra (C*) table and do some parallel writes

For the additional C* table:

CREATE TABLE sensor_by_value (

sensor-value INT,

ts TIMESTAMP,

sensorid TEXT,

reading_date DATE,

time_bucket INT,

PRIMARY KEY ((sensor-value, reading_date), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

You will definitely have to do some time bucketing here:

Remember, we don't want any more than 100k rows per partition

You'll have to understand the possible values (range)

The frequency of each reading

If you have 100 devices, 100 sensors, and each sensor being read up to 200k per day, then you have a potential for up to 2B sensor readings per day...

Typically, what I have my customers do is run some analysis on their data to understand these bits of info, that way you can be sure to account for it

How much you have to bucket will depend on the frequency

Good luck! :-)

Final tip

Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live

Your data seems immutable after the initial insert

TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need

A default_ttl will also help with the operational overhead of deleting data after you don't need it anymore.

Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.

To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!

edited Nov 20 '18 at 17:33

answered Nov 16 '18 at 17:25

Justin Breese

463

Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?

– ByteFlinger
Nov 17 '18 at 9:46

Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.

– ByteFlinger
Nov 17 '18 at 10:23

Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...

– Justin Breese
Nov 18 '18 at 16:15

One more question: for your second question Find the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)

– Justin Breese
Nov 18 '18 at 16:30

Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.

– Justin Breese
Nov 19 '18 at 3:24

|
show 9 more comments

Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.

To really help with this, we will need to understand a little more information:

How many devices do you have? Will that number grow or is it finite?

In plain English, what is an example of queries that you are trying to answer?

Incorporating this into the answer based on your answers (below):

Alright, here is a potential idea...

We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.

You're going to want two tables:

Lookup table

Sensor table

Lookup table will look something like:

CREATE TABLE lookup-table (

deviceid TEXT,

sensor-map MAP,

PRIMARY KEY (deviceid)

);

deviceid is the unique ID for each device

sensor-map is a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})

That way each device has a mapping of sensors that is available to it

Example query would be: SELECT * FROM lookup-table WHERE deviceid = 1234;

Another approach would be to have individual columns for each type of sensor and the unique ID for each sensor as a value

Sensor table will look like:

CREATE TABLE sensor_data (

sensorid TEXT,

sensor_value (whatever data type fits what you need),

ts TIMESTAMP,

reading_date date,

time_bucket int,

PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

As each sensor will get 200k readings/day AND we want to keep each partition under 100k rows, that means we want to do two partitions for each sensor each day

How could you bucket? You should do it in two parts:you need to bucket daily; each sensor gets a new partition each day (reading_date) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2

Within your application provide the specific sensorid and
time_bucket will come from the time that you're actually requesting
the query (e.g. if time is 1135 hours, then time_bucket = 1) and reading_date will come from the actual day that you are querying

Since you are clustering with ts DESC then it will retrieve the
latest reading for that given sensorid. So it would look like
SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;

By maintaining ts as a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwritten

To query:

There will be one query to retrieve all of the sensorid that a device has; once you have those sensorid, you can then use it for the next step

There will be n queries for each sensor_value for each sensorid

Since we are bucketing (via time_bucket), you should have an even distribution throughout all of the partitions

Lastly: give me the latest sensorid by a given value
To do that there are a couple of different ways...

Run a Spark job: to do that, you'll have to lift and shift the data to run the Spark query

Use DataStax Enterprise: with DSE you have an integrated Analytics component based on Spark so you can run Spark jobs without having to manage a separate Spark cluster. Disclosure: I work there, btw

Create an additional Cassandra (C*) table and do some parallel writes

For the additional C* table:

CREATE TABLE sensor_by_value (

sensor-value INT,

ts TIMESTAMP,

sensorid TEXT,

reading_date DATE,

time_bucket INT,

PRIMARY KEY ((sensor-value, reading_date), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

You will definitely have to do some time bucketing here:

Remember, we don't want any more than 100k rows per partition

You'll have to understand the possible values (range)

The frequency of each reading

If you have 100 devices, 100 sensors, and each sensor being read up to 200k per day, then you have a potential for up to 2B sensor readings per day...

Typically, what I have my customers do is run some analysis on their data to understand these bits of info, that way you can be sure to account for it

How much you have to bucket will depend on the frequency

Good luck! :-)

Final tip

Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live

Your data seems immutable after the initial insert

TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need

A default_ttl will also help with the operational overhead of deleting data after you don't need it anymore.

Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.

To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!

edited Nov 20 '18 at 17:33

answered Nov 16 '18 at 17:25

Justin Breese

463

Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?

– ByteFlinger
Nov 17 '18 at 9:46

Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.

– ByteFlinger
Nov 17 '18 at 10:23

Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...

– Justin Breese
Nov 18 '18 at 16:15

One more question: for your second question Find the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)

– Justin Breese
Nov 18 '18 at 16:30

Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.

– Justin Breese
Nov 19 '18 at 3:24

|
show 9 more comments

Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.

To really help with this, we will need to understand a little more information:

How many devices do you have? Will that number grow or is it finite?

In plain English, what is an example of queries that you are trying to answer?

Incorporating this into the answer based on your answers (below):

Alright, here is a potential idea...

We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.

You're going to want two tables:

Lookup table

Sensor table

Lookup table will look something like:

CREATE TABLE lookup-table (

deviceid TEXT,

sensor-map MAP,

PRIMARY KEY (deviceid)

);

deviceid is the unique ID for each device

sensor-map is a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})

That way each device has a mapping of sensors that is available to it

Example query would be: SELECT * FROM lookup-table WHERE deviceid = 1234;

Another approach would be to have individual columns for each type of sensor and the unique ID for each sensor as a value

Sensor table will look like:

CREATE TABLE sensor_data (

sensorid TEXT,

sensor_value (whatever data type fits what you need),

ts TIMESTAMP,

reading_date date,

time_bucket int,

PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

As each sensor will get 200k readings/day AND we want to keep each partition under 100k rows, that means we want to do two partitions for each sensor each day

How could you bucket? You should do it in two parts:you need to bucket daily; each sensor gets a new partition each day (reading_date) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2

Within your application provide the specific sensorid and
time_bucket will come from the time that you're actually requesting
the query (e.g. if time is 1135 hours, then time_bucket = 1) and reading_date will come from the actual day that you are querying

Since you are clustering with ts DESC then it will retrieve the
latest reading for that given sensorid. So it would look like
SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;

By maintaining ts as a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwritten

To query:

There will be one query to retrieve all of the sensorid that a device has; once you have those sensorid, you can then use it for the next step

There will be n queries for each sensor_value for each sensorid

Since we are bucketing (via time_bucket), you should have an even distribution throughout all of the partitions

Lastly: give me the latest sensorid by a given value
To do that there are a couple of different ways...

Run a Spark job: to do that, you'll have to lift and shift the data to run the Spark query

Use DataStax Enterprise: with DSE you have an integrated Analytics component based on Spark so you can run Spark jobs without having to manage a separate Spark cluster. Disclosure: I work there, btw

Create an additional Cassandra (C*) table and do some parallel writes

For the additional C* table:

CREATE TABLE sensor_by_value (

sensor-value INT,

ts TIMESTAMP,

sensorid TEXT,

reading_date DATE,

time_bucket INT,

PRIMARY KEY ((sensor-value, reading_date), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

You will definitely have to do some time bucketing here:

Remember, we don't want any more than 100k rows per partition

You'll have to understand the possible values (range)

The frequency of each reading

If you have 100 devices, 100 sensors, and each sensor being read up to 200k per day, then you have a potential for up to 2B sensor readings per day...

Typically, what I have my customers do is run some analysis on their data to understand these bits of info, that way you can be sure to account for it

How much you have to bucket will depend on the frequency

Good luck! :-)

Final tip

Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live

Your data seems immutable after the initial insert

TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need

A default_ttl will also help with the operational overhead of deleting data after you don't need it anymore.

Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.

To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!

edited Nov 20 '18 at 17:33

answered Nov 16 '18 at 17:25

Justin Breese

463

Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.

To really help with this, we will need to understand a little more information:

How many devices do you have? Will that number grow or is it finite?

In plain English, what is an example of queries that you are trying to answer?

Incorporating this into the answer based on your answers (below):

Alright, here is a potential idea...

We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.

You're going to want two tables:

Lookup table

Sensor table

Lookup table will look something like:

CREATE TABLE lookup-table (

deviceid TEXT,

sensor-map MAP,

PRIMARY KEY (deviceid)

);

deviceid is the unique ID for each device

sensor-map is a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})

That way each device has a mapping of sensors that is available to it

Example query would be: SELECT * FROM lookup-table WHERE deviceid = 1234;

Another approach would be to have individual columns for each type of sensor and the unique ID for each sensor as a value

Sensor table will look like:

CREATE TABLE sensor_data (

sensorid TEXT,

sensor_value (whatever data type fits what you need),

ts TIMESTAMP,

reading_date date,

time_bucket int,

PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

As each sensor will get 200k readings/day AND we want to keep each partition under 100k rows, that means we want to do two partitions for each sensor each day

How could you bucket? You should do it in two parts:you need to bucket daily; each sensor gets a new partition each day (reading_date) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2

Within your application provide the specific sensorid and
time_bucket will come from the time that you're actually requesting
the query (e.g. if time is 1135 hours, then time_bucket = 1) and reading_date will come from the actual day that you are querying

Since you are clustering with ts DESC then it will retrieve the
latest reading for that given sensorid. So it would look like
SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;

By maintaining ts as a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwritten

To query:

There will be one query to retrieve all of the sensorid that a device has; once you have those sensorid, you can then use it for the next step

There will be n queries for each sensor_value for each sensorid

Since we are bucketing (via time_bucket), you should have an even distribution throughout all of the partitions

Lastly: give me the latest sensorid by a given value
To do that there are a couple of different ways...

Run a Spark job: to do that, you'll have to lift and shift the data to run the Spark query

Use DataStax Enterprise: with DSE you have an integrated Analytics component based on Spark so you can run Spark jobs without having to manage a separate Spark cluster. Disclosure: I work there, btw

Create an additional Cassandra (C*) table and do some parallel writes

For the additional C* table:

CREATE TABLE sensor_by_value (

sensor-value INT,

ts TIMESTAMP,

sensorid TEXT,

reading_date DATE,

time_bucket INT,

PRIMARY KEY ((sensor-value, reading_date), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

You will definitely have to do some time bucketing here:

Remember, we don't want any more than 100k rows per partition

You'll have to understand the possible values (range)

The frequency of each reading

If you have 100 devices, 100 sensors, and each sensor being read up to 200k per day, then you have a potential for up to 2B sensor readings per day...

Typically, what I have my customers do is run some analysis on their data to understand these bits of info, that way you can be sure to account for it

How much you have to bucket will depend on the frequency

Good luck! :-)

Final tip

Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live

Your data seems immutable after the initial insert

TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need

A default_ttl will also help with the operational overhead of deleting data after you don't need it anymore.

Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.

To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!

edited Nov 20 '18 at 17:33

answered Nov 16 '18 at 17:25

Justin Breese

463

edited Nov 20 '18 at 17:33

answered Nov 16 '18 at 17:25

Justin Breese

463

answered Nov 16 '18 at 17:25

Justin Breese

463

answered Nov 16 '18 at 17:25

Justin Breese

463

Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?

– ByteFlinger
Nov 17 '18 at 9:46

Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.

– ByteFlinger
Nov 17 '18 at 10:23

Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...

– Justin Breese
Nov 18 '18 at 16:15

One more question: for your second question Find the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)

– Justin Breese
Nov 18 '18 at 16:30

Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.

– Justin Breese
Nov 19 '18 at 3:24

|
show 9 more comments

Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?

– ByteFlinger
Nov 17 '18 at 9:46

Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.

– ByteFlinger
Nov 17 '18 at 10:23

Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...

– Justin Breese
Nov 18 '18 at 16:15

One more question: for your second question Find the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)

– Justin Breese
Nov 18 '18 at 16:30

Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.

– Justin Breese
Nov 19 '18 at 3:24

Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?

– ByteFlinger
Nov 17 '18 at 9:46

Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.

– ByteFlinger
Nov 17 '18 at 10:23

Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...

– Justin Breese
Nov 18 '18 at 16:15

One more question: for your second question Find the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)

– Justin Breese
Nov 18 '18 at 16:30

Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.

– Justin Breese
Nov 19 '18 at 3:24

|
show 9 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky