Cassandra Data modeling IoT best practices
I am fairly new to Cassandra and I am trying to understand how to design my tables for IoT sensors.
The idea is to have several devices, each with several sensors attached to it sending data periodically (up to around 200000 values per device per day per sensor)
I'd like to be able to query for the latest value of a sensor for a specific list of sensors and devices in more or less real-time. Also devices do not always send data and may be down for long periods of time.
After a lot of reading I came up with something like this
CREATE TABLE "sensor_data" (
deviceid TEXT,
sensorid TEXT,
ts timestamp,
value TEXT,
PRIMARY KEY ((deviceid, sensorid), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
The idea behind this would be to perform one query per device and sensor such as
Select deviceid, sensorid, ts, value where deviceid = "device1" and sensorid = "temperature" limit 1
And run this for each device and sensor. It's not one query to return it all (Which would be ideal) but seems to be fast enough to run for potentially up to 100 sensors or so (With possibilities for parallelizing the queries) for a few devices.
However from what I have read so far, I understand this would give me a lot of columns for my row and it might be complicated in terms of long term storage and Cassandra limitations.
I am thinking that maybe adding something like the date to the table like so (as seen on some blogs and guides) might be a good idea
CREATE TABLE "sensor_data" (
deviceid TEXT,
sensorid TEXT,
date TEXT
ts timestamp,
value TEXT,
PRIMARY KEY ((deviceid, sensorid, date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
And then query like
Select deviceid, sensorid, date, ts, value where deviceid = "device1" and sensorid = "temperature" and date = "2018-11-14" limit 1
Does that even make sense? It feels like it might mitigate storage issues and allow for easier archiving of old data in the future however how do I go about querying for the latest value of a specific sensor and device if that device was down for a day or more? Do I really have to query for 1 day, if nothing is found, query the previous day and so forth (Maybe limit it to only the last few days or so)?
Are there better ways to handle this in Cassandra or am I in the right direction?
cassandra data-modeling iot cassandra-3.0
add a comment |
I am fairly new to Cassandra and I am trying to understand how to design my tables for IoT sensors.
The idea is to have several devices, each with several sensors attached to it sending data periodically (up to around 200000 values per device per day per sensor)
I'd like to be able to query for the latest value of a sensor for a specific list of sensors and devices in more or less real-time. Also devices do not always send data and may be down for long periods of time.
After a lot of reading I came up with something like this
CREATE TABLE "sensor_data" (
deviceid TEXT,
sensorid TEXT,
ts timestamp,
value TEXT,
PRIMARY KEY ((deviceid, sensorid), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
The idea behind this would be to perform one query per device and sensor such as
Select deviceid, sensorid, ts, value where deviceid = "device1" and sensorid = "temperature" limit 1
And run this for each device and sensor. It's not one query to return it all (Which would be ideal) but seems to be fast enough to run for potentially up to 100 sensors or so (With possibilities for parallelizing the queries) for a few devices.
However from what I have read so far, I understand this would give me a lot of columns for my row and it might be complicated in terms of long term storage and Cassandra limitations.
I am thinking that maybe adding something like the date to the table like so (as seen on some blogs and guides) might be a good idea
CREATE TABLE "sensor_data" (
deviceid TEXT,
sensorid TEXT,
date TEXT
ts timestamp,
value TEXT,
PRIMARY KEY ((deviceid, sensorid, date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
And then query like
Select deviceid, sensorid, date, ts, value where deviceid = "device1" and sensorid = "temperature" and date = "2018-11-14" limit 1
Does that even make sense? It feels like it might mitigate storage issues and allow for easier archiving of old data in the future however how do I go about querying for the latest value of a specific sensor and device if that device was down for a day or more? Do I really have to query for 1 day, if nothing is found, query the previous day and so forth (Maybe limit it to only the last few days or so)?
Are there better ways to handle this in Cassandra or am I in the right direction?
cassandra data-modeling iot cassandra-3.0
add a comment |
I am fairly new to Cassandra and I am trying to understand how to design my tables for IoT sensors.
The idea is to have several devices, each with several sensors attached to it sending data periodically (up to around 200000 values per device per day per sensor)
I'd like to be able to query for the latest value of a sensor for a specific list of sensors and devices in more or less real-time. Also devices do not always send data and may be down for long periods of time.
After a lot of reading I came up with something like this
CREATE TABLE "sensor_data" (
deviceid TEXT,
sensorid TEXT,
ts timestamp,
value TEXT,
PRIMARY KEY ((deviceid, sensorid), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
The idea behind this would be to perform one query per device and sensor such as
Select deviceid, sensorid, ts, value where deviceid = "device1" and sensorid = "temperature" limit 1
And run this for each device and sensor. It's not one query to return it all (Which would be ideal) but seems to be fast enough to run for potentially up to 100 sensors or so (With possibilities for parallelizing the queries) for a few devices.
However from what I have read so far, I understand this would give me a lot of columns for my row and it might be complicated in terms of long term storage and Cassandra limitations.
I am thinking that maybe adding something like the date to the table like so (as seen on some blogs and guides) might be a good idea
CREATE TABLE "sensor_data" (
deviceid TEXT,
sensorid TEXT,
date TEXT
ts timestamp,
value TEXT,
PRIMARY KEY ((deviceid, sensorid, date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
And then query like
Select deviceid, sensorid, date, ts, value where deviceid = "device1" and sensorid = "temperature" and date = "2018-11-14" limit 1
Does that even make sense? It feels like it might mitigate storage issues and allow for easier archiving of old data in the future however how do I go about querying for the latest value of a specific sensor and device if that device was down for a day or more? Do I really have to query for 1 day, if nothing is found, query the previous day and so forth (Maybe limit it to only the last few days or so)?
Are there better ways to handle this in Cassandra or am I in the right direction?
cassandra data-modeling iot cassandra-3.0
I am fairly new to Cassandra and I am trying to understand how to design my tables for IoT sensors.
The idea is to have several devices, each with several sensors attached to it sending data periodically (up to around 200000 values per device per day per sensor)
I'd like to be able to query for the latest value of a sensor for a specific list of sensors and devices in more or less real-time. Also devices do not always send data and may be down for long periods of time.
After a lot of reading I came up with something like this
CREATE TABLE "sensor_data" (
deviceid TEXT,
sensorid TEXT,
ts timestamp,
value TEXT,
PRIMARY KEY ((deviceid, sensorid), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
The idea behind this would be to perform one query per device and sensor such as
Select deviceid, sensorid, ts, value where deviceid = "device1" and sensorid = "temperature" limit 1
And run this for each device and sensor. It's not one query to return it all (Which would be ideal) but seems to be fast enough to run for potentially up to 100 sensors or so (With possibilities for parallelizing the queries) for a few devices.
However from what I have read so far, I understand this would give me a lot of columns for my row and it might be complicated in terms of long term storage and Cassandra limitations.
I am thinking that maybe adding something like the date to the table like so (as seen on some blogs and guides) might be a good idea
CREATE TABLE "sensor_data" (
deviceid TEXT,
sensorid TEXT,
date TEXT
ts timestamp,
value TEXT,
PRIMARY KEY ((deviceid, sensorid, date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
And then query like
Select deviceid, sensorid, date, ts, value where deviceid = "device1" and sensorid = "temperature" and date = "2018-11-14" limit 1
Does that even make sense? It feels like it might mitigate storage issues and allow for easier archiving of old data in the future however how do I go about querying for the latest value of a specific sensor and device if that device was down for a day or more? Do I really have to query for 1 day, if nothing is found, query the previous day and so forth (Maybe limit it to only the last few days or so)?
Are there better ways to handle this in Cassandra or am I in the right direction?
cassandra data-modeling iot cassandra-3.0
cassandra data-modeling iot cassandra-3.0
asked Nov 14 '18 at 21:54
ByteFlingerByteFlinger
7661729
7661729
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Part of the problem that you'll run into is that each sensor will be having 200k readings per day. In general, you want to keep each partition under 100k rows. So, your second idea (having date as part of the PK) may have perf issues.
Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.
To really help with this, we will need to understand a little more information:
- How many devices do you have? Will that number grow or is it finite?
- In plain English, what is an example of queries that you are trying to answer?
Incorporating this into the answer based on your answers (below):
Alright, here is a potential idea...
We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.
You're going to want two tables:
- Lookup table
- Sensor table
Lookup table will look something like:
CREATE TABLE lookup-table (
deviceid TEXT,
sensor-map MAP,
PRIMARY KEY (deviceid)
);
deviceidis the unique ID for each device
sensor-mapis a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})- That way each device has a mapping of sensors that is available to it
- Example query would be:
SELECT * FROM lookup-table WHERE deviceid = 1234;
- Another approach would be to have individual columns for each type of sensor and the unique ID for each sensor as a value
Sensor table will look like:
CREATE TABLE sensor_data (
sensorid TEXT,
sensor_value (whatever data type fits what you need),
ts TIMESTAMP,
reading_date date,
time_bucket int,
PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
- As each sensor will get 200k readings/day AND we want to keep each partition under 100k rows, that means we want to do two partitions for each sensor each day
- How could you bucket? You should do it in two parts:you need to bucket daily; each sensor gets a new partition each day (
reading_date) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2 - Within your application provide the specific
sensoridand
time_bucketwill come from the time that you're actually requesting
the query (e.g. if time is 1135 hours, thentime_bucket = 1) andreading_datewill come from the actual day that you are querying - Since you are clustering with
ts DESCthen it will retrieve the
latest reading for that givensensorid. So it would look like
SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;
- By maintaining
tsas a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwritten
Important to know: this works great if there is an even distribution of sensor readings throughout the 24-hour day. However, if you're reading heavily in the morning and not at all in the afternoon, then it isn't an even and we'll have to figure out another way to bucket. But, I think that you get what is going on.
To query:
- There will be one query to retrieve all of the
sensoridthat a device has; once you have thosesensorid, you can then use it for the next step - There will be n queries for each
sensor_valuefor eachsensorid
- Since we are bucketing (via
time_bucket), you should have an even distribution throughout all of the partitions
Lastly: give me the latest sensorid by a given value
To do that there are a couple of different ways...
- Run a Spark job: to do that, you'll have to lift and shift the data to run the Spark query
- Use DataStax Enterprise: with DSE you have an integrated Analytics component based on Spark so you can run Spark jobs without having to manage a separate Spark cluster. Disclosure: I work there, btw
- Create an additional Cassandra (C*) table and do some parallel writes
For the additional C* table:
CREATE TABLE sensor_by_value (
sensor-value INT,
ts TIMESTAMP,
sensorid TEXT,
reading_date DATE,
time_bucket INT,
PRIMARY KEY ((sensor-value, reading_date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
You will definitely have to do some time bucketing here:
- Remember, we don't want any more than 100k rows per partition
- You'll have to understand the possible values (range)
- The frequency of each reading
- If you have 100 devices, 100 sensors, and each sensor being read up to 200k per day, then you have a potential for up to 2B sensor readings per day...
- Typically, what I have my customers do is run some analysis on their data to understand these bits of info, that way you can be sure to account for it
- How much you have to bucket will depend on the frequency
- Good luck! :-)
Final tip
Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live
Your data seems immutable after the initial insert
TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need
A
default_ttlwill also help with the operational overhead of deleting data after you don't need it anymore.
Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.
To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!
Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?
– ByteFlinger
Nov 17 '18 at 9:46
Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.
– ByteFlinger
Nov 17 '18 at 10:23
Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...
– Justin Breese
Nov 18 '18 at 16:15
One more question: for your second questionFind the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)
– Justin Breese
Nov 18 '18 at 16:30
Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.
– Justin Breese
Nov 19 '18 at 3:24
|
show 9 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309291%2fcassandra-data-modeling-iot-best-practices%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Part of the problem that you'll run into is that each sensor will be having 200k readings per day. In general, you want to keep each partition under 100k rows. So, your second idea (having date as part of the PK) may have perf issues.
Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.
To really help with this, we will need to understand a little more information:
- How many devices do you have? Will that number grow or is it finite?
- In plain English, what is an example of queries that you are trying to answer?
Incorporating this into the answer based on your answers (below):
Alright, here is a potential idea...
We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.
You're going to want two tables:
- Lookup table
- Sensor table
Lookup table will look something like:
CREATE TABLE lookup-table (
deviceid TEXT,
sensor-map MAP,
PRIMARY KEY (deviceid)
);
deviceidis the unique ID for each device
sensor-mapis a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})- That way each device has a mapping of sensors that is available to it
- Example query would be:
SELECT * FROM lookup-table WHERE deviceid = 1234;
- Another approach would be to have individual columns for each type of sensor and the unique ID for each sensor as a value
Sensor table will look like:
CREATE TABLE sensor_data (
sensorid TEXT,
sensor_value (whatever data type fits what you need),
ts TIMESTAMP,
reading_date date,
time_bucket int,
PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
- As each sensor will get 200k readings/day AND we want to keep each partition under 100k rows, that means we want to do two partitions for each sensor each day
- How could you bucket? You should do it in two parts:you need to bucket daily; each sensor gets a new partition each day (
reading_date) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2 - Within your application provide the specific
sensoridand
time_bucketwill come from the time that you're actually requesting
the query (e.g. if time is 1135 hours, thentime_bucket = 1) andreading_datewill come from the actual day that you are querying - Since you are clustering with
ts DESCthen it will retrieve the
latest reading for that givensensorid. So it would look like
SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;
- By maintaining
tsas a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwritten
Important to know: this works great if there is an even distribution of sensor readings throughout the 24-hour day. However, if you're reading heavily in the morning and not at all in the afternoon, then it isn't an even and we'll have to figure out another way to bucket. But, I think that you get what is going on.
To query:
- There will be one query to retrieve all of the
sensoridthat a device has; once you have thosesensorid, you can then use it for the next step - There will be n queries for each
sensor_valuefor eachsensorid
- Since we are bucketing (via
time_bucket), you should have an even distribution throughout all of the partitions
Lastly: give me the latest sensorid by a given value
To do that there are a couple of different ways...
- Run a Spark job: to do that, you'll have to lift and shift the data to run the Spark query
- Use DataStax Enterprise: with DSE you have an integrated Analytics component based on Spark so you can run Spark jobs without having to manage a separate Spark cluster. Disclosure: I work there, btw
- Create an additional Cassandra (C*) table and do some parallel writes
For the additional C* table:
CREATE TABLE sensor_by_value (
sensor-value INT,
ts TIMESTAMP,
sensorid TEXT,
reading_date DATE,
time_bucket INT,
PRIMARY KEY ((sensor-value, reading_date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
You will definitely have to do some time bucketing here:
- Remember, we don't want any more than 100k rows per partition
- You'll have to understand the possible values (range)
- The frequency of each reading
- If you have 100 devices, 100 sensors, and each sensor being read up to 200k per day, then you have a potential for up to 2B sensor readings per day...
- Typically, what I have my customers do is run some analysis on their data to understand these bits of info, that way you can be sure to account for it
- How much you have to bucket will depend on the frequency
- Good luck! :-)
Final tip
Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live
Your data seems immutable after the initial insert
TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need
A
default_ttlwill also help with the operational overhead of deleting data after you don't need it anymore.
Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.
To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!
Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?
– ByteFlinger
Nov 17 '18 at 9:46
Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.
– ByteFlinger
Nov 17 '18 at 10:23
Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...
– Justin Breese
Nov 18 '18 at 16:15
One more question: for your second questionFind the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)
– Justin Breese
Nov 18 '18 at 16:30
Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.
– Justin Breese
Nov 19 '18 at 3:24
|
show 9 more comments
Part of the problem that you'll run into is that each sensor will be having 200k readings per day. In general, you want to keep each partition under 100k rows. So, your second idea (having date as part of the PK) may have perf issues.
Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.
To really help with this, we will need to understand a little more information:
- How many devices do you have? Will that number grow or is it finite?
- In plain English, what is an example of queries that you are trying to answer?
Incorporating this into the answer based on your answers (below):
Alright, here is a potential idea...
We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.
You're going to want two tables:
- Lookup table
- Sensor table
Lookup table will look something like:
CREATE TABLE lookup-table (
deviceid TEXT,
sensor-map MAP,
PRIMARY KEY (deviceid)
);
deviceidis the unique ID for each device
sensor-mapis a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})- That way each device has a mapping of sensors that is available to it
- Example query would be:
SELECT * FROM lookup-table WHERE deviceid = 1234;
- Another approach would be to have individual columns for each type of sensor and the unique ID for each sensor as a value
Sensor table will look like:
CREATE TABLE sensor_data (
sensorid TEXT,
sensor_value (whatever data type fits what you need),
ts TIMESTAMP,
reading_date date,
time_bucket int,
PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
- As each sensor will get 200k readings/day AND we want to keep each partition under 100k rows, that means we want to do two partitions for each sensor each day
- How could you bucket? You should do it in two parts:you need to bucket daily; each sensor gets a new partition each day (
reading_date) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2 - Within your application provide the specific
sensoridand
time_bucketwill come from the time that you're actually requesting
the query (e.g. if time is 1135 hours, thentime_bucket = 1) andreading_datewill come from the actual day that you are querying - Since you are clustering with
ts DESCthen it will retrieve the
latest reading for that givensensorid. So it would look like
SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;
- By maintaining
tsas a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwritten
Important to know: this works great if there is an even distribution of sensor readings throughout the 24-hour day. However, if you're reading heavily in the morning and not at all in the afternoon, then it isn't an even and we'll have to figure out another way to bucket. But, I think that you get what is going on.
To query:
- There will be one query to retrieve all of the
sensoridthat a device has; once you have thosesensorid, you can then use it for the next step - There will be n queries for each
sensor_valuefor eachsensorid
- Since we are bucketing (via
time_bucket), you should have an even distribution throughout all of the partitions
Lastly: give me the latest sensorid by a given value
To do that there are a couple of different ways...
- Run a Spark job: to do that, you'll have to lift and shift the data to run the Spark query
- Use DataStax Enterprise: with DSE you have an integrated Analytics component based on Spark so you can run Spark jobs without having to manage a separate Spark cluster. Disclosure: I work there, btw
- Create an additional Cassandra (C*) table and do some parallel writes
For the additional C* table:
CREATE TABLE sensor_by_value (
sensor-value INT,
ts TIMESTAMP,
sensorid TEXT,
reading_date DATE,
time_bucket INT,
PRIMARY KEY ((sensor-value, reading_date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
You will definitely have to do some time bucketing here:
- Remember, we don't want any more than 100k rows per partition
- You'll have to understand the possible values (range)
- The frequency of each reading
- If you have 100 devices, 100 sensors, and each sensor being read up to 200k per day, then you have a potential for up to 2B sensor readings per day...
- Typically, what I have my customers do is run some analysis on their data to understand these bits of info, that way you can be sure to account for it
- How much you have to bucket will depend on the frequency
- Good luck! :-)
Final tip
Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live
Your data seems immutable after the initial insert
TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need
A
default_ttlwill also help with the operational overhead of deleting data after you don't need it anymore.
Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.
To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!
Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?
– ByteFlinger
Nov 17 '18 at 9:46
Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.
– ByteFlinger
Nov 17 '18 at 10:23
Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...
– Justin Breese
Nov 18 '18 at 16:15
One more question: for your second questionFind the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)
– Justin Breese
Nov 18 '18 at 16:30
Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.
– Justin Breese
Nov 19 '18 at 3:24
|
show 9 more comments
Part of the problem that you'll run into is that each sensor will be having 200k readings per day. In general, you want to keep each partition under 100k rows. So, your second idea (having date as part of the PK) may have perf issues.
Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.
To really help with this, we will need to understand a little more information:
- How many devices do you have? Will that number grow or is it finite?
- In plain English, what is an example of queries that you are trying to answer?
Incorporating this into the answer based on your answers (below):
Alright, here is a potential idea...
We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.
You're going to want two tables:
- Lookup table
- Sensor table
Lookup table will look something like:
CREATE TABLE lookup-table (
deviceid TEXT,
sensor-map MAP,
PRIMARY KEY (deviceid)
);
deviceidis the unique ID for each device
sensor-mapis a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})- That way each device has a mapping of sensors that is available to it
- Example query would be:
SELECT * FROM lookup-table WHERE deviceid = 1234;
- Another approach would be to have individual columns for each type of sensor and the unique ID for each sensor as a value
Sensor table will look like:
CREATE TABLE sensor_data (
sensorid TEXT,
sensor_value (whatever data type fits what you need),
ts TIMESTAMP,
reading_date date,
time_bucket int,
PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
- As each sensor will get 200k readings/day AND we want to keep each partition under 100k rows, that means we want to do two partitions for each sensor each day
- How could you bucket? You should do it in two parts:you need to bucket daily; each sensor gets a new partition each day (
reading_date) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2 - Within your application provide the specific
sensoridand
time_bucketwill come from the time that you're actually requesting
the query (e.g. if time is 1135 hours, thentime_bucket = 1) andreading_datewill come from the actual day that you are querying - Since you are clustering with
ts DESCthen it will retrieve the
latest reading for that givensensorid. So it would look like
SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;
- By maintaining
tsas a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwritten
Important to know: this works great if there is an even distribution of sensor readings throughout the 24-hour day. However, if you're reading heavily in the morning and not at all in the afternoon, then it isn't an even and we'll have to figure out another way to bucket. But, I think that you get what is going on.
To query:
- There will be one query to retrieve all of the
sensoridthat a device has; once you have thosesensorid, you can then use it for the next step - There will be n queries for each
sensor_valuefor eachsensorid
- Since we are bucketing (via
time_bucket), you should have an even distribution throughout all of the partitions
Lastly: give me the latest sensorid by a given value
To do that there are a couple of different ways...
- Run a Spark job: to do that, you'll have to lift and shift the data to run the Spark query
- Use DataStax Enterprise: with DSE you have an integrated Analytics component based on Spark so you can run Spark jobs without having to manage a separate Spark cluster. Disclosure: I work there, btw
- Create an additional Cassandra (C*) table and do some parallel writes
For the additional C* table:
CREATE TABLE sensor_by_value (
sensor-value INT,
ts TIMESTAMP,
sensorid TEXT,
reading_date DATE,
time_bucket INT,
PRIMARY KEY ((sensor-value, reading_date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
You will definitely have to do some time bucketing here:
- Remember, we don't want any more than 100k rows per partition
- You'll have to understand the possible values (range)
- The frequency of each reading
- If you have 100 devices, 100 sensors, and each sensor being read up to 200k per day, then you have a potential for up to 2B sensor readings per day...
- Typically, what I have my customers do is run some analysis on their data to understand these bits of info, that way you can be sure to account for it
- How much you have to bucket will depend on the frequency
- Good luck! :-)
Final tip
Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live
Your data seems immutable after the initial insert
TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need
A
default_ttlwill also help with the operational overhead of deleting data after you don't need it anymore.
Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.
To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!
Part of the problem that you'll run into is that each sensor will be having 200k readings per day. In general, you want to keep each partition under 100k rows. So, your second idea (having date as part of the PK) may have perf issues.
Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.
To really help with this, we will need to understand a little more information:
- How many devices do you have? Will that number grow or is it finite?
- In plain English, what is an example of queries that you are trying to answer?
Incorporating this into the answer based on your answers (below):
Alright, here is a potential idea...
We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.
You're going to want two tables:
- Lookup table
- Sensor table
Lookup table will look something like:
CREATE TABLE lookup-table (
deviceid TEXT,
sensor-map MAP,
PRIMARY KEY (deviceid)
);
deviceidis the unique ID for each device
sensor-mapis a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})- That way each device has a mapping of sensors that is available to it
- Example query would be:
SELECT * FROM lookup-table WHERE deviceid = 1234;
- Another approach would be to have individual columns for each type of sensor and the unique ID for each sensor as a value
Sensor table will look like:
CREATE TABLE sensor_data (
sensorid TEXT,
sensor_value (whatever data type fits what you need),
ts TIMESTAMP,
reading_date date,
time_bucket int,
PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
- As each sensor will get 200k readings/day AND we want to keep each partition under 100k rows, that means we want to do two partitions for each sensor each day
- How could you bucket? You should do it in two parts:you need to bucket daily; each sensor gets a new partition each day (
reading_date) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2 - Within your application provide the specific
sensoridand
time_bucketwill come from the time that you're actually requesting
the query (e.g. if time is 1135 hours, thentime_bucket = 1) andreading_datewill come from the actual day that you are querying - Since you are clustering with
ts DESCthen it will retrieve the
latest reading for that givensensorid. So it would look like
SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;
- By maintaining
tsas a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwritten
Important to know: this works great if there is an even distribution of sensor readings throughout the 24-hour day. However, if you're reading heavily in the morning and not at all in the afternoon, then it isn't an even and we'll have to figure out another way to bucket. But, I think that you get what is going on.
To query:
- There will be one query to retrieve all of the
sensoridthat a device has; once you have thosesensorid, you can then use it for the next step - There will be n queries for each
sensor_valuefor eachsensorid
- Since we are bucketing (via
time_bucket), you should have an even distribution throughout all of the partitions
Lastly: give me the latest sensorid by a given value
To do that there are a couple of different ways...
- Run a Spark job: to do that, you'll have to lift and shift the data to run the Spark query
- Use DataStax Enterprise: with DSE you have an integrated Analytics component based on Spark so you can run Spark jobs without having to manage a separate Spark cluster. Disclosure: I work there, btw
- Create an additional Cassandra (C*) table and do some parallel writes
For the additional C* table:
CREATE TABLE sensor_by_value (
sensor-value INT,
ts TIMESTAMP,
sensorid TEXT,
reading_date DATE,
time_bucket INT,
PRIMARY KEY ((sensor-value, reading_date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
You will definitely have to do some time bucketing here:
- Remember, we don't want any more than 100k rows per partition
- You'll have to understand the possible values (range)
- The frequency of each reading
- If you have 100 devices, 100 sensors, and each sensor being read up to 200k per day, then you have a potential for up to 2B sensor readings per day...
- Typically, what I have my customers do is run some analysis on their data to understand these bits of info, that way you can be sure to account for it
- How much you have to bucket will depend on the frequency
- Good luck! :-)
Final tip
Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live
Your data seems immutable after the initial insert
TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need
A
default_ttlwill also help with the operational overhead of deleting data after you don't need it anymore.
Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.
To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!
edited Nov 20 '18 at 17:33
answered Nov 16 '18 at 17:25
Justin BreeseJustin Breese
463
463
Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?
– ByteFlinger
Nov 17 '18 at 9:46
Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.
– ByteFlinger
Nov 17 '18 at 10:23
Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...
– Justin Breese
Nov 18 '18 at 16:15
One more question: for your second questionFind the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)
– Justin Breese
Nov 18 '18 at 16:30
Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.
– Justin Breese
Nov 19 '18 at 3:24
|
show 9 more comments
Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?
– ByteFlinger
Nov 17 '18 at 9:46
Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.
– ByteFlinger
Nov 17 '18 at 10:23
Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...
– Justin Breese
Nov 18 '18 at 16:15
One more question: for your second questionFind the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)
– Justin Breese
Nov 18 '18 at 16:30
Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.
– Justin Breese
Nov 19 '18 at 3:24
Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?
– ByteFlinger
Nov 17 '18 at 9:46
Thank you. The main query I am looking for is "Give me the latest value for each sensor, in a list of sensors". We could be OK with limiting this query to a few days and handling the lack of a response if no value is found for the sensor in a reasonable amount of time. Another one we struggle with (though not included in the question above) is "Find the latest reading with a value of X for a specific sensor". Those are our main queries right now. Adding the date to the PK was intended as "bucketing" as you mentioned. Maybe I misunderstood the concept?
– ByteFlinger
Nov 17 '18 at 9:46
Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.
– ByteFlinger
Nov 17 '18 at 10:23
Number of devices is very small to being with (Around 10) but can grow to potentially 500, maybe more. My estimate is to design this for a limit of 100 devices for now.
– ByteFlinger
Nov 17 '18 at 10:23
Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...
– Justin Breese
Nov 18 '18 at 16:15
Thanks for the context. You were definitely on the right track. There are bounds though that you want to stay under to have performant queries: keep each partition under 100k rows, no partition greater than 100MB, no more than 2B columns in a partition. These are guidelines, of course, but that how you make your queries fast. docs.datastax.com/en/dse/6.0/cql/cql/cql_using/… docs.datastax.com/en/dse-planning/doc/planning/… As far as some recommendations, let me get back to you shortly on that...
– Justin Breese
Nov 18 '18 at 16:15
One more question: for your second question
Find the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)– Justin Breese
Nov 18 '18 at 16:30
One more question: for your second question
Find the latest reading with a value of X for a specific sensor: what is the range of possibilities that the sensor can display? Be sure to include how many decimal points, etc. Also, do you know the distribution (roughly) in those values (e.g. there is an average and it follows a normal distribution, etc.)– Justin Breese
Nov 18 '18 at 16:30
Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.
– Justin Breese
Nov 19 '18 at 3:24
Sorry for all of the comments and questions; I am new to this and trying to help. I just updated my answer above based on your feedback.
– Justin Breese
Nov 19 '18 at 3:24
|
show 9 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309291%2fcassandra-data-modeling-iot-best-practices%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown