Thread: Storing sensor data

Storing sensor data

From
Ivan Voras
Date:
Hi,

I need to store data about sensor readings. There is a known (but
configurable) number of sensors which can send update data at any time.
The "current" state needs to be kept but also all historical records.
I'm trying to decide between these two designs:

1) create a table for "current" data, one record for each sensor, update
this table when a sensor reading arrives, create a trigger that would
transfer old record data to a history table (of basically the same
structure)
2) write only to the history table, use relatively complex queries or
outside-the-database magic to determine what the "current" values of the
sensors are.

The volume of sensor data is potentially huge, on the order of 500,000
updates per hour. Sensor data is few numeric(15,5) numbers.

I think the second design would be easiest on the database but as the
current sensor state can potentially be queried often, it might be too
slow to read.

Any recommendations?


Attachment

Re: Storing sensor data

From
Heikki Linnakangas
Date:
Ivan Voras wrote:
> I need to store data about sensor readings. There is a known (but
> configurable) number of sensors which can send update data at any time.
> The "current" state needs to be kept but also all historical records.
> I'm trying to decide between these two designs:
>
> 1) create a table for "current" data, one record for each sensor, update
> this table when a sensor reading arrives, create a trigger that would
> transfer old record data to a history table (of basically the same
> structure)
> 2) write only to the history table, use relatively complex queries or
> outside-the-database magic to determine what the "current" values of the
> sensors are.

3) write only to the history table, but have an INSERT trigger to update
the table with "current" data. This has the same performance
characteristics as 1, but let's you design your application like 2.

I think I'd choose this approach (or 2), since it can handle
out-of-order or delayed arrival of sensor readings gracefully (assuming
they are timestamped at source).

If you go with 2, I'd recommend to still create a view to encapsulate
the complex query for the current values, to make the application
development simpler. And if it gets slow, you can easily swap the view
with a table, updated with triggers or periodically, without changing
the application.

> The volume of sensor data is potentially huge, on the order of 500,000
> updates per hour. Sensor data is few numeric(15,5) numbers.

Whichever design you choose, you should also consider partitioning the data.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Storing sensor data

From
Nikolas Everett
Date:
Option 1 is about somewhere between 2 and 3 times more work for the database than option 2.

Do you need every sensor update to hit the database?  In a situation like this I'd be tempted to keep the current values in the application itself and then sweep them all into the database periodically.  If some of the sensor updates should hit the database faster, you could push those in as you get them rather than wait for your sweeper.  This setup has the advantage that you c0an scale up the number of sensors and the frequency the sensors report without having to scale up the disks.  You can also do the sweeping all in one transaction or even in one batch update.


On Thu, May 28, 2009 at 9:31 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Ivan Voras wrote:
The volume of sensor data is potentially huge, on the order of 500,000
updates per hour. Sensor data is few numeric(15,5) numbers.

Whichever design you choose, you should also consider partitioning the data.


Amen.  Do that.

Re: Storing sensor data

From
Alexander Staubo
Date:
On Thu, May 28, 2009 at 2:54 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> The volume of sensor data is potentially huge, on the order of 500,000
> updates per hour. Sensor data is few numeric(15,5) numbers.

The size of that dataset, combined with the apparent simplicity of
your schema and the apparent requirement for most-sequential access
(I'm guessing about the latter two), all lead me to suspect you would
be happier with something other than a traditional relational
database.

I don't know how exact your historical data has to be. Could you get
by with something like RRDTool? RRdTool is a round-robin database that
stores multiple levels of historical values aggregated by function. So
you typically create an "average" database, a "max" database and so
on, with the appropriate functions to transform the data, and you
subdivide these into day, month, year and so on, by the granularity of
your choice.

When you store a value, the historical data is aggregated
appropriately -- at appropriate levels of granularity, so the current
day database is more precise than the monthly one, and so on -- and
you always have access to the exact current data. RRDTool is used by
software such as Munin and Cacti that track a huge number of readings
over time for graphing.

If you require precise data with the ability to filter, aggregate and
correlate over multiple dimensions, something like Hadoop -- or one of
the Hadoop-based column database implementations, such as HBase or
Hypertable -- might be a better option, combined with MapReduce/Pig to
execute analysis jobs

A.

Re: Storing sensor data

From
Ivan Voras
Date:
2009/5/28 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> Ivan Voras wrote:
>>
>> I need to store data about sensor readings. There is a known (but
>> configurable) number of sensors which can send update data at any time.
>> The "current" state needs to be kept but also all historical records.
>> I'm trying to decide between these two designs:
>>
>> 1) create a table for "current" data, one record for each sensor, update
>> this table when a sensor reading arrives, create a trigger that would
>> transfer old record data to a history table (of basically the same
>> structure)
>> 2) write only to the history table, use relatively complex queries or
>> outside-the-database magic to determine what the "current" values of the
>> sensors are.
>
> 3) write only to the history table, but have an INSERT trigger to update the
> table with "current" data. This has the same performance characteristics as
> 1, but let's you design your application like 2.

Excellent idea!

> I think I'd choose this approach (or 2), since it can handle out-of-order or
> delayed arrival of sensor readings gracefully (assuming they are timestamped
> at source).

It seems like your approach is currently the winner.

> If you go with 2, I'd recommend to still create a view to encapsulate the
> complex query for the current values, to make the application development
> simpler. And if it gets slow, you can easily swap the view with a table,
> updated with triggers or periodically, without changing the application.
>
>> The volume of sensor data is potentially huge, on the order of 500,000
>> updates per hour. Sensor data is few numeric(15,5) numbers.
>
> Whichever design you choose, you should also consider partitioning the data.

I'll look into it, but we'll first see if we can get away with
limiting the time the data needs to be available.

Re: Storing sensor data

From
Ivan Voras
Date:
2009/5/28 Nikolas Everett <nik9000@gmail.com>:
> Option 1 is about somewhere between 2 and 3 times more work for the database
> than option 2.

Yes, for writes.

> Do you need every sensor update to hit the database?  In a situation like

We can't miss an update - they can be delayed but they all need to be recorded.

> this I'd be tempted to keep the current values in the application itself and
> then sweep them all into the database periodically.  If some of the sensor
> updates should hit the database faster, you could push those in as you get
> them rather than wait for your sweeper.  This setup has the advantage that
> you can scale up the number of sensors and the frequency the sensors report
> without having to scale up the disks.  You can also do the sweeping all in
> one transaction or even in one batch update.

It would be nice, but then we need to invest more effort in making the
front-end buffering resilient.

Re: Storing sensor data

From
Kenneth Marshall
Date:
On Thu, May 28, 2009 at 04:55:34PM +0200, Ivan Voras wrote:
> 2009/5/28 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> > Ivan Voras wrote:
> >>
> >> I need to store data about sensor readings. There is a known (but
> >> configurable) number of sensors which can send update data at any time.
> >> The "current" state needs to be kept but also all historical records.
> >> I'm trying to decide between these two designs:
> >>
> >> 1) create a table for "current" data, one record for each sensor, update
> >> this table when a sensor reading arrives, create a trigger that would
> >> transfer old record data to a history table (of basically the same
> >> structure)
> >> 2) write only to the history table, use relatively complex queries or
> >> outside-the-database magic to determine what the "current" values of the
> >> sensors are.
> >
> > 3) write only to the history table, but have an INSERT trigger to update the
> > table with "current" data. This has the same performance characteristics as
> > 1, but let's you design your application like 2.
>
> Excellent idea!
>
> > I think I'd choose this approach (or 2), since it can handle out-of-order or
> > delayed arrival of sensor readings gracefully (assuming they are timestamped
> > at source).
>
> It seems like your approach is currently the winner.
>
> > If you go with 2, I'd recommend to still create a view to encapsulate the
> > complex query for the current values, to make the application development
> > simpler. And if it gets slow, you can easily swap the view with a table,
> > updated with triggers or periodically, without changing the application.
> >
> >> The volume of sensor data is potentially huge, on the order of 500,000
> >> updates per hour. Sensor data is few numeric(15,5) numbers.
> >
> > Whichever design you choose, you should also consider partitioning the data.
>
> I'll look into it, but we'll first see if we can get away with
> limiting the time the data needs to be available.
>

Mr. Voras,

One big benefit of partitioning is that you can prune old data with
minimal impact to the running system. Doing a large bulk delete would
be extremely I/O impacting without partion support. We use this for
a DB log system and it allows us to simply truncate a day table instead
of a delete -- much, much faster.

Regards,
Ken

Re: Storing sensor data

From
Ivan Voras
Date:
2009/5/28 Alexander Staubo <alex@bengler.no>:
> On Thu, May 28, 2009 at 2:54 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>> The volume of sensor data is potentially huge, on the order of 500,000
>> updates per hour. Sensor data is few numeric(15,5) numbers.
>
> The size of that dataset, combined with the apparent simplicity of
> your schema and the apparent requirement for most-sequential access
> (I'm guessing about the latter two),

Your guesses are correct, except every now and then a random value
indexed on a timestamp needs to be retrieved.

> all lead me to suspect you would
> be happier with something other than a traditional relational
> database.
>
> I don't know how exact your historical data has to be. Could you get

No "lossy" compression is allowed. Exact data is needed for the whole dataset-

> If you require precise data with the ability to filter, aggregate and
> correlate over multiple dimensions, something like Hadoop -- or one of
> the Hadoop-based column database implementations, such as HBase or
> Hypertable -- might be a better option, combined with MapReduce/Pig to
> execute analysis jobs

This looks like an interesting idea to investigate. Do you have more
experience with such databases? How do they fare with the following
requirements:

* Storing large datasets (do they pack data well in the database? No
wasted space like in e.g. hash tables?)
* Retrieving specific random records based on a timestamp or record ID?
* Storing "inifinite" datasets (i.e. whose size is not known in
advance - cf. e.g. hash tables)

On the other hand, we could periodically transfer data from PostgreSQL
into a simpler database (e.g. BDB) for archival purposes (at the
expense of more code). Would they be better suited?

Re: Storing sensor data

From
Ivan Voras
Date:
2009/5/28 Kenneth Marshall <ktm@rice.edu>:

>
> One big benefit of partitioning is that you can prune old data with
> minimal impact to the running system. Doing a large bulk delete would
> be extremely I/O impacting without partion support. We use this for
> a DB log system and it allows us to simply truncate a day table instead
> of a delete -- much, much faster.

Thanks. I'll need to investigate how much administrative overhead and
fragility partitioning will introduce since the data will also be
replicated between 2 servers (I'm thinking of using Slony). Any
experience with this combination?

Re: Storing sensor data

From
Alexander Staubo
Date:
On Thu, May 28, 2009 at 5:06 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>> If you require precise data with the ability to filter, aggregate and
>> correlate over multiple dimensions, something like Hadoop -- or one of
>> the Hadoop-based column database implementations, such as HBase or
>> Hypertable -- might be a better option, combined with MapReduce/Pig to
>> execute analysis jobs
>
> This looks like an interesting idea to investigate. Do you have more
> experience with such databases? How do they fare with the following
> requirements:

We might want to take this discussion off-list, since this list is
about PostgreSQL. Feel free to reply privately.

> * Storing large datasets (do they pack data well in the database? No
> wasted space like in e.g. hash tables?)

Columns databases like Hypertable and HBase are designed to store data
quite efficiently. Each column is grouped in a unit called a column
family and stored together in chunks usually called SSTables, after
the Google Bigtable paper. (When you design your database you must
determine which columns are usually accessed together, in other to
avoid incurring the I/O cost of loading non-pertinent columns.) Each
SSTable is like a partition. When storing a chunk to disk, the column
is compressed, each column being stored sequentially for optimal
compression.

I have used HBase, but I don't have any feel for how much space it
wastes. In theory, though, space usage should be more optimal than
with PostgreSQL. I have used Cassandra, another column database I
would also recommend, which is very efficient. In many ways I prefer
Cassandra to HBase -- it's leaner, completely decentralized (no single
point of failure) and independent of the rather huge, monolithic
Hadoop project -- but it does not currently support MapReduce. If you
want to implement some kind of distributed analysis system, you will
need to write yourself.

All three column stores support mapping information by a time
dimension. Each time you write a key, you also provide a timestamp. In
theory you can retain the entire history of a single key. HBase lets
you specify how many revisions to retain; not sure what Cassandra
does. However, Cassandra introduces the notion of a "supercolumn
family", another grouping level which lets you use the timestamp as a
column key. To explain how this works, consider the following inserts:

  # insert(table_name, key, column, value, timestamp)
  db.insert("readings", "temperature_sensor", "value:1", 23, "200905281725023")
  db.insert("readings", "temperature_sensor", "value:2", 27, "200905281725023")
  db.insert("readings", "temperature_sensor", "value:3", 21, "200905281725023")

The resulting "temperature_sensor" row will have three column values:

  value:1   value:2   value:3
  23          27          21

You can keep adding values and the row will get bigger. Because
columns are dynamic, only that row will grow; all other rows will stay
the same size. Cassandra users usually use the column name as a kind
of value -- image it's like subindexing an array.

As you can see, I also passed a timestamp (the 2009.. bit), which is
used for versioning. Since anyone can write to any node in a cluster,
Cassandra needs to be able to resolve conflicts.

Note that these databases are inherently distributed. You can run them
on a single node just fine -- and that might be appropriate in your
case -- but they really shine when you run a whole cluster. Cassandra
is multi-master, so you can just boot up a number of nodes and read
from/write to any of them.

> * Retrieving specific random records based on a timestamp or record ID?

Absolutely.

> * Storing "inifinite" datasets (i.e. whose size is not known in
> advance - cf. e.g. hash tables)

This is one area where column databases are better than relational
ones. The schema is completely dynamic, and you can treat it as a hash
table.

> On the other hand, we could periodically transfer data from PostgreSQL
> into a simpler database (e.g. BDB) for archival purposes (at the
> expense of more code). Would they be better suited?

Considering the size and sequential nature of the data, I think they
would be better match than a simple key-value store like BDB.

A.

Re: Storing sensor data

From
Grzegorz Jaśkiewicz
Date:
depends on how soon do you need to access that data after it's being
created, the way I do it in my systems, I get data from 8 points, bit
less than you - but I dump it to csv, and import it on database host
(separate server).
now, you could go to BDB or whatever, but that's not the solution.

So, I would try dumping it to a file, and have separate process, maybe
separete server that would import it, store it, as database. Whatever
you do, as guys said - partition.

Re: Storing sensor data

From
Kenneth Marshall
Date:
On Thu, May 28, 2009 at 05:24:33PM +0200, Ivan Voras wrote:
> 2009/5/28 Kenneth Marshall <ktm@rice.edu>:
>
> >
> > One big benefit of partitioning is that you can prune old data with
> > minimal impact to the running system. Doing a large bulk delete would
> > be extremely I/O impacting without partion support. We use this for
> > a DB log system and it allows us to simply truncate a day table instead
> > of a delete -- much, much faster.
>
> Thanks. I'll need to investigate how much administrative overhead and
> fragility partitioning will introduce since the data will also be
> replicated between 2 servers (I'm thinking of using Slony). Any
> experience with this combination?
>

We use Slony1 on a number of databases, but none yet on which we
use data partitioning.

Cheers,
Ken

Re: Storing sensor data

From
Greg Jaman
Date:
I currently have a database doing something very similar.  I setup partition tables with predictable names based on the the data's timestamp week number eg:  (Data_YYYY_WI). 

I have a tigger on the parent partition table to redirect data to the correct partition( tablename:='Data_' || to_char('$NEW(ts)'::timestamptz, 'IYYY_IW') ) .  then I use dynamic sql to do the insert.  I did some optimization by writting it in pl/TCL and using global variables to store prepared insert statements.

Most queries for me are based  on the date and we have decent performance with our current setup.  For last/current sensor data we just store the last dataID in the sensor record.  I haven't thought of a better way yet.  After batch inserts we caculate the last reading for each participating sensorID inserted.

With partition tables we struggled with the query to get the lastest data :  select * from "Data" where "sensorID"=x order by ts limit 1  -- for parition tables.   See (http://archives.postgresql.org/pgsql-performance/2008-11/msg00284.php)  





On Thu, May 28, 2009 at 7:55 AM, Ivan Voras <ivoras@freebsd.org> wrote:
2009/5/28 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> Ivan Voras wrote:
>>
>> I need to store data about sensor readings. There is a known (but
>> configurable) number of sensors which can send update data at any time.
>> The "current" state needs to be kept but also all historical records.
>> I'm trying to decide between these two designs:
>>
>> 1) create a table for "current" data, one record for each sensor, update
>> this table when a sensor reading arrives, create a trigger that would
>> transfer old record data to a history table (of basically the same
>> structure)
>> 2) write only to the history table, use relatively complex queries or
>> outside-the-database magic to determine what the "current" values of the
>> sensors are.
>
> 3) write only to the history table, but have an INSERT trigger to update the
> table with "current" data. This has the same performance characteristics as
> 1, but let's you design your application like 2.

Excellent idea!

> I think I'd choose this approach (or 2), since it can handle out-of-order or
> delayed arrival of sensor readings gracefully (assuming they are timestamped
> at source).

It seems like your approach is currently the winner.

> If you go with 2, I'd recommend to still create a view to encapsulate the
> complex query for the current values, to make the application development
> simpler. And if it gets slow, you can easily swap the view with a table,
> updated with triggers or periodically, without changing the application.
>
>> The volume of sensor data is potentially huge, on the order of 500,000
>> updates per hour. Sensor data is few numeric(15,5) numbers.
>
> Whichever design you choose, you should also consider partitioning the data.

I'll look into it, but we'll first see if we can get away with
limiting the time the data needs to be available.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Storing sensor data

From
Greg Jaman
Date:
I also forgot to note that I had no problems setting up replication via londiste (skytools).   The cronjob that creates the partition each week for me also adds the table to the replication set.   As simple as:
   londiste.py londiste.ini provider add 'public.Data_YYYY_WI'
   londiste.py londiste.ini subscriber add 'public.Data_YYYY_WI'






On Thu, May 28, 2009 at 11:56 AM, Greg Jaman <gjaman@gmail.com> wrote:
I currently have a database doing something very similar.  I setup partition tables with predictable names based on the the data's timestamp week number eg:  (Data_YYYY_WI). 

I have a tigger on the parent partition table to redirect data to the correct partition( tablename:='Data_' || to_char('$NEW(ts)'::timestamptz, 'IYYY_IW') ) .  then I use dynamic sql to do the insert.  I did some optimization by writting it in pl/TCL and using global variables to store prepared insert statements.

Most queries for me are based  on the date and we have decent performance with our current setup.  For last/current sensor data we just store the last dataID in the sensor record.  I haven't thought of a better way yet.  After batch inserts we caculate the last reading for each participating sensorID inserted.

With partition tables we struggled with the query to get the lastest data :  select * from "Data" where "sensorID"=x order by ts limit 1  -- for parition tables.   See (http://archives.postgresql.org/pgsql-performance/2008-11/msg00284.php)  





On Thu, May 28, 2009 at 7:55 AM, Ivan Voras <ivoras@freebsd.org> wrote:
2009/5/28 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> Ivan Voras wrote:
>>
>> I need to store data about sensor readings. There is a known (but
>> configurable) number of sensors which can send update data at any time.
>> The "current" state needs to be kept but also all historical records.
>> I'm trying to decide between these two designs:
>>
>> 1) create a table for "current" data, one record for each sensor, update
>> this table when a sensor reading arrives, create a trigger that would
>> transfer old record data to a history table (of basically the same
>> structure)
>> 2) write only to the history table, use relatively complex queries or
>> outside-the-database magic to determine what the "current" values of the
>> sensors are.
>
> 3) write only to the history table, but have an INSERT trigger to update the
> table with "current" data. This has the same performance characteristics as
> 1, but let's you design your application like 2.

Excellent idea!

> I think I'd choose this approach (or 2), since it can handle out-of-order or
> delayed arrival of sensor readings gracefully (assuming they are timestamped
> at source).

It seems like your approach is currently the winner.

> If you go with 2, I'd recommend to still create a view to encapsulate the
> complex query for the current values, to make the application development
> simpler. And if it gets slow, you can easily swap the view with a table,
> updated with triggers or periodically, without changing the application.
>
>> The volume of sensor data is potentially huge, on the order of 500,000
>> updates per hour. Sensor data is few numeric(15,5) numbers.
>
> Whichever design you choose, you should also consider partitioning the data.

I'll look into it, but we'll first see if we can get away with
limiting the time the data needs to be available.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance