Thread: Best way to handle multi-billion row read-only table?
Hello. I'm putting together a database to store the readings from various measurement devices for later processing. Since these things (water pressure monitors attached to very large water pipes) take readings at 200Hz and are typically deployed over multiple sites for several months at a time I've got many billions of rows of data, each (at the moment) with the following simple format: value REAL NOT NULL, sample_time TIMESTAMP WITH TIME ZONE NOT NULL, channel INTEGER REFERENCES channel(id) NOT NULL (Where the "channel" table contains metadata to identify the particular sensor, data logger, etc. used to obtain the data and the combination of channel and sample_time is unique.) Once loaded into the database the data will never be deleted or modified and will typically be accessed over a particular date range for a particular channel (e.g. "sample_time >= X AND sample_time <= Y AND channel=Z"). A typical query won't return more than a few million rows and speed is not desperately important (as long as the time is measured in minutes rather than hours). Are there any recommended ways to organise this? Should I partition my big table into multiple smaller ones which will always fit in memory (this would result in several hundreds or thousands of sub-tables)? Are there any ways to keep the index size to a minimum? At the moment I have a few weeks of data, about 180GB, loaded into a single table and indexed on sample_time and channel and the index takes up 180GB too. Since this is all for a typically budget-restricted PhD project the hardware is just a high-end desktop workstation with (at the moment) 2*2TB drives organised into a single 4TB partition using FreeBSD's vinum system. Many thanks for any help, Asher.
On 2/9/2010 12:47 PM, Asher wrote: > Hello. > > I'm putting together a database to store the readings from various > measurement devices for later processing. Since these things (water > pressure monitors attached to very large water pipes) take readings at > 200Hz and are typically deployed over multiple sites for several > months at a time I've got many billions of rows of data, each (at the > moment) with the following simple format: > > value REAL NOT NULL, > sample_time TIMESTAMP WITH TIME ZONE NOT NULL, > channel INTEGER REFERENCES channel(id) NOT NULL > > (Where the "channel" table contains metadata to identify the > particular sensor, data logger, etc. used to obtain the data and the > combination of channel and sample_time is unique.) > > Well first is that 200hz meaning 200 samples per channel per second. That is very fast sampling for pressure sensor, I would be surprised if the meters are actually giving real results at that rate. I would look at reducing that down to what the meter is actual capable of sending What kind of AD card is being used as this effects what makes sense to record. I would look into table partitioning http://www.postgresql.org/docs/current/static/ddl-partitioning.html http://wiki.postgresql.org/wiki/Table_partitioning A one big index for such a small record will not be a big win because the index are going to be the same size as table. Look into limiting the number of records each index covers. http://www.postgresql.org/docs/8.4/static/sql-createindex.html All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by ourproprietary quotation system. Quotations received via any other form of communication will not be honored. CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other informationproprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses.If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified thatany unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have receivedthis e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mailimmediately. Thank you.
I've only gone up to about a billion rows, but table partitioning seems to be the way to go to me. I did per-day partitioning, and just had the job that inserts the daily data add the table automatically. With the partitioning, it only has to pull up the tables for the specific days, and is therefore a kind of date index already. Alex Thurlow Blastro Networks http://www.blastro.com http://www.roxwel.com http://www.yallwire.com On 2/9/2010 11:47 AM, Asher wrote: > Hello. > > I'm putting together a database to store the readings from various > measurement devices for later processing. Since these things (water > pressure monitors attached to very large water pipes) take readings at > 200Hz and are typically deployed over multiple sites for several > months at a time I've got many billions of rows of data, each (at the > moment) with the following simple format: > > value REAL NOT NULL, > sample_time TIMESTAMP WITH TIME ZONE NOT NULL, > channel INTEGER REFERENCES channel(id) NOT NULL > > (Where the "channel" table contains metadata to identify the > particular sensor, data logger, etc. used to obtain the data and the > combination of channel and sample_time is unique.) > > Once loaded into the database the data will never be deleted or > modified and will typically be accessed over a particular date range > for a particular channel (e.g. "sample_time >= X AND sample_time <= Y > AND channel=Z"). A typical query won't return more than a few million > rows and speed is not desperately important (as long as the time is > measured in minutes rather than hours). > > Are there any recommended ways to organise this? Should I partition my > big table into multiple smaller ones which will always fit in memory > (this would result in several hundreds or thousands of sub-tables)? > Are there any ways to keep the index size to a minimum? At the moment > I have a few weeks of data, about 180GB, loaded into a single table > and indexed on sample_time and channel and the index takes up 180GB too. > > Since this is all for a typically budget-restricted PhD project the > hardware is just a high-end desktop workstation with (at the moment) > 2*2TB drives organised into a single 4TB partition using FreeBSD's > vinum system. > > > Many thanks for any help, > Asher. >
Justin Graf wrote: > Well first is that 200hz meaning 200 samples per channel per second. > That is very fast sampling for pressure sensor, I would be surprised if > the meters are actually giving real results at that rate. I would look > at reducing that down to what the meter is actual capable of sending > What kind of AD card is being used as this effects what makes sense to > record. Yes, we really are measuring at 200 samples per second. We're trying to capture high resolution images of pressure transients as they move along water distribution pipelines (the backbones of the water network, typically 4'-6' in diameter, carrying 500-1000 litres/second) to understand how they travel and what stress they put upon the pipe. We're using custom data loggers at the moment based around Intel iMote2 Linux systems with a high-speed QuickFilter ADC (and the sensors we're using can cope at 200Hz). > I would look into table partitioning > http://www.postgresql.org/docs/current/static/ddl-partitioning.html > http://wiki.postgresql.org/wiki/Table_partitioning Thanks for that, it looks like partitioning is the way to go. I'm assuming that I should try and keep my total_relation_sizes less than the memory size of the machine? > A one big index for such a small record will not be a big win because > the index are going to be the same size as table. > Look into limiting the number of records each index covers. > http://www.postgresql.org/docs/8.4/static/sql-createindex.html If I partition so that each partition holds data for a single channel (and set a CHECK constraint for this) then I can presumably remove the channel from the index since constraint exclusion will mean that only partitions holding the channel I'm interested in will be searched in a query. Given that within a partition all of my sample_time's will be different do you know if there's a more efficient way to index these? Many thanks, Asher
On Tue, Feb 9, 2010 at 10:47 AM, Asher <asher@piceur.co.uk> wrote: > Hello. > > I'm putting together a database to store the readings from various > measurement devices for later processing. Since these things (water pressure > monitors attached to very large water pipes) take readings at 200Hz and are > typically deployed over multiple sites for several months at a time I've got > many billions of rows of data, each (at the moment) with the following > simple format: > > value REAL NOT NULL, > sample_time TIMESTAMP WITH TIME ZONE NOT NULL, > channel INTEGER REFERENCES channel(id) NOT NULL > > (Where the "channel" table contains metadata to identify the particular > sensor, data logger, etc. used to obtain the data and the combination of > channel and sample_time is unique.) > > Once loaded into the database the data will never be deleted or modified and > will typically be accessed over a particular date range for a particular > channel (e.g. "sample_time >= X AND sample_time <= Y AND channel=Z"). A > typical query won't return more than a few million rows and speed is not > desperately important (as long as the time is measured in minutes rather > than hours). > > Are there any recommended ways to organise this? Should I partition my big > table into multiple smaller ones which will always fit in memory (this would > result in several hundreds or thousands of sub-tables)? Partitioning is the standard way. Note that you can partition on 1 axis, or 2, or more. In this case partitioning on time and channel might make the most sense. > Are there any ways > to keep the index size to a minimum? At the moment I have a few weeks of > data, about 180GB, loaded into a single table and indexed on sample_time and > channel and the index takes up 180GB too. It may be that with small enough partitions indexes aren't really needed. That's been the case for a lot of data I've worked with in the past. > Since this is all for a typically budget-restricted PhD project the hardware > is just a high-end desktop workstation with (at the moment) 2*2TB drives > organised into a single 4TB partition using FreeBSD's vinum system. Partitioning should definitely help. You might want to go with RAID-1 instead of RAID-0 since the read performance is similar under most modern OSes. I know linux now aggregates the two drives together to read, I'd assume BSD does too. That way you've got better reliability and about the same performance. Load times will be about double, but that's a one time thing, right?
Asher Hoskins wrote: > If I partition so that each partition holds data for a single channel > (and set a CHECK constraint for this) then I can presumably remove the > channel from the index since constraint exclusion will mean that only > partitions holding the channel I'm interested in will be searched in a > query. Given that within a partition all of my sample_time's will be > different do you know if there's a more efficient way to index these? how do you plan on accessing this monster data? do you expect to be looking up single values or small set of values at a specific time? seems to me like this is the sort of data thats more often processed in the aggregate, like running a fourier analysis of sliding windows, and that sort of data processing may well be more efficiently done with fixed block binary files rather than relational databases, as there's no real relationships in this data. for instance, a directory for each sensor, with a directory for each week, and a file for each hour, containing the hours worth of samples in fixed binary blocks after a file header identifying it. you can random access a specific time sample by using fseek (sampletime-starttimeofblock) * blocksize + headersize or whatever.
On 2/9/2010 4:41 PM, Asher Hoskins wrote: > > Thanks for that, it looks like partitioning is the way to go. I'm > assuming that I should try and keep my total_relation_sizes less than > the memory size of the machine? This depends on what the quires look like. As other have stated when partitioning you have to consider how the data is quired. > > > If I partition so that each partition holds data for a single channel > (and set a CHECK constraint for this) then I can presumably remove the > channel from the index since constraint exclusion will mean that only > partitions holding the channel I'm interested in will be searched in a > query. Given that within a partition all of my sample_time's will be > different do you know if there's a more efficient way to index these? Given the timestamp will most likely be the where clause, NO on the plus side its only 8 bytes All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by ourproprietary quotation system. Quotations received via any other form of communication will not be honored. CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other informationproprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses.If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified thatany unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have receivedthis e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mailimmediately. Thank you.
Is it also possible to denormalize by putting the 'channel' data in the first table (especially if it isn't very much)? Maintaining a foreign key constraint can impact performance significantly in most RDBMS's, even when deferring checking. I could be wrong, but I suspect PostgreSQL is no different. Or keep the data normalized and remove the constraint altogether. Also remove any primary key constraint so that it doesn't have to check uniqueness, and avoid as many indexes as you can. You have to take a leap of faith that you created your program well enough to not get out of sync. I would be interested to hear comments on this. These are some of the things we did on systems I have worked on running Oracle that handled even higher volumes (tens to hundreds of thousands of transactions per second or higher... sustained throughout the day at least on the lower volume). Granted we had real heavy hardware but the DBAs forbade us to create constraints and indexes etc. for this reason; except on less active tables. Everyone has already talked about partitioning, but load balancing across machines if you can afford a couple or few more could help too. Not sure what facility Postgres has for this though (I would be interested to hear comments on this too! :-) BillR -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Asher Sent: February-09-10 12:47 PM To: pgsql-general@postgresql.org Subject: [GENERAL] Best way to handle multi-billion row read-only table? Hello. I'm putting together a database to store the readings from various measurement devices for later processing. Since these things (water pressure monitors attached to very large water pipes) take readings at 200Hz and are typically deployed over multiple sites for several months at a time I've got many billions of rows of data, each (at the moment) with the following simple format: value REAL NOT NULL, sample_time TIMESTAMP WITH TIME ZONE NOT NULL, channel INTEGER REFERENCES channel(id) NOT NULL (Where the "channel" table contains metadata to identify the particular sensor, data logger, etc. used to obtain the data and the combination of channel and sample_time is unique.) Once loaded into the database the data will never be deleted or modified and will typically be accessed over a particular date range for a particular channel (e.g. "sample_time >= X AND sample_time <= Y AND channel=Z"). A typical query won't return more than a few million rows and speed is not desperately important (as long as the time is measured in minutes rather than hours). Are there any recommended ways to organise this? Should I partition my big table into multiple smaller ones which will always fit in memory (this would result in several hundreds or thousands of sub-tables)? Are there any ways to keep the index size to a minimum? At the moment I have a few weeks of data, about 180GB, loaded into a single table and indexed on sample_time and channel and the index takes up 180GB too. Since this is all for a typically budget-restricted PhD project the hardware is just a high-end desktop workstation with (at the moment) 2*2TB drives organised into a single 4TB partition using FreeBSD's vinum system. Many thanks for any help, Asher. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -------------------------------- Spam/Virus scanning by CanIt Pro For more information see http://www.kgbinternet.com/SpamFilter.htm To control your spam filter, log in at http://filter.kgbinternet.com -- BEGIN-ANTISPAM-VOTING-LINKS ------------------------------------------------------ Teach CanIt if this mail (ID 80270060) is spam: Spam: http://filter.kgbinternet.com/canit/b.php?i=80270060&m=5d99840e72f9&t=201002 09&c=s Not spam: http://filter.kgbinternet.com/canit/b.php?i=80270060&m=5d99840e72f9&t=201002 09&c=n Forget vote: http://filter.kgbinternet.com/canit/b.php?i=80270060&m=5d99840e72f9&t=201002 09&c=f ------------------------------------------------------ END-ANTISPAM-VOTING-LINKS __________ Information from ESET Smart Security, version of virus signature database 4852 (20100209) __________ The message was checked by ESET Smart Security. http://www.eset.com
On Tue, Feb 9, 2010 at 8:06 PM, BillR <iambill@williamrosmus.com> wrote: > Is it also possible to denormalize by putting the 'channel' data in the > first table (especially if it isn't very much)? Maintaining a foreign key > constraint can impact performance significantly in most RDBMS's, even when > deferring checking. I could be wrong, but I suspect PostgreSQL is no > different. Or keep the data normalized and remove the constraint > altogether. Also remove any primary key constraint so that it doesn't have > to check uniqueness, and avoid as many indexes as you can. > > You have to take a leap of faith that you created your program well enough > to not get out of sync. > > I would be interested to hear comments on this. These are some of the things > we did on systems I have worked on running Oracle that handled even higher > volumes (tens to hundreds of thousands of transactions per second or > higher... sustained throughout the day at least on the lower volume). > Granted we had real heavy hardware but the DBAs forbade us to create > constraints and indexes etc. for this reason; except on less active tables. > Everyone has already talked about partitioning, but load balancing across > machines if you can afford a couple or few more could help too. Not sure > what facility Postgres has for this though (I would be interested to hear I was under the impression the data was being gathered elsewhere and then imported, so the insert performance isn't as critical as if it was being done real time.
Asher wrote: > Once loaded into the database the data will never be deleted or > modified and will typically be accessed over a particular date range > for a particular channel (e.g. "sample_time >= X AND sample_time <= Y > AND channel=Z"). A typical query won't return more than a few million > rows and speed is not desperately important (as long as the time is > measured in minutes rather than hours). > > Are there any recommended ways to organise this? Should I partition my > big table into multiple smaller ones which will always fit in memory > (this would result in several hundreds or thousands of sub-tables)? > Are there any ways to keep the index size to a minimum? At the moment > I have a few weeks of data, about 180GB, loaded into a single table > and indexed on sample_time and channel and the index takes up 180GB too. One approach to consider is partitioning by sample_time and not even including the channel number in the index. You've got tiny records; there's going to be hundreds of channels of data on each data page pulled in, right? Why not minimize physical I/O by reducing the index size, just read that whole section of time in to memory (they should be pretty closely clustered and therefore mostly sequential I/O), and then push the filtering by channel onto the CPU instead. If you've got billions of rows, you're going to end up disk bound anyway; minimizing physical I/O and random seeking around at the expense of CPU time could be a big win. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Tue, Feb 9, 2010 at 11:51 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Asher wrote: >> >> Once loaded into the database the data will never be deleted or modified >> and will typically be accessed over a particular date range for a particular >> channel (e.g. "sample_time >= X AND sample_time <= Y AND channel=Z"). A >> typical query won't return more than a few million rows and speed is not >> desperately important (as long as the time is measured in minutes rather >> than hours). >> >> Are there any recommended ways to organise this? Should I partition my big >> table into multiple smaller ones which will always fit in memory (this would >> result in several hundreds or thousands of sub-tables)? Are there any ways >> to keep the index size to a minimum? At the moment I have a few weeks of >> data, about 180GB, loaded into a single table and indexed on sample_time and >> channel and the index takes up 180GB too. > > One approach to consider is partitioning by sample_time and not even > including the channel number in the index. You've got tiny records; there's > going to be hundreds of channels of data on each data page pulled in, right? > Why not minimize physical I/O by reducing the index size, just read that > whole section of time in to memory (they should be pretty closely clustered > and therefore mostly sequential I/O), and then push the filtering by channel > onto the CPU instead. If you've got billions of rows, you're going to end > up disk bound anyway; minimizing physical I/O and random seeking around at > the expense of CPU time could be a big win. If they're put in in a one time load, load them in channel order into the partitions, and the stats should see the perfect ordering and know to seq scan the right part of the table. I think.
John R Pierce wrote: > how do you plan on accessing this monster data? do you expect to be > looking up single values or small set of values at a specific time? > seems to me like this is the sort of data thats more often processed in > the aggregate, like running a fourier analysis of sliding windows, and > that sort of data processing may well be more efficiently done with > fixed block binary files rather than relational databases, as there's no > real relationships in this data. The data will initially be accessed via a simple GUI which will allow browsing over a subset of the data (subsampled down to 1 sample/minute/hour, etc. during the data load phase and so massively smaller datasets) and then once something "interesting" has been found manually (fully automatic detection of transients has to wait until we've formally described what we mean by "transient" :-)) the start and end times can be handed over to our automatic processing code to go through the full dataset. I did consider just sticking the data into a series of big dumb files but by putting them in a DB I can both maintain automatic links between the full and subsampled data sets and between each data point and the equipment that measured it and, possibly more importantly, I can provide a simpler interface to the other people on my project to access the data. I'm a computer scientist but I'm doing my PhD in the Civil Engineering dept and all of my colleagues are civil engineers - all quite happy using Matlab's database plugin but less happy writing traditional code to crunch through raw files. I'm aware that I'm taking a, possibly quite large, performance hit by using a database but I'm hoping that the advantages will outweigh this. Many thanks for all the replies to my query. I'm going to go with a partitioned table design and start uploading some data. I'll post how it performs once I've got some real size data in it. Asher.
On Wed, Feb 10, 2010 at 2:32 PM, Asher <asher@piceur.co.uk> wrote: > The data will initially be accessed via a simple GUI which will allow > browsing over a subset of the data (subsampled down to 1 sample/minute/hour, > etc. It sounds like you could use a tool like rrd that keeps various levels of aggregation and intelligently chooses the right level for the given query. I think there are such tools though I'm not sure there are any free ones. -- greg
2010/2/10 Greg Stark <gsstark@mit.edu>: > On Wed, Feb 10, 2010 at 2:32 PM, Asher <asher@piceur.co.uk> wrote: >> The data will initially be accessed via a simple GUI which will allow >> browsing over a subset of the data (subsampled down to 1 sample/minute/hour, >> etc. > > It sounds like you could use a tool like rrd that keeps various levels > of aggregation and intelligently chooses the right level for the given > query. I think there are such tools though I'm not sure there are any > free ones. Use as much memory as possible to fit indexes as well as portions of the table space itself in RAM. Of course, poor indexing can kill any effort. -- Vincenzo Romano NotOrAnd Information Technologies NON QVIETIS MARIBVS NAVTA PERITVS