Thread: Scaling SELECT:s with the number of disks on a stripe
Hello, I am looking to use PostgreSQL for storing some very simple flat data mostly in a single table. The amount of data will be in the hundreds of gigabytes range. Each row is on the order of 100-300 bytes in size; in other words, small enough that I am expecting disk I/O to be seek bound (even if PostgreSQL reads a full pg page at a time, since a page is significantly smaller than the stripe size of the volume). The only important performance characteristics are insertion/deletion performance, and the performance of trivial SELECT queries whose WHERE clause tests equality on one of the columns. Other than absolute performance, an important goal is to be able to scale fairly linearly with the number of underlying disk drives. We are fully willing to take a disk seek per item selected, as long as it scales. To this end I have been doing some benchmarking to see whether the plan is going to be feasable. On a 12 disk hardware stripe, insertion performance does scale somewhat with concurrent inserters. However, I am seeing surprising effects with SELECT:s: a single selecter generates the same amount of disk activity as two concurrent selecters (I was easily expecting about twice). The query is simple: SELECT * FROM test WHERE value = 'xxx' LIMIT 1000; No ordering, no joins, no nothing. Selecting concurrently with two different values of 'xxx' yields the same amount of disk activity (never any significant CPU activity). Note that the total amount of data is too large to fit in RAM (> 500 million rows), and the number of distinct values in the value column is 10000. The column in the WHERE clause is indexed. So my first question is - why am I not seeing this scaling? The absolute amount of disk activity with a single selecter is consistent with what I would expect from a SINGLE disk, which is completely expected since I never thought PostgreSQL would introduce disk I/O concurrency on its own. But this means that adding additional readers doing random-access reads *should* scale very well with 12 underlying disks in a stripe. (Note that I have seen fairly similar results on other RAID variants too, including software RAID5 (yes yes I know), in addition to the hardware stripe.) These tests have been done Linux 2.6.19.3 and PostgreSQL 8.1. Secondly, I am seeing a query plan switch after a certain threshold. Observe: perftest=# explain select * from test where val='7433' limit 1000; QUERY PLAN ----------------------------------------------------------------------------------------- Limit (cost=0.00..4016.50 rows=1000 width=143) -> Index Scan using test_val_ix on test (cost=0.00..206620.88 rows=51443 width=143) Index Cond: ((val)::text = '7433'::text) (3 rows) Now increasing to a limit of 10000: perftest=# explain select * from test where val='7433' limit 10000; QUERY PLAN -------------------------------------------------------------------------------------- Limit (cost=360.05..38393.36 rows=10000 width=143) -> Bitmap Heap Scan on test (cost=360.05..196014.82 rows=51443 width=143) Recheck Cond: ((val)::text = '7433'::text) -> Bitmap Index Scan on test_val_ix (cost=0.00..360.05 rows=51443 width=0) Index Cond: ((val)::text = '7433'::text) (5 rows) The interesting part is that the latter query is entirely CPU bound (no disk I/O at all) for an extended period of time before even beginning to read data from disk. And when it *does* start performing disk I/O, the performance is about the same as for the other case. In other words, the change in query plan seems to do nothing but add overhead. What is the bitmap heap scan supposed to be doing that would increase performance above a "seek once per matching row" plan? I haven't been able to Google my way to what the intended benefit is of a heap scan vs. a plain index scan. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
Attachment
Hello Peter, If you are dealing with timed data or similar, you may consider to partition your table(s). In order to deal with large data, I've built a "logical" partition system, whereas the target partition is defined by the date of my data (the date is part of the filenames that I import...). Instead of using the Postgres partitioning framework, I keep the tables boundaries within a refererence table. Then I've built a function that takes the different query parameters as argument (column list, where clause...). This functions retrieve the list of tables to query from my reference table and build the final query, binding the different subqueries from each partition with "UNION ALL". It also requires an additional reference table that describes the table columns (data type, behaviour , e.g. groupable,summable...) This allowed me to replace many "delete" with "drop table" statements, whis is probably the main advantage of the solution. The biggest issue was the implementation time ;-) but I'm really happy with the resulting performances. HTH, Marc -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Peter Schuller Sent: Friday, March 30, 2007 7:17 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Scaling SELECT:s with the number of disks on a stripe Hello, I am looking to use PostgreSQL for storing some very simple flat data mostly in a single table. The amount of data will be in the hundreds of gigabytes range. Each row is on the order of 100-300 bytes in size; in other words, small enough that I am expecting disk I/O to be seek bound (even if PostgreSQL reads a full pg page at a time, since a page is significantly smaller than the stripe size of the volume). The only important performance characteristics are insertion/deletion performance, and the performance of trivial SELECT queries whose WHERE clause tests equality on one of the columns. Other than absolute performance, an important goal is to be able to scale fairly linearly with the number of underlying disk drives. We are fully willing to take a disk seek per item selected, as long as it scales. To this end I have been doing some benchmarking to see whether the plan is going to be feasable. On a 12 disk hardware stripe, insertion performance does scale somewhat with concurrent inserters. However, I am seeing surprising effects with SELECT:s: a single selecter generates the same amount of disk activity as two concurrent selecters (I was easily expecting about twice). The query is simple: SELECT * FROM test WHERE value = 'xxx' LIMIT 1000; No ordering, no joins, no nothing. Selecting concurrently with two different values of 'xxx' yields the same amount of disk activity (never any significant CPU activity). Note that the total amount of data is too large to fit in RAM (> 500 million rows), and the number of distinct values in the value column is 10000. The column in the WHERE clause is indexed. So my first question is - why am I not seeing this scaling? The absolute amount of disk activity with a single selecter is consistent with what I would expect from a SINGLE disk, which is completely expected since I never thought PostgreSQL would introduce disk I/O concurrency on its own. But this means that adding additional readers doing random-access reads *should* scale very well with 12 underlying disks in a stripe. (Note that I have seen fairly similar results on other RAID variants too, including software RAID5 (yes yes I know), in addition to the hardware stripe.) These tests have been done Linux 2.6.19.3 and PostgreSQL 8.1. Secondly, I am seeing a query plan switch after a certain threshold. Observe: perftest=# explain select * from test where val='7433' limit 1000; QUERY PLAN ------------------------------------------------------------------------ ----------------- Limit (cost=0.00..4016.50 rows=1000 width=143) -> Index Scan using test_val_ix on test (cost=0.00..206620.88 rows=51443 width=143) Index Cond: ((val)::text = '7433'::text) (3 rows) Now increasing to a limit of 10000: perftest=# explain select * from test where val='7433' limit 10000; QUERY PLAN ------------------------------------------------------------------------ -------------- Limit (cost=360.05..38393.36 rows=10000 width=143) -> Bitmap Heap Scan on test (cost=360.05..196014.82 rows=51443 width=143) Recheck Cond: ((val)::text = '7433'::text) -> Bitmap Index Scan on test_val_ix (cost=0.00..360.05 rows=51443 width=0) Index Cond: ((val)::text = '7433'::text) (5 rows) The interesting part is that the latter query is entirely CPU bound (no disk I/O at all) for an extended period of time before even beginning to read data from disk. And when it *does* start performing disk I/O, the performance is about the same as for the other case. In other words, the change in query plan seems to do nothing but add overhead. What is the bitmap heap scan supposed to be doing that would increase performance above a "seek once per matching row" plan? I haven't been able to Google my way to what the intended benefit is of a heap scan vs. a plain index scan. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
On 2007-03-30, Peter Schuller <peter.schuller@infidyne.com> wrote: [...] > Other than absolute performance, an important goal is to be able to > scale fairly linearly with the number of underlying disk drives. We > are fully willing to take a disk seek per item selected, as long as it > scales. > > To this end I have been doing some benchmarking to see whether the > plan is going to be feasable. On a 12 disk hardware stripe, insertion > performance does scale somewhat with concurrent inserters. However, I > am seeing surprising effects with SELECT:s: a single selecter > generates the same amount of disk activity as two concurrent selecters > (I was easily expecting about twice). > > The query is simple: > > SELECT * FROM test WHERE value = 'xxx' LIMIT 1000; I tested this on a 14-way software raid10 on freebsd, using pg 8.1.6, and couldn't reproduce anything like it. With one client I get about 200 disk requests per second, scaling almost exactly linearly for the first 5 or so clients, as expected. At 14 clients it was down to about 150 reqs/sec per client, but the total throughput continued to increase with additional concurrency up to about 60 clients, giving about 3600 reqs/sec (260 per disk, which is about right for 10krpm scsi disks under highly concurrent random loads). > So my first question is - why am I not seeing this scaling? A good question. Have you tried testing the disks directly? e.g. create some huge files, and run a few concurrent random readers on them? That would test the array and the filesystem without involving postgres. > Secondly, I am seeing a query plan switch after a certain > threshold. Observe: [snip index scan changing to bitmapscan] This is entirely expected. With the larger row count, it is more likely (or so the planner estimates) that rows will need to be fetched from adjacent or at least nearby blocks, thus a plan which fetches rows in physical table order rather than index order would be expected to be superior. The planner takes into account the estimated startup cost and per-row cost when planning LIMIT queries; therefore it is no surprise that for larger limits, it switches to a plan with a higher startup cost but lower per-row cost. > The interesting part is that the latter query is entirely CPU bound > (no disk I/O at all) for an extended period of time before even > beginning to read data from disk. Most likely your index is small enough that large parts of it will be cached in RAM, so that the scan of the index to build the bitmap does not need to hit the disk much if at all. > And when it *does* start performing > disk I/O, the performance is about the same as for the other case. In > other words, the change in query plan seems to do nothing but add > overhead. This is going to depend quite a bit on the physical ordering of the data relative to the indexed values. > What is the bitmap heap scan supposed to be doing that would increase > performance above a "seek once per matching row" plan? I haven't been > able to Google my way to what the intended benefit is of a heap scan > vs. a plain index scan. The bitmap scan visits the heap in heap order, rather than index order, thus enabling it to take advantage of prefetch and other sequential-read optimizations (in the underlying OS and disk subsystem, rather than in pg itself). -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
Hello, > If you are dealing with timed data or similar, you may consider to > partition your table(s). Unfortunately this is not the case; the insertion is more or less random (not quite, but for the purpose of this problem it is). Thanks for the pointers though. That is sure to be useful in some other context down the road. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
Attachment
Hello, > > SELECT * FROM test WHERE value = 'xxx' LIMIT 1000; > > I tested this on a 14-way software raid10 on freebsd, using pg 8.1.6, and > couldn't reproduce anything like it. With one client I get about 200 disk > requests per second, scaling almost exactly linearly for the first 5 or so > clients, as expected. At 14 clients it was down to about 150 reqs/sec per > client, but the total throughput continued to increase with additional > concurrency up to about 60 clients, giving about 3600 reqs/sec (260 per > disk, which is about right for 10krpm scsi disks under highly concurrent > random loads). Ok. That is very intersting; so there is definitely nothing fundamental in PG that prevents the scaling (even if on FreeBSD). > A good question. Have you tried testing the disks directly? e.g. create > some huge files, and run a few concurrent random readers on them? That > would test the array and the filesystem without involving postgres. I have confirmed that I am seeing expected performance for random short and highly concurrent reads in one large (> 200 GB) file. The I/O is done using libaio however, so depending on implementation I suppose the I/O scheduling behavior of the fs/raid driver might be affected compared to having a number of concurrent threads doing synchronous reads. I will try to confirm performance in a way that will more closely match PostgreSQL's behavior. I have to say though that I will be pretty surprised if the performance is not matched in that test. Is there any chance there is some operation system conditional code in pg itself that might affect this behavior? Some kind of purposeful serialization of I/O for example (even if that sounds like an extremely strange thing to do)? > This is entirely expected. With the larger row count, it is more likely > (or so the planner estimates) that rows will need to be fetched from > adjacent or at least nearby blocks, thus a plan which fetches rows in > physical table order rather than index order would be expected to be > superior. The planner takes into account the estimated startup cost and > per-row cost when planning LIMIT queries; therefore it is no surprise > that for larger limits, it switches to a plan with a higher startup cost > but lower per-row cost. Roger that, makes sense. I had misunderstood the meaning of the heap scan. > Most likely your index is small enough that large parts of it will be > cached in RAM, so that the scan of the index to build the bitmap does > not need to hit the disk much if at all. Even so however, several seconds of CPU activity to scan the index for a few tens of thousands of entries sounds a bit excessive. Or does it not? Because at that level, the CPU bound period alone is approaching the time it would take to seek for each entry instead. But then I presume the amount of work is similar/the same for the other case, except it's being done at the beginning of the query instead of before each seek. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
Attachment
On 2007-04-02, Peter Schuller <peter.schuller@infidyne.com> wrote: > I have confirmed that I am seeing expected performance for random > short and highly concurrent reads in one large (> 200 GB) file. The > I/O is done using libaio however, so depending on implementation I > suppose the I/O scheduling behavior of the fs/raid driver might be > affected compared to having a number of concurrent threads doing > synchronous reads. I will try to confirm performance in a way that > will more closely match PostgreSQL's behavior. > > I have to say though that I will be pretty surprised if the > performance is not matched in that test. The next question then is whether anything in your postgres configuration is preventing it getting useful performance from the OS. What settings have you changed in postgresql.conf? Are you using any unusual settings within the OS itself? > Is there any chance there is some operation system conditional code in > pg itself that might affect this behavior? Unlikely. >> Most likely your index is small enough that large parts of it will be >> cached in RAM, so that the scan of the index to build the bitmap does >> not need to hit the disk much if at all. > > Even so however, several seconds of CPU activity to scan the index for > a few tens of thousands of entries sounds a bit excessive. Or does it > not? Because at that level, the CPU bound period alone is approaching > the time it would take to seek for each entry instead. But then I > presume the amount of work is similar/the same for the other case, > except it's being done at the beginning of the query instead of before > each seek. You're forgetting the LIMIT clause. For the straight index scan, the query aborts when the LIMIT is reached having scanned only the specified number of index rows (plus any index entries that turned out to be dead in the heap). For the bitmap scan case, the limit can be applied only after the heap scan is under way, therefore the index scan to build the bitmap will need to scan ~50k rows, not the 10k specified in the limit, so the amount of time spent scanning the index is 50 times larger than in the straight index scan case. However, I do suspect you have a problem here somewhere, because in my tests the time taken to do the bitmap index scan on 50k rows, with the index in cache, is on the order of 30ms (where the data is cached in shared_buffers) to 60ms (where the data is cached by the OS). That's on a 2.8GHz xeon. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
Hello, > The next question then is whether anything in your postgres configuration > is preventing it getting useful performance from the OS. What settings > have you changed in postgresql.conf? The only options not commented out are the following (it's not even tweaked for buffer sizes and such, since in this case I am not interested in things like sort performance and cache locality other than as an afterthought): hba_file = '/etc/postgresql/8.1/main/pg_hba.conf' ident_file = '/etc/postgresql/8.1/main/pg_ident.conf' external_pid_file = '/var/run/postgresql/8.1-main.pid' listen_addresses = '*' port = 5432 max_connections = 100 unix_socket_directory = '/var/run/postgresql' ssl = true shared_buffers = 1000 log_line_prefix = '%t ' stats_command_string = on stats_row_level = on autovacuum = on lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' > Are you using any unusual settings within the OS itself? No. It's a pretty standard kernel. The only local tweaking done is enabling/disabling various things; there are no special patches used or attempts to create a minimalistic kernel or anything like that. > You're forgetting the LIMIT clause. For the straight index scan, the > query aborts when the LIMIT is reached having scanned only the specified > number of index rows (plus any index entries that turned out to be dead > in the heap). For the bitmap scan case, the limit can be applied only after > the heap scan is under way, therefore the index scan to build the bitmap > will need to scan ~50k rows, not the 10k specified in the limit, so the > amount of time spent scanning the index is 50 times larger than in the > straight index scan case. Ok - makes sense that it has to scan the entire subset of the index for the value in question. I will have to tweak the CPU/disk costs settings (which I have, on purpose, not yet done). > However, I do suspect you have a problem here somewhere, because in my > tests the time taken to do the bitmap index scan on 50k rows, with the > index in cache, is on the order of 30ms (where the data is cached in > shared_buffers) to 60ms (where the data is cached by the OS). That's on > a 2.8GHz xeon. This is on a machine with 2.33GHz xeons and I wasn't trying to exaggerate. I timed it and it is CPU bound (in userspace; next to no system CPU usage at all) for about 15 seconds for the case of selecting with a limit of 10000. Given that there is no disk activity I can't imagine any buffer sizes or such affecting this other than userspace vs. kernelspace CPU concerns (since obviously the data being worked on is in RAM). Or am I missing something? It is worth noting that the SELECT of fewer entries is entirely disk bound; there is almost no CPU usage whatsoever. Even taking the cumulative CPU usage into account (gut feeling calculation, nothing scientific) and multiplying by 50 you are nowhere near 15 seconds of CPU boundness. So it is indeed strange. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
Attachment
On 4-Apr-07, at 2:01 AM, Peter Schuller wrote: > Hello, > >> The next question then is whether anything in your postgres >> configuration >> is preventing it getting useful performance from the OS. What >> settings >> have you changed in postgresql.conf? > > The only options not commented out are the following (it's not even > tweaked for buffer sizes and such, since in this case I am not > interested in things like sort performance and cache locality other > than as an afterthought): > > hba_file = '/etc/postgresql/8.1/main/pg_hba.conf' > ident_file = '/etc/postgresql/8.1/main/pg_ident.conf' > external_pid_file = '/var/run/postgresql/8.1-main.pid' > listen_addresses = '*' > port = 5432 > max_connections = 100 > unix_socket_directory = '/var/run/postgresql' > ssl = true > shared_buffers = 1000 This is way too low, if this 8.x then set it to 25% of available memory, and effective cache should be 3x that > log_line_prefix = '%t ' > stats_command_string = on > stats_row_level = on > autovacuum = on > lc_messages = 'C' > lc_monetary = 'C' > lc_numeric = 'C' > lc_time = 'C' > >> Are you using any unusual settings within the OS itself? > > No. It's a pretty standard kernel. The only local tweaking done is > enabling/disabling various things; there are no special patches used > or attempts to create a minimalistic kernel or anything like that. > >> You're forgetting the LIMIT clause. For the straight index scan, the >> query aborts when the LIMIT is reached having scanned only the >> specified >> number of index rows (plus any index entries that turned out to be >> dead >> in the heap). For the bitmap scan case, the limit can be applied >> only after >> the heap scan is under way, therefore the index scan to build the >> bitmap >> will need to scan ~50k rows, not the 10k specified in the limit, >> so the >> amount of time spent scanning the index is 50 times larger than in >> the >> straight index scan case. > > Ok - makes sense that it has to scan the entire subset of the index > for the value in question. I will have to tweak the CPU/disk costs > settings (which I have, on purpose, not yet done). > >> However, I do suspect you have a problem here somewhere, because >> in my >> tests the time taken to do the bitmap index scan on 50k rows, with >> the >> index in cache, is on the order of 30ms (where the data is cached in >> shared_buffers) to 60ms (where the data is cached by the OS). >> That's on >> a 2.8GHz xeon. > > This is on a machine with 2.33GHz xeons and I wasn't trying to > exaggerate. I timed it and it is CPU bound (in userspace; next to no > system CPU usage at all) for about 15 seconds for the case of > selecting with a limit of 10000. > > Given that there is no disk activity I can't imagine any buffer sizes > or such affecting this other than userspace vs. kernelspace CPU > concerns (since obviously the data being worked on is in RAM). Or am I > missing something? > > It is worth noting that the SELECT of fewer entries is entirely disk > bound; there is almost no CPU usage whatsoever. Even taking the > cumulative CPU usage into account (gut feeling calculation, nothing > scientific) and multiplying by 50 you are nowhere near 15 seconds of > CPU boundness. So it is indeed strange. > > -- > / Peter Schuller > > PGP userID: 0xE9758B7D or 'Peter Schuller > <peter.schuller@infidyne.com>' > Key retrieval: Send an E-Mail to getpgpkey@scode.org > E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org >
On 2007-04-04, Peter Schuller <peter.schuller@infidyne.com> wrote: >> The next question then is whether anything in your postgres configuration >> is preventing it getting useful performance from the OS. What settings >> have you changed in postgresql.conf? > > The only options not commented out are the following (it's not even > tweaked for buffer sizes and such, since in this case I am not > interested in things like sort performance and cache locality other > than as an afterthought): > > shared_buffers = 1000 I'd always do benchmarks with a realistic value of shared_buffers (i.e. much higher than that). Another thought that comes to mind is that the bitmap index scan does depend on the size of work_mem. Try increasing your shared_buffers to a reasonable working value (say 10%-15% of RAM - I was testing on a machine with 4GB of RAM, using a shared_buffers setting of 50000), and increase work_mem to 16364, and see if there are any noticable changes in behaviour. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
Hello, > I'd always do benchmarks with a realistic value of shared_buffers (i.e. > much higher than that). > > Another thought that comes to mind is that the bitmap index scan does > depend on the size of work_mem. > > Try increasing your shared_buffers to a reasonable working value (say > 10%-15% of RAM - I was testing on a machine with 4GB of RAM, using a > shared_buffers setting of 50000), and increase work_mem to 16364, and > see if there are any noticable changes in behaviour. Increasing the buffer size and work_mem did have a significant effect. I can understand it in the case of the heap scan, but I am still surprised at the index scan. Could pg be serializing the entire query as a result of insufficient buffers/work_mem to satisfy multiple concurrent queries? With both turned up, not only is the heap scan no longer visibly CPU bound, I am seeing some nice scaling in terms of disk I/O. I have not yet benchmarked to the point of being able to say whether it's entirely linear, but it certainly seems to at least be approaching the ballpark. Thank you for the help! I guess I made a bad call not tweaking this. My thinking was that I explicitly did not want to turn it up so that I could benchmark the raw performance of disk I/O, rather than having things be cached in memory more than it would already be. But apparantly it had other side-effects I did not consider. Thanks again, -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org