Thread: Questions on query planner, join types, and work_mem
I have spent the last couple of weeks digging into a Postgres performance problem that ultimately boiled down to this: the planner was choosing to use hash joins on a set of join keys that were much larger than the configured work_mem. We found we could make the performance much better by either
1) increasing work_mem to 500MB or more, or
2) forcing the planner to choose index-backed nested loops by turning off hash and merge joins as well as bitmap and sequential scans.
Now we are trying to decide which of these paths to choose, and asking why the planner doesn't handle this for us.
Background: LabKey builds an open source platform for biomedical research data. The platform consists of a tomcat web application and a relational database. we support two databases, Postgres and SQL Server. We started with SQL Server because we were very familiar with it. Two of our technical team came from the SQL Server development team. We chose Postgres because we assessed that it was the open source database most likely to be able to handle our application requirements for capacity and complex, nested, generated SQL handling. Postgres is now the default database for our platform and most of our key customers use it. In general we've been very satisfied with Postgres' performance and compatibility, but our customers are starting to hit situations where we really need to be able to understand why a particular operation is slow. We are currently recommending version 8.4 and using that ourselves.
The core of the problem query was
SELECT * INTO snapshot_table FROM
(SELECT ... FROM tableA A LEFT OUTER JOIN tableB B ON (A.lsid = B.lsid) and A.datasetid = ? ) query1
the join column, lsid, is a poor choice for a join column as it is a long varchar value (avg length 101 characters) that us only gets unique way out on the right hand side. But we are stuck with this choice. I can post the SQL query and table definitions if it will help, but changes to either of those would be risky and difficult, whereas setting the work_mem value or forcing nested loop joins is less risky.
The Performance curve looks something like this
Join Type work_mem(MB) time to populate snapshot (min)
______________________________________________________________
Hash 50 85
Hash 200 38
Hash 400 21
Hash 500 12
Hash 1000 12
_______________________________________________________________
NestedLoop 50 15
NestedLoop 200 11
NestedLoop 400 11
NestedLoop 500 10
NestedLoop 1000 10
________________________________________________________
Table A contains about 3.5 million rows, and table B contains about 4.4 million rows. By looking at the EXPLAIN ANALYZE reports I concluded that the planner seemed to be accurately determining the approximate number of rows returned on each side of the join node. I also noticed that at the work_mem = 50 test, the hash join query execution was using over a GB of space in the pgsql_tmp, space that grew and shrank slowly over the course of the test.
Now for the questions:
1) If we tell the customer to set his work_mem value to 500MB or 1GB in postgres.config, what problems might they see? the documentation and the guidelines we received from Rupinder Singh in support suggest a much lower value, e.g. a max work_mem of 10MB. Other documentation such as the "Guide to Posting Slow Query Questions" suggest at least testing up to 1GB. What is a reasonable maximum to configure for all connnections?
2) How is work_mem used by a query execution? For example, does each hash table in an execution get allocated a full work_mem's worth of memory ? Is this memory released when the query is finished, or does it stay attached to the connection or some other object?
3) is there a reason why the planner doesn't seem to recognize the condition when the hash table won't fit in the current work_mem, and choose a low-memory plan instead?
Excuse the long-winded post; I was trying to give the facts and nothing but the facts.
Thanks,
Peter Hussey
LabKey Software
1) increasing work_mem to 500MB or more, or
2) forcing the planner to choose index-backed nested loops by turning off hash and merge joins as well as bitmap and sequential scans.
Now we are trying to decide which of these paths to choose, and asking why the planner doesn't handle this for us.
Background: LabKey builds an open source platform for biomedical research data. The platform consists of a tomcat web application and a relational database. we support two databases, Postgres and SQL Server. We started with SQL Server because we were very familiar with it. Two of our technical team came from the SQL Server development team. We chose Postgres because we assessed that it was the open source database most likely to be able to handle our application requirements for capacity and complex, nested, generated SQL handling. Postgres is now the default database for our platform and most of our key customers use it. In general we've been very satisfied with Postgres' performance and compatibility, but our customers are starting to hit situations where we really need to be able to understand why a particular operation is slow. We are currently recommending version 8.4 and using that ourselves.
The core of the problem query was
SELECT * INTO snapshot_table FROM
(SELECT ... FROM tableA A LEFT OUTER JOIN tableB B ON (A.lsid = B.lsid) and A.datasetid = ? ) query1
the join column, lsid, is a poor choice for a join column as it is a long varchar value (avg length 101 characters) that us only gets unique way out on the right hand side. But we are stuck with this choice. I can post the SQL query and table definitions if it will help, but changes to either of those would be risky and difficult, whereas setting the work_mem value or forcing nested loop joins is less risky.
The Performance curve looks something like this
Join Type work_mem(MB) time to populate snapshot (min)
______________________________________________________________
Hash 50 85
Hash 200 38
Hash 400 21
Hash 500 12
Hash 1000 12
_______________________________________________________________
NestedLoop 50 15
NestedLoop 200 11
NestedLoop 400 11
NestedLoop 500 10
NestedLoop 1000 10
________________________________________________________
Table A contains about 3.5 million rows, and table B contains about 4.4 million rows. By looking at the EXPLAIN ANALYZE reports I concluded that the planner seemed to be accurately determining the approximate number of rows returned on each side of the join node. I also noticed that at the work_mem = 50 test, the hash join query execution was using over a GB of space in the pgsql_tmp, space that grew and shrank slowly over the course of the test.
Now for the questions:
1) If we tell the customer to set his work_mem value to 500MB or 1GB in postgres.config, what problems might they see? the documentation and the guidelines we received from Rupinder Singh in support suggest a much lower value, e.g. a max work_mem of 10MB. Other documentation such as the "Guide to Posting Slow Query Questions" suggest at least testing up to 1GB. What is a reasonable maximum to configure for all connnections?
2) How is work_mem used by a query execution? For example, does each hash table in an execution get allocated a full work_mem's worth of memory ? Is this memory released when the query is finished, or does it stay attached to the connection or some other object?
3) is there a reason why the planner doesn't seem to recognize the condition when the hash table won't fit in the current work_mem, and choose a low-memory plan instead?
Excuse the long-winded post; I was trying to give the facts and nothing but the facts.
Thanks,
Peter Hussey
LabKey Software
Hi, On Tue, Jul 27, 2010 at 04:08:16PM -0700, Peter Hussey wrote: > Now for the questions: > 1) If we tell the customer to set his work_mem value to 500MB or 1GB in > postgres.config, what problems might they see? the documentation and the > guidelines we received from Rupinder Singh in support suggest a much lower > value, e.g. a max work_mem of 10MB. Other documentation such as the "Guide > to Posting Slow Query Questions" suggest at least testing up to 1GB. What > is a reasonable maximum to configure for all connnections? Well. That depends on the amount of expected concurrency and available memory. Obviously you can set it way much higher in an OLAPish, low concurrency setting than in an OLTP environment. That setting is significantly complex to estimate in my opinion. For one the actualy usage depends on the complexity of the queries, for another to be halfway safe you have to use avail_mem/(max_connections * max_nodes_of_most_complex_query). Which is often a very pessimistic and unusably low estimate. > 2) How is work_mem used by a query execution? For example, does each hash > table in an execution get allocated a full work_mem's worth of memory ? Is > this memory released when the query is finished, or does it stay attached to > the connection or some other object? Each Node of the query can use one work_mem worth of data (sometimes a bit more). The memory is released after the query finished (or possibly earlier, dependent of the structure of the query). The specific allocation pattern and implementation details (of malloc) influence how and when that memory is actually returned to the os. > 3) is there a reason why the planner doesn't seem to recognize the condition > when the hash table won't fit in the current work_mem, and choose a > low-memory plan instead? Hard to say without more information. Bad estimates maybe? Best show your query plan (EXPLAIN ANALYZE), the table definition and some details about common hardware (i.e. whether it has 1GB of memory or 256GB). Andres
Peter Hussey <peter@labkey.com> writes: > I have spent the last couple of weeks digging into a Postgres performance > problem that ultimately boiled down to this: the planner was choosing to > use hash joins on a set of join keys that were much larger than the > configured work_mem. What Postgres version is this, exactly? ("8.4" is not the answer I want.) > the join column, lsid, is a poor choice for a join column as it is a long > varchar value (avg length 101 characters) that us only gets unique way out > on the right hand side. Hm, but it is unique eventually? It's not necessarily bad for hashing as long as that's so. > 1) If we tell the customer to set his work_mem value to 500MB or 1GB in > postgres.config, what problems might they see? That would almost certainly be disastrous. If you have to follow the hack-work_mem path, I'd suggest increasing it locally in the session executing the problem query, and only for the duration of that query. Use SET, or even SET LOCAL. > 2) How is work_mem used by a query execution? Well, the issue you're hitting is that the executor is dividing the query into batches to keep the size of the in-memory hash table below work_mem. The planner should expect that and estimate the cost of the hash technique appropriately, but seemingly it's failing to do so. Since you didn't provide EXPLAIN ANALYZE output, though, it's hard to be sure. > 3) is there a reason why the planner doesn't seem to recognize the condition > when the hash table won't fit in the current work_mem, and choose a > low-memory plan instead? That's the question, all right. I wonder if it's got something to do with the wide-varchar nature of the join key ... but again that's just speculation with no facts. Please show us EXPLAIN ANALYZE results for the hash plan with both small and large work_mem, as well as for the nestloop plan. regards, tom lane
Excerpts from Tom Lane's message of mar jul 27 20:05:02 -0400 2010: > Peter Hussey <peter@labkey.com> writes: > > 2) How is work_mem used by a query execution? > > Well, the issue you're hitting is that the executor is dividing the > query into batches to keep the size of the in-memory hash table below > work_mem. The planner should expect that and estimate the cost of > the hash technique appropriately, but seemingly it's failing to do so. > Since you didn't provide EXPLAIN ANALYZE output, though, it's hard > to be sure. Hmm, I wasn't aware that hash joins worked this way wrt work_mem. Is this visible in the explain output? If it's something subtle (like an increased total cost), may I suggest that it'd be a good idea to make it explicit somehow in the machine-readable outputs?
Hello, > the join column, lsid, is a poor choice for a join column as it is a > long varchar value (avg length 101 characters) that us only gets > unique way out on the right hand side. Would a join on subtring on the 'way out on the right hand side' (did you mean 'rightmost characters' or 'only when we take almost all the 101 characters'?) together with a function based index help? Regards, Jayadevan DISCLAIMER: "The information in this e-mail and any attachment is intended only for the person to whom it is addressed and may contain confidential and/or privileged material. If you have received this e-mail in error, kindly contact the sender and destroy all copies of the original communication. IBS makes no warranty, express or implied, nor guarantees the accuracy, adequacy or completeness of the information contained in this email or any attachment and is not liable for any errors, defects, omissions, viruses or for resultant loss or damage, if any, direct or indirect."
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Tom Lane's message of mar jul 27 20:05:02 -0400 2010: >> Well, the issue you're hitting is that the executor is dividing the >> query into batches to keep the size of the in-memory hash table below >> work_mem. The planner should expect that and estimate the cost of >> the hash technique appropriately, but seemingly it's failing to do so. > Hmm, I wasn't aware that hash joins worked this way wrt work_mem. Is > this visible in the explain output? As of 9.0, any significant difference between "Hash Batches" and "Original Hash Batches" would be a cue that the planner blew the estimate. For Peter's problem, we're just going to have to look to see if the estimated cost changes in a sane way between the small-work_mem and large-work_mem cases. regards, tom lane
Peter Hussey <peter@labkey.com> writes: > Using the default of 1MB work_mem, the planner chooses a hash join plan : > "Hash Left Join (cost=252641.82..11847353.87 rows=971572 width=111) (actual > time=124196.670..280461.604 rows=968080 loops=1)" > ... > For the same default 1MB work_mem, a nested loop plan is better > "Nested Loop Left Join (cost=8.27..15275401.19 rows=971572 width=111) > (actual time=145.015..189957.023 rows=968080 loops=1)" > ... Hm. A nestloop with nearly a million rows on the outside is pretty scary. The fact that you aren't unhappy with that version of the plan, rather than the hash, indicates that the "object" table must be fully cached in memory, otherwise the repeated indexscans would be a lot slower than this: > " -> Index Scan using uq_object on object obj (cost=0.00..3.51 rows=1 > width=95) (actual time=0.168..0.170 rows=1 loops=968080)" > " Index Cond: ((sd.lsid)::text = (obj.objecturi)::text)" My take on it is that the estimate of the hash plan's cost isn't bad; what's bad is that the planner is mistakenly estimating the nestloop as being worse. What you need to do is adjust the planner's cost parameters so that it has a better idea of the true cost of repeated index probes in your environment. Crank up effective_cache_size if you didn't already, and experiment with lowering random_page_cost. See the list archives for more discussion of these parameters. regards, tom lane
I already had effective_cache_size set to 500MB.
I experimented with lowering random_page_cost to 3 then 2. It made no difference in the choice of plan that I could see. In the explain analyze output the estimated costs of nested loop were in fact lowererd, but so were the costs of the hash join plan, and the hash join remained the lowest predicted costs in all tests i tried.
What seems wrong to me is that the hash join strategy shows almost no difference in estimated costs as work_mem goes from 1MB to 500MB. The cost function decreases by 1%, but the actual time for the query to execute decreases by 86% as work_mem goes from 1MB to 500MB.
My questions are still
1) Does the planner have any component of cost calculations based on the size of work_mem, and if so why do those calculations seem to have so little effect here?
2) Why is the setting of work_mem something left to the admin and/or developer? Couldn't the optimizer say how much it thinks it needs to build a hash table based on size of the keys and estimated number of rows?
It is difficult for a software development platform like ours to take advantage of suggestions to set work_mem, or to change the cost function, or turn on/off join strategies for individual queries. The SQL we issue is formed by user interaction with the product and rarely static. How would we know when to turn something on or off? That's why I'm looking for a configuratoin solution that I can set on a database-wide basis and have it work well for all queries.
thanks
Peter
On Fri, Jul 30, 2010 at 7:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
--
Peter Hussey
LabKey Software
206-667-7193 (office)
206-291-5625 (cell)
I experimented with lowering random_page_cost to 3 then 2. It made no difference in the choice of plan that I could see. In the explain analyze output the estimated costs of nested loop were in fact lowererd, but so were the costs of the hash join plan, and the hash join remained the lowest predicted costs in all tests i tried.
What seems wrong to me is that the hash join strategy shows almost no difference in estimated costs as work_mem goes from 1MB to 500MB. The cost function decreases by 1%, but the actual time for the query to execute decreases by 86% as work_mem goes from 1MB to 500MB.
My questions are still
1) Does the planner have any component of cost calculations based on the size of work_mem, and if so why do those calculations seem to have so little effect here?
2) Why is the setting of work_mem something left to the admin and/or developer? Couldn't the optimizer say how much it thinks it needs to build a hash table based on size of the keys and estimated number of rows?
It is difficult for a software development platform like ours to take advantage of suggestions to set work_mem, or to change the cost function, or turn on/off join strategies for individual queries. The SQL we issue is formed by user interaction with the product and rarely static. How would we know when to turn something on or off? That's why I'm looking for a configuratoin solution that I can set on a database-wide basis and have it work well for all queries.
thanks
Peter
On Fri, Jul 30, 2010 at 7:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Hussey <peter@labkey.com> writes:> Using the default of 1MB work_mem, the planner chooses a hash join plan :> ...
> "Hash Left Join (cost=252641.82..11847353.87 rows=971572 width=111) (actual
> time=124196.670..280461.604 rows=968080 loops=1)"> For the same default 1MB work_mem, a nested loop plan is better> ...
> "Nested Loop Left Join (cost=8.27..15275401.19 rows=971572 width=111)
> (actual time=145.015..189957.023 rows=968080 loops=1)"
Hm. A nestloop with nearly a million rows on the outside is pretty
scary. The fact that you aren't unhappy with that version of the plan,
rather than the hash, indicates that the "object" table must be
fully cached in memory, otherwise the repeated indexscans would be a
lot slower than this:My take on it is that the estimate of the hash plan's cost isn't bad;
> " -> Index Scan using uq_object on object obj (cost=0.00..3.51 rows=1
> width=95) (actual time=0.168..0.170 rows=1 loops=968080)"
> " Index Cond: ((sd.lsid)::text = (obj.objecturi)::text)"
what's bad is that the planner is mistakenly estimating the nestloop as
being worse. What you need to do is adjust the planner's cost
parameters so that it has a better idea of the true cost of repeated
index probes in your environment. Crank up effective_cache_size if
you didn't already, and experiment with lowering random_page_cost.
See the list archives for more discussion of these parameters.
regards, tom lane
--
Peter Hussey
LabKey Software
206-667-7193 (office)
206-291-5625 (cell)
On Mon, Aug 2, 2010 at 5:23 PM, Peter Hussey <peter@labkey.com> wrote: > I already had effective_cache_size set to 500MB. > > I experimented with lowering random_page_cost to 3 then 2. It made no > difference in the choice of plan that I could see. In the explain analyze > output the estimated costs of nested loop were in fact lowererd, but so were > the costs of the hash join plan, and the hash join remained the lowest > predicted costs in all tests i tried. What do you get if you set random_page_cost to a small value such as 0.01? > What seems wrong to me is that the hash join strategy shows almost no > difference in estimated costs as work_mem goes from 1MB to 500MB. The cost > function decreases by 1%, but the actual time for the query to execute > decreases by 86% as work_mem goes from 1MB to 500MB. Wow. It would be interesting to find out how many batches are being used. Unfortunately, releases prior to 9.0 don't display that information. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Peter Hussey <peter@labkey.com> writes: > My questions are still > 1) Does the planner have any component of cost calculations based on the > size of work_mem, Sure. > and if so why do those calculations seem to have so > little effect here? Since you haven't provided sufficient information to let someone else reproduce what you're seeing, it's pretty hard to say. It might have something to do with the particularly wide join key values you're using, but that's mere speculation based on the one tidbit you provided. There might be some other effect altogether that's making it do the wrong thing. > 2) Why is the setting of work_mem something left to the admin and/or > developer? Because we're not smart enough to find a way to avoid that. regards, tom lane
On Mon, 2010-08-02 at 14:23 -0700, Peter Hussey wrote: > I already had effective_cache_size set to 500MB. > > I experimented with lowering random_page_cost to 3 then 2. In case of fully cached database it is closer to 1. > 2) Why is the setting of work_mem something left to the admin and/or > developer? Couldn't the optimizer say how much it thinks it needs to > build a hash table based on size of the keys and estimated number of > rows? Yes, It can say how much it thinks it needs to build a hash table, the part it can't figure out is how much it can afford, based on things like number concurrent queries and how much work-mem these are using, and any work-mem used will be substracted from total memory pool, affecting also how much of the files the system caches. > It is difficult for a software development platform like ours to take > advantage of suggestions to set work_mem, or to change the cost > function, or turn on/off join strategies for individual queries. The > SQL we issue is formed by user interaction with the product and rarely > static. How would we know when to turn something on or off? That's > why I'm looking for a configuration solution that I can set on a > database-wide basis and have it work well for all queries. Keep trying. The close you get with your conf to real conditions, the better choices the optimiser can make ;) -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
On Tue, Aug 3, 2010 at 3:03 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > In case of fully cached database it is closer to 1. In the case of a fully cached database I believe the correct answer begins with a decimal point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Peter Hussey <peter@labkey.com> wrote: > I already had effective_cache_size set to 500MB. That seems awfully small. You do realize that this setting does not cause PostgreSQL to allocate any memory; it merely advises how much disk space is likely to be cached. It should normally be set to the sum of your shared_buffers setting and whatever your OS reports as cached. Setting it too small will discourage the optimizer from picking plans which use indexes. > I experimented with lowering random_page_cost to 3 then 2. As others have said, in a fully cached system that's still too high. If the active portion of your database is fully cached, you should set random_page_cost and seq_page_cost to the same value, and that value should probably be in the range of 0.1 to 0.005. It can get trickier if the active portion is largely but not fully cached; we have one server where we found, through experimentation, that we got better plans overall with seq_page_cost = 0.3 and random_page_cost = 0.5 than any other settings we tried. -Kevin
On Wed, 2010-08-04 at 09:14 -0400, Robert Haas wrote: > On Tue, Aug 3, 2010 at 3:03 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > > In case of fully cached database it is closer to 1. > > In the case of a fully cached database I believe the correct answer > begins with a decimal point. The number 1 here was suggested in relation to seq_page_cost, which is 1. For fully cached db there is no additional seek time for random access, so seq_page_cost == random_page_cost. Of course there are more variables than just *_page_cost, so if you nail down any other one, you may end with less than 1 for both page costs. I have always used seq_page_cost = 1 in my thinking and adjusted others relative to it. -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Hannu Krosing <hannu@2ndquadrant.com> writes: > Of course there are more variables than just *_page_cost, so if you nail > down any other one, you may end with less than 1 for both page costs. > I have always used seq_page_cost = 1 in my thinking and adjusted others > relative to it. Right, seq_page_cost = 1 is sort of the traditional reference point, but you don't have to do it that way. The main point here is that for an all-in-RAM database, the standard page access costs are too high relative to the CPU effort costs: regression=# select name, setting from pg_settings where name like '%cost'; name | setting ----------------------+--------- cpu_index_tuple_cost | 0.005 cpu_operator_cost | 0.0025 cpu_tuple_cost | 0.01 random_page_cost | 4 seq_page_cost | 1 (5 rows) To model an all-in-RAM database, you can either dial down both random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost to 1 and increase all the CPU costs. The former is less effort ;-) It should be noted also that there's not all that much evidence backing up the default values of the cpu_xxx_cost variables. In the past those didn't matter much because I/O costs always swamped CPU costs anyway. But I can foresee us having to twiddle those defaults and maybe refine the CPU cost model more, as all-in-RAM cases get more common. regards, tom lane
On Wed, 2010-08-04 at 14:00 -0400, Tom Lane wrote: > Hannu Krosing <hannu@2ndquadrant.com> writes: > > Of course there are more variables than just *_page_cost, so if you nail > > down any other one, you may end with less than 1 for both page costs. > > > I have always used seq_page_cost = 1 in my thinking and adjusted others > > relative to it. > > Right, seq_page_cost = 1 is sort of the traditional reference point, > but you don't have to do it that way. The main point here is that for > an all-in-RAM database, the standard page access costs are too high > relative to the CPU effort costs: > > regression=# select name, setting from pg_settings where name like '%cost'; > name | setting > ----------------------+--------- > cpu_index_tuple_cost | 0.005 > cpu_operator_cost | 0.0025 > cpu_tuple_cost | 0.01 > random_page_cost | 4 > seq_page_cost | 1 > (5 rows) > > To model an all-in-RAM database, you can either dial down both > random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost > to 1 and increase all the CPU costs. The former is less effort ;-) > > It should be noted also that there's not all that much evidence backing > up the default values of the cpu_xxx_cost variables. In the past those > didn't matter much because I/O costs always swamped CPU costs anyway. > But I can foresee us having to twiddle those defaults and maybe refine > the CPU cost model more, as all-in-RAM cases get more common. Especially the context switch + copy between shared buffers and system disk cache will become noticeable at these speeds. An easy way to test it is loading a table with a few indexes, once with a shared_buffers value, which is senough for only the main table and once with one that fits both table and indexes, > regards, tom lane -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
On Wed, 2010-08-04 at 21:41 +0300, Hannu Krosing wrote: > On Wed, 2010-08-04 at 14:00 -0400, Tom Lane wrote: > > regression=# select name, setting from pg_settings where name like '%cost'; > > name | setting > > ----------------------+--------- > > cpu_index_tuple_cost | 0.005 > > cpu_operator_cost | 0.0025 > > cpu_tuple_cost | 0.01 > > random_page_cost | 4 > > seq_page_cost | 1 > > (5 rows) > > > > To model an all-in-RAM database, you can either dial down both > > random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost > > to 1 and increase all the CPU costs. The former is less effort ;-) > > > > It should be noted also that there's not all that much evidence backing > > up the default values of the cpu_xxx_cost variables. In the past those > > didn't matter much because I/O costs always swamped CPU costs anyway. > > But I can foresee us having to twiddle those defaults and maybe refine > > the CPU cost model more, as all-in-RAM cases get more common. > > Especially the context switch + copy between shared buffers and system > disk cache will become noticeable at these speeds. > > An easy way to test it is loading a table with a few indexes, once with > a shared_buffers value, which is senough for only the main table and > once with one that fits both table and indexes, ok, just to back this up I ran the following test with 28MB and 128MB shared buffers. create table sbuf_test(f1 float, f2 float, f3 float); create index sbuf_test1 on sbuf_test(f1); create index sbuf_test2 on sbuf_test(f2); create index sbuf_test3 on sbuf_test(f3); and then did 3 times the following for each shared_buffers setting truncate sbuf_test; insert into sbuf_test select random(), random(), random() from generate_series(1,600000); the main table size was 31MB, indexes were 18MB each for total size of 85MB in case of 128MB shared buffers, the insert run in 14sec (+/- 1 sec) in case of 28MB shared buffers, the insert run between 346 and 431 sec, that is 20-30 _times_ slower. There was ample space for keeping the indexes in linux cache (it has 1GB cached currently) though the system may have decided to start writing it to disk, so I suspect that most of the time was spent copying random index pages back and forth between shared buffers and disk cache. I did not verify this, so there may be some other factors involved, but this seems like the most obvious suspect. -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Hannu Krosing wrote: > There was ample space for keeping the indexes in linux cache (it has 1GB > cached currently) though the system may have decided to start writing it > to disk, so I suspect that most of the time was spent copying random > index pages back and forth between shared buffers and disk cache. > Low shared_buffers settings will result in the same pages more often being written multiple times per checkpoint, particularly index pages, which is less efficient than keeping in the database cache and updating them there. This is a slightly different issue than just the overhead of copying them back and forth; by keeping them in cache, you actually reduce writes to the OS cache. What I do to quantify that is...well, the attached shows it better than I can describe; only works on 9.0 or later as it depends on a feature I added for this purpose there. It measures exactly how much buffer cache churn happened during a test, in this case creating a pgbench database. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
This time with attachment... -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Attachment
Hannu Krosing <hannu@2ndquadrant.com> writes: > There was ample space for keeping the indexes in linux cache (it has 1GB > cached currently) though the system may have decided to start writing it > to disk, so I suspect that most of the time was spent copying random > index pages back and forth between shared buffers and disk cache. If you're on a platform that has oprofile, you could probably verify that rather than just guess it ... regards, tom lane
Greg Smith <greg@2ndquadrant.com> wrote: > What I do to quantify that is...well, the attached shows it better > than I can describe; only works on 9.0 or later as it depends on a > feature I added for this purpose there. It measures exactly how > much buffer cache churn happened during a test, in this case > creating a pgbench database. I'm not entirely sure I understand what I'm supposed to get from that. On a 3GB workstation, a compile from a recent HEAD checkout, with a default postgresql.conf file, I get this: -[ RECORD 1 ]------+------------------------------ now | 2010-08-04 14:25:46.683766-05 checkpoints_timed | 0 checkpoints_req | 0 buffers_checkpoint | 0 buffers_clean | 0 maxwritten_clean | 0 buffers_backend | 0 buffers_alloc | 73 Initializing pgbench -[ RECORD 1 ]------+------------------------------ now | 2010-08-04 14:27:49.062551-05 checkpoints_timed | 0 checkpoints_req | 0 buffers_checkpoint | 0 buffers_clean | 0 maxwritten_clean | 0 buffers_backend | 633866 buffers_alloc | 832 I boost shared_buffers from 32MB to 320MB, restart, and get this: -[ RECORD 1 ]------+------------------------------ now | 2010-08-04 14:30:42.816719-05 checkpoints_timed | 0 checkpoints_req | 0 buffers_checkpoint | 0 buffers_clean | 0 maxwritten_clean | 0 buffers_backend | 0 buffers_alloc | 0 Initializing pgbench -[ RECORD 1 ]------+------------------------------ now | 2010-08-04 14:32:40.750098-05 checkpoints_timed | 0 checkpoints_req | 0 buffers_checkpoint | 0 buffers_clean | 0 maxwritten_clean | 0 buffers_backend | 630794 buffers_alloc | 2523 So run time dropped from 123 seconds to 118 seconds, buffers_backend dropped by less than 0.5%, and buffers_alloc went up. Assuming this is real, and not just "in the noise" -- what conclusions would you draw from this? Dedicating an additional 10% of my free memory got me a 4% speed improvement? Was I supposed to try with other scales? Which ones? -Kevin
On Wed, 2010-08-04 at 22:03 +0300, Hannu Krosing wrote: > On Wed, 2010-08-04 at 21:41 +0300, Hannu Krosing wrote: > > On Wed, 2010-08-04 at 14:00 -0400, Tom Lane wrote: > > > > regression=# select name, setting from pg_settings where name like '%cost'; > > > name | setting > > > ----------------------+--------- > > > cpu_index_tuple_cost | 0.005 > > > cpu_operator_cost | 0.0025 > > > cpu_tuple_cost | 0.01 > > > random_page_cost | 4 > > > seq_page_cost | 1 > > > (5 rows) > > > > > > To model an all-in-RAM database, you can either dial down both > > > random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost > > > to 1 and increase all the CPU costs. The former is less effort ;-) > > > > > > It should be noted also that there's not all that much evidence backing > > > up the default values of the cpu_xxx_cost variables. In the past those > > > didn't matter much because I/O costs always swamped CPU costs anyway. > > > But I can foresee us having to twiddle those defaults and maybe refine > > > the CPU cost model more, as all-in-RAM cases get more common. > > > > Especially the context switch + copy between shared buffers and system > > disk cache will become noticeable at these speeds. > > > > An easy way to test it is loading a table with a few indexes, once with > > a shared_buffers value, which is senough for only the main table and > > once with one that fits both table and indexes, I re-ran the test, and checked idx_blks_read for 28MB case hannu=# select * from pg_statio_user_indexes where relname = 'sbuf_test'; | schemaname | relname | indexrelname | idx_blks_read | idx_blks_hit +------------+-----------+--------------+---------------+-------------- | hannu | sbuf_test | sbuf_test1 | 71376 | 1620908 | hannu | sbuf_test | sbuf_test2 | 71300 | 1620365 | hannu | sbuf_test | sbuf_test3 | 71436 | 1619619 this means that there were a total of 214112 index blocks read back from disk cache (obviously at least some of these had to be copied the other way as well). This seems to indicate about 1 ms for moving pages over user/system boundary. (Intel Core2 Duo T7500 @ 2.20GHz, Ubuntu 9.10, 4GB RAM) for 128MB shared buffers the total idx_blks_read for 3 indexes was about 6300 . > ok, just to back this up I ran the following test with 28MB and 128MB > shared buffers. > > create table sbuf_test(f1 float, f2 float, f3 float); > create index sbuf_test1 on sbuf_test(f1); > create index sbuf_test2 on sbuf_test(f2); > create index sbuf_test3 on sbuf_test(f3); > > and then did 3 times the following for each shared_buffers setting > > truncate sbuf_test; > insert into sbuf_test > select random(), random(), random() from generate_series(1,600000); > > the main table size was 31MB, indexes were 18MB each for total size of > 85MB > > in case of 128MB shared buffers, the insert run in 14sec (+/- 1 sec) > > in case of 28MB shared buffers, the insert run between 346 and 431 sec, > that is 20-30 _times_ slower. > > There was ample space for keeping the indexes in linux cache (it has 1GB > cached currently) though the system may have decided to start writing it > to disk, so I suspect that most of the time was spent copying random > index pages back and forth between shared buffers and disk cache. > > I did not verify this, so there may be some other factors involved, but > this seems like the most obvious suspect. > > -- > Hannu Krosing http://www.2ndQuadrant.com > PostgreSQL Scalability and Availability > Services, Consulting and Training > > > -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Kevin Grittner wrote: > Assuming this is real, and not just "in the noise" -- what conclusions would you > draw from this? Was trying to demonstrate the general ability of pg_stat_bgwriter snapshots at points in time to directly measure the buffer activity Hannu was theorizing about, not necessarily show a useful benchmark of any sort with that. Watching pgbench create a database isn't all that interesting unless you either a) increase the database scale such that at least one timed checkpoint kicks in, or b) turn on archive_mode so the whole WAL COPY optimization is defeated. More on this topic later, just happened to have that little example script ready to demonstrate the measurement concept. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Hannu Krosing wrote: > Do you mean "written to disk", or written out from shared_buffers to > disk cache ? > The later turns into the former eventually, so both really. The kernel will do some amount of write combining for you if you're lucky. But not in all cases; it may decide to write something out to physical disk before the second write shows up. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Hannu Krosing <hannu@2ndquadrant.com> wrote: > This seems to indicate about 1 ms for moving pages over > user/system boundary. (Intel Core2 Duo T7500 @ 2.20GHz, Ubuntu > 9.10, 4GB RAM) Using Greg's test script on a box with two cores like this: Intel(R) Pentium(R) D CPU 3.40GHz Linux kgrittn-desktop 2.6.31-22-generic #60-Ubuntu SMP Thu May 27 00:22:23 UTC 2010 i686 GNU/Linux Dividing the run time by accumulated buffers_backend, it comes to less than 0.2 ms per dirty buffer flushed. If I get a few spare ticks I'll try again while checking what vmstat and oprofile say about how much of that went to things besides the transfer from shared buffers to the OS. I mean, it's possible I was waiting on actual disk I/O at some point. -Kevin
Greg Smith <greg@2ndquadrant.com> wrote: > Was trying to demonstrate the general ability of pg_stat_bgwriter > snapshots at points in time to directly measure the buffer > activity Hannu was theorizing about, not necessarily show a useful > benchmark of any sort with that. Ah, OK. Sorry I didn't pick up on that; I was struggling to tease out some particular effect you expected to see in the numbers from that particular run. :-/ -Kevin
On Wed, 2010-08-04 at 15:16 -0400, Greg Smith wrote: > Hannu Krosing wrote: > > There was ample space for keeping the indexes in linux cache (it has 1GB > > cached currently) though the system may have decided to start writing it > > to disk, so I suspect that most of the time was spent copying random > > index pages back and forth between shared buffers and disk cache. > > > > Low shared_buffers settings will result in the same pages more often > being written multiple times per checkpoint, Do you mean "written to disk", or written out from shared_buffers to disk cache ? > particularly index pages, > which is less efficient than keeping in the database cache and updating > them there. This is a slightly different issue than just the overhead > of copying them back and forth; by keeping them in cache, you actually > reduce writes to the OS cache. That's what I meant. Both writes to and read from the OS cache take a significant amount of time once you are not doing real disk I/O. > What I do to quantify that is...well, > the attached shows it better than I can describe; only works on 9.0 or > later as it depends on a feature I added for this purpose there. It > measures exactly how much buffer cache churn happened during a test, in > this case creating a pgbench database. > > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us > >
On Wed, 2010-08-04 at 22:03 +0300, Hannu Krosing wrote: > On Wed, 2010-08-04 at 21:41 +0300, Hannu Krosing wrote: > > On Wed, 2010-08-04 at 14:00 -0400, Tom Lane wrote: > > > > regression=# select name, setting from pg_settings where name like '%cost'; > > > name | setting > > > ----------------------+--------- > > > cpu_index_tuple_cost | 0.005 > > > cpu_operator_cost | 0.0025 > > > cpu_tuple_cost | 0.01 > > > random_page_cost | 4 > > > seq_page_cost | 1 > > > (5 rows) > > > > > > To model an all-in-RAM database, you can either dial down both > > > random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost > > > to 1 and increase all the CPU costs. The former is less effort ;-) > > > > > > It should be noted also that there's not all that much evidence backing > > > up the default values of the cpu_xxx_cost variables. In the past those > > > didn't matter much because I/O costs always swamped CPU costs anyway. > > > But I can foresee us having to twiddle those defaults and maybe refine > > > the CPU cost model more, as all-in-RAM cases get more common. > > > > Especially the context switch + copy between shared buffers and system > > disk cache will become noticeable at these speeds. > > > > An easy way to test it is loading a table with a few indexes, once with > > a shared_buffers value, which is senough for only the main table and > > once with one that fits both table and indexes, I re-ran the test, and checked idx_blks_read for 28MB case hannu=# select * from pg_statio_user_indexes where relname = 'sbuf_test'; | schemaname | relname | indexrelname | idx_blks_read | idx_blks_hit +------------+-----------+--------------+---------------+-------------- | hannu | sbuf_test | sbuf_test1 | 71376 | 1620908 | hannu | sbuf_test | sbuf_test2 | 71300 | 1620365 | hannu | sbuf_test | sbuf_test3 | 71436 | 1619619 this means that there were a total of 214112 index blocks read back from disk cache (obviously at least some of these had to be copied the other way as well). This seems to indicate about 1 ms for moving pages over user/system boundary. (Intel Core2 Duo T7500 @ 2.20GHz, Ubuntu 9.10, 4GB RAM) for 128MB shared buffers the total idx_blks_read for 3 indexes was about 6300 . > ok, just to back this up I ran the following test with 28MB and 128MB > shared buffers. > > create table sbuf_test(f1 float, f2 float, f3 float); > create index sbuf_test1 on sbuf_test(f1); > create index sbuf_test2 on sbuf_test(f2); > create index sbuf_test3 on sbuf_test(f3); > > and then did 3 times the following for each shared_buffers setting > > truncate sbuf_test; > insert into sbuf_test > select random(), random(), random() from generate_series(1,600000); > > the main table size was 31MB, indexes were 18MB each for total size of > 85MB > > in case of 128MB shared buffers, the insert run in 14sec (+/- 1 sec) > > in case of 28MB shared buffers, the insert run between 346 and 431 sec, > that is 20-30 _times_ slower. > > There was ample space for keeping the indexes in linux cache (it has 1GB > cached currently) though the system may have decided to start writing it > to disk, so I suspect that most of the time was spent copying random > index pages back and forth between shared buffers and disk cache. > > I did not verify this, so there may be some other factors involved, but > this seems like the most obvious suspect. > > -- > Hannu Krosing http://www.2ndQuadrant.com > PostgreSQL Scalability and Availability > Services, Consulting and Training > > >
Tom Lane wrote: > Hannu Krosing <hannu@2ndquadrant.com> writes: > > Of course there are more variables than just *_page_cost, so if you nail > > down any other one, you may end with less than 1 for both page costs. > > > I have always used seq_page_cost = 1 in my thinking and adjusted others > > relative to it. > > Right, seq_page_cost = 1 is sort of the traditional reference point, > but you don't have to do it that way. The main point here is that for > an all-in-RAM database, the standard page access costs are too high > relative to the CPU effort costs: > > regression=# select name, setting from pg_settings where name like '%cost'; > name | setting > ----------------------+--------- > cpu_index_tuple_cost | 0.005 > cpu_operator_cost | 0.0025 > cpu_tuple_cost | 0.01 > random_page_cost | 4 > seq_page_cost | 1 > (5 rows) > > To model an all-in-RAM database, you can either dial down both > random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost > to 1 and increase all the CPU costs. The former is less effort ;-) > > It should be noted also that there's not all that much evidence backing > up the default values of the cpu_xxx_cost variables. In the past those > didn't matter much because I/O costs always swamped CPU costs anyway. > But I can foresee us having to twiddle those defaults and maybe refine > the CPU cost model more, as all-in-RAM cases get more common. This confused me. If we are assuing the data is in effective_cache_size, why are we adding sequential/random page cost to the query cost routines? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Aug 11, 2010 at 9:42 PM, Bruce Momjian <bruce@momjian.us> wrote: > This confused me. If we are assuing the data is in > effective_cache_size, why are we adding sequential/random page cost to > the query cost routines? See the comments for index_pages_fetched(). We basically assume that all data starts uncached at the beginning of each query - in fact, each plan node. effective_cache_size only measures the chances that if we hit the same block again later in the execution of something like a nested-loop-with-inner-indexscan, it'll still be in cache. It's an extremely weak knob, and unless you have tables or indices that are larger than RAM, the only mistake you can make is setting it too low. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas wrote: > On Wed, Aug 11, 2010 at 9:42 PM, Bruce Momjian <bruce@momjian.us> wrote: > > This confused me. ?If we are assuing the data is in > > effective_cache_size, why are we adding sequential/random page cost to > > the query cost routines? > > See the comments for index_pages_fetched(). We basically assume that > all data starts uncached at the beginning of each query - in fact, > each plan node. effective_cache_size only measures the chances that > if we hit the same block again later in the execution of something > like a nested-loop-with-inner-indexscan, it'll still be in cache. > > It's an extremely weak knob, and unless you have tables or indices > that are larger than RAM, the only mistake you can make is setting it > too low. The attached patch documents that there is no assumption that data remains in the disk cache between queries. I thought this information might be helpful. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 77cacdd..520170b 100644 *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *************** SET ENABLE_SEQSCAN TO OFF; *** 2424,2430 **** space. This parameter has no effect on the size of shared memory allocated by <productname>PostgreSQL</productname>, nor does it reserve kernel disk cache; it is used only for estimation ! purposes. The default is 128 megabytes (<literal>128MB</>). </para> </listitem> </varlistentry> --- 2424,2432 ---- space. This parameter has no effect on the size of shared memory allocated by <productname>PostgreSQL</productname>, nor does it reserve kernel disk cache; it is used only for estimation ! purposes. The system also does not assume data remains in ! the disk cache between queries. The default is 128 megabytes ! (<literal>128MB</>). </para> </listitem> </varlistentry>
Bruce Momjian wrote: > Robert Haas wrote: > > On Wed, Aug 11, 2010 at 9:42 PM, Bruce Momjian <bruce@momjian.us> wrote: > > > This confused me. ?If we are assuing the data is in > > > effective_cache_size, why are we adding sequential/random page cost to > > > the query cost routines? > > > > See the comments for index_pages_fetched(). We basically assume that > > all data starts uncached at the beginning of each query - in fact, > > each plan node. effective_cache_size only measures the chances that > > if we hit the same block again later in the execution of something > > like a nested-loop-with-inner-indexscan, it'll still be in cache. > > > > It's an extremely weak knob, and unless you have tables or indices > > that are larger than RAM, the only mistake you can make is setting it > > too low. > > The attached patch documents that there is no assumption that data > remains in the disk cache between queries. I thought this information > might be helpful. Applied. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +