Thread: How to force Nested Loop plan?
I'm trying to understand how I can get the planner to always do the right thing with this query: EXPLAIN ANALYZE SELECT aa_t.min_date_time FROM aa_t , bb_t , cc_t WHERE bb_t.bb_id = aa_t.bb_id AND aa_t.realm_id = cc_t.realm_id AND aa_t.server_id = 21 ORDER BY aa_t.min_date_time desc LIMIT 1 OFFSET 674 ; There's a extreme elbow in the performance curve around the sum of LIMIT and OFFSET. The two plans follow. First for the query above: Limit (cost=21569.56..21601.56 rows=1 width=84) (actual time=59.60..59.69 rows=1 loops=1) -> Nested Loop (cost=0.00..110535.66 rows=3454 width=84) (actual time=0.19..59.20 rows=676 loops=1) -> Nested Loop (cost=0.00..93177.46 rows=3454 width=65) (actual time=0.14..44.41 rows=676 loops=1) -> Index Scan Backward using aa_t20 on aa_t (cost=0.00..76738.77 rows=3454 width=46) (actual time=0.10..31.30rows=676 loops=1) Filter: (server_id = 21::numeric) -> Index Scan using cc_t1 on cc_t (cost=0.00..4.75 rows=1 width=19) (actual time=0.01..0.01 rows=1 loops=676) Index Cond: ("outer".realm_id = cc_t.realm_id) -> Index Scan using bb_t1 on bb_t (cost=0.00..5.01 rows=1 width=19) (actual time=0.02..0.02 rows=1 loops=676) Index Cond: (bb_t.bb_id = "outer".bb_id) Total runtime: 59.89 msec (10 rows) Setting OFFSET to 675 in the above query, results in this 100 times slower plan: Limit (cost=21614.48..21614.48 rows=1 width=84) (actual time=4762.39..4762.39 rows=1 loops=1) -> Sort (cost=21612.79..21621.42 rows=3454 width=84) (actual time=4761.45..4761.92 rows=677 loops=1) Sort Key: aa_t.min_date_time -> Merge Join (cost=21139.96..21409.80 rows=3454 width=84) (actual time=4399.80..4685.24 rows=41879 loops=1) Merge Cond: ("outer".bb_id = "inner".bb_id) -> Sort (cost=8079.83..8184.53 rows=41879 width=19) (actual time=936.99..967.37 rows=41879 loops=1) Sort Key: bb_t.bb_id -> Seq Scan on bb_t (cost=0.00..4864.79 rows=41879 width=19) (actual time=0.06..729.60 rows=41879loops=1) -> Sort (cost=13060.13..13068.76 rows=3454 width=65) (actual time=3462.76..3493.97 rows=41879 loops=1) Sort Key: aa_t.bb_id -> Merge Join (cost=12794.42..12857.14 rows=3454 width=65) (actual time=2923.62..3202.78 rows=41879loops=1) Merge Cond: ("outer".realm_id = "inner".realm_id) -> Sort (cost=12762.78..12771.41 rows=3454 width=46) (actual time=2920.78..2950.87 rows=41879loops=1) Sort Key: aa_t.realm_id -> Index Scan using aa_t5 on aa_t (cost=0.00..12559.79 rows=3454 width=46) (actual time=0.18..2589.22rows=41879 loops=1) Index Cond: (server_id = 21::numeric) -> Sort (cost=31.64..32.78 rows=455 width=19) (actual time=2.54..33.12 rows=42163 loops=1) Sort Key: cc_t.realm_id -> Seq Scan on cc_t (cost=0.00..11.55 rows=455 width=19) (actual time=0.04..0.86 rows=455loops=1) Total runtime: 4792.84 msec (20 rows) Twiddling effective_cache_size and random_page_cost allows for a large LIMIT+OFFSET number but not enough. These tests are made with 400000 effective_cache_size and random_page_cost of 4. I can increase the LIMIT+OFFSET elbow to 1654 by changing the query thusly: < AND aa_t.server_id = 21 --- > AND aa_t.server_id IN (21, 0) The value 0 is an invalid server_id, so I know it won't be returned. However, I've got 41K rows that could be returned by this query and growing, and 1654 is obviously not enough. (aa is 690K rows, bb is 41K rows, and cc is 500 rows.) If I drop the ORDER BY, the query goes much faster, but the query is useless without the ORDER BY. I've figured out that the second plan is slow, because it is writing a huge result set to disk (+200MB). This doesn't make sense to me, since sort_mem is 32000. Is there a way to tell the optimizer to use Nested Loop plan always instead of the Merge/Join plan? Turning off enable_mergejoin is obviously not an option. Thanks, Rob
Rob Nagler <nagler@bivio.biz> writes: > I'm trying to understand how I can get the planner to always do the > right thing with this query: > SELECT > aa_t.min_date_time > FROM > aa_t > , bb_t > , cc_t > WHERE bb_t.bb_id = aa_t.bb_id > AND aa_t.realm_id = cc_t.realm_id > AND aa_t.server_id = 21 > ORDER BY aa_t.min_date_time desc > LIMIT 1 > OFFSET 674 > -> Index Scan Backward using aa_t20 on aa_t (cost=0.00..76738.77 rows=3454 width=46) (actual time=0.10..31.30rows=676 loops=1) > Filter: (server_id = 21::numeric) The reason the planner does not much like this plan is that it's estimating that quite a lot of rows will have to be hit in min_date_time order before it finds enough rows with server_id = 21. Thus the high cost estimate for the above step. I suspect that the reason you like this plan is that there's actually substantial correlation between server_id and min_date_time, such that the required rows are found quickly. Before trying to force the planner into what you consider an optimal plan, you had better ask yourself whether you can expect that correlation to hold up in the future. If not, your plan could become pessimal pretty quickly. I'd suggest creating a double-column index: create index aa_ti on aa_t(server_id, min_date_time); and altering the query to read ORDER BY aa_t.server_id DESC, aa_t.min_date_time DESC (you need this kluge to make sure the planner recognizes that the new index matches the ORDER BY request). Then you should get a plan with a much smaller cost coefficient for this step. regards, tom lane PS: does server_id really need to be NUMERIC? Why not integer, or at worst bigint?
Tom Lane writes: > The reason the planner does not much like this plan is that it's > estimating that quite a lot of rows will have to be hit in min_date_time > order before it finds enough rows with server_id = 21. Thus the high > cost estimate for the above step. Thanks for the speedy and useful reply! More questions follow. :) Very interesting. How does it know "quite a lot"? Is there something I can do to get the planner to analyze the data better? > I suspect that the reason you like this plan is that there's actually > substantial correlation between server_id and min_date_time, such that > the required rows are found quickly. Before trying to force the planner > into what you consider an optimal plan, you had better ask yourself > whether you can expect that correlation to hold up in the future. > If not, your plan could become pessimal pretty quickly. The correlation holds. min_date_time increases over time as records are inserted. server_id is uniformly distributed over time. There's no randomness. There is at least one 21 record for every value of min_date_time. 21 is a special server_id containing aggregate (denormalized) data for the other servers. I thought about putting it in a separate table, but this would complicate the code as the data is identical to the non-aggregated case. Do you have any suggestions for organizing the data/query now that you know this? > I'd suggest creating a double-column index: Thanks. I'll try this. I'm a very big fan of declarative programming. However, there's a danger in declarative programming when the interperter isn't smart enough. When I add this index, I will slow down inserts (about 20K/day) and increase data size (this is the second largest table in the database). Moreover, if the planner is improved, I've should fix my code, delete the index, etc. Is there a way of giving the planner direct hints as in Oracle? They can be ignored when the optimizer is improved, just as "register" is ignored by C compilers nowadays. Adding the extra index and ORDER BY is also not easy in our case. The query is dynamically generated. I can force the query ORDER BY to be whatever I want, but I would lose the ability to do interesting things, like the automatic generation of ORDER BY when someone clicks on a column header in the application. Indeed there are times when people want to sort on other columns in the query. I reduced the problem to the salient details for my post to this board. What if the ORDER BY was: ORDER BY aa_t.server_id DESC, cc_t.name ASC Would the planner do the right thing? > PS: does server_id really need to be NUMERIC? Why not integer, or at > worst bigint? It is a NUMERIC(18). It could be a bigint. What would be the change in performance of this query if we changed it to bigint? BTW, my customer is probably going to be switching to Oracle. This particular query has been one of the reasons. Maybe this change will help us stay with Postgres. Thanks, Rob
Rob Nagler <nagler@bivio.biz> writes: > Tom Lane writes: >> The reason the planner does not much like this plan is that it's >> estimating that quite a lot of rows will have to be hit in min_date_time >> order before it finds enough rows with server_id = 21. > Very interesting. How does it know "quite a lot"? It doesn't, because it has no cross-column-correlation stats. The default assumption is that there's no correlation. > server_id is uniformly distributed over time. There's > no randomness. There is at least one 21 record for every value of > min_date_time. That doesn't really tell me anything. What's the proportion of 21 records out of the total table? > 21 is a special server_id containing aggregate > (denormalized) data for the other servers. I thought about putting it > in a separate table, but this would complicate the code as the data is > identical to the non-aggregated case. Hm. If most of your queries are for id 21, an alternative approach is to create single-column partial indexes: create index fooi on foo (min_date_time) where server_id = 21; This reduces the cost of maintaining the index but also makes it useful *only* for id = 21 queries. On the plus side, you don't need to hack the ORDER BY clause to get your queries to use it. Your choice... > What if the ORDER BY was: > ORDER BY aa_t.server_id DESC, cc_t.name ASC > Would the planner do the right thing? What do you consider the right thing? cc_t.name doesn't seem connected to this table at all --- or did I miss something? > It is a NUMERIC(18). It could be a bigint. What would be the change > in performance of this query if we changed it to bigint? Hard to say. I can tell you that the raw comparison operator is a lot quicker for bigint than for numeric, but I don't have any hard numbers about what percentage of total CPU time is involved. You'd pretty much have to try it for yourself to see what the effect is in your queries. If you've been generically using NUMERIC(n) where you could be using integer or bigint, then I think you've probably paid a high price without knowing it. I don't know what Oracle's cost tradeoffs are for these datatypes, but I can tell you that Postgres's integer types are way faster (and more compact) than our NUMERIC. regards, tom lane
On Sat, 2003-08-30 at 10:47, Rob Nagler wrote: > Tom Lane writes: [snip] > enough. When I add this index, I will slow down inserts (about > 20K/day) and increase data size (this is the second largest table in [snip] Since I gather that this is a web-site, can we presume that they are clumped into an 8 hour range? 20,000/8 = 2,500/hour, which is 41.67/minute. If you can't do .69 inserts/second, something is wrong, and it ain't hardware, and it ain't Postgresql... > > PS: does server_id really need to be NUMERIC? Why not integer, or at > > worst bigint? > > It is a NUMERIC(18). It could be a bigint. What would be the change > in performance of this query if we changed it to bigint? http://www.postgresql.org/docs/7.3/static/datatype.html#DATATYPE-INT http://www.postgresql.org/docs/7.3/static/datatype.html#DATATYPE-NUMERIC-DECIMAL Scalars are faster than arbitrary precision types. Small (32 bit) scalars are faster than bit (64 bit) scalars on x86 h/w. -- ----------------------------------------------------------------- Ron Johnson, Jr. ron.l.johnson@cox.net Jefferson, LA USA "Adventure is a sign of incompetence" Stephanson, great polar explorer
Tom Lane writes: > That doesn't really tell me anything. What's the proportion of 21 > records out of the total table? Currently we have about 15 servers so 6% of the data is uniformly distributed with the value 21. > create index fooi on foo (min_date_time) where server_id = 21; > > This reduces the cost of maintaining the index but also makes it useful > *only* for id = 21 queries. On the plus side, you don't need to hack > the ORDER BY clause to get your queries to use it. Your choice... I like that better, thanks. After testing I found the elbow was at 1610 records with this index, but this clause still yields better performance at 1654 records: AND aa_t.server_id IN (21, 0) This is independent of the existence of the new index. Interestingly, when I drop the aa_t5 index, the elbow goes up to 1729 with the IN (21, 0) query. You might ask: why do I have an index at all? That's from my Oracle experience. server_id is a foreign key into the server table. If you don't create an index on a foreign key, Oracle locks the entire foreign table when you modify the local table. With an index, it only locks a row in the foreign table. This causes a major bottleneck, but in this case the server table is static. Therefore, the index is superfluous, and since there are only 16 values, the index should be bitmap index (Oracle speak, sorry, don't know the PG term). Dropping the index probably won't change any of the other queries, I think. Without the aa_t5 index and after the elbow, the Index Scan is replaced with a Seq Scan, which is just about as fast, but still 50 times slower than before the elbow: Limit (cost=34071.30..34071.31 rows=1 width=84) (actual time=5111.14..5111.15 rows=1 loops=1) -> Sort (cost=34066.98..34075.61 rows=3454 width=84) (actual time=5108.74..5109.96 rows=1733 loops=1) Sort Key: aa_t.min_date_time -> Merge Join (cost=33801.26..33863.98 rows=3454 width=84) (actual time=4868.62..5020.58 rows=41879 loops=1) Merge Cond: ("outer".realm_id = "inner".realm_id) -> Sort (cost=31.64..32.78 rows=455 width=19) (actual time=3.06..3.38 rows=415 loops=1) Sort Key: cc_t.realm_id -> Seq Scan on cc_t (cost=0.00..11.55 rows=455 width=19) (actual time=0.05..0.99 rows=455 loops=1) -> Sort (cost=33769.63..33778.26 rows=3454 width=65) (actual time=4865.20..4895.28 rows=41879 loops=1) Sort Key: aa_t.realm_id -> Merge Join (cost=33296.79..33566.63 rows=3454 width=65) (actual time=4232.52..4541.24 rows=41879loops=1) Merge Cond: ("outer".bb_id = "inner".bb_id) -> Sort (cost=25216.97..25225.60 rows=3454 width=46) (actual time=3213.53..3243.65 rows=41879loops=1) Sort Key: aa_t.bb_id -> Seq Scan on aa_t (cost=0.00..25013.97 rows=3454 width=46) (actual time=20.07..2986.11rows=41879 loops=1) Filter: (server_id = 21::numeric) -> Sort (cost=8079.83..8184.53 rows=41879 width=19) (actual time=1018.95..1049.37 rows=41879loops=1) Sort Key: bb_t.bb_id -> Seq Scan on bb_t (cost=0.00..4864.79 rows=41879 width=19) (actual time=0.04..810.88rows=41879 loops=1) Total runtime: 5141.22 msec What I'm not sure is why does it decide to switch modes so "early", i.e., at about 5% of the table size or less? It seems that an Index Scan would give better mileage than a Seq Scan for possibly up to 50% of the table in this case. I clearly don't understand the internals, but the elbow seems rather sharp to me. > > What if the ORDER BY was: > > ORDER BY aa_t.server_id DESC, cc_t.name ASC > > Would the planner do the right thing? > > What do you consider the right thing? > cc_t.name doesn't seem connected > to this table at all --- or did I miss something? Sorry, this is a red herring. Please ignore. > If you've been generically using NUMERIC(n) where you could be using > integer or bigint, then I think you've probably paid a high price > without knowing it. I don't know what Oracle's cost tradeoffs are for > these datatypes, but I can tell you that Postgres's integer types are > way faster (and more compact) than our NUMERIC. I'll try to figure out what the price is in our case. I think Oracle does a pretty good job on data compression for NUMERIC. I haven't dealt with a large Postgres database until this one, so I guess it's time to learn. :) We actually have been quite pleased with Postgres's performance without paying much attention to it before this. When we first set up Oracle, we got into all of its parameters pretty heavily. With Postgres, we just tried it and it worked. This is the first query where we ran out of ideas to try. BTW, everybody's help on this list is fantastic. Usually, I can find the answer to my question (and have been doing so for 3 years) on this list without asking. Thanks, Rob
On Sat, 2003-08-30 at 15:42, Rob Nagler wrote: [snip] > We actually have been quite pleased with Postgres's performance > without paying much attention to it before this. When we first set up > Oracle, we got into all of its parameters pretty heavily. With > Postgres, we just tried it and it worked. This is the first query > where we ran out of ideas to try. Dumb question: given your out-of-the-box satisfaction, could it be that postgresql.conf hasn't been tweaked? -- ----------------------------------------------------------------- Ron Johnson, Jr. ron.l.johnson@cox.net Jefferson, LA USA 484,246 sq mi are needed for 6 billion people to live, 4 persons per lot, in lots that are 60'x150'. That is ~ California, Texas and Missouri. Alternatively, France, Spain and The United Kingdom.
Rob Nagler <nagler@bivio.biz> writes: > What I'm not sure is why does it decide to switch modes so "early", > i.e., at about 5% of the table size or less? Given the default cost parameters and cost models, that's the correct place to switch. Since the estimate evidently doesn't match reality for your case, you might want to play with the parameters. Reducing random_page_cost would be the first thing I'd try. Some people think that increasing effective_cache_size is a good idea too, though I feel that that has only marginal impact on the planner's choices. Keep in mind though that you seem to be experimenting with a fully-cached database; you may find that the planner's beliefs more nearly approach reality when actual I/O has to occur. Another thing I'd be interested to know about is how closely the physical order of the table entries correlates with min_date_time. A high correlation reduces the actual cost of the indexscan (since visiting the rows in index order becomes less of a random-access proposition). We are aware that the planner doesn't model this effect very well at present ... regards, tom lane
Ron Johnson writes: > Dumb question: given your out-of-the-box satisfaction, could it be > that postgresql.conf hasn't been tweaked? Here are the modified values: shared_buffers = 8000 wal_buffers = 80 sort_mem = 32000 effective_cache_size = 400000 random_page_cost = 4 autocommit = false timezone = UTC I had run a test with effective_cache_size to high value to see what would happen. Also adjusted random_page_cost: random_page_cost effective_cache_size elbow 4 40000 675 .5 40000 592 .1 40000 392 4 1000 30 My conclusion is that random_page_cost should be left alone and effective_cache_size higher is better. BTW, the hardware is 2 x 2.4ghz Xeon, 1.2GB, SCSI (linux software raid) with 10K disks. This is close to the production box. Although we are planning on adding more memory to production. Rob
Tom Lane writes: > Keep in mind though that you seem to be experimenting with a > fully-cached database; you may find that the planner's beliefs more > nearly approach reality when actual I/O has to occur. My hope is that the entire database should fit in memory. This may not be in the case right now with only 1GB, but it should be close. The pgsql/data/base/NNN directory is about 1.5GB on production. I'm pretty sure with constant vacuuming, we could keep that size down. A pgdump is about 60MB now, growing at about .5MB a day. > Another thing I'd be interested to know about is how closely the > physical order of the table entries correlates with min_date_time. Probably "pretty close". The primary key of aa_t is (bb_id, server_id), and bb_id is a sequence. aa_t is updated heavily on production, but these tests are on a fresh import so vacuuming and index order is not a factor. We do a reload every now and then to improve performance on production. min_date_time is highly correlated with bb_id, because both are increasing constantly. server_id is one of 16 values. > A high correlation reduces the actual cost of the indexscan (since > visiting the rows in index order becomes less of a random-access > proposition). We are aware that the planner doesn't model this effect > very well at present ... Oracle's optimizer is lacking here, too. The best optimizer I've seen was at Tandem, and even then hints were required. Are there plans for explicit hints to the planner? Thanks, Rob
Rob Nagler <nagler@bivio.biz> writes: > Are there plans for explicit hints to the planner? Personally, I'm philosophically opposed to planner hints; see previous discussions in the archives. regards, tom lane
On Sun, 2003-08-31 at 18:12, Tom Lane wrote: > Rob Nagler <nagler@bivio.biz> writes: > > Are there plans for explicit hints to the planner? > > Personally, I'm philosophically opposed to planner hints; see previous > discussions in the archives. How about (if you don't already do it) ranked (or approximately ranked) b-tree indexes, where each node also stores the (approximate) count of tuple pointers under it? This way, the planner would know whether or how skewed a tree is, and (approximately) how many tuples a given WHERE predicate resolves to. -- ----------------------------------------------------------------- Ron Johnson, Jr. ron.l.johnson@cox.net Jefferson, LA USA "Fair is where you take your cows to be judged." Unknown
Ron Johnson <ron.l.johnson@cox.net> writes: > How about (if you don't already do it) ranked (or approximately > ranked) b-tree indexes, where each node also stores the (approximate) > count of tuple pointers under it? > This way, the planner would know whether or how skewed a tree is, > and (approximately) how many tuples a given WHERE predicate resolves > to. Why is that better than our existing implementation of column statistics? regards, tom lane