Thread: Merge Join vs Nested Loop
I have some odd cases here joining two tables - the planner insists on Merge Join, but Nested Loop is really faster - and that makes sense, since I'm selecting just a small partition of the data available. All planner constants seems to be set at the default values, the only way to get a shift towards Nested Loops seems to be to raise the constants. I believe our memory is big enough to hold the indices, and that the effective_cache_size is set to a sane value (but how to verify that, anyway?). What causes the nested loops to be estimated so costly - or is it the merge joins that are estimated too cheaply? Should I raise all the planner cost constants, or only one of them? Here are some sample explains: prod=> explain analyze select * from ticket join users on users_id=users.id where ticket.created>'2006-09-25 17:00'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..67664.15 rows=10977 width=675) (actual time=0.038..202.877 rows=10627 loops=1) -> Index Scan using ticket_on_created on ticket (cost=0.00..11665.94 rows=10977 width=80) (actual time=0.014..35.571rows=10627 loops=1) Index Cond: (created > '2006-09-25 17:00:00'::timestamp without time zone) -> Index Scan using users_pkey on users (cost=0.00..5.00 rows=1 width=595) (actual time=0.007..0.008 rows=1 loops=10627) Index Cond: ("outer".users_id = users.id) Total runtime: 216.612 ms (6 rows) prod=> explain analyze select * from ticket join users on users_id=users.id where ticket.created>'2006-09-25 16:00'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Merge Join (cost=12844.93..68580.37 rows=11401 width=675) (actual time=106.631..1712.458 rows=11554 loops=1) Merge Cond: ("outer".id = "inner".users_id) -> Index Scan using users_pkey on users (cost=0.00..54107.38 rows=174508 width=595) (actual time=0.041..1215.221 rows=174599loops=1) -> Sort (cost=12844.93..12873.43 rows=11401 width=80) (actual time=105.753..123.905 rows=11554 loops=1) Sort Key: ticket.users_id -> Index Scan using ticket_on_created on ticket (cost=0.00..12076.68 rows=11401 width=80) (actual time=0.074..65.297rows=11554 loops=1) Index Cond: (created > '2006-09-25 16:00:00'::timestamp without time zone) Total runtime: 1732.452 ms (8 rows)
Tobias Brox <tobias@nordicbet.com> writes: > What causes the nested loops to be estimated so costly - or is it the > merge joins that are estimated too cheaply? Should I raise all the > planner cost constants, or only one of them? If your tables are small enough to fit (mostly) in memory, then the planner tends to overestimate the cost of a nestloop because it fails to account for cacheing effects across multiple scans of the inner table. This is addressed in 8.2, but in earlier versions about all you can do is reduce random_page_cost, and a sane setting of that (ie not less than 1.0) may not be enough to push the cost estimates where you want them. Still, reducing random_page_cost ought to be your first recourse. regards, tom lane
[Tom Lane - Tue at 06:09:56PM -0400] > If your tables are small enough to fit (mostly) in memory, then the > planner tends to overestimate the cost of a nestloop because it fails to > account for cacheing effects across multiple scans of the inner table. > This is addressed in 8.2, but in earlier versions about all you can do > is reduce random_page_cost, and a sane setting of that (ie not less than > 1.0) may not be enough to push the cost estimates where you want them. > Still, reducing random_page_cost ought to be your first recourse. Thank you. Reducing the random page hit cost did reduce the nested loop cost significantly, sadly the merge join costs where reduced even further, causing the planner to favor those even more than before. Setting the effective_cache_size really low solved the issue, but I believe we rather want to have a high effective_cache_size. Eventually, setting the effective_cache_size to near-0, and setting random_page_cost to 1 could maybe be a desperate measure. Another one is to turn off merge/hash joins and seq scans. It could be a worthwhile experiment if nothing else :-) The bulk of our database is historical data that most often is not touched at all, though one never knows for sure until the queries have run all through - so table partitioning is not an option, it seems like. My general idea is that nested loops would cause the most recent data and most important part of the indexes to stay in the OS cache. Does this make sense from an experts point of view? :-)
On Wed, 2006-09-27 at 11:48 +0200, Tobias Brox wrote: > [Tom Lane - Tue at 06:09:56PM -0400] > > If your tables are small enough to fit (mostly) in memory, then the > > planner tends to overestimate the cost of a nestloop because it fails to > > account for cacheing effects across multiple scans of the inner table. > > This is addressed in 8.2, but in earlier versions about all you can do > > is reduce random_page_cost, and a sane setting of that (ie not less than > > 1.0) may not be enough to push the cost estimates where you want them. > > Still, reducing random_page_cost ought to be your first recourse. > > Thank you. Reducing the random page hit cost did reduce the nested loop > cost significantly, sadly the merge join costs where reduced even > further, causing the planner to favor those even more than before. > Setting the effective_cache_size really low solved the issue, but I > believe we rather want to have a high effective_cache_size. > > Eventually, setting the effective_cache_size to near-0, and setting > random_page_cost to 1 could maybe be a desperate measure. Another one > is to turn off merge/hash joins and seq scans. It could be a worthwhile > experiment if nothing else :-) > > The bulk of our database is historical data that most often is not > touched at all, though one never knows for sure until the queries have > run all through - so table partitioning is not an option, it seems like. > My general idea is that nested loops would cause the most recent data > and most important part of the indexes to stay in the OS cache. Does > this make sense from an experts point of view? :-) Have you tried chaning the cpu_* cost options to see how they affect merge versus nested loop?
[Scott Marlowe - Wed at 09:58:30AM -0500] > Have you tried chaning the cpu_* cost options to see how they affect > merge versus nested loop? As said in the original post, increasing any of them shifts the planner towards nested loops instead of merge_join. I didn't check which one of the cost constants made the most impact.
On Wed, 2006-09-27 at 17:05 +0200, Tobias Brox wrote: > [Scott Marlowe - Wed at 09:58:30AM -0500] > > Have you tried chaning the cpu_* cost options to see how they affect > > merge versus nested loop? > > As said in the original post, increasing any of them shifts the planner > towards nested loops instead of merge_join. I didn't check which one of > the cost constants made the most impact. So, by decreasing them, you should move away from nested loops then, right? Has that not worked for some reason?
[Scott Marlowe - Wed at 10:19:24AM -0500] > So, by decreasing them, you should move away from nested loops then, > right? Has that not worked for some reason? I want to move to nested loops, they are empirically faster in many of our queries, and that makes sense since we've got quite big tables and most of the queries only touch a small partition of the data. I've identified that moving any of the cost constants (including random_page_cost) upwards gives me the right result, but I'm still wary if this is the right thing to do. Even if so, what constants should I target first? I could of course try to analyze a bit what constants give the biggest impact. Then again, we have many more queries hitting the database than the few I'm doing research into (and those I'm doing research into is even very simplified versions of the real queries).
On Wed, 2006-09-27 at 10:26, Tobias Brox wrote: > [Scott Marlowe - Wed at 10:19:24AM -0500] > > So, by decreasing them, you should move away from nested loops then, > > right? Has that not worked for some reason? > > I want to move to nested loops, they are empirically faster in many of > our queries, and that makes sense since we've got quite big tables and > most of the queries only touch a small partition of the data. > > I've identified that moving any of the cost constants (including > random_page_cost) upwards gives me the right result, but I'm still wary > if this is the right thing to do. Even if so, what constants should I > target first? I could of course try to analyze a bit what constants > give the biggest impact. Then again, we have many more queries hitting > the database than the few I'm doing research into (and those I'm doing > research into is even very simplified versions of the real queries). Ahh, the other direction then. I would think it's safer to nudge these a bit than to drop random page cost to 1 or set effective_cache_size to 1000 etc... But I'm sure you should test the other queries and / or keep an eye on your database while running to make sure those changes don't impact other users.
[Scott Marlowe - Wed at 10:31:35AM -0500] > And remember, you can always change any of those settings in session for > just this one query to force the planner to make the right decision. sure ... I could identify the most problematic queries, and hack up the software application to modify the config settings for those exact queries ... but it's a very ugly solution. :-) Particularly if Tom Lane is correct saying the preferance of merge join instead of nested loop is indeed a bug.
I found a way to survive yet some more weeks :-) One of the queries we've had most problems with today is principially something like: select A.*,sum(B.*) from A join B where A.created>x and ... order by A.created desc limit 32 group by A.* There is by average two rows in B for every row in A. Note the 'limit 32'-part. I rewrote the query to: select A.*,(select sum(B.*) from B ...) where A.created>x and ... order by A.created desc limit 32; And voila, the planner found out it needed just some few rows from A, and execution time was cutted from 1-2 minutes down to 20 ms. :-) I've also started thinking a bit harder about table partitioning, if we add some more redundancy both to the queries and the database, it may help us drastically reduce the real expenses of some of the merge joins...