Thread: order of nested loop
I have two queries that return identical results. One is a SELECT DISTINCT and the other is the same query without the DISTINCT. The explain for the second one makes it seem as if it would be faster: Sort (cost=73560.75..73560.75 rows=3 width=604) vs. Sort (cost=67246.81..67246.82 rows=3 width=604) However in reality the first query runs much faster. The problem is this nested loop: not distinct: -> Subquery Scan "*SELECT* 2" (cost=0.00..30602.38 rows=25 width=604) -> Limit (cost=0.00..30602.38 rows=25 width=604) -> Nested Loop (cost=0.00..5499145.64 rows=4492 width=604) ================ vs. ================================= distinct: -> Sort (cost=36903.81..36915.04 rows=4492 width=604) Sort Key: <snip> -> Nested Loop (cost=0.00..36631.27 rows=4492 width=604) In the query with the distinct one table is done first, in the other the order is reversed. This makes all the difference in the query, because in my test case there is only one matching entry in one of the tables and that is always the table that determines the number of rows in the result (and except in pathalogical cases will always be much lower than the number returned from the first table). So how can I tell postgres which table to scan in the loop first?
Joseph Shraibman <jks@selectacast.net> writes: > I have two queries that return identical results. One is a SELECT > DISTINCT and the other is the same query without the DISTINCT. Could we see the actual queries? And the complete EXPLAIN ANALYZE output? regards, tom lane
Joseph Shraibman <joseph@xtenit.com> writes: > Tom Lane wrote: >> Joseph Shraibman <jks@selectacast.net> writes: >>> I have two queries that return identical results. One is a SELECT >>> DISTINCT and the other is the same query without the DISTINCT. >> >> Could we see the actual queries? And the complete EXPLAIN ANALYZE >> output? > (private email, bec I don't want to sanatize this huge thing for the list) Okay, but people send bigger things than this to the list all the time... > [fast case] > (SELECT distinct > u.userkey,u.adminper,u.addper,u.approvper,u.ownerper,u.subper,u.banned,u.status,u.config,u.rules, ... > ORDER BY lower(d.username) LIMIT 25) > -> Limit (cost=36903.81..36916.31 rows=25 width=604) (actual > time=10.92..10.93 rows=1 loops=1) > -> Unique (cost=36903.81..37128.43 rows=449 width=604) > (actual time=10.92..10.92 rows=1 loops=1) > -> Sort (cost=36903.81..36915.04 rows=4492 > width=604) (actual time=10.91..10.91 rows=1 loops=1) > Sort Key: lower(d.username), u.userkey, > u.adminper, u.addper, u.approvper, u.ownerper, u.subper, u.banned, u.status, u.co > nfig, u.rules, (subplan), CASE WHEN ((u.rules IS NULL) OR (u.rules = ''::text)) THEN > ''::text ELSE 'r'::text END, d.username, d.realname, d.firstname, (subplan), d.st > atus, 2 > -> Nested Loop (cost=0.00..36631.27 > rows=4492 width=604) (actual time=10.65..10.67 rows=1 loops=1) > -> Index Scan using > usertable_podkey_key on usertable u (cost=0.00..16003.53 rows=5446 width=555) (actual time=0. > 04..0.05 rows=1 loops=1) > Index Cond: (podkey = 20) > Filter: ((status = 2) AND (NOT > banned)) > -> Index Scan using directory_pkey on > directory d (cost=0.00..3.78 rows=1 width=49) (actual time=0.02..0.03 rows= > 1 loops=1) > Index Cond: (d.userkey = > "outer".userkey) > Filter: ((status = 2) OR (status > = 5)) > [slow case] > (SELECT > u.userkey,u.adminper,u.addper,u.approvper,u.ownerper,u.subper,u.banned,u.status,u.config,u.rules, ... > ORDER BY lower(d.username) LIMIT 25) > -> Limit (cost=0.00..30602.38 rows=25 width=604) (actual > time=74810.10..102624.62 rows=1 loops=1) > -> Nested Loop (cost=0.00..5499145.64 rows=4492 > width=604) (actual time=74810.10..102624.61 rows=1 loops=1) > -> Index Scan using directory_lower_username_key > on directory d (cost=0.00..2380577.42 rows=525568 width=49) (actual time=15.61..79588.09 > rows=589718 loops=1) > Filter: ((status = 2) OR (status = 5)) > -> Index Scan using usertable_pkey on usertable u > (cost=0.00..5.92 rows=1 width=555) (actual time=0.04..0.04 rows=0 loops=589718) > Index Cond: (("outer".userkey = u.userkey) > AND (u.podkey = 20)) > Filter: ((status = 2) AND (NOT banned)) As near as I can tell, the failure of estimation is not in the slow case --- the planner correctly estimates that this plan is expensive. Rather, it's in the fast case, because the planner mistakenly thinks that that is also expensive. The critical error is right here: > -> Index Scan using > usertable_podkey_key on usertable u (cost=0.00..16003.53 rows=5446 width=555) (actual time=0. > 04..0.05 rows=1 loops=1) > Index Cond: (podkey = 20) > Filter: ((status = 2) AND (NOT > banned)) That scan is estimated to yield 5446 rows and it only yields 1. Do you have any idea why the estimate is so far off? (My guess is that podkey, status and banned are correlated to a large extent, but you tell us.) regards, tom lane
Tom, I am currious. Why perform a sort first? Would it not be more efficient to create an in memory index on the fly? I seems to me that it would consume less resources, specially if there are many duplicates. Also caching the last key lookup would save time if the records are "bunched". What am I mising here? JL Tom Lane wrote: > > Joseph Shraibman <joseph@xtenit.com> writes: > > Tom Lane wrote: > >> Joseph Shraibman <jks@selectacast.net> writes: > >>> I have two queries that return identical results. One is a SELECT > >>> DISTINCT and the other is the same query without the DISTINCT. > >> > >> Could we see the actual queries? And the complete EXPLAIN ANALYZE > >> output? > > > (private email, bec I don't want to sanatize this huge thing for the list) > > Okay, but people send bigger things than this to the list all the time... > > > [fast case] > > (SELECT distinct > > u.userkey,u.adminper,u.addper,u.approvper,u.ownerper,u.subper,u.banned,u.status,u.config,u.rules, ... > > ORDER BY lower(d.username) LIMIT 25) > > > -> Limit (cost=36903.81..36916.31 rows=25 width=604) (actual > > time=10.92..10.93 rows=1 loops=1) > > -> Unique (cost=36903.81..37128.43 rows=449 width=604) > > (actual time=10.92..10.92 rows=1 loops=1) > > -> Sort (cost=36903.81..36915.04 rows=4492 > > width=604) (actual time=10.91..10.91 rows=1 loops=1) > > Sort Key: lower(d.username), u.userkey, > > u.adminper, u.addper, u.approvper, u.ownerper, u.subper, u.banned, u.status, u.co > > nfig, u.rules, (subplan), CASE WHEN ((u.rules IS NULL) OR (u.rules = ''::text)) THEN > > ''::text ELSE 'r'::text END, d.username, d.realname, d.firstname, (subplan), d.st > > atus, 2 > > -> Nested Loop (cost=0.00..36631.27 > > rows=4492 width=604) (actual time=10.65..10.67 rows=1 loops=1) > > -> Index Scan using > > usertable_podkey_key on usertable u (cost=0.00..16003.53 rows=5446 width=555) (actual time=0. > > 04..0.05 rows=1 loops=1) > > Index Cond: (podkey = 20) > > Filter: ((status = 2) AND (NOT > > banned)) > > -> Index Scan using directory_pkey on > > directory d (cost=0.00..3.78 rows=1 width=49) (actual time=0.02..0.03 rows= > > 1 loops=1) > > Index Cond: (d.userkey = > > "outer".userkey) > > Filter: ((status = 2) OR (status > > = 5)) > > > [slow case] > > (SELECT > > u.userkey,u.adminper,u.addper,u.approvper,u.ownerper,u.subper,u.banned,u.status,u.config,u.rules, ... > > ORDER BY lower(d.username) LIMIT 25) > > > -> Limit (cost=0.00..30602.38 rows=25 width=604) (actual > > time=74810.10..102624.62 rows=1 loops=1) > > -> Nested Loop (cost=0.00..5499145.64 rows=4492 > > width=604) (actual time=74810.10..102624.61 rows=1 loops=1) > > -> Index Scan using directory_lower_username_key > > on directory d (cost=0.00..2380577.42 rows=525568 width=49) (actual time=15.61..79588.09 > > rows=589718 loops=1) > > Filter: ((status = 2) OR (status = 5)) > > -> Index Scan using usertable_pkey on usertable u > > (cost=0.00..5.92 rows=1 width=555) (actual time=0.04..0.04 rows=0 loops=589718) > > Index Cond: (("outer".userkey = u.userkey) > > AND (u.podkey = 20)) > > Filter: ((status = 2) AND (NOT banned)) > > As near as I can tell, the failure of estimation is not in the slow case > --- the planner correctly estimates that this plan is expensive. > Rather, it's in the fast case, because the planner mistakenly thinks > that that is also expensive. The critical error is right here: > > > -> Index Scan using > > usertable_podkey_key on usertable u (cost=0.00..16003.53 rows=5446 width=555) (actual time=0. > > 04..0.05 rows=1 loops=1) > > Index Cond: (podkey = 20) > > Filter: ((status = 2) AND (NOT > > banned)) > > That scan is estimated to yield 5446 rows and it only yields 1. Do you > have any idea why the estimate is so far off? (My guess is that podkey, > status and banned are correlated to a large extent, but you tell us.) > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html
Jean-Luc Lachance <jllachan@nsd.ca> writes: > I am currious. Why perform a sort first? Would it not be more > efficient to create an in memory index on the fly? Why would you think that? Creating an index involves a sort ... there's no free lunch there that I can see. regards, tom lane
Tom Lane wrote: > > Jean-Luc Lachance <jllachan@nsd.ca> writes: > > I am currious. Why perform a sort first? Would it not be more > > efficient to create an in memory index on the fly? > > Why would you think that? Creating an index involves a sort ... > there's no free lunch there that I can see. > > regards, tom lane There is only a small difference, but distinct implies unique index which mean there is no need to sort duplicate records.
Tom Lane wrote: > > That scan is estimated to yield 5446 rows and it only yields 1. Do you > have any idea why the estimate is so far off? (My guess is that podkey, > status and banned are correlated to a large extent, but you tell us.) The relationship between d and u is like this: There is a row in d for each user, and for each pod they are a member of there is an entry in u. So when I'm querying u for members of a particular pod I'm filtering by the status of the d entry and that status of the u entry. There are a lot of entries in d, but only a few of them will be members of a particular pod. Thus it would make sense to first get the entries in u, filter them, then filter by their status in d. There will be an entry in d for each entry in u, but not vice versa. The planner shows this for the scan on d: (cost=0.00..2380577.42 rows=525568 width=49) Maybe it thinks it will reach the limit of 25 before it actually does, which is why it is willing to try something so expensive?
Joseph Shraibman <jks@selectacast.net> writes: > The planner shows this for the scan on d: > (cost=0.00..2380577.42 rows=525568 width=49) > Maybe it thinks it will reach the limit of 25 before it actually does, > which is why it is willing to try something so expensive? Yeah, exactly, it's extrapolating that it will only actually have to process a relatively small part of that scan. Which would be true if it were getting 4492 rows out of the join per estimate, and not just 1 per reality. This is the same estimation failure as in the other plan, I think, but it's a lot simpler to see in the other plan. > ... Thus it would make sense to first get the entries in u, > filter them, then filter by their status in d. Right, but the problem is to know how many u entries will get through the filter. When that estimate is off by a factor of ~5000, it's no surprise that the planner is likely to choose the wrong plan. If you could cut that estimation error by even a factor of 2, it would have made the right choices here. So we're back to my previous question: why is that estimate so far off? You might try comparing explain select * from usertable where podkey = 20; select count(*) from usertable where podkey = 20; to see whether the estimate is failing on the basic podkey=20 part. If that doesn't seem too far off, add in the status = 2 and/or (NOT banned) parts to see what confuses it. I'd like to see the pg_stats rows for these three columns, too. BTW, you have done an ANALYZE recently on usertable, I hope. regards, tom lane
Tom Lane wrote: > Joseph Shraibman <jks@selectacast.net> writes: > >>The planner shows this for the scan on d: >>(cost=0.00..2380577.42 rows=525568 width=49) >>Maybe it thinks it will reach the limit of 25 before it actually does, >>which is why it is willing to try something so expensive? > > > Yeah, exactly, it's extrapolating that it will only actually have to > process a relatively small part of that scan. Which would be true if > it were getting 4492 rows out of the join per estimate, and not just 1 > per reality. This is the same estimation failure as in the other plan, > I think, but it's a lot simpler to see in the other plan. > > >>... Thus it would make sense to first get the entries in u, >>filter them, then filter by their status in d. > > > Right, but the problem is to know how many u entries will get through > the filter. When that estimate is off by a factor of ~5000, it's no > surprise that the planner is likely to choose the wrong plan. If you > could cut that estimation error by even a factor of 2, it would have > made the right choices here. > > So we're back to my previous question: why is that estimate so far off? > You might try comparing > explain select * from usertable where podkey = 20; > select count(*) from usertable where podkey = 20; => explain select * from usertable where podkey = 20; QUERY PLAN ----------------------------------------------------------------------------------------------- Index Scan using usertable_podkey_key on usertable (cost=0.00..16019.99 rows=5923 width=625) Index Cond: (podkey = 20) (2 rows) => select count(*) from usertable where podkey = 20; count ------- 3 (1 row) => select * from pg_stats where tablename = 'usertable' and attname in('podkey','status','banned'); schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+-----------+------------+------------------------------------------+-------------------------------------------------------------------------------+---------------------------------------------+------------- public | usertable | podkey | 0 | 4 | 66 | {<actual numbers deleted, but 20 isn't on of them>} | {0.208,0.156,0.112,0.0696667,0.0306667,0.028,0.0273333,0.025,0.0243333,0.023} | {10,90,137,140,197,207,246,264,267,269,300} | 0.53816 public | usertable | status | 0 | 2 | 4 | {2,4,1,3} | {0.938,0.0496667,0.0103333,0.002} | | 0.840237 public | usertable | banned | 0 | 1 | 2 | {f,t} | {0.982,0.018} | | 0.9339 (3 rows) => select count(*) from usertable; count --------- 1121190 (1 row) > to see whether the estimate is failing on the basic podkey=20 part. > If that doesn't seem too far off, add in the status = 2 and/or > (NOT banned) parts to see what confuses it. I'd like to see the > pg_stats rows for these three columns, too. > > BTW, you have done an ANALYZE recently on usertable, I hope. Yeah, every week by cron job.
Joseph Shraibman <jks@selectacast.net> writes: > => explain select * from usertable where podkey = 20; > Index Scan using usertable_podkey_key on usertable > (cost=0.00..16019.99 rows=5923 width=625) > => select count(*) from usertable where podkey = 20; > count > ------- > 3 > (1 row) Well, there's our problem :-( I would suggest bumping up the statistics target for usertable.podkey (see ALTER TABLE SET STATISTICS). Since it's only an int4 column you could make the target 100 (instead of the default 10) without much cost. That should give substantially finer-grain detail and hopefully bring this estimate down out of the stratosphere. Re-analyze the table and see what it gets you. regards, tom lane
On Tue, Jun 17, 2003 at 11:53:05AM -0400, Jean-Luc Lachance wrote: > Tom Lane wrote: > > > > Jean-Luc Lachance <jllachan@nsd.ca> writes: > > > I am currious. Why perform a sort first? Would it not be more > > > efficient to create an in memory index on the fly? > > > > Why would you think that? Creating an index involves a sort ... > > there's no free lunch there that I can see. > > > > regards, tom lane > > > There is only a small difference, but distinct implies unique index > which mean there is no need to sort duplicate records. Also (and not specific to this example), an index doesn't need to sort entire tuples. If the tuples are wide enough, it would be faster to build an index, then use that index to access the tuples, especially if you're going to read through the data more than once (the tuples might have to be hopelessly large to make a single scan effective). On a related note; since temporary tables are only visible to a single connection, do they have full MVCC info in them, or can it be bypassed? If it's not there you'd probably need some other means to allow for transactions, but the performance gain might be well worth it. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Tom Lane wrote: > I would suggest bumping up the statistics target for usertable.podkey > (see ALTER TABLE SET STATISTICS). Since it's only an int4 column you > could make the target 100 (instead of the default 10) without much > cost. That should give substantially finer-grain detail and hopefully > bring this estimate down out of the stratosphere. Re-analyze the table > and see what it gets you. > Thanks, that seemed to have helped. Now I have this other query: QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=46167.67..81194.20 rows=175858 width=13) (actual time=8807.10..16097.85 rows=249551 loops=1) Hash Cond: ("outer".userkey = "inner".userkey) -> Seq Scan on u (cost=0.00..25886.79 rows=175858 width=7) (actual time=0.10..3329.93 rows=249551 loops=1) Filter: (podkey = 260) -> Hash (cost=20167.55..20167.55 rows=637355 width=6) (actual time=8806.36..8806.36 rows=0 loops=1) -> Seq Scan on d (cost=0.00..20167.55 rows=637355 width=6) (actual time=28.13..7317.19 rows=638045 loops=1) Total runtime: 16926.00 msec (7 rows) How do I read that? Is it creating a hash out of the data in d, then going through u doing a join?
Joseph Shraibman <jks@selectacast.net> writes: > How do I read that? Is it creating a hash out of the data in d, then > going through u doing a join? Yeah. Given the numbers of rows involved, the plan seems pretty reasonable --- I doubt you can do a lot better within the context you're showing here. To make it faster you'll have to find a way to not need to look at all the rows. regards, tom lane
Joseph Shraibman <joseph@xtenit.com> writes: > Well there is no reason for it to look at all the rows in d, since it > has a filter on u that should produce much less rows than there on in > d. What filter? The scan is producing 250 thousand rows, man! You expect it to join 250k rows in no time flat? You can try forcing merge or nestloop joins with the enable_foo switches, but I suspect you'll find that for the problem as posed, this is the best plan available. To make it go faster you will need a way to eliminate a large percentage of one or both tables before the join happens. regards, tom lane
"Jim C. Nasby" <jim@nasby.net> writes: > On a related note; since temporary tables are only visible to a single > connection, do they have full MVCC info in them, or can it be bypassed? You can't really bypass it. You still need the transaction number so you can tell if a tuple was created by a committed, aborted, or the current transaction, and you still need the command number for the same reasons as usual. We do (in 7.3) bypass WAL logging of changes in temp tables, and we also use non-shared buffers for them so as to avoid buffer locking overhead. Not sure there's much more win to be had there. (The local buffer manager code could stand improvement, but that's a different issue.) regards, tom lane
On Tue, Jun 17, 2003 at 07:29:16PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > On a related note; since temporary tables are only visible to a single > > connection, do they have full MVCC info in them, or can it be bypassed? > > You can't really bypass it. You still need the transaction number so > you can tell if a tuple was created by a committed, aborted, or the > current transaction, and you still need the command number for the same > reasons as usual. You do need that info, but I think a local-only temp. table means you don't have to use XID (and maybe CID, I don't really know how that's used) to store the info. This is because only a single transaction can reference the table. As part of committing an update, you can either set a simple flag in the old tuples, or you might even be able to just delete them (though nested transactions would probably break that). You would have to keep a list of what tuples are old and new, but that's probably more efficient than 16/20 bytes per tuple. OTOH, I've also been wondering if MVCC info could be stored more efficiently across the board. One idea is to keep a list of valid MVCC states, and assign int (or maybe even smaller) ID's to every entry in that list. Each tuple would only have that entry number then. This would probably not be good for tables that have a lot of single-row transactions (though there's another advantage that might offset the overhead), but for tables that only see fairly large transactions it could be a tremendous gain. The other advantage to normalizing the MVCC info is it should then be reasonable to store it in indexes as well as base tuples. This could dramatically speed up index work, since you don't have to read the base tuple to see if it's still valid. It would also allow things like index covering, and count(*) to do only an index scan. Another alternative to an MVCC ID would be to store the MVCC info outside the mainline storage with a reference back to the base tuple, and use some form of RLE. IE: xmin, cmin, xmax, cmax, xvac, min_tid, max_tid where min_tid and max_tid define the range of tuples that MVCC info applies to. I'm sure there's plenty of other ways MVCC info could be stored without using 16/20 bytes per tuple. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > ... I'm sure there's plenty of other ways MVCC info could be > stored without using 16/20 bytes per tuple. I didn't really see a single workable idea there. Keep in mind that storage space is only one consideration (and not a real big one given modern disk-drive sizes). Ask yourself about atomicity, failure recovery, and update costs. RLE encoding of tuple states? Get real --- how many rows could get wiped out by a one-bit lossage? How extensive are the on-disk changes needed to encode a one-tuple change in state, and how do you recover if the machine crashes when only some of those changes are down to disk? In my opinion PG's on-disk structures are barely reliable enough now; we don't want to introduce compression schemes with the potential for large cross-tuple failure modes. Storing commit state in index entries has been repeatedly proposed and repeatedly rejected, too. It converts an atomic operation (update one word in one page) into a non-atomic, multi-page operation, which creates lots of performance and reliability problems. And the point of an index is to be smaller than the main table --- the more stuff you cram into an index tuple header, the less the advantage of having the index. Criticism in the form of a patch with experimental evidence is welcome, but I'm not really interested in debating what-if proposals, especially not ones that are already discussed in the archives. regards, tom lane
On Wed, Jun 18, 2003 at 01:42:02AM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > ... I'm sure there's plenty of other ways MVCC info could be > > stored without using 16/20 bytes per tuple. > > I didn't really see a single workable idea there. Keep in mind that > storage space is only one consideration (and not a real big one given > modern disk-drive sizes). Ask yourself about atomicity, failure Disk space is cheap; disk bandwidth is insanely expensive, as is memory bandwidth. > recovery, and update costs. RLE encoding of tuple states? Get real --- > how many rows could get wiped out by a one-bit lossage? How extensive > are the on-disk changes needed to encode a one-tuple change in state, > and how do you recover if the machine crashes when only some of those > changes are down to disk? In my opinion PG's on-disk structures are > barely reliable enough now; we don't want to introduce compression > schemes with the potential for large cross-tuple failure modes. Well, with more efficient MVCC info storage, adding error correction capability becomes more practical. On the other hand, is it really pgsql's responsibility to handle errors caused by bad hardware? I don't think any other database tries to. > Storing commit state in index entries has been repeatedly proposed > and repeatedly rejected, too. It converts an atomic operation > (update one word in one page) into a non-atomic, multi-page operation, > which creates lots of performance and reliability problems. And the > point of an index is to be smaller than the main table --- the more > stuff you cram into an index tuple header, the less the advantage > of having the index. Well, it doesn't have to be stored in the index itself. Moving MVCC information to a seperate set of pages from the base tuples would provide the same ability. > Criticism in the form of a patch with experimental evidence is welcome, > but I'm not really interested in debating what-if proposals, especially > not ones that are already discussed in the archives. Fair enough, though I hope some others are interested since my C coding ability is nil. Can you point me at the archived discussions? Searching for index and mvcc reveals nothing (actually, searching for 'index' returns nothing, which surely must be a bug). -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Tom Lane wrote: > Joseph Shraibman <jks@selectacast.net> writes: > >>How do I read that? Is it creating a hash out of the data in d, then >>going through u doing a join? > > > Yeah. Given the numbers of rows involved, the plan seems pretty > reasonable --- I doubt you can do a lot better within the context > you're showing here. To make it faster you'll have to find a way > to not need to look at all the rows. > > regards, tom lane Well there is no reason for it to look at all the rows in d, since it has a filter on u that should produce much less rows than there on in d. In both the plan and the actual the rows returned from u are less than the rows from d