Thread: A Better Way? (Multi-Left Join Lookup)
Hi!
Can someone please point me to a resource (or suggest a solution) that will improve the performance of this query? I have some thoughts but figure I should avoid reinventing the wheel since this seems like something that has to have been solved already.
I am working on a query where I have a list of identifiers (sample set has about 8,500 records) and I have three other queries that return a subset of these 8,500 identifiers
Basic query is designed as such:
WITH
full_set AS ( ) -- 8,500 records
, sub_1 AS () -- also about 8,500
, sub_2 AS () -- maybe 5,000
, sub_3 AS () - - maybe 3,000
SELECT full_set.*
, COALESCE(sub_1.field, FALSE)
, COALESCE(sub_2.field, FALSE)
, COALESCE(sub_2.field, FALSE)
FROM full_set
LEFT JOIN sub_1
LEFT JOIN sub_2
LEFT JOIN sub_3
The goal is to output a boolean for each record in “full_set” specifying whether a corresponding records exists in the sub-set. If the record exists “sub_x.field” is defined to be TRUE and thus is output otherwise sub_x.field is NULL and coalesce returns FALSE.
I have some explain+analyze results for this and an equivalent query that uses sub-queries in FROM instead of CTEs but I figure I’ll throw this out on general as it seems fairly basic. I am guessing that anything more deep should be sent to the performance sub-list (which I haven’t subscribed to at this point) with the explain+analyze results attached.
The performance of this query is exponential due to the fact that the sub-queries/CTEs are not indexed and so each subset has to be scanned completely for each record in the full set.
The identifiers are sortable and so my first thought was to try and divide each dataset into sub-sets of, say, 1000, then UNION the results of the 9 passes it would take to process all 8,500 records. Would a Recursive CTE be able to accomplish this? The sticking point is that the identifiers not pure integers (usually, in my current dataset they are) and even when the range of values is variable.
I am working with 9.0
Thank You!
David J.
"David Johnston" <polobo@yahoo.com> writes: > WITH > full_set AS ( ) -- 8,500 records > , sub_1 AS () -- also about 8,500 > , sub_2 AS () -- maybe 5,000 > , sub_3 AS () - - maybe 3,000 > SELECT full_set.* > , COALESCE(sub_1.field, FALSE) > , COALESCE(sub_2.field, FALSE) > , COALESCE(sub_2.field, FALSE) > FROM full_set > LEFT JOIN sub_1 > LEFT JOIN sub_2 > LEFT JOIN sub_3 > The performance of this query is exponential due to the fact that the > sub-queries/CTEs are not indexed and so each subset has to be scanned > completely for each record in the full set. Surely not. Neither merge nor hash joins require an index. What plan is getting selected? Are you sure there's at most one match in each "sub" set for each row in the "full" set? If you were getting a large number of matches in some cases, the size of the result could balloon to something unfortunate ... but we have not got enough information to know. regards, tom lane
On 20 Jul 2012, at 22:30, David Johnston wrote: > Hi! > > Can someone please point me to a resource (or suggest a solution) that will improve the performance of this query? I havesome thoughts but figure I should avoid reinventing the wheel since this seems like something that has to have been solvedalready. > > I am working on a query where I have a list of identifiers (sample set has about 8,500 records) and I have three otherqueries that return a subset of these 8,500 identifiers > > Basic query is designed as such: > > WITH > full_set AS ( ) -- 8,500 records > , sub_1 AS () -- also about 8,500 > , sub_2 AS () -- maybe 5,000 > , sub_3 AS () - - maybe 3,000 > SELECT full_set.* > , COALESCE(sub_1.field, FALSE) > , COALESCE(sub_2.field, FALSE) > , COALESCE(sub_2.field, FALSE) > > FROM full_set > LEFT JOIN sub_1 > LEFT JOIN sub_2 > LEFT JOIN sub_3 > > The goal is to output a boolean for each record in “full_set” specifying whether a corresponding records exists in thesub-set. If the record exists “sub_x.field” is defined to be TRUE and thus is output otherwise sub_x.field is NULL andcoalesce returns FALSE. You are creating a product of the result sets for sub_1 to _3 there, while you only seem to need the union of the three. Perhaps something like this is what you're after? WITH full_set AS ( ) , subs AS ( SELECT 1 AS sub, TRUE AS field, ... FROM sub_1 UNION ALL SELECT 2 AS sub, TRUE AS field, ... FROM sub_2 UNION ALL SELECT 3 AS sub, TRUE AS field, ... FROM sub_3 ) SELECT ... FROM full_set LEFT JOIN subs If you need those rows to be distinct, use UNION instead of UNION ALL, but the database needs to do more work for that. Alban Hertroys -- Screwing up is an excellent way to attach something to the ceiling.
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Friday, July 20, 2012 4:47 PM > To: David Johnston > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] A Better Way? (Multi-Left Join Lookup) > > "David Johnston" <polobo@yahoo.com> writes: > > WITH > > full_set AS ( ) -- 8,500 records > > , sub_1 AS () -- also about 8,500 > > , sub_2 AS () -- maybe 5,000 > > , sub_3 AS () - - maybe 3,000 > > SELECT full_set.* > > , COALESCE(sub_1.field, FALSE) > > , COALESCE(sub_2.field, FALSE) > > , COALESCE(sub_2.field, FALSE) > > FROM full_set > > LEFT JOIN sub_1 > > LEFT JOIN sub_2 > > LEFT JOIN sub_3 > > > The performance of this query is exponential due to the fact that the > > sub-queries/CTEs are not indexed and so each subset has to be scanned > > completely for each record in the full set. > > Surely not. Neither merge nor hash joins require an index. What plan is > getting selected? Are you sure there's at most one match in each "sub" set > for each row in the "full" set? If you were getting a large number of matches > in some cases, the size of the result could balloon to something unfortunate > ... but we have not got enough information to know. > > regards, tom lane The final result, in this case would have 8,500 records AND sub_1.field would be TRUE for basically all of them and FALSE for the minimal remainder sub_2.field would be TRUE for 5,000 of them and FALSE for 3,500 of them sub_3.field would be TRUE for 3,000 of them and FALSE for 5,500 of them There is never, in reality, two records in a sub-table for a single record in the master table. It is possible a record exists in a sub-table but not in the main table but I do not care about those (thus the LEFT instead of a FULL OUTER JOIN). I have attached a scrubbed query and explain/analyze. Let me know if something more is needed. I have included two versions of the query, one using CTE and the other using mostly sub-selects. I had run ANALYZE on the pertinent tables but the CTE queries all perform quite quickly when run by themselves. In looking at the source tables for the data I did notice that I have not properly defined the relevant INDEXes as being UNIQUE. This applies to two of the sub-tables. The third sub-table requires the use of "DISTINCT". The joining columns with each set of data are unique when fed into the LEFT JOIN. The master CTE/Query is generated via a function call and it also generates unique keys for the LEFT JOIN. Thank you for your help! David J.
Attachment
> -----Original Message----- > From: Alban Hertroys [mailto:haramrae@gmail.com] > Sent: Friday, July 20, 2012 5:03 PM > To: David Johnston > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] A Better Way? (Multi-Left Join Lookup) > > On 20 Jul 2012, at 22:30, David Johnston wrote: > > > Hi! > > > > Can someone please point me to a resource (or suggest a solution) that will > improve the performance of this query? I have some thoughts but figure I > should avoid reinventing the wheel since this seems like something that has > to have been solved already. > > > > I am working on a query where I have a list of identifiers (sample set has > about 8,500 records) and I have three other queries that return a subset of > these 8,500 identifiers > > > > Basic query is designed as such: > > > > WITH > > full_set AS ( ) -- 8,500 records > > , sub_1 AS () -- also about 8,500 > > , sub_2 AS () -- maybe 5,000 > > , sub_3 AS () - - maybe 3,000 > > SELECT full_set.* > > , COALESCE(sub_1.field, FALSE) > > , COALESCE(sub_2.field, FALSE) > > , COALESCE(sub_2.field, FALSE) > > > > FROM full_set > > LEFT JOIN sub_1 > > LEFT JOIN sub_2 > > LEFT JOIN sub_3 > > > > The goal is to output a boolean for each record in "full_set" specifying > whether a corresponding records exists in the sub-set. If the record exists > "sub_x.field" is defined to be TRUE and thus is output otherwise sub_x.field > is NULL and coalesce returns FALSE. > > You are creating a product of the result sets for sub_1 to _3 there, while you > only seem to need the union of the three. > > Perhaps something like this is what you're after? > > WITH > full_set AS ( ) > , subs AS ( > SELECT 1 AS sub, TRUE AS field, ... FROM sub_1 > UNION ALL > SELECT 2 AS sub, TRUE AS field, ... FROM sub_2 > UNION ALL > SELECT 3 AS sub, TRUE AS field, ... FROM sub_3 > ) > SELECT ... > FROM full_set > LEFT JOIN subs > > If you need those rows to be distinct, use UNION instead of UNION ALL, but > the database needs to do more work for that. > > > Alban Hertroys > Using "UNION" I increase the number of output rows such that an identifier that has a matching record in all three subsets will appear 3-times in the result. Now, I can run this through a GROUP BY and use CASE statements to get it back into the multi-column format required but that seems messy. Also, there should not be a "product" between the sub-queries but only between an individual sub-query and the main query. The fact there are 3 sub-queries should result in additive resource consumption (al. la. UNION): [ M x (A + B + C) == MA + MB + MC ]. The left side is the UNION suggestion while the right-side is the current multi-left-join suggestion. Data wise they are equivalent but the left-side uses additional rows while the right-side uses additional columns. That said I will play with it just to see if the pre-UNION and a post-GROUP performs better than the multi-left-join that seems to be the most direct solution. Thank You! David J.
"David Johnston" <polobo@yahoo.com> writes: >> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] >> Surely not. Neither merge nor hash joins require an index. What plan is >> getting selected? > I have attached a scrubbed query and explain/analyze. Let me know if > something more is needed. Well, here's your problem: > CTE master_listing {# The LEFT side of the multi-joins #} > -> Subquery Scan on call (cost=22762.65..22762.94 rows=1 width=32) (actual time=619.158..735.559 rows=8656 loops=1) The planner thinks master_listing will return only one row, which would make a nestloop the right way to do things. However, with 8500 rows coming out, the nestloop iterates 8500 times and takes forever. So what you need to do is figure out why that rowcount estimate is so far off and do whatever's needful to make it better. It does not have to be dead on --- even an estimate of a few dozen rows would likely be enough to discourage the planner from using a nestloop. You haven't shown enough info for anybody else to guess exactly why the rowcount estimate is bad, though. regards, tom lane
> -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Tom Lane > Sent: Friday, July 20, 2012 6:51 PM > To: David Johnston > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] A Better Way? (Multi-Left Join Lookup) > > "David Johnston" <polobo@yahoo.com> writes: > >> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Surely not. Neither merge > >> nor hash joins require an index. What plan is getting selected? > > > I have attached a scrubbed query and explain/analyze. Let me know if > > something more is needed. > > Well, here's your problem: > > > CTE master_listing {# The LEFT side of the multi-joins #} > > -> Subquery Scan on call (cost=22762.65..22762.94 rows=1 > > width=32) (actual time=619.158..735.559 rows=8656 loops=1) > > The planner thinks master_listing will return only one row, which would > make a nestloop the right way to do things. However, with 8500 rows coming > out, the nestloop iterates 8500 times and takes forever. > > So what you need to do is figure out why that rowcount estimate is so far off > and do whatever's needful to make it better. It does not have to be dead on > --- even an estimate of a few dozen rows would likely be enough to > discourage the planner from using a nestloop. > > You haven't shown enough info for anybody else to guess exactly why the > rowcount estimate is bad, though. > > regards, tom lane > OK. So, EXPLAIN SELECT function_call(...) -- yields a planner expectation of 1 row [Whereas] EXPLAIN SELECT * FROM function_call(...) -- yields a planner expectation of "result_rows" which defaults to 1000 The syntax: SELECT function_call(field_on_another_relation) FROM another_relation Is convenient in order to avoid... SELECT * FROM function_call( (SELECT field_on_another_relation FROM another_relation) ); ...especially when you need multiple fields from "another_relation" I guess I get the idea that a function used "inline" is generally going to return a single result and so the estimate of "1" is most probable. May I suggest, then, that the CREATE FUNCTION documentation for "ROWS result_rows" be modified: Something like: "The default assumption is 1,000 rows if the function is called in the FROM clause of a query. If it is called anywhere else (e.g., the Select List) the assumption is 1 row regardless of an explicit or default ROWS estimate." Was this an intentional design decision to override the result_rows estimate of the function if it is used in the select list? I get the general reasoning behind it and do not know enough regarding the internals to make a judgement but if intentional could it maybe be a little smarter as to when the override occurs? Obviously the ideal solution is to implement LATERAL... FYI: I included the following section in the query I provided because I suspected the function call may have been the issue... , master_listing AS ( SELECT -- identifier fields FROM ( SELECT (func).* FROM ( SELECT fuction_generating_8500_records(...) --<<<<< Use of function in Select List; results in the 1-row estimate. 1,000 rows triggers the "MERGE LEFT JOIN" plan ) func FROM scenario_info ) call ) master (function_column_rename) ) Thank You! David J.
"David Johnston" <polobo@yahoo.com> writes: > So, > EXPLAIN SELECT function_call(...) -- yields a planner expectation of 1 row > [Whereas] > EXPLAIN SELECT * FROM function_call(...) -- yields a planner expectation of > "result_rows" which defaults to 1000 Hm ... > Was this an intentional design decision to override the result_rows estimate > of the function if it is used in the select list? Not so much an intentional decision as just that nobody ever did anything about it. In general, SRFs in the target list are considered a legacy feature, which we're going to deprecate as soon as we have LATERAL so that there's a better-defined substitute. So I'd not want to expend a lot of work on this. But it probably would be possible to extract the row count estimates nearly for free while extracting the target list's cost estimate, which we already have to make a pass over the expressions for. Let me go look at that ... regards, tom lane
I wrote: > "David Johnston" <polobo@yahoo.com> writes: >> So, >> EXPLAIN SELECT function_call(...) -- yields a planner expectation of 1 row >> [Whereas] >> EXPLAIN SELECT * FROM function_call(...) -- yields a planner expectation of >> "result_rows" which defaults to 1000 > Hm ... >> Was this an intentional design decision to override the result_rows estimate >> of the function if it is used in the select list? > Not so much an intentional decision as just that nobody ever did > anything about it. I've now done something about that for 9.2. I'm loath to back-patch it into any already-stable releases, though, for fear of destabilizing plan choices that production applications might be relying on. regards, tom lane
> -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Tom Lane > Sent: Saturday, July 21, 2012 5:48 PM > To: David Johnston > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] A Better Way? (Multi-Left Join Lookup) > > I wrote: > > "David Johnston" <polobo@yahoo.com> writes: > >> So, > >> EXPLAIN SELECT function_call(...) -- yields a planner expectation of > >> 1 row [Whereas] EXPLAIN SELECT * FROM function_call(...) -- yields a > >> planner expectation of "result_rows" which defaults to 1000 > > > Hm ... > > >> Was this an intentional design decision to override the result_rows > >> estimate of the function if it is used in the select list? > > > Not so much an intentional decision as just that nobody ever did > > anything about it. > > I've now done something about that for 9.2. I'm loath to back-patch it into > any already-stable releases, though, for fear of destabilizing plan choices that > production applications might be relying on. > > regards, tom lane > Understood and agree. It isn't like the proper estimates cannot be gotten - it just is less than syntactically beautiful to do so. Maybe a documentation patch instead of a code patch would be in order to at least give people of chance to learn about the inconsistent behavior before it bites them in 8.3 to 9.1? For me personally I read and learned about the function row estimate property and didn't make the connection between the fact I knew I was using the default of 1000 and the planner was telling me it was only using 1. Thank you for your responsiveness on this. David J.