Thread: A Better Way? (Multi-Left Join Lookup)

A Better Way? (Multi-Left Join Lookup)

From

"David Johnston"

Date:

20 July 2012, 17:31:19

Hi!

Can someone please point me to a resource (or suggest a solution) that will improve the performance of this query? I have some thoughts but figure I should avoid reinventing the wheel since this seems like something that has to have been solved already.

I am working on a query where I have a list of identifiers (sample set has about 8,500 records) and I have three other queries that return a subset of these 8,500 identifiers

Basic query is designed as such:

WITH

full_set AS ( ) -- 8,500 records

, sub_1 AS () -- also about 8,500

, sub_2 AS () -- maybe 5,000

, sub_3 AS () - - maybe 3,000

SELECT full_set.*

, COALESCE(sub_1.field, FALSE)

, COALESCE(sub_2.field, FALSE)

FROM full_set

LEFT JOIN sub_1

LEFT JOIN sub_2

LEFT JOIN sub_3

The goal is to output a boolean for each record in “full_set” specifying whether a corresponding records exists in the sub-set. If the record exists “sub_x.field” is defined to be TRUE and thus is output otherwise sub_x.field is NULL and coalesce returns FALSE.

I have some explain+analyze results for this and an equivalent query that uses sub-queries in FROM instead of CTEs but I figure I’ll throw this out on general as it seems fairly basic. I am guessing that anything more deep should be sent to the performance sub-list (which I haven’t subscribed to at this point) with the explain+analyze results attached.

The performance of this query is exponential due to the fact that the sub-queries/CTEs are not indexed and so each subset has to be scanned completely for each record in the full set.

The identifiers are sortable and so my first thought was to try and divide each dataset into sub-sets of, say, 1000, then UNION the results of the 9 passes it would take to process all 8,500 records. Would a Recursive CTE be able to accomplish this? The sticking point is that the identifiers not pure integers (usually, in my current dataset they are) and even when the range of values is variable.

I am working with 9.0

Thank You!

David J.

Re: A Better Way? (Multi-Left Join Lookup)

From

Tom Lane

Date:

20 July 2012, 17:46:48

"David Johnston" <polobo@yahoo.com> writes:
> WITH
>   full_set AS ( ) -- 8,500 records
> , sub_1 AS () -- also about 8,500
> , sub_2 AS () -- maybe 5,000
> , sub_3 AS () - - maybe 3,000
> SELECT full_set.*
> , COALESCE(sub_1.field, FALSE)
> , COALESCE(sub_2.field, FALSE)
> , COALESCE(sub_2.field, FALSE)
> FROM full_set
> LEFT JOIN sub_1
> LEFT JOIN sub_2
> LEFT JOIN sub_3

> The performance of this query is exponential due to the fact that the
> sub-queries/CTEs are not indexed and so each subset has to be scanned
> completely for each record in the full set.

Surely not.  Neither merge nor hash joins require an index.  What plan
is getting selected?  Are you sure there's at most one match in each
"sub" set for each row in the "full" set?  If you were getting a large
number of matches in some cases, the size of the result could balloon
to something unfortunate ... but we have not got enough information to
know.

            regards, tom lane

Re: A Better Way? (Multi-Left Join Lookup)

From

Alban Hertroys

Date:

20 July 2012, 18:03:13

On 20 Jul 2012, at 22:30, David Johnston wrote:

> Hi!
>
> Can someone please point me to a resource (or suggest a solution) that will improve the performance of this query?  I
havesome thoughts but figure I should avoid reinventing the wheel since this seems like something that has to have been
solvedalready.  
>
> I am working on a query where I have a list of identifiers (sample set has about 8,500 records) and I have three
otherqueries that return a subset of these 8,500 identifiers 
>
> Basic query is designed as such:
>
> WITH
>   full_set AS ( ) -- 8,500 records
> , sub_1 AS () -- also about 8,500
> , sub_2 AS () -- maybe 5,000
> , sub_3 AS () - - maybe 3,000
> SELECT full_set.*
> , COALESCE(sub_1.field, FALSE)
> , COALESCE(sub_2.field, FALSE)
> , COALESCE(sub_2.field, FALSE)
>
> FROM full_set
> LEFT JOIN sub_1
> LEFT JOIN sub_2
> LEFT JOIN sub_3
>
> The goal is to output a boolean for each record in “full_set” specifying whether a corresponding records exists in
thesub-set.  If the record exists “sub_x.field” is defined to be TRUE and thus is output otherwise sub_x.field is NULL
andcoalesce returns FALSE. 

You are creating a product of the result sets for sub_1 to _3 there, while you only seem to need the union of the
three.

Perhaps something like this is what you're after?

WITH
  full_set AS ( )
, subs AS (
  SELECT 1 AS sub, TRUE AS field, ... FROM sub_1
  UNION ALL
  SELECT 2 AS sub, TRUE AS field, ... FROM sub_2
  UNION ALL
  SELECT 3 AS sub, TRUE AS field, ... FROM sub_3
)
SELECT ...
FROM full_set
LEFT JOIN subs

If you need those rows to be distinct, use UNION instead of UNION ALL, but the database needs to do more work for that.

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.

Re: A Better Way? (Multi-Left Join Lookup)

From

"David Johnston"

Date:

20 July 2012, 18:38:05

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Friday, July 20, 2012 4:47 PM
> To: David Johnston
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] A Better Way? (Multi-Left Join Lookup)
>
> "David Johnston" <polobo@yahoo.com> writes:
> > WITH
> >   full_set AS ( ) -- 8,500 records
> > , sub_1 AS () -- also about 8,500
> > , sub_2 AS () -- maybe 5,000
> > , sub_3 AS () - - maybe 3,000
> > SELECT full_set.*
> > , COALESCE(sub_1.field, FALSE)
> > , COALESCE(sub_2.field, FALSE)
> > , COALESCE(sub_2.field, FALSE)
> > FROM full_set
> > LEFT JOIN sub_1
> > LEFT JOIN sub_2
> > LEFT JOIN sub_3
>
> > The performance of this query is exponential due to the fact that the
> > sub-queries/CTEs are not indexed and so each subset has to be scanned
> > completely for each record in the full set.
>
> Surely not.  Neither merge nor hash joins require an index.  What plan is
> getting selected?  Are you sure there's at most one match in each "sub"
set
> for each row in the "full" set?  If you were getting a large number of
matches
> in some cases, the size of the result could balloon to something
unfortunate
> ... but we have not got enough information to know.
>
>             regards, tom lane

The final result, in this case would have 8,500 records AND
sub_1.field would be TRUE for basically all of them and FALSE for the
minimal remainder
sub_2.field would be TRUE for 5,000 of them and FALSE for 3,500 of them
sub_3.field would be TRUE for 3,000 of them and FALSE for 5,500 of them

There is never, in reality, two records in a sub-table for a single record
in the master table.  It is possible a record exists in a sub-table but not
in the main table but I do not care about those (thus the LEFT instead of a
FULL OUTER JOIN).

I have attached a scrubbed query and explain/analyze.  Let me know if
something more is needed.

I have included two versions of the query, one using CTE and the other using
mostly sub-selects.

I had run ANALYZE on the pertinent tables but the CTE queries all perform
quite quickly when run by themselves.

In looking at the source tables for the data I did notice that I have not
properly defined the relevant INDEXes as being UNIQUE.  This applies to two
of the sub-tables.  The third sub-table requires the use of "DISTINCT".  The
joining columns with each set of data are unique when fed into the LEFT
JOIN.  The master CTE/Query is generated via a function call and it also
generates unique keys for the LEFT JOIN.

Thank you for your help!

David J.

Attachment

query_and_explain.txt

Re: A Better Way? (Multi-Left Join Lookup)

From

"David Johnston"

Date:

20 July 2012, 18:50:05

> -----Original Message-----
> From: Alban Hertroys [mailto:haramrae@gmail.com]
> Sent: Friday, July 20, 2012 5:03 PM
> To: David Johnston
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] A Better Way? (Multi-Left Join Lookup)
>
> On 20 Jul 2012, at 22:30, David Johnston wrote:
>
> > Hi!
> >
> > Can someone please point me to a resource (or suggest a solution) that
will
> improve the performance of this query?  I have some thoughts but figure I
> should avoid reinventing the wheel since this seems like something that
has
> to have been solved already.
> >
> > I am working on a query where I have a list of identifiers (sample set
has
> about 8,500 records) and I have three other queries that return a subset
of
> these 8,500 identifiers
> >
> > Basic query is designed as such:
> >
> > WITH
> >   full_set AS ( ) -- 8,500 records
> > , sub_1 AS () -- also about 8,500
> > , sub_2 AS () -- maybe 5,000
> > , sub_3 AS () - - maybe 3,000
> > SELECT full_set.*
> > , COALESCE(sub_1.field, FALSE)
> > , COALESCE(sub_2.field, FALSE)
> > , COALESCE(sub_2.field, FALSE)
> >
> > FROM full_set
> > LEFT JOIN sub_1
> > LEFT JOIN sub_2
> > LEFT JOIN sub_3
> >
> > The goal is to output a boolean for each record in "full_set" specifying
> whether a corresponding records exists in the sub-set.  If the record
exists
> "sub_x.field" is defined to be TRUE and thus is output otherwise
sub_x.field
> is NULL and coalesce returns FALSE.
>
> You are creating a product of the result sets for sub_1 to _3 there, while
you
> only seem to need the union of the three.
>
> Perhaps something like this is what you're after?
>
> WITH
>   full_set AS ( )
> , subs AS (
>   SELECT 1 AS sub, TRUE AS field, ... FROM sub_1
>   UNION ALL
>   SELECT 2 AS sub, TRUE AS field, ... FROM sub_2
>   UNION ALL
>   SELECT 3 AS sub, TRUE AS field, ... FROM sub_3
> )
> SELECT ...
> FROM full_set
> LEFT JOIN subs
>
> If you need those rows to be distinct, use UNION instead of UNION ALL, but
> the database needs to do more work for that.
>
>
> Alban Hertroys
>

Using "UNION" I increase the number of output rows such that an identifier
that has a matching record in all three subsets will appear 3-times in the
result.  Now, I can run this through a GROUP BY and use CASE statements to
get it back into the multi-column format required but that seems messy.
Also, there should not be a "product" between the sub-queries but only
between an individual sub-query and the main query.  The fact there are 3
sub-queries should result in additive resource consumption (al. la. UNION):
[ M x (A + B + C) == MA + MB + MC ].  The left side is the UNION suggestion
while the right-side is the current multi-left-join suggestion.  Data wise
they are equivalent but the left-side uses additional rows while the
right-side uses additional columns.

That said I will play with it just to see if the pre-UNION and a post-GROUP
performs better than the multi-left-join that seems to be the most direct
solution.

Thank You!

David J.

Re: A Better Way? (Multi-Left Join Lookup)

From

Tom Lane

Date:

20 July 2012, 19:51:26

"David Johnston" <polobo@yahoo.com> writes:
>> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>> Surely not.  Neither merge nor hash joins require an index.  What plan is
>> getting selected?

> I have attached a scrubbed query and explain/analyze.  Let me know if
> something more is needed.

Well, here's your problem:

>   CTE master_listing {# The LEFT side of the multi-joins #}
>     ->  Subquery Scan on call  (cost=22762.65..22762.94 rows=1 width=32) (actual time=619.158..735.559 rows=8656
loops=1)

The planner thinks master_listing will return only one row, which would
make a nestloop the right way to do things.  However, with 8500 rows
coming out, the nestloop iterates 8500 times and takes forever.

So what you need to do is figure out why that rowcount estimate is so
far off and do whatever's needful to make it better.  It does not have
to be dead on --- even an estimate of a few dozen rows would likely be
enough to discourage the planner from using a nestloop.

You haven't shown enough info for anybody else to guess exactly why
the rowcount estimate is bad, though.

            regards, tom lane

Re: A Better Way? (Multi-Left Join Lookup)

From

"David Johnston"

Date:

20 July 2012, 20:39:19

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Friday, July 20, 2012 6:51 PM
> To: David Johnston
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] A Better Way? (Multi-Left Join Lookup)
>
> "David Johnston" <polobo@yahoo.com> writes:
> >> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Surely not.  Neither merge
> >> nor hash joins require an index.  What plan is getting selected?
>
> > I have attached a scrubbed query and explain/analyze.  Let me know if
> > something more is needed.
>
> Well, here's your problem:
>
> >   CTE master_listing {# The LEFT side of the multi-joins #}
> >     ->  Subquery Scan on call  (cost=22762.65..22762.94 rows=1
> > width=32) (actual time=619.158..735.559 rows=8656 loops=1)
>
> The planner thinks master_listing will return only one row, which would
> make a nestloop the right way to do things.  However, with 8500 rows
coming
> out, the nestloop iterates 8500 times and takes forever.
>
> So what you need to do is figure out why that rowcount estimate is so far
off
> and do whatever's needful to make it better.  It does not have to be dead
on
> --- even an estimate of a few dozen rows would likely be enough to
> discourage the planner from using a nestloop.
>
> You haven't shown enough info for anybody else to guess exactly why the
> rowcount estimate is bad, though.
>
>             regards, tom lane
>

OK.

So,

EXPLAIN SELECT function_call(...)  -- yields a planner expectation of 1 row

[Whereas]

EXPLAIN SELECT * FROM function_call(...) -- yields a planner expectation of
"result_rows" which defaults to 1000

The syntax:

SELECT function_call(field_on_another_relation)
    FROM another_relation

Is convenient in order to avoid...

SELECT * FROM function_call(
    (SELECT field_on_another_relation FROM another_relation)
    );

...especially when you need multiple fields from "another_relation"

I guess I get the idea that a function used "inline" is generally going to
return a single result and so the estimate of "1" is most probable.

May I suggest, then, that the CREATE FUNCTION documentation for "ROWS
result_rows" be modified:

Something like:

"The default assumption is 1,000 rows if the function is called in the FROM
clause of a query.  If it is called anywhere else (e.g., the Select List)
the assumption is 1 row regardless of an explicit or default ROWS estimate."

Was this an intentional design decision to override the result_rows estimate
of the function if it is used in the select list?  I get the general
reasoning behind it and do not know enough regarding the internals to make a
judgement but if intentional could it maybe be a little smarter as to when
the override occurs?  Obviously the ideal solution is to implement
LATERAL...

FYI:  I included the following section in the query I provided because I
suspected the function call may have been the issue...

, master_listing AS (

SELECT

    -- identifier fields

    FROM (
        SELECT (func).* FROM (
            SELECT fuction_generating_8500_records(...)
--<<<<< Use of function in Select List; results in the 1-row estimate.
1,000 rows triggers the "MERGE LEFT JOIN" plan
            ) func
            FROM scenario_info
         ) call
    ) master (function_column_rename)

)

Thank You!

David J.

Re: A Better Way? (Multi-Left Join Lookup)

From

Tom Lane

Date:

21 July 2012, 14:36:32

"David Johnston" <polobo@yahoo.com> writes:
> So,
> EXPLAIN SELECT function_call(...)  -- yields a planner expectation of 1 row
> [Whereas]
> EXPLAIN SELECT * FROM function_call(...) -- yields a planner expectation of
> "result_rows" which defaults to 1000

Hm ...

> Was this an intentional design decision to override the result_rows estimate
> of the function if it is used in the select list?

Not so much an intentional decision as just that nobody ever did
anything about it.

In general, SRFs in the target list are considered a legacy feature,
which we're going to deprecate as soon as we have LATERAL so that
there's a better-defined substitute.  So I'd not want to expend a lot
of work on this.  But it probably would be possible to extract the row
count estimates nearly for free while extracting the target list's cost
estimate, which we already have to make a pass over the expressions for.
Let me go look at that ...

            regards, tom lane

Re: A Better Way? (Multi-Left Join Lookup)

From

Tom Lane

Date:

21 July 2012, 18:47:53

I wrote:
> "David Johnston" <polobo@yahoo.com> writes:
>> So,
>> EXPLAIN SELECT function_call(...)  -- yields a planner expectation of 1 row
>> [Whereas]
>> EXPLAIN SELECT * FROM function_call(...) -- yields a planner expectation of
>> "result_rows" which defaults to 1000

> Hm ...

>> Was this an intentional design decision to override the result_rows estimate
>> of the function if it is used in the select list?

> Not so much an intentional decision as just that nobody ever did
> anything about it.

I've now done something about that for 9.2.  I'm loath to back-patch it
into any already-stable releases, though, for fear of destabilizing
plan choices that production applications might be relying on.

            regards, tom lane

Re: A Better Way? (Multi-Left Join Lookup)

From

"David Johnston"

Date:

21 July 2012, 21:49:16

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Saturday, July 21, 2012 5:48 PM
> To: David Johnston
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] A Better Way? (Multi-Left Join Lookup)
>
> I wrote:
> > "David Johnston" <polobo@yahoo.com> writes:
> >> So,
> >> EXPLAIN SELECT function_call(...)  -- yields a planner expectation of
> >> 1 row [Whereas] EXPLAIN SELECT * FROM function_call(...) -- yields a
> >> planner expectation of "result_rows" which defaults to 1000
>
> > Hm ...
>
> >> Was this an intentional design decision to override the result_rows
> >> estimate of the function if it is used in the select list?
>
> > Not so much an intentional decision as just that nobody ever did
> > anything about it.
>
> I've now done something about that for 9.2.  I'm loath to back-patch it
into
> any already-stable releases, though, for fear of destabilizing plan
choices that
> production applications might be relying on.
>
>             regards, tom lane
>

Understood and agree.  It isn't like the proper estimates cannot be gotten -
it just is less than syntactically beautiful to do so.

Maybe a documentation patch instead of a code patch would be in order to at
least give people of chance to learn about the inconsistent behavior before
it bites them in 8.3 to 9.1?  For me personally I read and learned about the
function row estimate property and didn't make the connection between the
fact I knew I was using the default of 1000 and the planner was telling me
it was only using 1.

Thank you for your responsiveness on this.

David J.