Thread: [PERFORM] select subquery versus join subquery

[PERFORM] select subquery versus join subquery

From
Jeff Janes
Date:
I need to make a view which decorates rows from a parent table with aggregated values from a child table.  I can think of two ways to write it, one which aggregates the child table to make a new query table and joins the parent to that, as shown in "view1" below.  Or does subselect in the select list to aggregate just the currently matching rows, and returns that value, as in "view2" below.

While these two are semantically equivalent, the planner doesn't understand that, and always executes them pretty much the way you would naively do it based on the text of the query.

But view1 is pretty slow if the WHERE clause is highly selective (like "WHERE cutoff<0.00001") because it has to summarize the entire child table just to pull out a few rows.  But view2 is pretty slow if the entire view or most of it (like "WHERE cutoff<0.9") is being returned.

Is there some 3rd way to write the query which allows the planner to switch between strategies (summarize whole table vs summarize values on demand) depending on the known selectivity of the where clause?

In this case, the planner is getting the relative cost estimates roughly correct.  It is not a problem of mis-estimation.

I can always create two views, view_small and view_large, and swap between them based on my own knowledge of how restrictive a query is likely to be, but that is rather annoying.  Especially in the real-world situation, which is quite a bit more complex than this.

create table thing as select x as id, random() as cutoff from generate_series(1,2000000) f(x);

create table thing_alias as select floor(power(random()*power(2000000,5),0.2))::int thing_id, md5(x::text), random() as priority from generate_series(1,150000) f(x);

create index on thing_alias (thing_id );

create index on thing (cutoff );

vacuum; analyze;

create view view1 as select id, md5,cutoff from thing left join 
  (
     select distinct on (thing_id) thing_id, md5 from thing_alias 
     order by thing_id, priority desc
  ) as foo 
  on (thing_id=id);

create view view2 as select id, 
  (
     select md5 from thing_alias where thing_id=id 
        order by priority desc limit 1
  ) as md5, 
  cutoff from thing;

Cheers,

Jeff

Re: [PERFORM] select subquery versus join subquery

From
"Gunnar \"Nick\" Bluth"
Date:
Am 05/22/2017 um 09:57 PM schrieb Jeff Janes:
> I need to make a view which decorates rows from a parent table with
> aggregated values from a child table.  I can think of two ways to write
> it, one which aggregates the child table to make a new query table and
> joins the parent to that, as shown in "view1" below.  Or does subselect
> in the select list to aggregate just the currently matching rows, and
> returns that value, as in "view2" below.
>
> While these two are semantically equivalent, the planner doesn't
> understand that, and always executes them pretty much the way you would
> naively do it based on the text of the query.
>
> But view1 is pretty slow if the WHERE clause is highly selective (like
> "WHERE cutoff<0.00001") because it has to summarize the entire child
> table just to pull out a few rows.  But view2 is pretty slow if the
> entire view or most of it (like "WHERE cutoff<0.9") is being returned.
>
> Is there some 3rd way to write the query which allows the planner to
> switch between strategies (summarize whole table vs summarize values on
> demand) depending on the known selectivity of the where clause?
>
> In this case, the planner is getting the relative cost estimates roughly
> correct.  It is not a problem of mis-estimation.
>
> I can always create two views, view_small and view_large, and swap
> between them based on my own knowledge of how restrictive a query is
> likely to be, but that is rather annoying.  Especially in the real-world
> situation, which is quite a bit more complex than this.
>
> create table thing as select x as id, random() as cutoff from
> generate_series(1,2000000) f(x);
>
> create table thing_alias as select
> floor(power(random()*power(2000000,5),0.2))::int thing_id, md5(x::text),
> random() as priority from generate_series(1,150000) f(x);
>
> create index on thing_alias (thing_id );
>
> create index on thing (cutoff );
>
> vacuum; analyze;
>
> create view view1 as select id, md5,cutoff from thing left join
>   (
>      select distinct on (thing_id) thing_id, md5 from thing_alias
>      order by thing_id, priority desc
>   ) as foo
>   on (thing_id=id);
>
> create view view2 as select id,
>   (
>      select md5 from thing_alias where thing_id=id
>         order by priority desc limit 1
>   ) as md5,
>   cutoff from thing;
>
> Cheers,
>
> Jeff

Hi Jeff,

how does something like

CREATE OR REPLACE VIEW public.view3 AS
 SELECT thing.id,
    foo.md5,
    thing.cutoff
   FROM thing,
    LATERAL ( SELECT DISTINCT ON (thing_alias.thing_id)
thing_alias.thing_id,
            thing_alias.md5
           FROM thing_alias
          WHERE thing_alias.thing_id = thing.id
          ORDER BY thing_alias.thing_id, thing_alias.priority DESC) foo

work for you? At least that's always using an index scan here, as
opposed to view1, which (for me) defaults to a SeqScan on thing_alias at
a low cutoff.

*****
Note btw. that both view1 and view2 don't return any md5 values for me,
while view3 does!
*****

Results (ms, median of 3 runs):
cutoff<  0.1   0.9
view1:   348   1022
view2:   844   6484
view3:   842   5976

With

 LATERAL ( SELECT string_agg(thing_alias.md5, ','::text) AS md5
           FROM thing_alias
          WHERE thing_alias.thing_id = thing.id
          GROUP BY thing_alias.thing_id) foo

(which seems to make more sense ;-)

I yield 483 (0.1) and 3410 (0.9) ms (and return md5-Aggregates).

Cheers,
--
Gunnar "Nick" Bluth
DBA ELSTER

Tel:   +49 911/991-4665
Mobil: +49 172/8853339


Attachment

Re: [PERFORM] select subquery versus join subquery

From
Jeff Janes
Date:
On Tue, May 23, 2017 at 4:03 AM, Gunnar "Nick" Bluth <gunnar.bluth.extern@elster.de> wrote:
Am 05/22/2017 um 09:57 PM schrieb Jeff Janes:
>
> create view view2 as select id,
>   (
>      select md5 from thing_alias where thing_id=id
>         order by priority desc limit 1
>   ) as md5,
>   cutoff from thing;
>
> Cheers,
>
> Jeff

Hi Jeff,

how does something like

CREATE OR REPLACE VIEW public.view3 AS
 SELECT thing.id,
    foo.md5,
    thing.cutoff
   FROM thing,
    LATERAL ( SELECT DISTINCT ON (thing_alias.thing_id)
thing_alias.thing_id,
            thing_alias.md5
           FROM thing_alias
          WHERE thing_alias.thing_id = thing.id
          ORDER BY thing_alias.thing_id, thing_alias.priority DESC) foo

work for you? At least that's always using an index scan here, as
opposed to view1, which (for me) defaults to a SeqScan on thing_alias at
a low cutoff.

Unfortunately that always uses the index scan, even at a high cutoff where aggregation on the seq scan and then hash joining is more appropriate.  So it is very similar to view2, except that it doesn't return the rows from "thing" which have zero corresponding rows in thing_alias.

*****
Note btw. that both view1 and view2 don't return any md5 values for me,
while view3 does!
*****

Because of the way I constructed the data, using the power transform of the uniform random distribution, the early rows of the view (if sorted by thing_id) are mostly null in the md5 column, so if you only look at the first few screen-fulls you might not see any md5.  But your view does effectively an inner join rather than a left join, so your view gets rid of the rows with a NULL md5.  Most things don't have aliases; of the things that do, most have 1; and some have a several.

 

Cheers,

Jeff

Re: [PERFORM] select subquery versus join subquery

From
"Gunnar \"Nick\" Bluth"
Date:
Am 05/23/2017 um 06:59 PM schrieb Jeff Janes:
> On Tue, May 23, 2017 at 4:03 AM, Gunnar "Nick" Bluth
> <gunnar.bluth.extern@elster.de <mailto:gunnar.bluth.extern@elster.de>>
> wrote:
8>< -----
>
> Unfortunately that always uses the index scan, even at a high cutoff
> where aggregation on the seq scan and then hash joining is more
> appropriate.  So it is very similar to view2, except that it doesn't
> return the rows from "thing" which have zero corresponding rows in
> thing_alias.
>
>     *****
>     Note btw. that both view1 and view2 don't return any md5 values for me,
>     while view3 does!
>     *****
>
>
> Because of the way I constructed the data, using the power transform of
> the uniform random distribution, the early rows of the view (if sorted
> by thing_id) are mostly null in the md5 column, so if you only look at
> the first few screen-fulls you might not see any md5.  But your view
> does effectively an inner join rather than a left join, so your view
> gets rid of the rows with a NULL md5.  Most things don't have aliases;
> of the things that do, most have 1; and some have a several.

D'oh, of course! My bad... shouldn't have looked at the results with
LIMIT :-/

My next best guess would involve a MatView for the aggregates...
--
Gunnar "Nick" Bluth
DBA ELSTER

Tel:   +49 911/991-4665
Mobil: +49 172/8853339


Attachment