[PERFORM] select subquery versus join subquery - Mailing list pgsql-performance

From Jeff Janes
Subject [PERFORM] select subquery versus join subquery
Date
Msg-id CAMkU=1zxu+xZ7X44FpZ=Syot_STrUiGR9miB8K-HCqvb=KMtwA@mail.gmail.com
Whole thread Raw
Responses Re: [PERFORM] select subquery versus join subquery  ("Gunnar \"Nick\" Bluth" <gunnar.bluth.extern@elster.de>)
List pgsql-performance
I need to make a view which decorates rows from a parent table with aggregated values from a child table.  I can think of two ways to write it, one which aggregates the child table to make a new query table and joins the parent to that, as shown in "view1" below.  Or does subselect in the select list to aggregate just the currently matching rows, and returns that value, as in "view2" below.

While these two are semantically equivalent, the planner doesn't understand that, and always executes them pretty much the way you would naively do it based on the text of the query.

But view1 is pretty slow if the WHERE clause is highly selective (like "WHERE cutoff<0.00001") because it has to summarize the entire child table just to pull out a few rows.  But view2 is pretty slow if the entire view or most of it (like "WHERE cutoff<0.9") is being returned.

Is there some 3rd way to write the query which allows the planner to switch between strategies (summarize whole table vs summarize values on demand) depending on the known selectivity of the where clause?

In this case, the planner is getting the relative cost estimates roughly correct.  It is not a problem of mis-estimation.

I can always create two views, view_small and view_large, and swap between them based on my own knowledge of how restrictive a query is likely to be, but that is rather annoying.  Especially in the real-world situation, which is quite a bit more complex than this.

create table thing as select x as id, random() as cutoff from generate_series(1,2000000) f(x);

create table thing_alias as select floor(power(random()*power(2000000,5),0.2))::int thing_id, md5(x::text), random() as priority from generate_series(1,150000) f(x);

create index on thing_alias (thing_id );

create index on thing (cutoff );

vacuum; analyze;

create view view1 as select id, md5,cutoff from thing left join 
  (
     select distinct on (thing_id) thing_id, md5 from thing_alias 
     order by thing_id, priority desc
  ) as foo 
  on (thing_id=id);

create view view2 as select id, 
  (
     select md5 from thing_alias where thing_id=id 
        order by priority desc limit 1
  ) as md5, 
  cutoff from thing;

Cheers,

Jeff

pgsql-performance by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: [PERFORM] Bulk persistence strategy
Next
From: Clemens Eisserer
Date:
Subject: [PERFORM] Can postgresql plan a query using multiple CPU cores?