Re: Order by (for 15 rows) adds 30 seconds to query time - Mailing list pgsql-performance

From Kevin Grittner
Subject Re: Order by (for 15 rows) adds 30 seconds to query time
Date
Msg-id 4B3875C6020000250002D97D@gw.wicourts.gov
Whole thread Raw
In response to Re: Order by (for 15 rows) adds 30 seconds to query time  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Order by (for 15 rows) adds 30 seconds to query time  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Yeah, that is expected.  Nestloop inner indexscans have a rowcount
> estimate that is different from that of the parent table --- the
> parent's rowcount is what would be applicable for another type of
> join, such as merge or hash, where the join condition is applied
> at the join node not in the relation scan.
>
> The problem here boils down to the fact that examine_variable
> punts on appendrel variables:

>  * XXX This means the Var represents a column of an append
>  * relation. Later add code to look at the member relations and
>  * try to derive some kind of combined statistics?

> This means you get a default estimate for the selectivity of the
> join condition, so the joinrel size estimate ends up being 0.005 *
> 1 * 40000.  That's set long before we ever generate indexscan
> plans, and I don't think there's any clean way to correct the size
> estimate when we do.

Thanks for the explanation.

> Fixing this has been on the to-do list since forever.

Which item?  I looked and couldn't find one which seems to fit.
(I was hoping to find a reference to a discussion thread.)

> I don't think we'll make much progress on it until we have an
> explicit notion of partitioned tables.

I'm not clear that a generalized solution for partitioned tables
would solve the production query from the OP.  The OP was using
table extension to model the properties of the data.  In the actual
production problem, the table structure involved, for example,
materials -- some of which were containers (which had all the
properties of other materials, plus some unique to containers); so
the containers table extends the materials table to model that.  In
the problem query, he wanted to find all the materials related to
some item.  He was also joining to location, which might be (among
other things) a waypoint or a container (both extending location).
Note that a container is both a location and a material.

> The approach contemplated in the comment, of assembling some stats
> on-the-fly from the stats for individual child tables, doesn't
> seem real practical from a planning-time standpoint.

Can you give a thumbnail sketch of why that is?

> The thing that you really want to know here is that there will be
> only one matching id value in the whole partitioned table

It would seem to matter nearly as much if statistics indicated you
would get five rows out of twenty million.

> it's difficult to extract from independent sets of stats.

Since we make the attempt for most intermediate results, it's not
immediately clear to me why it's so hard here.  Not that I'm
doubting your assertion that it *is* hard; I'm just trying to see
*why* it is.  Perhaps the code which generates such estimates for
everything else could be made available here?

The usefulness of inheritance to model data would seem to be rather
limited without better optimizer support.

-Kevin

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Order by (for 15 rows) adds 30 seconds to query time
Next
From: Tom Lane
Date:
Subject: Re: Order by (for 15 rows) adds 30 seconds to query time