Thread: Use zero for nullness estimates of system attributes

Use zero for nullness estimates of system attributes

From
Edmund Horner
Date:
I added some code to selfuncs.c to estimate the selectivity of CTID, including nullness, in my ongoing attempt to add TID range scans [1].  And as Tom pointed out [2], no system attribute can be null, so we might as well handle them all.

That's what the attached patch does.
I observed a few interesting things with outer join selectivity:

While system attributes aren't NULL in the table, they can be in queries such as:

    SELECT *
    FROM a LEFT JOIN b ON a.id = b.id
    WHERE b.ctid IS NULL;

And the patch does affect the estimates for such plans.  But it's just replacing one hardcoded nullness (0.005) for another (0.0), which seems no worse than the original.

I was a bit concerned that with, for example,

    CREATE TABLE a (id INTEGER);
    INSERT INTO a SELECT * FROM generate_series(1,1000);
    ANALYZE a;
    CREATE TABLE b (id INTEGER, id2 INTEGER);
    INSERT INTO b SELECT *, * FROM generate_series(1,10);
    ANALYZE b;

    EXPLAIN ANALYZE
    SELECT * FROM a LEFT JOIN b ON a.id = b.id
    WHERE b.ctid IS NULL;

you get a row estimate of 1 (vs the actual 990).  It's not specific to system attributes.  Plain left-join selectivity calculation doesn't seem to take into account the join selectivity, while anti-join calculation does.

I do not think this affects the usefulness of the present patch, but maybe it's something we could improve.

Finally: I thought about introducing a macro to attnum.h:

    /*
    * AttrNumberIsForSystemAttr
    * True iff the attribute number corresponds to a system attribute.
    */
    #define AttrNumberIsForSystemAttr(attributeNumber) \
     ((bool) ((attributeNumber) < 0))

But there's a zillion places that could be changed to use it, so I haven't in this version of the patch.

Edmund


Attachment

Re: Use zero for nullness estimates of system attributes

From
Tom Lane
Date:
Edmund Horner <ejrh00@gmail.com> writes:
> I added some code to selfuncs.c to estimate the selectivity of CTID,
> including nullness, in my ongoing attempt to add TID range scans [1].  And
> as Tom pointed out [2], no system attribute can be null, so we might as
> well handle them all.
> That's what the attached patch does.

Seems pretty uncontroversial, so pushed.

> I observed a few interesting things with outer join selectivity:
> While system attributes aren't NULL in the table, they can be in queries
> such as:

Yeah, none of our selectivity calculations account for the possibility
that we're above a join that has affected the distribution of a Var's
values.  Going to NULL in an outer join is just part of that issue.
I don't feel this patch needs to solve it, and anyway it'd be a rather
massive rethink.

> Finally: I thought about introducing a macro to attnum.h:
>     #define AttrNumberIsForSystemAttr(attributeNumber) \
>      ((bool) ((attributeNumber) < 0))
> But there's a zillion places that could be changed to use it, so I haven't
> in this version of the patch.

I can't get too excited about that.  Even if the reader is unfamiliar with
the negative-attno convention, most of these places are commented in a way
that makes it clear what's going on.

            regards, tom lane


Re: Use zero for nullness estimates of system attributes

From
Jim Finnerty
Date:
re: Yeah, none of our selectivity calculations account for the possibility
that we're above a join that has affected the distribution of a Var's
values.

Tom, there's an analogous issue of adjusting distinct values on a per-column
basis based on the selectivity of other local predicates.  Several
commercial RDBMS's make such adjustments in an effort to get better
selectivity estimates when there are multiple local predicates.  Is this
something that the PG community has considered and decided not to do because
of the additional planning-time overhead that may be required, or just
something that hasn't been discussed or tackled yet?



-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


Re: Use zero for nullness estimates of system attributes

From
Tom Lane
Date:
Jim Finnerty <jfinnert@amazon.com> writes:
> Tom, there's an analogous issue of adjusting distinct values on a per-column
> basis based on the selectivity of other local predicates.  Several
> commercial RDBMS's make such adjustments in an effort to get better
> selectivity estimates when there are multiple local predicates.  Is this
> something that the PG community has considered and decided not to do because
> of the additional planning-time overhead that may be required, or just
> something that hasn't been discussed or tackled yet?

I think what you're talking about is the correlated-variables problem,
which we have made a start on with the "extended statistics" mechanism,
though certainly a lot remains to be done.

            regards, tom lane


Re: Use zero for nullness estimates of system attributes

From
Jim Finnerty
Date:
It's related, but what I was referring to applies even to the uncorrelated
case: suppose you have something like:

select x, sum(z) 
from t
where
    x > 5 and y in ('a', 'b', 'c')
group by x;

let's say that 'a', 'b', and 'c' are not frequent values of y, so the
estimated selectivity is based on the n_distinct of y and the 3 values.  Now
imagine that x > 5 is applied first.  That reduces the number of qualifying
rows by the selectivity of (x > 5), but it may also reduce the number of
distinct values of y.  If it reduces the n_distinct of y, then the IN
predicate selectivity should be adjusted also.

The negative n_distinct representation (for n_distinct a large fraction of
numrows) already accounts for an automatic scaling of n_distinct when the
table size grows or shrinks.  Adjusting the n_distinct of y after applying
predicate (x > 5) is conceptually the same thing, except that the scaling is
not linear, but would be based on a (fairly cheap) probabilistic formula of
how many distinct y's can be expected to survive after applying predicate x,
rather than assuming that all of them do.

If you have correlation information for columns (x, y) you'd want to use
that information instead, but in the absence of that information you could
make this adjustment.

It might make a good intern project if somebody wants to investigate it.



-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


Re: Use zero for nullness estimates of system attributes

From
Tom Lane
Date:
Jim Finnerty <jfinnert@amazon.com> writes:
> It's related, but what I was referring to applies even to the uncorrelated
> case: suppose you have something like:

> select x, sum(z) 
> from t
> where
>     x > 5 and y in ('a', 'b', 'c')
> group by x;

> let's say that 'a', 'b', and 'c' are not frequent values of y, so the
> estimated selectivity is based on the n_distinct of y and the 3 values.  Now
> imagine that x > 5 is applied first.  That reduces the number of qualifying
> rows by the selectivity of (x > 5), but it may also reduce the number of
> distinct values of y.  If it reduces the n_distinct of y, then the IN
> predicate selectivity should be adjusted also.

I don't actually think that's a foregone conclusion.  If the two where
clauses are in fact independent, then simply multiplying their
selectivities together is the right thing.

            regards, tom lane