statistics for array types - Mailing list pgsql-hackers

From Jeff Janes
Subject statistics for array types
Date
Msg-id CAMkU=1x2W1gpEP3AQsrSA30uxQk1Sau5VDOLL4LkhWLwrOY8Lw@mail.gmail.com
Whole thread Raw
Responses Re: statistics for array types  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: statistics for array types  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
When reviewing some recent patches, I decided the statistics gathered for arrays had some pre-existing shortcomings.

The main one is that when the arrays contain rare elements there is no histogram to fall back upon when the MCE array is empty, the way there is for scalar stats.  So it has to punt completely and resort to saying that it is 0.5% selectivity without recourse to any data at all.

The rationale for applying the threshold before things are eligible for inclusion in the MCE array seems to be that this puts some theoretical bound on the amount of error we are likely to have in that element.  But I think it is better to exceed that theoretical bound than it is to have no data at all.

The attached patch forces there to be at least one element in MCE, keeping the one element with the highest predicted frequency if the MCE would otherwise be empty.  Then any other element queried for is assumed to be no more common than this most common element.

I'd also briefly considered just having the part of the code that pulls the stats out of pg_stats interpret a MCE array as meaning that nothing is more frequent than the threshold, but that would mean that that part of the code needs to know about how the threshold is chosen, which just seems wrong.  And it would need to know the difference between NULL MCE because no stats were gathered, versus because stats were gathered but nothing met the threshold.

I'm only marginally interested in this area, but since I did the investigation into it already I wanted to present it here.  I'd be quite happy if someone else wanted to take over the patch (or the concept behind it).

Here is an example, where the estimate drops from 500 rows to 3 rows where the correct number is zero:

create table foobar as
select (
  select
    array_agg(floor((random()*1000000000+g.y/1000000+f.x/10000000))::int)
    from generate_series(1,10) g(y)
  ) foo
from generate_series(1,100000) f(x);

Unpatched:

analyze;

explain (analyze) select * from foobar where foo @> '{567}';
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..2387.00 rows=500 width=61) (actual time=76.448..76.448 rows=0 loops=1)
   Filter: (foo @> '{567}'::integer[])
   Rows Removed by Filter: 100000
 Planning time: 0.211 ms
 Execution time: 76.492 ms


With patch:

analyze;

explain (analyze) select * from foobar where foo @> '{567}';
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..2387.00 rows=3 width=61) (actual time=78.604..78.604 rows=0 loops=1)
   Filter: (foo @> '{567}'::integer[])
   Rows Removed by Filter: 100000
 Planning time: 0.199 ms
 Execution time: 78.665 ms

Cheers,

Jeff
Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: WIP: SCRAM authentication
Next
From: David Fetter
Date:
Subject: Re: [patch] A \pivot command for psql