Home > mailing lists

statistics for array types - Mailing list pgsql-hackers

From	Jeff Janes
Subject	statistics for array types
Date	August 11, 2015 17:38:59
Msg-id	CAMkU=1x2W1gpEP3AQsrSA30uxQk1Sau5VDOLL4LkhWLwrOY8Lw@mail.gmail.com Whole thread Raw
Responses	Re: statistics for array types (Tomas Vondra <tomas.vondra@2ndquadrant.com>) Re: statistics for array types (Alvaro Herrera <alvherre@2ndquadrant.com>)
List	pgsql-hackers

Tree view

When reviewing some recent patches, I decided the statistics gathered for arrays had some pre-existing shortcomings.

The main one is that when the arrays contain rare elements there is no histogram to fall back upon when the MCE array is empty, the way there is for scalar stats. So it has to punt completely and resort to saying that it is 0.5% selectivity without recourse to any data at all.

The rationale for applying the threshold before things are eligible for inclusion in the MCE array seems to be that this puts some theoretical bound on the amount of error we are likely to have in that element. But I think it is better to exceed that theoretical bound than it is to have no data at all.

The attached patch forces there to be at least one element in MCE, keeping the one element with the highest predicted frequency if the MCE would otherwise be empty. Then any other element queried for is assumed to be no more common than this most common element.

I'd also briefly considered just having the part of the code that pulls the stats out of pg_stats interpret a MCE array as meaning that nothing is more frequent than the threshold, but that would mean that that part of the code needs to know about how the threshold is chosen, which just seems wrong. And it would need to know the difference between NULL MCE because no stats were gathered, versus because stats were gathered but nothing met the threshold.

I'm only marginally interested in this area, but since I did the investigation into it already I wanted to present it here. I'd be quite happy if someone else wanted to take over the patch (or the concept behind it).

Here is an example, where the estimate drops from 500 rows to 3 rows where the correct number is zero:

create table foobar as

select (

select

array_agg(floor((random()*1000000000+g.y/1000000+f.x/10000000))::int)

from generate_series(1,10) g(y)

) foo

from generate_series(1,100000) f(x);

Unpatched:

analyze;

explain (analyze) select * from foobar where foo @> '{567}';

QUERY PLAN

--------------------------------------------------------------------------------------------------------

Seq Scan on foobar (cost=0.00..2387.00 rows=500 width=61) (actual time=76.448..76.448 rows=0 loops=1)

Filter: (foo @> '{567}'::integer[])

Rows Removed by Filter: 100000

Planning time: 0.211 ms

Execution time: 76.492 ms

With patch:

analyze;

explain (analyze) select * from foobar where foo @> '{567}';

QUERY PLAN

------------------------------------------------------------------------------------------------------

Seq Scan on foobar (cost=0.00..2387.00 rows=3 width=61) (actual time=78.604..78.604 rows=0 loops=1)

Filter: (foo @> '{567}'::integer[])

Rows Removed by Filter: 100000

Planning time: 0.199 ms

Execution time: 78.665 ms

Cheers,

Jeff

Attachment

array_type_analyze_MCE_V001.patch

pgsql-hackers by date:

From: Robert Haas
Date: 11 August 2015, 17:28:19
Subject: Re: WIP: SCRAM authentication

From: David Fetter
Date: 11 August 2015, 17:41:52
Subject: Re: [patch] A \pivot command for psql

statistics for array types - Mailing list pgsql-hackers

Attachment

Previous

Next