WIP: collect frequency statistics for arrays - Mailing list pgsql-hackers

From Alexander Korotkov
Subject WIP: collect frequency statistics for arrays
Date
Msg-id AANLkTin02SOXNzFxju74Hf-uqhAmymR0-hyAodi4G0Of@mail.gmail.com
Whole thread Raw
Responses Re: WIP: collect frequency statistics for arrays  (Robert Haas <robertmhaas@gmail.com>)
Re: WIP: collect frequency statistics for arrays  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
WIP patch of statistics collection for arrays is attached. It generally copies statistics collection for tsvector, but there are following differencies:
1) Default comparison, hash and equality function for element data type is used (from corresponding default operator classes).
2) Operators @> and && don't takes care about element occurence count in array, i.e. '{1}':int[] @> '{1,1}':int[] and so on. That's why statistics collection and selectivity estimation functions takes care about uniqueness counting of array element.
3) array_typanalyze collects frequency of null element into separate value (like maximum and minimum frequencies in ts_typanalyze). Currently it is not used in selectivity estimation, but it can be useful in future.

Also I've faced with following problems:
1) Do selectivity estimation for ANY and ALL keywords seems not so easy as for operators because their selectivity is estimating inside planner. So it's required to modify planner to do selectivity estimation for these keywords. Probably I'm missing something.
2) I didn't implement selectivity estimation for "column <@ const" and "column == const" cases. The problem of "column <@ const" case is that we need to estimate frequency of occurence of any element not in const. We can try to collect statistics of frequency of all elements which is not in most common elements based on assumption of their independent occurence. But I'm not sure that this statistic will be precise enough. "column == const" case have also another problem. @> and && operators don't takes care about element occurence count and order while == operator require exact match. That's why statistics for @> and && operators can be applied to == very approximately.
3) I need to test selectivity estimation for arrays. But it's hard to understand which distributions is typical for arrays. For example, we know that data in tsvector is based on natural language data, so we can assume something about data distribution in tsvector. But we don't know almost nothing about data in arrays because it can hold any data (tsvector also can holds any data, but it using for non nutural language data is out of purpose).

------
With best regards,
Alexander Korotkov.
Attachment

pgsql-hackers by date:

Previous
From: PostgreSQL - Hans-Jürgen Schönig
Date:
Subject: Re: WIP: cross column correlation ...
Next
From: Peter Geoghegan
Date:
Subject: Re: Correctly producing array literals for prepared statements