Home > mailing lists

WIP: collect frequency statistics for arrays - Mailing list pgsql-hackers

From	Alexander Korotkov
Subject	WIP: collect frequency statistics for arrays
Date	February 23, 2011 14:00:56
Msg-id	AANLkTin02SOXNzFxju74Hf-uqhAmymR0-hyAodi4G0Of@mail.gmail.com Whole thread Raw
Responses	Re: WIP: collect frequency statistics for arrays (Robert Haas <robertmhaas@gmail.com>) Re: WIP: collect frequency statistics for arrays (Alexander Korotkov <aekorotkov@gmail.com>)
List	pgsql-hackers

Tree view

WIP patch of statistics collection for arrays is attached. It generally copies statistics collection for tsvector, but there are following differencies:

1) Default comparison, hash and equality function for element data type is used (from corresponding default operator classes).

2) Operators @> and && don't takes care about element occurence count in array, i.e. '{1}':int[] @> '{1,1}':int[] and so on. That's why statistics collection and selectivity estimation functions takes care about uniqueness counting of array element.

3) array_typanalyze collects frequency of null element into separate value (like maximum and minimum frequencies in ts_typanalyze). Currently it is not used in selectivity estimation, but it can be useful in future.

Also I've faced with following problems:

1) Do selectivity estimation for ANY and ALL keywords seems not so easy as for operators because their selectivity is estimating inside planner. So it's required to modify planner to do selectivity estimation for these keywords. Probably I'm missing something.

2) I didn't implement selectivity estimation for "column <@ const" and "column == const" cases. The problem of "column <@ const" case is that we need to estimate frequency of occurence of any element not in const. We can try to collect statistics of frequency of all elements which is not in most common elements based on assumption of their independent occurence. But I'm not sure that this statistic will be precise enough. "column == const" case have also another problem. @> and && operators don't takes care about element occurence count and order while == operator require exact match. That's why statistics for @> and && operators can be applied to == very approximately.

3) I need to test selectivity estimation for arrays. But it's hard to understand which distributions is typical for arrays. For example, we know that data in tsvector is based on natural language data, so we can assume something about data distribution in tsvector. But we don't know almost nothing about data in arrays because it can hold any data (tsvector also can holds any data, but it using for non nutural language data is out of purpose).

------
With best regards,
Alexander Korotkov.

Attachment

arrayanalyze-0.1.patch.gz

pgsql-hackers by date:

From: PostgreSQL - Hans-Jürgen Schönig
Date: 23 February 2011, 13:57:07
Subject: Re: WIP: cross column correlation ...

From: Peter Geoghegan
Date: 23 February 2011, 14:09:25
Subject: Re: Correctly producing array literals for prepared statements

WIP: collect frequency statistics for arrays - Mailing list pgsql-hackers

Attachment

Previous

Next