Thread: WIP: collect frequency statistics for arrays
WIP patch of statistics collection for arrays is attached. It generally copies statistics collection for tsvector, but there are following differencies:
1) Default comparison, hash and equality function for element data type is used (from corresponding default operator classes).
2) Operators @> and && don't takes care about element occurence count in array, i.e. '{1}':int[] @> '{1,1}':int[] and so on. That's why statistics collection and selectivity estimation functions takes care about uniqueness counting of array element.
3) array_typanalyze collects frequency of null element into separate value (like maximum and minimum frequencies in ts_typanalyze). Currently it is not used in selectivity estimation, but it can be useful in future.
Also I've faced with following problems:
1) Do selectivity estimation for ANY and ALL keywords seems not so easy as for operators because their selectivity is estimating inside planner. So it's required to modify planner to do selectivity estimation for these keywords. Probably I'm missing something.
2) I didn't implement selectivity estimation for "column <@ const" and "column == const" cases. The problem of "column <@ const" case is that we need to estimate frequency of occurence of any element not in const. We can try to collect statistics of frequency of all elements which is not in most common elements based on assumption of their independent occurence. But I'm not sure that this statistic will be precise enough. "column == const" case have also another problem. @> and && operators don't takes care about element occurence count and order while == operator require exact match. That's why statistics for @> and && operators can be applied to == very approximately.
3) I need to test selectivity estimation for arrays. But it's hard to understand which distributions is typical for arrays. For example, we know that data in tsvector is based on natural language data, so we can assume something about data distribution in tsvector. But we don't know almost nothing about data in arrays because it can hold any data (tsvector also can holds any data, but it using for non nutural language data is out of purpose).
With best regards,
Alexander Korotkov.
Attachment
On Wed, Feb 23, 2011 at 10:00 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > WIP patch of statistics collection for arrays is attached. Please add this patch to the currently open CommitFest at https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Second version of patch attached. Main changes:
1) ANY and ALL keywords handling.
2) Collecting statistics of array length. It is used in "column <@ const".
3) Relatively accurate estimation of "column <@ const" selectivity. This estimation is most hard part of patch. I'm warrying that there could be too expensibe calculations for estimation. Also, likely current comments don't clearify anything...
With best regards,
Alexander Korotkov.
Attachment
I forgot to commit before diff. Here is correct version.
------
With best regards,
Alexander Korotkov.
With best regards,
Alexander Korotkov.
Attachment
On Mon, May 23, 2011 at 6:54 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > Second version of patch attached. Main changes: > 1) ANY and ALL keywords handling. > 2) Collecting statistics of array length. It is used in "column <@ const". > 3) Relatively accurate estimation of "column <@ const" selectivity. This > estimation is most hard part of patch. I'm warrying that there could be too > expensibe calculations for estimation. Also, likely current comments don't > clearify anything... Just had a brief look at this ahead of starting review. Initial comments are that the code is well structured and I doubt there will be problems at the code level. Looks like a good patch. At the moment I see no tests. If this code will be exercised by existing tests then you should put some notes with the patch to explain that, or at least provide some pointers as to how I might test this. Also, I'd like to see some more explanation. Either in comments, or just as a post to hackers. That saves me time, but we need to be clear about what this does and does not do, what it might do in the future etc.. 3+ years from now we need to be able to remember what the code was supposed to do. You will forget yourself in time, if you write enough patches. Based on this, I think you'll be writing quite a few more. And of course, a few lines for the docs also. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 10, 2011 at 9:03 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
------Initial comments are that the code is well structured and I doubtthere will be problems at the code level. Looks like a good patch.
I'm worrying about perfomance of "column <@ const" estimation. It takes O(m*(n+m)) of time, where m - const length and n - statistics target. Probably, it can be too slow is some some cases.
At the moment I see no tests. If this code will be exercised by
existing tests then you should put some notes with the patch to
explain that, or at least provide some pointers as to how I might test
this.
I didn't find in existing tests which check selectivity estimation accuracy. And I found difficult to create them because regression tests gives binary result while estimation accuracy is quantitative value. Existing regression tests covers case if typanalyze or selectivity estimation function falls down. I've added "ANALYZE array_op_test;" line into array test in order to these tests covers falldown case for this patch functions too.
Seems that, selectivity estimation accuracy should be tested manually on various distributions. I've done very small amount of such tests. Unfortunately, few months pass before I got idea about "column <@ const" case. And now, I don't have sufficient time for it due to my GSoC project. It would be great if you can help me with this tests.
Also, I'd like to see some more explanation. Either in comments, or
just as a post to hackers. That saves me time, but we need to be clear
about what this does and does not do, what it might do in the future
etc.. 3+ years from now we need to be able to remember what the code
was supposed to do. You will forget yourself in time, if you write
enough patches. Based on this, I think you'll be writing quite a few
more.
I've added some more comments. I'm afraid that it should be completely rewritten before committing due to my english. If some particular points should be clarified more, please, specify them.
And of course, a few lines for the docs also.
I found that in statistics patch for tsvector only article about pg_stats view was corrected. I've corrected this article a little bit too.
With best regards,
Alexander Korotkov.
Attachment
On Sun, Jun 12, 2011 at 12:17 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > I'm worrying about perfomance of "column <@ const" estimation. It takes > O(m*(n+m)) of time, where m - const length and n - statistics target. > Probably, it can be too slow is some some cases. Hmm, that doesn't sound terribly promising. The best thing to do here is probably to construct a pessimal case and test it. So, say, look for a column <@ a 100-element array with a statistics target of 400...once you know how bad it is, then we can make intelligentdecisions about where to go with it. If the data type is hashable, you could consider building a hash table on the MCVs and then do a probe for each element in the array. I think that's better than the other way around because there can't be more than 10k MCVs, whereas the input constant could be arbitrarily long. I'm not entirely sure whether this case is important enough to be worth spending a lot of code on, but then again it might not be that much code. Another option is to bound the number of operations you're willing to perform to some reasonable limit, say, 10 * default_statistics_target.Work out ceil((10 * default_statistics_target) / number-of-elements-in-const) and consider at most that many MCVs. When this limit kicks in you'll get a less-accurate selectivity estimate, but that's a reasonable price to pay for not blowing out planning time. >> At the moment I see no tests. If this code will be exercised by >> existing tests then you should put some notes with the patch to >> explain that, or at least provide some pointers as to how I might test >> this. > > I didn't find in existing tests which check selectivity estimation accuracy. > And I found difficult to create them because regression tests gives binary > result while estimation accuracy is quantitative value. Existing regression > tests covers case if typanalyze or selectivity estimation function falls > down. I've added "ANALYZE array_op_test;" line into array test in order to > these tests covers falldown case for this patch functions too. > Seems that, selectivity estimation accuracy should be tested manually on > various distributions. I've done very small amount of such tests. > Unfortunately, few months pass before I got idea about "column <@ const" > case. And now, I don't have sufficient time for it due to my GSoC project. > It would be great if you can help me with this tests. AFAIK, we don't have regression tests for any other selectivity estimator; except perhaps to the extent that we verify they don't crash. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 13, 2011 at 8:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
------Unfortunately, most time consuming operation isn't related to elements comparison. It is caused by complex computations in calc_distr function.If the data type is hashable, you could consider building a hash tableon the MCVs and then do a probe for each element in the array. I
think that's better than the other way around because there can't be
more than 10k MCVs, whereas the input constant could be arbitrarily
long. I'm not entirely sure whether this case is important enough to
be worth spending a lot of code on, but then again it might not be
that much code.
Another option is to bound the number of operations you're willing to
perform to some reasonable limit, say, 10 * default_statistics_target.
Work out ceil((10 * default_statistics_target) /
number-of-elements-in-const) and consider at most that many MCVs.
When this limit kicks in you'll get a less-accurate selectivity
estimate, but that's a reasonable price to pay for not blowing out
planning time.
Good option. I'm going to add such condition to my patch.
With best regards,
Alexander Korotkov.
Version of patch with few more comments and some fixes.
------
With best regards,
Alexander Korotkov.
Attachment
Alexander Korotkov wrote: > Version of patch with few more comments and some fixes. Where are we on this? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Fri, Oct 14, 2011 at 3:24 PM, Bruce Momjian <bruce@momjian.us> wrote: > Alexander Korotkov wrote: >> Version of patch with few more comments and some fixes. The CommitFest app page is here. https://commitfest.postgresql.org/action/patch_view?id=539 It was reviewed in July by Nate Boley, and never updated. Looking now, I see that Alexander wasn't Cc'd on the review, so it's possible he missed the message? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Looking now, I see that Alexander wasn't Cc'd on the review, so it's > possible he missed the message? > We've corresponded off list and have discussed my review at some length. Alex submitted an updated patch on Sep 22 to me personally ( although not to the list? Alex? ), with the promise of a new version with improved comments. Best, Nathan
Hi!
Thanks for your attention to my patch!
On Sat, Oct 15, 2011 at 2:47 PM, Nathan Boley <npboley@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
> Looking now, I see that Alexander wasn't Cc'd on the review, so it'sWe've corresponded off list and have discussed my review at some length.
> possible he missed the message?
>
Alex submitted an updated patch on Sep 22 to me personally ( although
not to the list? Alex? ), with the promise of a new version with
improved comments.
Oh, I didn't noticed that I've posted updated patch off-list. So, there is repost of that version of patch. List of changes is below:
1) Distinct slot is used for length histogram.
2) Standard statistics is collected for arrays.
3) Most common values and most common elements are mapped to distinct columns of pg_stats view, because both of them are calculated for arrays.
Now, it's hard for me to give to it as much time as I would like to. But I hope to present improved comments and testing until end of october.
------
With best regards,
Alexander Korotkov.
Attachment
Alexander Korotkov wrote: > Version of patch with few more comments and some fixes. Where are we on the idea of better statistics for arrays? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
>> Version of patch with few more comments and some fixes. > > Where are we on the idea of better statistics for arrays? I need to finish reviewing the patch: I'll try to send in something tomorrow night. So far it looks good. Best, Nathan