Re: PATCH: adaptive ndistinct estimator v4 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: PATCH: adaptive ndistinct estimator v4 |
Date | |
Msg-id | 55818880.10300@2ndquadrant.com Whole thread Raw |
In response to | Re: PATCH: adaptive ndistinct estimator v4 (Jeff Janes <jeff.janes@gmail.com>) |
List | pgsql-hackers |
Hi, On 05/13/15 23:07, Jeff Janes wrote: > With the warning it is very hard to correlate the discrepancy you do > see with which column is causing it, as the warnings don't include > table or column names (Assuming of course that you run it on a > substantial database--if you just run it on a few toy cases then the > warning works well). That's true. I've added attnum/attname to the warning in the attached version of the patch. > If we want to have an explicitly experimental patch which we want > people with interesting real-world databases to report back on, what > kind of patch would it have to be to encourage that to happen? Or are > we never going to get such feedback no matter how friendly we make > it? Another problem is that you really need to have the gold standard > to compare them to, and getting that is expensive (which is why we > resort to sampling in the first place). I don't think there is much > to be done on that front other than bite the bullet and just do > it--perhaps only for the tables which have discrepancies. Not sure. The "experimental" part of the patch was not really aimed at the users outside the development community - it was meant to be used by members of the community, possibly testing it on customer databases I don't think adding the GUC into the final release is a good idea, it's just a noise in the config no-one would actually use. > Some of the regressions I've seen are at least partly a bug: > > + /* find the 'm' value minimizing the difference */ > + for (m = 1; m <= total_rows; m += step) > + { > + double q = k / (sample_rows * m); > > sample_rows and m are both integers, and their product overflows > vigorously. A simple cast to double before the multiplication fixes > the first example I produced. The estimate goes from 137,177 to > 1,108,076. The reality is 1,062,223. > > Perhaps m should be just be declared a double, as it is frequently > used in double arithmetic. Yeah, I just discovered this bug independently. There's another bug that the adaptive_estimator takes total_rows as integer, so it breaks for tables with more than INT_MAX rows. Both are fixed in the v5. > > Therefore, I think that: > > 1. This should be committed near the beginning of a release cycle, > not near the end. That way, if there are problem cases, we'll have a > year or so of developer test to shake them out. > > > It can't hurt, but how effective will it be? Will developers know or > care whether ndistinct happened to get better or worse while they > are working on other things? I would think that problems will be > found by focused testing, or during beta, and probably not by > accidental discovery during the development cycle. It can't hurt, but > I don't know how much it will help. I agree with that - it's unlikely the regressions will get discovered randomly. OTOH I'd expect non-trivial number of people on this list to have a few examples of ndistinct failures, and testing those would be more useful I guess. But that's unlikely to find the cases that worked OK before and got broken by the new estimator :-( > I agree with the "experimental GUC". That way if hackers do happen to > see something suspicious, they can just turn it off and see what > difference it makes. If they have to reverse out a patch from 6 months > ago in an area of the code they aren't particularly interested in and > then recompile their code and then juggle two different sets of > binaries, they will likely just shrug it off without investigation. +1 -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: