Thread: Re: Re: [COMMITTERS] pgsql: Implement multivariate n-distinct coefficients

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Robert Haas wrote:
>> dromedary and arapaima have failures like this, which seems likely
>> related to this commit:
>> 
>> EXPLAIN
>> SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
>> QUERY PLAN
>> ---------------------------------------------------------------------
>> !  HashAggregate  (cost=225.00..235.00 rows=1000 width=16)
>> Group Key: a, d
>> !    ->  Seq Scan on ndistinct  (cost=0.00..150.00 rows=10000 width=8)
>> (3 rows)

> Yes.  What seems to be going on here, is that both arapaima and
> dromedary are 32 bit machines; all the 64 bit ones are passing (except
> for prion which showed a real relcache bug, which I already stomped).
> Now, the difference is that the total cost in those machines for seqscan
> is 155 instead of 150.  Tomas suggests that this happens because
> MAXALIGN is different, leading to packing tuples differently: the
> expected cost (on our laptop's 64 bit) is 155, and the cost we get in 32
> bit arch is 150 -- so 5 pages of difference.  We insert 1000 rows on the
> table; 4 bytes per tuple would amount to 40 kB, which is exactly 5
> pages.

> I'll push an alternate expected file for this test, which we think is
> the simplest fix.

Why not use COSTS OFF?  Or I'll put that even more strongly: all the
existing regression tests use COSTS OFF, exactly to avoid this sort of
machine-dependent output.  There had better be a really damn good
reason not to use it here.
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Implement multivariaten-distinct coefficients

From
Alvaro Herrera
Date:
Tom Lane wrote:

> Why not use COSTS OFF?  Or I'll put that even more strongly: all the
> existing regression tests use COSTS OFF, exactly to avoid this sort of
> machine-dependent output.  There had better be a really damn good
> reason not to use it here.

If we use COSTS OFF, the test is completely pointless, as the plans look
identical regardless of whether the multivariate stats are being used or
not.

If we had a ROWS option to ANALYZE that showed estimated number of rows
but not the cost, that would be an option.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Why not use COSTS OFF?  Or I'll put that even more strongly: all the
>> existing regression tests use COSTS OFF, exactly to avoid this sort of
>> machine-dependent output.  There had better be a really damn good
>> reason not to use it here.

> If we use COSTS OFF, the test is completely pointless, as the plans look
> identical regardless of whether the multivariate stats are being used or
> not.

Well, I think you are going to find that the exact costs are far too
fragile to have in the regression test output.  Just because you wish
you could test them this way doesn't mean you can.

> If we had a ROWS option to ANALYZE that showed estimated number of rows
> but not the cost, that would be an option.

Unlikely to be any better.  All these numbers are subject to lots of
noise, eg due to auto-analyze happening at unexpected times, random
sampling during analyze, etc.  If you try to constrain the test case
enough that none of that happens, I wonder how useful it will really be.

What I would suggest is devising a test case whereby you actually
get a different plan shape now than you did before.  That shouldn't
be too terribly hard, or else what was the point?
        regards, tom lane