Use int8 for int4/int2 aggregate accumulators? - Mailing list pgsql-hackers

From Tom Lane
Subject Use int8 for int4/int2 aggregate accumulators?
Date
Msg-id 24638.997133222@sss.pgh.pa.us
Whole thread Raw
Responses Re: Use int8 for int4/int2 aggregate accumulators?
List pgsql-hackers
This was discussed on pgsql-general a little bit on 21-July, but the
discussion died off without reaching a conclusion.  I'd like to
put out a concrete proposal and see if anyone has objections.

1. SUM() and AVG() for int2 and int4 inputs should accumulate the
running sum as an INT8, not a NUMERIC, for speed reasons.  INT8 seems
large enough to avoid overflow in practical situations.  The final
output datatype of AVG() will still be NUMERIC, but the final output
of SUM() will become INT8 for these two input types.

2. STDDEV() and VARIANCE() for int2 and int4 inputs will continue to
use NUMERIC for accuracy and overflow reasons (accumulating sum(x^2)
is much more prone to overflow than sum(x)).  So will all these
aggregates for INT8.

3. As a separate proposal, we could change COUNT()'s running counter
and output datatype from INT4 to INT8.  This would make it a little
slower but effectively overflow-proof.


All of these changes are within the latitude that the SQL92 spec
affords (it just says that the output values are exact numeric with
implementation-defined precision and scale).  Issues to consider are:

* On machines with no 8-byte-int C datatype, the accumulator would
effectively be int4.  This would make the behavior no worse than
currently for COUNT(), and no worse than it was in 7.0 for SUM() and
AVG(), so that doesn't bother me a whole lot.  But it would be a
new source of cross-platform behavioral differences.

* Changing the output datatype of these operations --- especially COUNT
--- might affect or even break applications.  We got a few complaints,
not many, about changing SUM() and AVG() from integer to NUMERIC output
in 7.1.  Changing SUM() to INT8 isn't likely to hurt anyone who survived
that transition.  But COUNT() is much more widely used and is more
likely to affect people.  Should we keep it at INT4 output to avoid
compatibility problems?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Notes about int8 sequences
Next
From: mlw
Date:
Subject: Re: Re: AW: Re: OID wraparound: summary and proposal