Thread: Re: [HACKERS] Reducing the overhead of NUMERIC data

Re: [HACKERS] Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

02 November 2005, 14:28:44

On Tue, 2005-11-01 at 17:55 -0500, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to
> > fit length info into 1 or 2 bytes. So there's something to be said for a
> > small numeric type that has less overhead and a large numeric (what we
> > have today).
>
> I don't think it'd be worth having 2 types.  Remember that the weight is
> measured in base-10k digits.  Suppose for instance
>     sign        1 bit
>     weight        7 bits (-64 .. +63)
>     dscale        8 bits (0..255)
> This gives us a dynamic range of 1e-256 to 1e255 as well as the ability
> to represent up to 255 displayable fraction digits.  Does anyone know
> any real database applications where that's not enough?
>
> (I'm neglecting NaN here in supposing we need only 1 bit for sign,
> but we could have a special encoding for NaN.  Perhaps disallow the
> weight = -64 case and use that to signal NaN.)

I've coded a short patch to do this, which is the result of two
alternate patches and some thinking, but maybe not enough yet.

The patch given here is different on two counts from above:

This sets...
#define NUMERIC_MAX_PRECISION        64

since

#define NUMERIC_MAX_RESULT_SCALE    (NUMERIC_MAX_PRECISION * 2)

We don't seem to be able to use all of the bits actually available to us
in the format. Perhaps we need to decouple these now? Previously, we had
room for 14 bits, which gave a maximum of 16384. We were using
NUMERIC_MAX of 1000, so doubling it didn't give problems.

The above on-disk format showed sign & weight together, whereas the
current code has sign and dscale together. Trying to put sign and weight
together is somewhat difficult, since weight is itself a signed value.

I coded it up that way around, which is reasonably straightforward but
harder than the patch enclosed here. But AFAICS - which isn't that far
normally I grant you, doing things that way around would require some
twos-complement work to get things correct when weight is negative. That
worries me.

IMHO we should accept the step down in maximum numeric precision (down
to "only" 64 digits) rather than put extra processing into every
manipulation of a NUMERIC datatype. With luck, I've misunderstood and we
can have both performance and precision.

If not, I commend 64 digits to you as sufficient for every imaginable
purpose - saving 2 bytes off every numeric column. (And still 28 decimal
places more accurate than Oracle).

Best Regards, Simon Riggs

Attachment

shortnumeric.patch

Re: [HACKERS] Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

02 November 2005, 14:46:51

Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2005-11-01 at 17:55 -0500, Tom Lane wrote:
>> I don't think it'd be worth having 2 types.  Remember that the weight is
>> measured in base-10k digits.  Suppose for instance
>>     sign        1 bit
>>     weight        7 bits (-64 .. +63)
>>     dscale        8 bits (0..255)

> I've coded a short patch to do this, which is the result of two
> alternate patches and some thinking, but maybe not enough yet.

What your patch does is

    sign        2 bits
    weight        8 bits (-128..127)
    dscale        6 bits (0..63)

which is simply pretty lame: weight effectively has a factor of 8 more
dynamic range than dscale in this representation.  What's the point of
being able to represent 1 * 10000 ^ -128 (ie, 10^-512) if the dscale
only lets you show 63 fractional digits?  You've got to allocate the
bits in a saner fashion.  Yes, that takes a little more work.

Also, since the internal (unpacked) calculation representation has a
much wider dynamic range than this, it'd probably be appropriate to add
some range checks to the code that forms a packed value from unpacked.

            regards, tom lane

Re: [HACKERS] Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

02 November 2005, 15:38:38

On Wed, 2005-11-02 at 13:46 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Tue, 2005-11-01 at 17:55 -0500, Tom Lane wrote:
> >> I don't think it'd be worth having 2 types.  Remember that the weight is
> >> measured in base-10k digits.  Suppose for instance
> >>     sign        1 bit
> >>     weight        7 bits (-64 .. +63)
> >>     dscale        8 bits (0..255)
>
> > I've coded a short patch to do this, which is the result of two
> > alternate patches and some thinking, but maybe not enough yet.
>
> What your patch does is

Thanks for checking this so quickly.

>
>     sign        2 bits

OK, thats just a mistake in my second patch. Thats easily corrected.
Please ignore that for now.

>     weight        8 bits (-128..127)
>     dscale        6 bits (0..63)
>
> which is simply pretty lame: weight effectively has a factor of 8 more
> dynamic range than dscale in this representation.  What's the point of
> being able to represent 1 * 10000 ^ -128 (ie, 10^-512) if the dscale
> only lets you show 63 fractional digits?  You've got to allocate the
> bits in a saner fashion.  Yes, that takes a little more work.

I wasn't trying to claim the bit assignment made sense. My point was
that the work to mangle the two fields together to make it make sense
looked like it would take more CPU (since the standard representation of
signed integers is different for +ve and -ve values). It is the more CPU
I'm worried about, not the wasted bits on the weight. Spending CPU
cycles on *all* numerics just so we can have numbers with > +/-64
decimal places doesn't seem a good trade. Hence I stuck the numeric sign
back on the dscale, and so dscale and weight seem out of balance.

So, AFAICS, the options are:
0 (current cvstip)
   Numeric range up to 1000, with additional 2 bytes per column value
1. Numeric range up to 128, but with overhead to extract last bit
2. Numeric range up to 64

I'm suggesting we choose (2).... other views are welcome.

(I'll code it whichever way we decide.)

> Also, since the internal (unpacked) calculation representation has a
> much wider dynamic range than this, it'd probably be appropriate to add
> some range checks to the code that forms a packed value from unpacked.

Well, there already is one that does that, otherwise I would have added
one as you suggest. (The unpacked code has int values, whereas the
previous packed format used u/int16 values).

Best Regards, Simon Riggs

Re: [HACKERS] Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

02 November 2005, 16:09:52

Simon Riggs <simon@2ndquadrant.com> writes:
> I wasn't trying to claim the bit assignment made sense. My point was
> that the work to mangle the two fields together to make it make sense
> looked like it would take more CPU (since the standard representation of
> signed integers is different for +ve and -ve values). It is the more CPU
> I'm worried about, not the wasted bits on the weight.

I think that's purely hypothetical.  The existing representation, as
well as the one you propose, both require shift-and-mask operations
to pull the fields out of the packed format.  The format I'm suggesting
would require some different shift-and-mask operations.  As a first
approximation it'd be a wash, and any actual differences would be
CPU-specific enough that we shouldn't put a whole lot of weight on the
point.  (C compilers tend to be fairly bright about optimizing field
extraction operations.)

Moreover, you've forgotten the basic gating factor here, which is
whether such a proposal will get accepted at all.  Reducing the
available range from 1000 digits to 255 might pass without too much
objection, but dropping it down another factor of 4 to 63 starts to
bring it uncomfortably close to mattering to people.

[ thinks for a moment... ]  Actually, neither proposal is going to get
off the ground, because the parser's handling of numeric constants is
predicated on the assumption that type NUMERIC can handle any valid
value of FLOAT8, and so we can get away with converting to NUMERIC on
sight and then coercing to float later if parse analysis finds out the
constant should be float.  If the dynamic range of NUMERIC is less than
10^308 then this fails.  So we have to find another bit somewhere, or
the idea is dead in the water.

            regards, tom lane

Re: [HACKERS] Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

02 November 2005, 17:52:44

On Wed, 2005-11-02 at 15:09 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I wasn't trying to claim the bit assignment made sense. My point was
> > that the work to mangle the two fields together to make it make sense
> > looked like it would take more CPU (since the standard representation of
> > signed integers is different for +ve and -ve values). It is the more CPU
> > I'm worried about, not the wasted bits on the weight.
>
> I think that's purely hypothetical.  The existing representation, as
> well as the one you propose, both require shift-and-mask operations
> to pull the fields out of the packed format.  The format I'm suggesting
> would require some different shift-and-mask operations.  As a first
> approximation it'd be a wash, and any actual differences would be
> CPU-specific enough that we shouldn't put a whole lot of weight on the
> point.  (C compilers tend to be fairly bright about optimizing field
> extraction operations.)

OK

> Moreover, you've forgotten the basic gating factor here, which is
> whether such a proposal will get accepted at all.  Reducing the
> available range from 1000 digits to 255 might pass without too much
> objection, but dropping it down another factor of 4 to 63 starts to
> bring it uncomfortably close to mattering to people.
>
> [ thinks for a moment... ]  Actually, neither proposal is going to get
> off the ground, because the parser's handling of numeric constants is
> predicated on the assumption that type NUMERIC can handle any valid
> value of FLOAT8, and so we can get away with converting to NUMERIC on
> sight and then coercing to float later if parse analysis finds out the
> constant should be float.  If the dynamic range of NUMERIC is less than
> 10^308 then this fails.  So we have to find another bit somewhere, or
> the idea is dead in the water.

Well, that certainly is obscure, I'll give you that. 308 huh?

The middle ground between 64 and 308 is somewhere around 255, yes? :-)

I'll get on it. Including Catch-308.

Best Regards, Simon Riggs

Re: [HACKERS] Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

02 November 2005, 18:58:38

On Wed, 2005-11-02 at 15:09 -0500, Tom Lane wrote:

> [ thinks for a moment... ]  Actually, neither proposal is going to get
> off the ground, because the parser's handling of numeric constants is
> predicated on the assumption that type NUMERIC can handle any valid
> value of FLOAT8, and so we can get away with converting to NUMERIC on
> sight and then coercing to float later if parse analysis finds out the
> constant should be float.  If the dynamic range of NUMERIC is less than
> 10^308 then this fails.  So we have to find another bit somewhere, or
> the idea is dead in the water.

We convert a Value node to a Const in
backend/parser/parse_node.c:make_const

It seems straightforward enough to put in an additional test, similar to
the ones already there so that if its too big for a decimal we make it a
float straight away - only a float can be that big in that case. After
that I can't really see what the problem is?

There is only a single call where numeric_float8() is called anywhere in
the code, which is during selectivity calculations. In that case we
actually call numeric_float8_no_overflow(). If its a FLOAT8OID, then we
can simply avoid the conversion, since it already would be one.

Can you explain further? Thanks,

Best Regards, Simon Riggs

Re: [HACKERS] Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

02 November 2005, 19:12:42

Simon Riggs <simon@2ndquadrant.com> writes:
> It seems straightforward enough to put in an additional test, similar to
> the ones already there so that if its too big for a decimal we make it a
> float straight away - only a float can be that big in that case. After
> that I can't really see what the problem is?

Wrong answer.  You'll be introducing weird corner cases into the type
resolution behavior.

An approach that would actually have some credibility would be to not
resolve constants to NUMERIC right away, but to invent an UNKNOWNNUMERIC
pseudotype with coercion behavior comparable to the existing UNKNOWN
type for string literals.  This has been proposed before but hasn't
really been needed so far.  Of course, this converts the project from a
minor localized hack on NUMERIC into a major piece of fiddling with the
type resolution rules, with the potential for unforeseen side-effects on
the behavior of other data types.  It might be worth doing anyway --- I
don't recall at the moment what problems UNKNOWNNUMERIC was intended to
solve, but if they're still open issues then it's something we ought to
get around to sometime.

            regards, tom lane

Re: [HACKERS] Reducing the overhead of NUMERIC data

From

"Jim C. Nasby"

Date:

02 November 2005, 19:28:26

On Wed, Nov 02, 2005 at 06:12:37PM -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > It seems straightforward enough to put in an additional test, similar to
> > the ones already there so that if its too big for a decimal we make it a
> > float straight away - only a float can be that big in that case. After
> > that I can't really see what the problem is?
>
> Wrong answer.  You'll be introducing weird corner cases into the type
> resolution behavior.
>
> An approach that would actually have some credibility would be to not
> resolve constants to NUMERIC right away, but to invent an UNKNOWNNUMERIC
> pseudotype with coercion behavior comparable to the existing UNKNOWN
> type for string literals.  This has been proposed before but hasn't
> really been needed so far.  Of course, this converts the project from a
> minor localized hack on NUMERIC into a major piece of fiddling with the
> type resolution rules, with the potential for unforeseen side-effects on
> the behavior of other data types.  It might be worth doing anyway --- I
> don't recall at the moment what problems UNKNOWNNUMERIC was intended to
> solve, but if they're still open issues then it's something we ought to
> get around to sometime.

Thought I'd look to see if I could find anything about UNKNOWNNUMERIC,
but no such luck (ISTM we really need a better way to find discussion on
old ideas...) But while looking I did find this TODO, which might be
relevant to the current discussion:

# Change NUMERIC to enforce the maximum precision, and increase it

Unfortunately I can't find any reference to that in the archives...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: [HACKERS] Reducing the overhead of NUMERIC data

From

Andrew Dunstan

Date:

02 November 2005, 19:32:43

[patches removed]

Tom Lane wrote:

>Simon Riggs <simon@2ndquadrant.com> writes:
>
>
>>It seems straightforward enough to put in an additional test, similar to
>>the ones already there so that if its too big for a decimal we make it a
>>float straight away - only a float can be that big in that case. After
>>that I can't really see what the problem is?
>>
>>
>
>Wrong answer.  You'll be introducing weird corner cases into the type
>resolution behavior.
>
>An approach that would actually have some credibility would be to not
>resolve constants to NUMERIC right away, but to invent an UNKNOWNNUMERIC
>pseudotype with coercion behavior comparable to the existing UNKNOWN
>type for string literals.  This has been proposed before but hasn't
>really been needed so far.  Of course, this converts the project from a
>minor localized hack on NUMERIC into a major piece of fiddling with the
>type resolution rules, with the potential for unforeseen side-effects on
>the behavior of other data types.  It might be worth doing anyway --- I
>don't recall at the moment what problems UNKNOWNNUMERIC was intended to
>solve, but if they're still open issues then it's something we ought to
>get around to sometime.
>
>
>
>

Could someone please quantify how much bang we might get for what seems
like quite a lot of bucks?

I appreciate the need for speed, but the saving here strikes me as
marginal at best, unless my instincts are all wrong (quite possible)

cheers

andrew

Re: [HACKERS] Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

02 November 2005, 20:12:48

Andrew Dunstan <andrew@dunslane.net> writes:
> Could someone please quantify how much bang we might get for what seems
> like quite a lot of bucks?
> I appreciate the need for speed, but the saving here strikes me as
> marginal at best, unless my instincts are all wrong (quite possible)

Two bytes per numeric value is not a lot, agreed.

If we were willing to invent the "varlena2" datum format then we could
save four bytes per numeric, plus reduce numeric's alignment requirement
from int to short which would probably save another byte per value on
average.  I'm not sure that that's worth doing if numeric and inet are
the only beneficiaries, but it might be.

From a speed perspective the only thing favoring UNKNOWNNUMERIC is the
possibility for saving some conversion operations when the grammar's
initial guess about datatype is wrong.  That's probably pretty marginal
though.  I was thinking of it more as a vehicle for helping us clean up
the type-assignment behavior some more.  The example I have in my notes
is that "float4col = 1.8" is certain to fail since 1.8 will be taken as
float8, not float4, and then you have precision mismatch problems.
You can make it work by quoting: "float4col = '1.8'" but that's surely
pretty ugly.  If the constant were initially UNKNOWNNUMERIC then it
would end up as float4 without that hack.

            regards, tom lane

Re: [HACKERS] Reducing the overhead of NUMERIC data

From

"Qingqing Zhou"

Date:

04 November 2005, 23:09:21

"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> If we were willing to invent the "varlena2" datum format then we could
> save four bytes per numeric, plus reduce numeric's alignment requirement
> from int to short which would probably save another byte per value on
> average.  I'm not sure that that's worth doing if numeric and inet are
> the only beneficiaries, but it might be.
>

I would support "varlena2" from user's applications. Another benefit is for
char types. Many applications are from DB2, Oracle or SQL Server:

        Max Char Length
DB2         32672
SQL         8000
Oracle      4000

All of above just need varlena2. To support bigger char types, we could
follow the tradition "long varchar", etc.

Regards,
Qingqing