Thread: Reducing the overhead of NUMERIC data

Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

01 November 2005, 17:24:33

Currently, the overhead of NUMERIC datatype is 8 bytes. Each value is
stored on disk as

typedef struct NumericData
{int32 varlen; /* Variable size (std varlena header) */int16 n_weight; /* Weight of 1st digit
*/uint16 n_sign_dscale; /* Sign + display scale */char n_data[1]; /* Digits (really array of NumericDigit)

} NumericData;

Let's see if we can reduce that:

varlen is int32 to match the standard varlena header. However, the max
number of digits of the datatype is less than the threshold at which
values get toasted. So no NUMERIC values ever get toasted - in which
case, why worry about matching the size of varlena - lets reduce it to 2
bytes which still gives us up to 1000 digits as we have now.

n_weight seems to exist because we do not store trailing zeroes. So
1000000 is stored as 1 with a weight of 6. My experience is that large
numbers of trailing zeroes do not occur with any frequency in real
measurement or financial data and that this is an over-optimization.
This is probably a hang over from the original algorithm, rather than a
conscious design goal for PostgreSQL?

n_sign_dscale shows us where the decimal point is. We could actually
store a marker representing the decimal point, which would cost us 0.5
byte rather than 2 bytes. Since we have 4 bits to represent a decimal
number, that leaves a few bits spare to represent either a decimal-
point-and-positive-sign and decimal-point-and-negative-sign. (We would
still need to store trailing zeroes even after the decimal point).

So, assuming I have this all correct, means we could reduce the on disk
storage for NUMERIC datatypes to the following struct. This gives an
overhead of just 2.5 bytes, plus the loss of the optimization of
trailing zeroes, which I assess as having almost no value anyway in
99.9999% of data values (literally...).

typedef struct NumericData
{int16 varlen; /* Variable size (std varlena header) */char n_data[1]; /* Digits (really array of
NumericDigit)
} NumericData;
so that sign/decimal point stored as 0.5 byte in the record.

A saving of 5.5 bytes per NUMERIC column per row could make an
interesting difference to row length in large tables.

The above would require reasonable amount of change to the existing
datatype code, so I'm not suggesting its a simple change to an .h file.
But the changes as proposed would seem to be able to be made to the
existing NUMERIC type, rather than invent another similar one.

Is my thinking straight on this?

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

01 November 2005, 17:54:13

Simon Riggs <simon@2ndquadrant.com> writes:
> varlen is int32 to match the standard varlena header. However, the max
> number of digits of the datatype is less than the threshold at which
> values get toasted. So no NUMERIC values ever get toasted - in which
> case, why worry about matching the size of varlena - lets reduce it to 2
> bytes which still gives us up to 1000 digits as we have now.

Because that will require an extra case in the code that disassembles
tuples, which will slow down *everything* even in databases that don't
contain one single NUMERIC value.  I think you need more than "let's
save 2 bytes in NUMERICs" to justify that.

> n_weight seems to exist because we do not store trailing zeroes. So
> 1000000 is stored as 1 with a weight of 6. My experience is that large
> numbers of trailing zeroes do not occur with any frequency in real
> measurement or financial data and that this is an over-optimization.

That seems debatable.  Keep in mind that not storing extra zeroes cuts
computation time as well as storage.

> n_sign_dscale shows us where the decimal point is. We could actually
> store a marker representing the decimal point, which would cost us 0.5
> byte rather than 2 bytes. Since we have 4 bits to represent a decimal
> number, that leaves a few bits spare to represent either a decimal-
> point-and-positive-sign and decimal-point-and-negative-sign. (We would
> still need to store trailing zeroes even after the decimal point).

This is completely bogus.  How are you going to remember the sign except
by always storing a marker?  ISTM this proposal just moves the
sign/decimalpoint overhead from one place to another, ie, somewhere in
the NumericDigit array instead of in a fixed field.

Also, you can't drop dscale without abandoning the efficient base-10000
representation, at least not unless you want people complaining that the
database shows NUMERIC(3) data with four decimal places.

It might be reasonable to restrict the range of NUMERIC to the point
that we could fit the weight/sign/dscale into 2 bytes instead of 4,
thereby saving 2 bytes per NUMERIC.  I'm not excited about the other
aspects of this, though.
        regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

Martijn van Oosterhout

Date:

01 November 2005, 18:17:06

On Tue, Nov 01, 2005 at 09:22:17PM +0000, Simon Riggs wrote:
> varlen is int32 to match the standard varlena header. However, the max
> number of digits of the datatype is less than the threshold at which
> values get toasted. So no NUMERIC values ever get toasted - in which
> case, why worry about matching the size of varlena - lets reduce it to 2
> bytes which still gives us up to 1000 digits as we have now.

The other ideas may have merit, I don't know. But this one is a
no-goer. The backend currently has recognises three forms of Datum:

- Fixed length, By value:    integers, chars, anything short anough to fit in a word
- Fixed length, By reference:   datatime, etc, anything that's fixed length but too long for a word
- Variable length:   Anything variable: text, varchar(), etc

The last all, without exception, have a varlena header. This makes the
code easy, because all variable length values look the same for
copying, loading, storing, etc.

You are proposing a fourth type, say VARLENA2 which looks a lot like a
verlena but it's not. I think the shear volume of code that would need
to be checked is huge. Also, things like pg_attribute would need
changing because you have to represent this new state somehow.

I seriously doubt this isn't going to happen. Your other possible
optimisations have other issues.

> n_weight seems to exist because we do not store trailing zeroes. So
> 1000000 is stored as 1 with a weight of 6. My experience is that large
> numbers of trailing zeroes do not occur with any frequency in real
> measurement or financial data and that this is an over-optimization.
> This is probably a hang over from the original algorithm, rather than a
> conscious design goal for PostgreSQL?

But if you are storing large numbers then it's helpful. Whether it's
worth the cost...

> n_sign_dscale shows us where the decimal point is. We could actually
> store a marker representing the decimal point, which would cost us 0.5
> byte rather than 2 bytes. Since we have 4 bits to represent a decimal
> number, that leaves a few bits spare to represent either a decimal-
> point-and-positive-sign and decimal-point-and-negative-sign. (We would
> still need to store trailing zeroes even after the decimal point).

Consider the algorithm: A number is stored as base + exponent. To
multiply two numbers you can multiply the bases and add the exponents.
OTOH, if you store the decimal inside the data, now you have to extract
it again before you can do any calculating. So you've traded CPU time
for disk space. Is diskspace cheaper or more expensive than CPU?
Debatable I guess.

> So, assuming I have this all correct, means we could reduce the on disk
> storage for NUMERIC datatypes to the following struct. This gives an
> overhead of just 2.5 bytes, plus the loss of the optimization of
> trailing zeroes, which I assess as having almost no value anyway in
> 99.9999% of data values (literally...).

Actually, I have a table with a column declared as numeric(12,4)
because there has to be 4 decimal places. As it turns out, the decimal
places are mostly zero so the optimisation works for me.

Interesting ideas, but there's a lot of hurdles to jump I think...
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Reducing the overhead of NUMERIC data

From

"Jim C. Nasby"

Date:

01 November 2005, 18:38:09

On Tue, Nov 01, 2005 at 04:54:11PM -0500, Tom Lane wrote:
> It might be reasonable to restrict the range of NUMERIC to the point
> that we could fit the weight/sign/dscale into 2 bytes instead of 4,
> thereby saving 2 bytes per NUMERIC.  I'm not excited about the other
> aspects of this, though.

FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to
fit length info into 1 or 2 bytes. So there's something to be said for a
small numeric type that has less overhead and a large numeric (what we
have today).
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Reducing the overhead of NUMERIC data

From

"Jim C. Nasby"

Date:

01 November 2005, 18:39:32

On Tue, Nov 01, 2005 at 11:16:58PM +0100, Martijn van Oosterhout wrote:
> Consider the algorithm: A number is stored as base + exponent. To
> multiply two numbers you can multiply the bases and add the exponents.
> OTOH, if you store the decimal inside the data, now you have to extract
> it again before you can do any calculating. So you've traded CPU time
> for disk space. Is diskspace cheaper or more expensive than CPU?
> Debatable I guess.

Well, I/O bandwidth is much more expensive than either CPU or disk
space...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

01 November 2005, 18:40:37

Martijn van Oosterhout <kleptog@svana.org> writes:
> You are proposing a fourth type, say VARLENA2 which looks a lot like a
> verlena but it's not. I think the shear volume of code that would need
> to be checked is huge. Also, things like pg_attribute would need
> changing because you have to represent this new state somehow.

It wouldn't be an impossible amount of code --- for precedent see back
when we made cstring into a full-fledged datatype (cstring is already
a fourth option in your list BTW).  That patch wasn't all that large IIRC.
The issue in my mind is the performance implications of adding an
additional case to places that are already hotspots.  There were
compelling functional reasons to pay that price to make cstring work,
but "save 2 bytes per numeric" doesn't seem like it rises to that level.
Maybe if we had a few other datatypes that could also use the feature.
[ thinks... ]  inet/cidr comes to mind but I don't see any others.
The case seems a bit weak :-(
        regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

01 November 2005, 18:48:22

On Tue, 2005-11-01 at 16:54 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > varlen is int32 to match the standard varlena header. However, the max
> > number of digits of the datatype is less than the threshold at which
> > values get toasted. So no NUMERIC values ever get toasted - in which
> > case, why worry about matching the size of varlena - lets reduce it to 2
> > bytes which still gives us up to 1000 digits as we have now.
> 
> Because that will require an extra case in the code that disassembles
> tuples, which will slow down *everything* even in databases that don't
> contain one single NUMERIC value.  I think you need more than "let's
> save 2 bytes in NUMERICs" to justify that.

OK

> > n_weight seems to exist because we do not store trailing zeroes. So
> > 1000000 is stored as 1 with a weight of 6. My experience is that large
> > numbers of trailing zeroes do not occur with any frequency in real
> > measurement or financial data and that this is an over-optimization.
> 
> That seems debatable.  Keep in mind that not storing extra zeroes cuts
> computation time as well as storage.

Check what % difference this makes. 2 bytes on everything makes more
difference than a 1 byte saving on a few percent of values.

> > n_sign_dscale shows us where the decimal point is. We could actually
> > store a marker representing the decimal point, which would cost us 0.5
> > byte rather than 2 bytes. Since we have 4 bits to represent a decimal
> > number, that leaves a few bits spare to represent either a decimal-
> > point-and-positive-sign and decimal-point-and-negative-sign. (We would
> > still need to store trailing zeroes even after the decimal point).
> 
> This is completely bogus.  How are you going to remember the sign except
> by always storing a marker?  ISTM this proposal just moves the
> sign/decimalpoint overhead from one place to another, ie, somewhere in
> the NumericDigit array instead of in a fixed field.

That is exactly my proposal. Thus an overhead of 0.5 bytes rather than 2
bytes, as I explained....but

> Also, you can't drop dscale without abandoning the efficient base-10000
> representation, at least not unless you want people complaining that the
> database shows NUMERIC(3) data with four decimal places.

... I take it I have misunderstood the storage format.

> It might be reasonable to restrict the range of NUMERIC to the point
> that we could fit the weight/sign/dscale into 2 bytes instead of 4,
> thereby saving 2 bytes per NUMERIC.  I'm not excited about the other
> aspects of this, though.

That seems easily doable - it seemed like something would stick.

Restricting total number of digits to 255 and maxscale of 254 would
allow that saving, yes?

We can then have a BIGNUMERIC which would allow up to 1000 digits for
anybody out there that ever got that high. I'm sure there's a few %, so
I won't dismiss you entirely, guys...

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

"Jim C. Nasby"

Date:

01 November 2005, 18:49:30

On Tue, Nov 01, 2005 at 05:40:35PM -0500, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > You are proposing a fourth type, say VARLENA2 which looks a lot like a
> > verlena but it's not. I think the shear volume of code that would need
> > to be checked is huge. Also, things like pg_attribute would need
> > changing because you have to represent this new state somehow.
> 
> It wouldn't be an impossible amount of code --- for precedent see back
> when we made cstring into a full-fledged datatype (cstring is already
> a fourth option in your list BTW).  That patch wasn't all that large IIRC.
> The issue in my mind is the performance implications of adding an
> additional case to places that are already hotspots.  There were
> compelling functional reasons to pay that price to make cstring work,
> but "save 2 bytes per numeric" doesn't seem like it rises to that level.
> Maybe if we had a few other datatypes that could also use the feature.
> [ thinks... ]  inet/cidr comes to mind but I don't see any others.
> The case seems a bit weak :-(

Would varchar(255) fit into that case? There's a heck of a lot of people
who use that as "well, dunno how big this is so I'll just use 255". A
better use case would be places where you know you'll only need 10-20
characters; saving 2 bytes in those cases would likely be worth it. This
would work for char as well (and given that people are probably not in
the habit of defining very large char's it would probably be even more
useful there).

Of course that means either yet another varchar/char type, or we have
some automatic cut-over for fields defined over a certain size...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

01 November 2005, 18:55:41

"Jim C. Nasby" <jnasby@pervasive.com> writes:
> FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to
> fit length info into 1 or 2 bytes. So there's something to be said for a
> small numeric type that has less overhead and a large numeric (what we
> have today).

I don't think it'd be worth having 2 types.  Remember that the weight is
measured in base-10k digits.  Suppose for instancesign        1 bitweight        7 bits (-64 .. +63)dscale        8
bits(0..255)

This gives us a dynamic range of 1e-256 to 1e255 as well as the ability
to represent up to 255 displayable fraction digits.  Does anyone know
any real database applications where that's not enough?

(I'm neglecting NaN here in supposing we need only 1 bit for sign,
but we could have a special encoding for NaN.  Perhaps disallow the
weight = -64 case and use that to signal NaN.)
        regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

01 November 2005, 19:01:36

On Tue, 2005-11-01 at 23:16 +0100, Martijn van Oosterhout wrote:

lots of useful things, thank you.

> > So, assuming I have this all correct, means we could reduce the on disk
> > storage for NUMERIC datatypes to the following struct. This gives an
> > overhead of just 2.5 bytes, plus the loss of the optimization of
> > trailing zeroes, which I assess as having almost no value anyway in
> > 99.9999% of data values (literally...).
> 
> Actually, I have a table with a column declared as numeric(12,4)
> because there has to be 4 decimal places. As it turns out, the decimal
> places are mostly zero so the optimisation works for me.

Of course it fits some data. The point is whether it is useful for most
people's data.

My contention is that *most* (but definitely nowhere near all) NUMERIC
data is either financial or measured data. That usually means it has
digits that follow Benfold's Law - which for this discussion is a
variant on a uniform random distribution.

Optimizing for trailing zeroes just isn't worth the very minimal
benefits, in most cases. It doesn't really matter that it saves on
storage and processing time in those cases - Amdahl's Law says we can
ignore that saving because the optimized case is not prevalent enough
for us to care.

Anybody like to work out a piece of SQL to perform data profiling and
derive the distribution of values with trailing zeroes? I'd be happy to
be proved wrong with an analysis of real data tables.

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

01 November 2005, 19:05:29

"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Tue, Nov 01, 2005 at 05:40:35PM -0500, Tom Lane wrote:
>> Maybe if we had a few other datatypes that could also use the feature.
>> [ thinks... ]  inet/cidr comes to mind but I don't see any others.
>> The case seems a bit weak :-(

> Would varchar(255) fit into that case?

That's attractive at first thought, but not when you stop to consider
that most of the string-datatype support is built around the assumption
that text, varchar, and char share the same underlying representation.
You'd have to write a whole bunch of new code to support such a
datatype.
        regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

01 November 2005, 19:15:50

Simon Riggs <simon@2ndquadrant.com> writes:
> Anybody like to work out a piece of SQL to perform data profiling and
> derive the distribution of values with trailing zeroes?

Don't forget leading zeroes.  And all-zero (we omit digits entirely in
that case).  I don't think you can claim that zero isn't a common case.
        regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

"J. Andrew Rogers"

Date:

01 November 2005, 21:46:10

On 11/1/05 2:38 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:
> 
> FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to
> fit length info into 1 or 2 bytes. So there's something to be said for a
> small numeric type that has less overhead and a large numeric (what we
> have today).

The 38 digit limit is the decimal size of a 128-bit signed integer.  The
optimization has less to do with the size of the length info and more to do
with fast math and fixed structure size.

J. Andrew Rogers

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

02 November 2005, 04:50:44

On Tue, 2005-11-01 at 18:15 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Anybody like to work out a piece of SQL to perform data profiling and
> > derive the distribution of values with trailing zeroes?
> 
> Don't forget leading zeroes.  And all-zero (we omit digits entirely in
> that case).  I don't think you can claim that zero isn't a common case.

The question is: how common?

For INTEGERs I would accept that many are often zero. For NUMERIC, these
are seldom exactly zero, IMHO.

This is one of those issues where we need to run tests and take input.
We cannot decide this sort of thing just by debate alone. So, I'll leave
this as a less potentially fruitful line of enquiry.

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Mike Rylander

Date:

02 November 2005, 08:59:46

On 11/2/05, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, 2005-11-01 at 18:15 -0500, Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > Anybody like to work out a piece of SQL to perform data profiling and
> > > derive the distribution of values with trailing zeroes?
> >
> > Don't forget leading zeroes.  And all-zero (we omit digits entirely in
> > that case).  I don't think you can claim that zero isn't a common case.
>
> The question is: how common?
>
> For INTEGERs I would accept that many are often zero. For NUMERIC, these
> are seldom exactly zero, IMHO.

Seconded.  My INTEGER data does have a quite a few zeros but most of
my NUMERIC columns hold debits and credits.  Those are almost never
zero.

>
> This is one of those issues where we need to run tests and take input.
> We cannot decide this sort of thing just by debate alone. So, I'll leave
> this as a less potentially fruitful line of enquiry.
>
> Best Regards, Simon Riggs
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>


--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: Reducing the overhead of NUMERIC data

From

"Pollard, Mike"

Date:

02 November 2005, 10:15:05

I am not able to quickly find your numeric format, so I'll just throw
this in.  MaxDB (I only mention this because the format and algorithms
are now under the GPL, so they can be reviewed by the public) uses a
nifty number format that allows the use memcpy to compare two numbers
when they are in the same precision and scale.  Basically, the first
byte contains the sign and number of digits in the number (number of
digits is complemented if the number is negative), then the next N bytes
contain the actual decimal digits, where N is the number of decimal
digits / 2 (so two decimal digits per byte).  Trailing 0's are removed
to save space.  So,

0 is stored as {128}
1 is stored as {193, 16}
1000 is stored as {196, 16}
1001 is stored as {196, 16, 1} x{C4 10 01}
-1 is stored as {63, 144}
-1001 is stored as {60, 144}

Their storage allows for a max of 63 digits in a number, but it should
be no problem to increase the size to 2 bytes, thus allowing up to
16,385 digits.

The advantages are:- ability to memcmp two numbers.- compact storage (can be made more compact if you choose to
save hex digits instead of decimal, but I'm not sure you want to do
that).
The disadvantages are as follows:- this format does not remember the database definition for the
number (that is, no precision or scale); numeric functions must be told
what they are.  It would be nice if the number kept track of that as
well...- comparing two numbers that are not the same precision and
scale means converting one or both (if both precision and scale are
different you may have to convert both)- calculations (addition, subtraction, etc) require functions to
extract the digits and do the calculation a digit at a time.- I do not know of any trig functions, so they would need
tobe 
written

If any one is interested, I would be happy to discuss this further.

Mike Pollard
SUPRA Server SQL Engineering and Support
Cincom Systems, Inc

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

02 November 2005, 14:28:47

On Tue, 2005-11-01 at 17:55 -0500, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to
> > fit length info into 1 or 2 bytes. So there's something to be said for a
> > small numeric type that has less overhead and a large numeric (what we
> > have today).
>
> I don't think it'd be worth having 2 types.  Remember that the weight is
> measured in base-10k digits.  Suppose for instance
>     sign        1 bit
>     weight        7 bits (-64 .. +63)
>     dscale        8 bits (0..255)
> This gives us a dynamic range of 1e-256 to 1e255 as well as the ability
> to represent up to 255 displayable fraction digits.  Does anyone know
> any real database applications where that's not enough?
>
> (I'm neglecting NaN here in supposing we need only 1 bit for sign,
> but we could have a special encoding for NaN.  Perhaps disallow the
> weight = -64 case and use that to signal NaN.)

I've coded a short patch to do this, which is the result of two
alternate patches and some thinking, but maybe not enough yet.

The patch given here is different on two counts from above:

This sets...
#define NUMERIC_MAX_PRECISION        64

since

#define NUMERIC_MAX_RESULT_SCALE    (NUMERIC_MAX_PRECISION * 2)

We don't seem to be able to use all of the bits actually available to us
in the format. Perhaps we need to decouple these now? Previously, we had
room for 14 bits, which gave a maximum of 16384. We were using
NUMERIC_MAX of 1000, so doubling it didn't give problems.

The above on-disk format showed sign & weight together, whereas the
current code has sign and dscale together. Trying to put sign and weight
together is somewhat difficult, since weight is itself a signed value.

I coded it up that way around, which is reasonably straightforward but
harder than the patch enclosed here. But AFAICS - which isn't that far
normally I grant you, doing things that way around would require some
twos-complement work to get things correct when weight is negative. That
worries me.

IMHO we should accept the step down in maximum numeric precision (down
to "only" 64 digits) rather than put extra processing into every
manipulation of a NUMERIC datatype. With luck, I've misunderstood and we
can have both performance and precision.

If not, I commend 64 digits to you as sufficient for every imaginable
purpose - saving 2 bytes off every numeric column. (And still 28 decimal
places more accurate than Oracle).

Best Regards, Simon Riggs

Attachment

shortnumeric.patch

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

02 November 2005, 14:46:49

Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2005-11-01 at 17:55 -0500, Tom Lane wrote:
>> I don't think it'd be worth having 2 types.  Remember that the weight is
>> measured in base-10k digits.  Suppose for instance
>>     sign        1 bit
>>     weight        7 bits (-64 .. +63)
>>     dscale        8 bits (0..255)

> I've coded a short patch to do this, which is the result of two
> alternate patches and some thinking, but maybe not enough yet.

What your patch does is

    sign        2 bits
    weight        8 bits (-128..127)
    dscale        6 bits (0..63)

which is simply pretty lame: weight effectively has a factor of 8 more
dynamic range than dscale in this representation.  What's the point of
being able to represent 1 * 10000 ^ -128 (ie, 10^-512) if the dscale
only lets you show 63 fractional digits?  You've got to allocate the
bits in a saner fashion.  Yes, that takes a little more work.

Also, since the internal (unpacked) calculation representation has a
much wider dynamic range than this, it'd probably be appropriate to add
some range checks to the code that forms a packed value from unpacked.

            regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

"Jim C. Nasby"

Date:

02 November 2005, 14:53:14

On Wed, Nov 02, 2005 at 08:48:25AM +0000, Simon Riggs wrote:
> On Tue, 2005-11-01 at 18:15 -0500, Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > Anybody like to work out a piece of SQL to perform data profiling and
> > > derive the distribution of values with trailing zeroes?
> > 
> > Don't forget leading zeroes.  And all-zero (we omit digits entirely in
> > that case).  I don't think you can claim that zero isn't a common case.
> 
> The question is: how common?
> 
> For INTEGERs I would accept that many are often zero. For NUMERIC, these
> are seldom exactly zero, IMHO.
> 
> This is one of those issues where we need to run tests and take input.
> We cannot decide this sort of thing just by debate alone. So, I'll leave
> this as a less potentially fruitful line of enquiry.

Is it worth comming up with some script that users can run against a
table to provide us with real data?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

02 November 2005, 15:38:38

On Wed, 2005-11-02 at 13:46 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Tue, 2005-11-01 at 17:55 -0500, Tom Lane wrote:
> >> I don't think it'd be worth having 2 types.  Remember that the weight is
> >> measured in base-10k digits.  Suppose for instance
> >>     sign        1 bit
> >>     weight        7 bits (-64 .. +63)
> >>     dscale        8 bits (0..255)
>
> > I've coded a short patch to do this, which is the result of two
> > alternate patches and some thinking, but maybe not enough yet.
>
> What your patch does is

Thanks for checking this so quickly.

>
>     sign        2 bits

OK, thats just a mistake in my second patch. Thats easily corrected.
Please ignore that for now.

>     weight        8 bits (-128..127)
>     dscale        6 bits (0..63)
>
> which is simply pretty lame: weight effectively has a factor of 8 more
> dynamic range than dscale in this representation.  What's the point of
> being able to represent 1 * 10000 ^ -128 (ie, 10^-512) if the dscale
> only lets you show 63 fractional digits?  You've got to allocate the
> bits in a saner fashion.  Yes, that takes a little more work.

I wasn't trying to claim the bit assignment made sense. My point was
that the work to mangle the two fields together to make it make sense
looked like it would take more CPU (since the standard representation of
signed integers is different for +ve and -ve values). It is the more CPU
I'm worried about, not the wasted bits on the weight. Spending CPU
cycles on *all* numerics just so we can have numbers with > +/-64
decimal places doesn't seem a good trade. Hence I stuck the numeric sign
back on the dscale, and so dscale and weight seem out of balance.

So, AFAICS, the options are:
0 (current cvstip)
   Numeric range up to 1000, with additional 2 bytes per column value
1. Numeric range up to 128, but with overhead to extract last bit
2. Numeric range up to 64

I'm suggesting we choose (2).... other views are welcome.

(I'll code it whichever way we decide.)

> Also, since the internal (unpacked) calculation representation has a
> much wider dynamic range than this, it'd probably be appropriate to add
> some range checks to the code that forms a packed value from unpacked.

Well, there already is one that does that, otherwise I would have added
one as you suggest. (The unpacked code has int values, whereas the
previous packed format used u/int16 values).

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

02 November 2005, 16:09:52

Simon Riggs <simon@2ndquadrant.com> writes:
> I wasn't trying to claim the bit assignment made sense. My point was
> that the work to mangle the two fields together to make it make sense
> looked like it would take more CPU (since the standard representation of
> signed integers is different for +ve and -ve values). It is the more CPU
> I'm worried about, not the wasted bits on the weight.

I think that's purely hypothetical.  The existing representation, as
well as the one you propose, both require shift-and-mask operations
to pull the fields out of the packed format.  The format I'm suggesting
would require some different shift-and-mask operations.  As a first
approximation it'd be a wash, and any actual differences would be
CPU-specific enough that we shouldn't put a whole lot of weight on the
point.  (C compilers tend to be fairly bright about optimizing field
extraction operations.)

Moreover, you've forgotten the basic gating factor here, which is
whether such a proposal will get accepted at all.  Reducing the
available range from 1000 digits to 255 might pass without too much
objection, but dropping it down another factor of 4 to 63 starts to
bring it uncomfortably close to mattering to people.

[ thinks for a moment... ]  Actually, neither proposal is going to get
off the ground, because the parser's handling of numeric constants is
predicated on the assumption that type NUMERIC can handle any valid
value of FLOAT8, and so we can get away with converting to NUMERIC on
sight and then coercing to float later if parse analysis finds out the
constant should be float.  If the dynamic range of NUMERIC is less than
10^308 then this fails.  So we have to find another bit somewhere, or
the idea is dead in the water.

            regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

Martijn van Oosterhout

Date:

02 November 2005, 16:42:56

On Wed, Nov 02, 2005 at 12:53:07PM -0600, Jim C. Nasby wrote:
> > This is one of those issues where we need to run tests and take input.
> > We cannot decide this sort of thing just by debate alone. So, I'll leave
> > this as a less potentially fruitful line of enquiry.
>
> Is it worth comming up with some script that users can run against a
> table to provide us with real data?

Like I said, I have a few columns of numeric(12,4). They're costs in
cent, to 4 decimal places. The test is (column = trunc(column)).

Sample data: col1   |  col2   |  col3  |  col4
---------+---------+--------+---------21.0000 | 10.1818 | 0.0000 | 21.000022.0000 | 11.2727 | 0.0000 | 22.000022.0000 |
6.0909 | 0.0000 | 22.0000 

For each column (across 17 million rows):

Col 1:  83%   trailing zeros
Col 2:  49%
Col 3:  94%
Col 4:  83%

AIUI, I currently get the four decimal places for free, and the idea is
to store them explicitly.

Fact is, things that cost fractions of cents are not that common, in
this database anyway. As for the argument in general, this table is so
wide that any gain will vanish into the slack at the end of a block so
it won't actually matter...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

02 November 2005, 17:52:48

On Wed, 2005-11-02 at 15:09 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I wasn't trying to claim the bit assignment made sense. My point was
> > that the work to mangle the two fields together to make it make sense
> > looked like it would take more CPU (since the standard representation of
> > signed integers is different for +ve and -ve values). It is the more CPU
> > I'm worried about, not the wasted bits on the weight.
>
> I think that's purely hypothetical.  The existing representation, as
> well as the one you propose, both require shift-and-mask operations
> to pull the fields out of the packed format.  The format I'm suggesting
> would require some different shift-and-mask operations.  As a first
> approximation it'd be a wash, and any actual differences would be
> CPU-specific enough that we shouldn't put a whole lot of weight on the
> point.  (C compilers tend to be fairly bright about optimizing field
> extraction operations.)

OK

> Moreover, you've forgotten the basic gating factor here, which is
> whether such a proposal will get accepted at all.  Reducing the
> available range from 1000 digits to 255 might pass without too much
> objection, but dropping it down another factor of 4 to 63 starts to
> bring it uncomfortably close to mattering to people.
>
> [ thinks for a moment... ]  Actually, neither proposal is going to get
> off the ground, because the parser's handling of numeric constants is
> predicated on the assumption that type NUMERIC can handle any valid
> value of FLOAT8, and so we can get away with converting to NUMERIC on
> sight and then coercing to float later if parse analysis finds out the
> constant should be float.  If the dynamic range of NUMERIC is less than
> 10^308 then this fails.  So we have to find another bit somewhere, or
> the idea is dead in the water.

Well, that certainly is obscure, I'll give you that. 308 huh?

The middle ground between 64 and 308 is somewhere around 255, yes? :-)

I'll get on it. Including Catch-308.

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

02 November 2005, 18:58:37

On Wed, 2005-11-02 at 15:09 -0500, Tom Lane wrote:

> [ thinks for a moment... ]  Actually, neither proposal is going to get
> off the ground, because the parser's handling of numeric constants is
> predicated on the assumption that type NUMERIC can handle any valid
> value of FLOAT8, and so we can get away with converting to NUMERIC on
> sight and then coercing to float later if parse analysis finds out the
> constant should be float.  If the dynamic range of NUMERIC is less than
> 10^308 then this fails.  So we have to find another bit somewhere, or
> the idea is dead in the water.

We convert a Value node to a Const in
backend/parser/parse_node.c:make_const

It seems straightforward enough to put in an additional test, similar to
the ones already there so that if its too big for a decimal we make it a
float straight away - only a float can be that big in that case. After
that I can't really see what the problem is?

There is only a single call where numeric_float8() is called anywhere in
the code, which is during selectivity calculations. In that case we
actually call numeric_float8_no_overflow(). If its a FLOAT8OID, then we
can simply avoid the conversion, since it already would be one.

Can you explain further? Thanks,

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

02 November 2005, 19:12:42

Simon Riggs <simon@2ndquadrant.com> writes:
> It seems straightforward enough to put in an additional test, similar to
> the ones already there so that if its too big for a decimal we make it a
> float straight away - only a float can be that big in that case. After
> that I can't really see what the problem is?

Wrong answer.  You'll be introducing weird corner cases into the type
resolution behavior.

An approach that would actually have some credibility would be to not
resolve constants to NUMERIC right away, but to invent an UNKNOWNNUMERIC
pseudotype with coercion behavior comparable to the existing UNKNOWN
type for string literals.  This has been proposed before but hasn't
really been needed so far.  Of course, this converts the project from a
minor localized hack on NUMERIC into a major piece of fiddling with the
type resolution rules, with the potential for unforeseen side-effects on
the behavior of other data types.  It might be worth doing anyway --- I
don't recall at the moment what problems UNKNOWNNUMERIC was intended to
solve, but if they're still open issues then it's something we ought to
get around to sometime.

            regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

"Jim C. Nasby"

Date:

02 November 2005, 19:28:25

On Wed, Nov 02, 2005 at 06:12:37PM -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > It seems straightforward enough to put in an additional test, similar to
> > the ones already there so that if its too big for a decimal we make it a
> > float straight away - only a float can be that big in that case. After
> > that I can't really see what the problem is?
>
> Wrong answer.  You'll be introducing weird corner cases into the type
> resolution behavior.
>
> An approach that would actually have some credibility would be to not
> resolve constants to NUMERIC right away, but to invent an UNKNOWNNUMERIC
> pseudotype with coercion behavior comparable to the existing UNKNOWN
> type for string literals.  This has been proposed before but hasn't
> really been needed so far.  Of course, this converts the project from a
> minor localized hack on NUMERIC into a major piece of fiddling with the
> type resolution rules, with the potential for unforeseen side-effects on
> the behavior of other data types.  It might be worth doing anyway --- I
> don't recall at the moment what problems UNKNOWNNUMERIC was intended to
> solve, but if they're still open issues then it's something we ought to
> get around to sometime.

Thought I'd look to see if I could find anything about UNKNOWNNUMERIC,
but no such luck (ISTM we really need a better way to find discussion on
old ideas...) But while looking I did find this TODO, which might be
relevant to the current discussion:

# Change NUMERIC to enforce the maximum precision, and increase it

Unfortunately I can't find any reference to that in the archives...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Reducing the overhead of NUMERIC data

From

Andrew Dunstan

Date:

02 November 2005, 19:32:41

[patches removed]

Tom Lane wrote:

>Simon Riggs <simon@2ndquadrant.com> writes:
>
>
>>It seems straightforward enough to put in an additional test, similar to
>>the ones already there so that if its too big for a decimal we make it a
>>float straight away - only a float can be that big in that case. After
>>that I can't really see what the problem is?
>>
>>
>
>Wrong answer.  You'll be introducing weird corner cases into the type
>resolution behavior.
>
>An approach that would actually have some credibility would be to not
>resolve constants to NUMERIC right away, but to invent an UNKNOWNNUMERIC
>pseudotype with coercion behavior comparable to the existing UNKNOWN
>type for string literals.  This has been proposed before but hasn't
>really been needed so far.  Of course, this converts the project from a
>minor localized hack on NUMERIC into a major piece of fiddling with the
>type resolution rules, with the potential for unforeseen side-effects on
>the behavior of other data types.  It might be worth doing anyway --- I
>don't recall at the moment what problems UNKNOWNNUMERIC was intended to
>solve, but if they're still open issues then it's something we ought to
>get around to sometime.
>
>
>
>

Could someone please quantify how much bang we might get for what seems
like quite a lot of bucks?

I appreciate the need for speed, but the saving here strikes me as
marginal at best, unless my instincts are all wrong (quite possible)

cheers

andrew

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

02 November 2005, 20:12:46

Andrew Dunstan <andrew@dunslane.net> writes:
> Could someone please quantify how much bang we might get for what seems
> like quite a lot of bucks?
> I appreciate the need for speed, but the saving here strikes me as
> marginal at best, unless my instincts are all wrong (quite possible)

Two bytes per numeric value is not a lot, agreed.

If we were willing to invent the "varlena2" datum format then we could
save four bytes per numeric, plus reduce numeric's alignment requirement
from int to short which would probably save another byte per value on
average.  I'm not sure that that's worth doing if numeric and inet are
the only beneficiaries, but it might be.

From a speed perspective the only thing favoring UNKNOWNNUMERIC is the
possibility for saving some conversion operations when the grammar's
initial guess about datatype is wrong.  That's probably pretty marginal
though.  I was thinking of it more as a vehicle for helping us clean up
the type-assignment behavior some more.  The example I have in my notes
is that "float4col = 1.8" is certain to fail since 1.8 will be taken as
float8, not float4, and then you have precision mismatch problems.
You can make it work by quoting: "float4col = '1.8'" but that's surely
pretty ugly.  If the constant were initially UNKNOWNNUMERIC then it
would end up as float4 without that hack.

            regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

03 November 2005, 04:30:20

On Wed, 2005-11-02 at 19:12 -0500, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > Could someone please quantify how much bang we might get for what seems 
> > like quite a lot of bucks?
> > I appreciate the need for speed, but the saving here strikes me as 
> > marginal at best, unless my instincts are all wrong (quite possible)
> 
> Two bytes per numeric value is not a lot, agreed.

I'm optimising for Data Warehousing. If you have a very large table with
a higher proportion of numerics on it, then your saving can be >5% of
tablesize which could be very useful. For the general user, it might
produce less benefit, I accept.

At the moment we've established we can do this fairly much for free.
i.e. nobody cares about the drop in digits (to 255) and the other coding
seems very minimal. I've accepted that the runtime costs are almost
zero. Even the coding is small, so we've talked more lines than the
patch already. Seems like a quick win; Tom is helping me to make sure it
is not a quick and dirty win and I very much appreciate that.

[I've taken this off -patches to show that we are not really discussing
minor mods to the original patch anymore.]

> If we were willing to invent the "varlena2" datum format then we could
> save four bytes per numeric, plus reduce numeric's alignment requirement
> from int to short which would probably save another byte per value on
> average.  I'm not sure that that's worth doing if numeric and inet are
> the only beneficiaries, but it might be.

That and variations can be the next discussion. They sound good.

Those ideas are complementary, if I understand them.

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

03 November 2005, 09:52:12

On Thu, 2005-11-03 at 08:27 +0000, Simon Riggs wrote:
> On Wed, 2005-11-02 at 19:12 -0500, Tom Lane wrote:
> > If we were willing to invent the "varlena2" datum format then we could
> > save four bytes per numeric, plus reduce numeric's alignment requirement
> > from int to short which would probably save another byte per value on
> > average.  I'm not sure that that's worth doing if numeric and inet are
> > the only beneficiaries, but it might be.
> 
> That and variations can be the next discussion. They sound good.

Kicking off the discussion on that...

Varlena2 datum format sounds interesting. If we did that, I'd also like
to apply that thought to VAR/CHAR(32000) and below.
(The benefit of varlena2 is saving of 2 bytes + ~1 byte alignment, yes?,
the other two bytes come from the other numeric savings discussed).

Alternatively, what I'd been thinking about was altering the self-
contained nature of PostgreSQL datatypes.

In other databases, CHAR(12) and NUMERIC(12) are fixed length datatypes.
In PostgreSQL, they are dynamically varying datatypes.

What actually happens is that in many other systems the datatype is the
same, but additional metadata is provided for that particular attribute.
So CHAR(12) is a datatype of CHAR with a metadata item called length
which is set to 12 for that attribute.

On PostgreSQL, CHAR(12) is a bpchar datatype with all instantiations of
that datatype having a 4 byte varlena header. In this example, all of
those instantiations having the varlena header set to 12, so essentially
wasting the 4 byte header.

It seems like it would be an interesting move to allow the attribute
metadata to be stored in the TupleDesc, so we can store it once, rather
than once per row.

If we did this we would need two datatypes where currently we need only
one. We would still need variable-length char datatype VARCHAR and we
would be inventing a new fixed-char datatype with metadata of length
CHAR(n).

This would give us two things:
- reduce many attributes by 4 bytes in length
- allow attribute access to increase considerably in speed for queries,
sorts etc since more of the tuple offsets will be constant

Anyway, I accept that many will say I clearly don't understand Object
Relational. It seems like this could be done without actually breaking
anything. The question is, how much work would it be?

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Martijn van Oosterhout

Date:

03 November 2005, 10:09:43

On Thu, Nov 03, 2005 at 01:49:46PM +0000, Simon Riggs wrote:
> In other databases, CHAR(12) and NUMERIC(12) are fixed length datatypes.
> In PostgreSQL, they are dynamically varying datatypes.

Please explain how a CHAR(12) can store 12 UTF-8 characters when each
character may be 1 to 4 bytes, unless the CHAR itself is variable
length...

> What actually happens is that in many other systems the datatype is the
> same, but additional metadata is provided for that particular attribute.
> So CHAR(12) is a datatype of CHAR with a metadata item called length
> which is set to 12 for that attribute.

We already have this metadata, it's called atttypmod and it's stored in
pg_attribute. That's where the 12 for CHAR(12) is stored BTW.

> On PostgreSQL, CHAR(12) is a bpchar datatype with all instantiations of
> that datatype having a 4 byte varlena header. In this example, all of
> those instantiations having the varlena header set to 12, so essentially
> wasting the 4 byte header.

Nope, the verlena header stores the actual length on disk. If you store
"hello" in a char(12) field it takes only 9 bytes (4 for the header, 5
for the data), which is less than 12.

Good ideas, but it all hinges on the fact that CHAR(12) can take a
fixed amount of space, which simply isn't true in a multibyte encoding.

Having a different header for things shorter than 255 bytes has been
discussed before, that's another argument though.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Reducing the overhead of NUMERIC data

From

Alvaro Herrera

Date:

03 November 2005, 10:12:50

Simon Riggs wrote:

> On PostgreSQL, CHAR(12) is a bpchar datatype with all instantiations of
> that datatype having a 4 byte varlena header. In this example, all of
> those instantiations having the varlena header set to 12, so essentially
> wasting the 4 byte header.

We need the length word because the actual size in bytes is variable,
due to multibyte encoding considerations.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

03 November 2005, 10:51:28

On Thu, 2005-11-03 at 11:13 -0300, Alvaro Herrera wrote:
> Simon Riggs wrote:
> > On PostgreSQL, CHAR(12) is a bpchar datatype with all instantiations of
> > that datatype having a 4 byte varlena header. In this example, all of
> > those instantiations having the varlena header set to 12, so essentially
> > wasting the 4 byte header.
> 
> We need the length word because the actual size in bytes is variable,
> due to multibyte encoding considerations.

Succinctly put, thanks.

Incidentally, you remind me that other databases do *not* vary the
character length, even if they do have varying length UTF-8 within them.
So if you define CHAR(255) then it could blow up at a random length if
you store UTF-8 within it.

That's behaviour that I could never sanction, so I'll leave this now.

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Stephan Szabo

Date:

03 November 2005, 11:03:06

On Thu, 3 Nov 2005, Simon Riggs wrote:

> On Wed, 2005-11-02 at 19:12 -0500, Tom Lane wrote:
> > Andrew Dunstan <andrew@dunslane.net> writes:
> > > Could someone please quantify how much bang we might get for what seems
> > > like quite a lot of bucks?
> > > I appreciate the need for speed, but the saving here strikes me as
> > > marginal at best, unless my instincts are all wrong (quite possible)
> >
> > Two bytes per numeric value is not a lot, agreed.
>
> I'm optimising for Data Warehousing. If you have a very large table with
> a higher proportion of numerics on it, then your saving can be >5% of
> tablesize which could be very useful. For the general user, it might
> produce less benefit, I accept.
>
> At the moment we've established we can do this fairly much for free.
> i.e. nobody cares about the drop in digits (to 255) and the other coding

I don't believe the above is safe to say, yet. AFAICS, this has been
discussed only on hackers (and patches) in this discussion, whereas this
sort of change should probably be brought up on general as well to get a
greater understanding of whether there are people who care. I expect that
there won't be, but given that I'm still not sure what the plan to support
applications upward is for this change, I think it's probably a good idea
to query a larger segment of the population.

Re: Reducing the overhead of NUMERIC data

From

Marcus Engene

Date:

03 November 2005, 11:07:44

Simon Riggs wrote:
> On Thu, 2005-11-03 at 11:13 -0300, Alvaro Herrera wrote:
> 
>>Simon Riggs wrote:
>>
>>>On PostgreSQL, CHAR(12) is a bpchar datatype with all instantiations of
>>>that datatype having a 4 byte varlena header. In this example, all of
>>>those instantiations having the varlena header set to 12, so essentially
>>>wasting the 4 byte header.
>>
>>We need the length word because the actual size in bytes is variable,
>>due to multibyte encoding considerations.
> 
> 
> Succinctly put, thanks.
> 
> Incidentally, you remind me that other databases do *not* vary the
> character length, even if they do have varying length UTF-8 within them.
> So if you define CHAR(255) then it could blow up at a random length if
> you store UTF-8 within it.
> 
> That's behaviour that I could never sanction, so I'll leave this now.
> 
> Best Regards, Simon Riggs
> 

Just as a side note, in Oracle you can use the syntax (f.ex on on a db 
with utf-8 charset):

column VARCHAR2(10 CHAR)

...to indicate that Oracle should fit 10 characters there. It might use 
up to 40 bytes in the db, but that's up to Oracle. If I s/10 CHAR/10, at 
most 10 characters will fit.

This works very well. The only catch is that it's not good to use more 
than 1000 chars since oracle's varchars dont want to go past 4000 bytes.

Best regards,
Marcus

Re: Reducing the overhead of NUMERIC data

From

Bruce Momjian

Date:

03 November 2005, 11:09:49

Stephan Szabo wrote:
> 
> On Thu, 3 Nov 2005, Simon Riggs wrote:
> 
> > On Wed, 2005-11-02 at 19:12 -0500, Tom Lane wrote:
> > > Andrew Dunstan <andrew@dunslane.net> writes:
> > > > Could someone please quantify how much bang we might get for what seems
> > > > like quite a lot of bucks?
> > > > I appreciate the need for speed, but the saving here strikes me as
> > > > marginal at best, unless my instincts are all wrong (quite possible)
> > >
> > > Two bytes per numeric value is not a lot, agreed.
> >
> > I'm optimising for Data Warehousing. If you have a very large table with
> > a higher proportion of numerics on it, then your saving can be >5% of
> > tablesize which could be very useful. For the general user, it might
> > produce less benefit, I accept.
> >
> > At the moment we've established we can do this fairly much for free.
> > i.e. nobody cares about the drop in digits (to 255) and the other coding
> 
> I don't believe the above is safe to say, yet. AFAICS, this has been
> discussed only on hackers (and patches) in this discussion, whereas this
> sort of change should probably be brought up on general as well to get a
> greater understanding of whether there are people who care. I expect that
> there won't be, but given that I'm still not sure what the plan to support
> applications upward is for this change, I think it's probably a good idea
> to query a larger segment of the population.

Agreed.  With the proposal, we are saving perhaps 5% storage space for
numeric fields, but are adding code complexity and reducing its possible
precision.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

03 November 2005, 11:32:07

Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> On Thu, 3 Nov 2005, Simon Riggs wrote:
>>> At the moment we've established we can do this fairly much for free.

> Agreed.  With the proposal, we are saving perhaps 5% storage space for
> numeric fields, but are adding code complexity and reducing its possible
> precision.

Having to invent UNKNOWNNUMERIC is hardly what I'd call "for free".
That takes it out of the realm of being a small localized project.
I'd feel a lot happier about this if we could keep the dynamic range
up to, say, 10^512 so that it's still true that NUMERIC can be a
universal parse-time representation.  That would also make it even
more unlikely that anyone would complain about loss of functionality.

To do that we'd need 8 bits for weight (-128..127 for a base-10K
exponent is enough) but we need 9 bits for dscale which does not
quite fit.  I think we could make it go by cramming the sign and
the high-order dscale bit into the first NumericDigit --- the
digit itself can only be 0..9999 so there are a couple of bits
to spare.  This probably *would* slow down packing and unpacking of
numerics, but just by a couple lines of C.  Arguably the net reduction
in I/O costs would justify that.
        regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

03 November 2005, 12:25:08

On Thu, 2005-11-03 at 07:03 -0800, Stephan Szabo wrote:

> I don't believe the above is safe to say, yet. AFAICS, this has been
> discussed only on hackers (and patches) in this discussion, whereas this
> sort of change should probably be brought up on general as well to get a
> greater understanding of whether there are people who care. I expect that
> there won't be, but given that I'm still not sure what the plan to support
> applications upward is for this change, I think it's probably a good idea
> to query a larger segment of the population.

Fairly spoken, I believe.

I would suggest that I work just a little more with Tom et al to make
sure we have a plan that can fly. If so, I will query GENERAL as you
suggest.

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Andrew Dunstan

Date:

03 November 2005, 12:37:07


Simon Riggs wrote:

>On Wed, 2005-11-02 at 19:12 -0500, Tom Lane wrote:
>  
>
>>Andrew Dunstan <andrew@dunslane.net> writes:
>>    
>>
>>>Could someone please quantify how much bang we might get for what seems 
>>>like quite a lot of bucks?
>>>I appreciate the need for speed, but the saving here strikes me as 
>>>marginal at best, unless my instincts are all wrong (quite possible)
>>>      
>>>
>>Two bytes per numeric value is not a lot, agreed.
>>    
>>
>
>I'm optimising for Data Warehousing. If you have a very large table with
>a higher proportion of numerics on it, then your saving can be >5% of
>tablesize which could be very useful. For the general user, it might
>produce less benefit, I accept.
>
>  
>

Well, it could also be argued that DW apps could often get away with 
using floating point types, even where the primary source needs to be in 
fixed point for accuracy, and that could generate lots of  savings in 
speed and space. But I guess everybody gets to make their own choices.

cheers

andrew

Re: Reducing the overhead of NUMERIC data

From

mark@mark.mielke.cc

Date:

03 November 2005, 13:52:08

On Thu, Nov 03, 2005 at 03:09:26PM +0100, Martijn van Oosterhout wrote:
> On Thu, Nov 03, 2005 at 01:49:46PM +0000, Simon Riggs wrote:
> > In other databases, CHAR(12) and NUMERIC(12) are fixed length datatypes.
> > In PostgreSQL, they are dynamically varying datatypes.
> Please explain how a CHAR(12) can store 12 UTF-8 characters when each
> character may be 1 to 4 bytes, unless the CHAR itself is variable
> length...
> ...
> Nope, the verlena header stores the actual length on disk. If you store
> "hello" in a char(12) field it takes only 9 bytes (4 for the header, 5
> for the data), which is less than 12.
> ...
> Having a different header for things shorter than 255 bytes has been
> discussed before, that's another argument though.

It's unfortunate that the length is encoded multiple times. In UTF-8,
for instance, each character has its length encoded in the most
significant bits. Complicated to extract, however, the data is encoded
twice. 1 in the header, and 1 in the combination between the column
attribute, and the per character lengths.

For "other databases", the column could be encoded as 2 byte characters
or 4 byte characters, allowing it to be fixed. I find myself doubting
that ASCII characters could be encoded more efficiently in such formats,
than the inclusion of a length header and per character length encoding,
but for multibyte characters, the race is probably even. :-)

I dunno... no opinion on the matter here, but I did want to point out
that the field can be fixed length without a header. Those proposing such
a change, however, should accept that this may result in an overall
expense.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...

                          http://mark.mielke.cc/

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

03 November 2005, 14:14:09

On Thu, 2005-11-03 at 11:36 -0500, Andrew Dunstan wrote:

> Well, it could also be argued that DW apps could often get away with 
> using floating point types, even where the primary source needs to be in 
> fixed point for accuracy, and that could generate lots of  savings in 
> speed and space. But I guess everybody gets to make their own choices.

I've never seen a requirement for a DW to be less accurate than the
primary source. Its almost always the other way around.

If its money, then no. If its other physical measurements, then yes.
There's lots of scientific apps that wouldn't use NUMERICs much, but
there's enough apps that do for me to care.

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Martijn van Oosterhout

Date:

03 November 2005, 14:20:12

On Thu, Nov 03, 2005 at 12:28:02PM -0500, mark@mark.mielke.cc wrote:
> It's unfortunate that the length is encoded multiple times. In UTF-8,
> for instance, each character has its length encoded in the most
> significant bits. Complicated to extract, however, the data is encoded
> twice. 1 in the header, and 1 in the combination between the column
> attribute, and the per character lengths.
>
> For "other databases", the column could be encoded as 2 byte characters
> or 4 byte characters, allowing it to be fixed. I find myself doubting
> that ASCII characters could be encoded more efficiently in such formats,
> than the inclusion of a length header and per character length encoding,
> but for multibyte characters, the race is probably even. :-)

That's called UTF-16 and is currently not supported by PostgreSQL at
all. That may change, since the locale library ICU requires UTF-16 for
everything.

The question is, if someone declares a field CHAR(20), do they really
mean to fix 40 bytes of storage for each and every row? I doubt it,
that's even more wasteful of space than a varlena header.

Which puts you right back to variable length fields.

> I dunno... no opinion on the matter here, but I did want to point out
> that the field can be fixed length without a header. Those proposing such
> a change, however, should accept that this may result in an overall
> expense.

The only time this may be useful is for *very* short fields, in the
order of 4 characters or less. Else the overhead swamps the varlena
header...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Reducing the overhead of NUMERIC data

From

Andrew - Supernews

Date:

03 November 2005, 15:43:51

On 2005-11-03, Martijn van Oosterhout <kleptog@svana.org> wrote:
>> For "other databases", the column could be encoded as 2 byte characters
>> or 4 byte characters, allowing it to be fixed. I find myself doubting
>> that ASCII characters could be encoded more efficiently in such formats,
>> than the inclusion of a length header and per character length encoding,
>> but for multibyte characters, the race is probably even. :-)
>
> That's called UTF-16 and is currently not supported by PostgreSQL at
> all. That may change, since the locale library ICU requires UTF-16 for
> everything.

Even UTF-16 is not fixed-length; some characters require 4 bytes, not 2.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

03 November 2005, 16:42:08

On Thu, 2005-11-03 at 10:32 -0500, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >> On Thu, 3 Nov 2005, Simon Riggs wrote:
> >>> At the moment we've established we can do this fairly much for free.
> 
> > Agreed.  With the proposal, we are saving perhaps 5% storage space for
> > numeric fields, but are adding code complexity and reducing its possible
> > precision.
> 
> Having to invent UNKNOWNNUMERIC is hardly what I'd call "for free".
> That takes it out of the realm of being a small localized project.
> I'd feel a lot happier about this if we could keep the dynamic range
> up to, say, 10^512 so that it's still true that NUMERIC can be a
> universal parse-time representation.  That would also make it even
> more unlikely that anyone would complain about loss of functionality.
> 
> To do that we'd need 8 bits for weight (-128..127 for a base-10K
> exponent is enough) but we need 9 bits for dscale which does not
> quite fit.  I think we could make it go by cramming the sign and
> the high-order dscale bit into the first NumericDigit --- the
> digit itself can only be 0..9999 so there are a couple of bits
> to spare.  This probably *would* slow down packing and unpacking of
> numerics, but just by a couple lines of C.  Arguably the net reduction
> in I/O costs would justify that.

This sounds a much better solution and the code neater too.

This still would be a very small patch, not very intrusive at all.

I'll have a hack at this now.

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Gregory Maxwell

Date:

03 November 2005, 21:42:10

On 11/3/05, Martijn van Oosterhout <kleptog@svana.org> wrote:
> That's called UTF-16 and is currently not supported by PostgreSQL at
> all. That may change, since the locale library ICU requires UTF-16 for
> everything.

UTF-16 doesn't get us out of the variable length character game, for
that we need UTF-32... Unless we were to only support UCS-2, which is
what some databases do for their Unicode support.  I think that would
be a huge step back and as you pointed out below, it is not efficient.
:)

> The question is, if someone declares a field CHAR(20), do they really
> mean to fix 40 bytes of storage for each and every row? I doubt it,
> that's even more wasteful of space than a varlena header.
>
> Which puts you right back to variable length fields.

Another way to look at this is in the context of compression: With
unicode, characters are really 32bit values... But only a small range
of these values is common.  So we store and work with them in a
compressed format, UTF-8.

The costs of compression is that fixed width fields can not be fixed
width, and the some operations are much more expensive than they would
be otherwise.

As such it might be more interesting to ask some other questions like:
are we using the best compression algorithm for the application, and,
why do we sometimes stack two compression algorithms? For longer
fields would we be better off working with UTF-32 and being more
agressive about where we LZ compress the fields?

> > I dunno... no opinion on the matter here, but I did want to point out
> > that the field can be fixed length without a header. Those proposing such
> > a change, however, should accept that this may result in an overall
> > expense.
>
> The only time this may be useful is for *very* short fields, in the
> order of 4 characters or less. Else the overhead swamps the varlena
> header...

Not even 4 characters if we are to support all of unicode... Length +
UTF-8 is a win vs UTF-32 in most cases for fields with more than one
character.

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

03 November 2005, 22:17:54

Gregory Maxwell <gmaxwell@gmail.com> writes:
> Another way to look at this is in the context of compression: With
> unicode, characters are really 32bit values... But only a small range
> of these values is common.  So we store and work with them in a
> compressed format, UTF-8.

> As such it might be more interesting to ask some other questions like:
> are we using the best compression algorithm for the application, and,
> why do we sometimes stack two compression algorithms?

Actually, the real reason we use UTF-8 and not any of the
sorta-fixed-size representations of Unicode is that the backend is by
and large an ASCII, null-terminated-string engine.  *All* of the
supported backend encodings are ASCII-superset codes.  Making
everything null-safe in order to allow use of UCS2 or UCS4 would be
a huge amount of work, and the benefit is at best questionable.
        regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

mark@mark.mielke.cc

Date:

04 November 2005, 12:58:36

On Thu, Nov 03, 2005 at 09:17:43PM -0500, Tom Lane wrote:
> Gregory Maxwell <gmaxwell@gmail.com> writes:
> > Another way to look at this is in the context of compression: With
> > unicode, characters are really 32bit values... But only a small range
> > of these values is common.  So we store and work with them in a
> > compressed format, UTF-8.
> > As such it might be more interesting to ask some other questions like:
> > are we using the best compression algorithm for the application, and,
> > why do we sometimes stack two compression algorithms?
> Actually, the real reason we use UTF-8 and not any of the
> sorta-fixed-size representations of Unicode is that the backend is by
> and large an ASCII, null-terminated-string engine.  *All* of the
> supported backend encodings are ASCII-superset codes.  Making
> everything null-safe in order to allow use of UCS2 or UCS4 would be
> a huge amount of work, and the benefit is at best questionable.

Perhaps on a side note - my intuition (which sometimes lies) would tell
me that, if the above is true, the backend is doing unnecessary copies
of read-only data, if only, to insert a '\0' at the end of the strings.
Is this true?

I'm thinking along the lines of the other threads that speak of PostgreSQL
being CPU or I/O bound, not disk bound, for many sorts of operations. Is
PostgreSQL unnecessary copying string data around (and other data, I would
assume).

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/

Re: Reducing the overhead of NUMERIC data

From

Martijn van Oosterhout

Date:

04 November 2005, 14:23:27

On Fri, Nov 04, 2005 at 08:38:38AM -0500, mark@mark.mielke.cc wrote:
> On Thu, Nov 03, 2005 at 09:17:43PM -0500, Tom Lane wrote:
> > Actually, the real reason we use UTF-8 and not any of the
> > sorta-fixed-size representations of Unicode is that the backend is by
> > and large an ASCII, null-terminated-string engine.  *All* of the
> > supported backend encodings are ASCII-superset codes.  Making
> > everything null-safe in order to allow use of UCS2 or UCS4 would be
> > a huge amount of work, and the benefit is at best questionable.
>
> Perhaps on a side note - my intuition (which sometimes lies) would tell
> me that, if the above is true, the backend is doing unnecessary copies
> of read-only data, if only, to insert a '\0' at the end of the strings.
> Is this true?

It's not quite that bad. Obviously for all on disk datatype zeros are
allowed. Bit strings, arrays, timestamps, numerics can all have
embedded nulls and they have a length header.

Where this becomes an issue is for things like table names, field
names, encoding names, etc. The "name" type is a fixed length string
which is kept in a way that it can be treated as a C string. If these
could contain null characters it would get messy.

I do conceive that the backend could support a UTF-16 datatype which
would be indexable and have various support functions. But as soon as
it came to talking to clients, it would be converted back to UTF-8
because libpq treats all strings coming back as null terminated.
Similarly, querys sent couldn't be anything other than UTF-8 also.

And if users can't send or receive UTF-16 text, why should the backend
store it that way?

> I'm thinking along the lines of the other threads that speak of PostgreSQL
> being CPU or I/O bound, not disk bound, for many sorts of operations. Is
> PostgreSQL unnecessary copying string data around (and other data, I would
> assume).

Well, there is a bit of copying around while creating tuples and such,
but it's not to add null terminators.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Reducing the overhead of NUMERIC data

From

mark@mark.mielke.cc

Date:

04 November 2005, 14:28:02

On Fri, Nov 04, 2005 at 04:13:29PM +0100, Martijn van Oosterhout wrote:
> On Fri, Nov 04, 2005 at 08:38:38AM -0500, mark@mark.mielke.cc wrote:
> > On Thu, Nov 03, 2005 at 09:17:43PM -0500, Tom Lane wrote:
> > > Actually, the real reason we use UTF-8 and not any of the
> > > sorta-fixed-size representations of Unicode is that the backend is by
> > > and large an ASCII, null-terminated-string engine.  *All* of the
> > > supported backend encodings are ASCII-superset codes.  Making
> > > everything null-safe in order to allow use of UCS2 or UCS4 would be
> > > a huge amount of work, and the benefit is at best questionable.
> > Perhaps on a side note - my intuition (which sometimes lies) would tell
> > me that, if the above is true, the backend is doing unnecessary copies
> > of read-only data, if only, to insert a '\0' at the end of the strings.
> > Is this true?
> It's not quite that bad. Obviously for all on disk datatype zeros are
> allowed. Bit strings, arrays, timestamps, numerics can all have
> embedded nulls and they have a length header.

Are you and Tom conflicting in opinion? :-)

I read "the backend is by and large an ASCII, null-terminated-string
engine" with "we use UTF-8 [for varlena strings?]" as, a lot of the
code assumes varlena strings are '\0' terminated, and an assumption
on my part, that the varlena strings are not stored in the backend
with a '\0' terminator, therefore, they require being copied out,
terminated with a '\0', before they can be used?

Or perhaps I'm just confused. :-)

> > I'm thinking along the lines of the other threads that speak of PostgreSQL
> > being CPU or I/O bound, not disk bound, for many sorts of operations. Is
> > PostgreSQL unnecessary copying string data around (and other data, I would
> > assume).
> Well, there is a bit of copying around while creating tuples and such,
> but it's not to add null terminators.

How much effort (past discussions that I've missed from a decade ago? 
hehe) has been put into determining whether a zero-copy architecture,
or really, a minimum copy architecture, would address some of these
bottlenecks? Am I dreaming? :-)

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

04 November 2005, 14:54:20

mark@mark.mielke.cc writes:
> I read "the backend is by and large an ASCII, null-terminated-string
> engine" with "we use UTF-8 [for varlena strings?]" as, a lot of the
> code assumes varlena strings are '\0' terminated, and an assumption
> on my part, that the varlena strings are not stored in the backend
> with a '\0' terminator, therefore, they require being copied out,
> terminated with a '\0', before they can be used?

There are places where we have to do that, the worst from a performance
viewpoint being in string comparison --- we have to null-terminate both
values before we can pass them to strcoll().

One of the large bits that would have to be done before we could even
contemplate using UCS2/UCS4 is getting rid of our dependence on strcoll,
since its API is null-terminated-string.

> How much effort (past discussions that I've missed from a decade ago? 
> hehe) has been put into determining whether a zero-copy architecture,
> or really, a minimum copy architecture, would address some of these
> bottlenecks? Am I dreaming? :-)

We've already done it in places, for instance the new implementation
of "virtual tuples" in TupleTableSlots eliminates a lot of copying
of pass-by-reference values.
        regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

Martijn van Oosterhout

Date:

04 November 2005, 15:13:41

On Fri, Nov 04, 2005 at 01:54:04PM -0500, Tom Lane wrote:
> mark@mark.mielke.cc writes:
> > I read "the backend is by and large an ASCII, null-terminated-string
> > engine" with "we use UTF-8 [for varlena strings?]" as, a lot of the
> > code assumes varlena strings are '\0' terminated, and an assumption
> > on my part, that the varlena strings are not stored in the backend
> > with a '\0' terminator, therefore, they require being copied out,
> > terminated with a '\0', before they can be used?
>
> There are places where we have to do that, the worst from a performance
> viewpoint being in string comparison --- we have to null-terminate both
> values before we can pass them to strcoll().
>
> One of the large bits that would have to be done before we could even
> contemplate using UCS2/UCS4 is getting rid of our dependence on strcoll,
> since its API is null-terminated-string.

Yeah, and while one way of removing that dependance is to use ICU, that
library wants everything in UTF-16. So we replace "copying to add NULL
to string" with "converting UTF-8 to UTF-16 on each call. Ugh! The
argument for UTF-16 is that if you're using a language that doesn't use
ASCII at all, UTF-8 gets inefficient pretty quickly.

Locale sensetive, efficient storage, fast comparisons, pick any two!

My guess is that in the long run there would be two basic string
datatypes, one UTF-8, null terminated string used in the backend code
as a standard C string, default collation strcmp. The other UTF-16 for
user data that wants to be able to collate in a locale dependant way.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Reducing the overhead of NUMERIC data

From

Gregory Maxwell

Date:

04 November 2005, 15:59:06

On 11/4/05, Martijn van Oosterhout <kleptog@svana.org> wrote:
> Yeah, and while one way of removing that dependance is to use ICU, that
> library wants everything in UTF-16. So we replace "copying to add NULL
> to string" with "converting UTF-8 to UTF-16 on each call. Ugh! The
> argument for UTF-16 is that if you're using a language that doesn't use
> ASCII at all, UTF-8 gets inefficient pretty quickly.

Is this really the case? Only unicode values 000800 - 00FFFF are
smaller in UTF-16 than in UTF-8, and in their case it's three bytes vs
two. Cyrilic, Arabic, Greek, Latin, etc are all two bytes in both.

So, yes in some cases UTF-8 will use three bytes where UTF-16 would be
two, but thats less inefficient than UTF-16 for ASCII, which many
people find acceptable.

> Locale sensetive, efficient storage, fast comparisons, pick any two!

I don't know that the choices are that limited, as I indicated earlier
in the thread I think it's useful to think of all of these encodings
as just different compression algorithms. If our desire was to have
all three, the backend could be made null safe and we could use the
locale-sensitive and fast representation (Probably UTF-16 or UTF-32)
in memory, and store on disk whatever is most efficient for storage.
(lz compressed UTF-whatever for fat fields, UTF-8 for mostly ascii
small fields, SCSU for non-ascii short fields
(http://www.unicode.org/reports/tr6/), etc)

> My guess is that in the long run there would be two basic string
> datatypes, one UTF-8, null terminated string used in the backend code
> as a standard C string, default collation strcmp. The other UTF-16 for
> user data that wants to be able to collate in a locale dependant way.

So if we need locale dependant colation we suffer 2x inflation for
many texts, and multibyte complexity still required if we are to
collate correctly when there are characters outside of the BMP. Yuck.

Disk storage type, memory strorage type, user API type, and collation
should be decoupled.

Re: Reducing the overhead of NUMERIC data

From

Gregory Maxwell

Date:

04 November 2005, 16:13:14

On 11/4/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Yeah, and while one way of removing that dependance is to use ICU, that
> > library wants everything in UTF-16.
>
> Really?  Can't it do UCS4 (UTF-32)?  There's a nontrivial population
> of our users that isn't satisfied with UTF-16 anyway, so if that really
> is a restriction of ICU then we're going to have to look elsewhere :-(

The correct question to ask is something like "Does it support non-bmp
characters?" or "Does it really support UTF-16 or just UCS2?"

UTF-16 is (now) a variable width encoding which is a strict superset
of UCS2 which allows the representation of all Unicode characters.
UCS2 is fixed width and only supports characters from the basic
multilingual plane.  UTF-32 and UCS4 are (now) effectively the same
thing and can represent all unicode characters with a 4 byte fixed
length word.

The code can demand UTF-16 but still be fine for non-BMP characters.
However, many things which claim to support UTF-16 really only support
UCS2 or at least have bugs in their handling of non-bmp characters.
Software that supports UTF-8 is somewhat more likely to support
non-bmp characters correctly since the variable length code paths get
more of a workout in many environments. :)

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

04 November 2005, 16:23:37

Martijn van Oosterhout <kleptog@svana.org> writes:
> Yeah, and while one way of removing that dependance is to use ICU, that
> library wants everything in UTF-16.

Really?  Can't it do UCS4 (UTF-32)?  There's a nontrivial population
of our users that isn't satisfied with UTF-16 anyway, so if that really
is a restriction of ICU then we're going to have to look elsewhere :-(
        regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

"Jim C. Nasby"

Date:

04 November 2005, 17:23:52

On Thu, Nov 03, 2005 at 04:07:41PM +0100, Marcus Engene wrote:
> Simon Riggs wrote:
> >On Thu, 2005-11-03 at 11:13 -0300, Alvaro Herrera wrote:
> >
> >>Simon Riggs wrote:
> >>
> >>>On PostgreSQL, CHAR(12) is a bpchar datatype with all instantiations of
> >>>that datatype having a 4 byte varlena header. In this example, all of
> >>>those instantiations having the varlena header set to 12, so essentially
> >>>wasting the 4 byte header.
> >>
> >>We need the length word because the actual size in bytes is variable,
> >>due to multibyte encoding considerations.
> >
> >
> >Succinctly put, thanks.
> >
> >Incidentally, you remind me that other databases do *not* vary the
> >character length, even if they do have varying length UTF-8 within them.
> >So if you define CHAR(255) then it could blow up at a random length if
> >you store UTF-8 within it.
> >
> >That's behaviour that I could never sanction, so I'll leave this now.
> >
> >Best Regards, Simon Riggs
> >
> 
> Just as a side note, in Oracle you can use the syntax (f.ex on on a db 
> with utf-8 charset):
> 
> column VARCHAR2(10 CHAR)
> 
> ...to indicate that Oracle should fit 10 characters there. It might use 
> up to 40 bytes in the db, but that's up to Oracle. If I s/10 CHAR/10, at 
> most 10 characters will fit.
> 
> This works very well. The only catch is that it's not good to use more 
> than 1000 chars since oracle's varchars dont want to go past 4000 bytes.

Likewise other databases use different character types such as NCHAR
(nationalized char), which is the 16 bit variant.

I think it's perfectly acceptable to have a char type that is a
fixed-width in terms of number of bytes, so long as we provide an
alternative. Heck, in my experience char is only used to store things
like hashes that are in ASCII anyway.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Reducing the overhead of NUMERIC data

From

"Jim C. Nasby"

Date:

04 November 2005, 17:31:53

On Thu, Nov 03, 2005 at 10:32:03AM -0500, Tom Lane wrote:
> I'd feel a lot happier about this if we could keep the dynamic range
> up to, say, 10^512 so that it's still true that NUMERIC can be a
> universal parse-time representation.  That would also make it even
> more unlikely that anyone would complain about loss of functionality.

Would it be feasable to have a type that satisfies that constraint but
isn't generally intended for on-disk use? My thought is that this new
type would be used mostly for casting purposes. Kind of like the
UNKNOWNNUMBER but easier to do since it'd just be another type. (BTW,
I'm not suggesting that we disallow un-disk storage of the type, only
discourage it unless someone really, really needs an absurd number of
digits).
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

04 November 2005, 17:32:42

"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Thu, Nov 03, 2005 at 10:32:03AM -0500, Tom Lane wrote:
>> I'd feel a lot happier about this if we could keep the dynamic range
>> up to, say, 10^512 so that it's still true that NUMERIC can be a
>> universal parse-time representation.  That would also make it even
>> more unlikely that anyone would complain about loss of functionality.

> Would it be feasable to have a type that satisfies that constraint but
> isn't generally intended for on-disk use? My thought is that this new
> type would be used mostly for casting purposes. Kind of like the
> UNKNOWNNUMBER but easier to do since it'd just be another type.

What exactly would be easier about it?  ISTM you just described
UNKNOWNNUMERIC to a T.
        regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

"Jim C. Nasby"

Date:

04 November 2005, 18:07:27

On Fri, Nov 04, 2005 at 04:30:27PM -0500, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > On Thu, Nov 03, 2005 at 10:32:03AM -0500, Tom Lane wrote:
> >> I'd feel a lot happier about this if we could keep the dynamic range
> >> up to, say, 10^512 so that it's still true that NUMERIC can be a
> >> universal parse-time representation.  That would also make it even
> >> more unlikely that anyone would complain about loss of functionality.
> 
> > Would it be feasable to have a type that satisfies that constraint but
> > isn't generally intended for on-disk use? My thought is that this new
> > type would be used mostly for casting purposes. Kind of like the
> > UNKNOWNNUMBER but easier to do since it'd just be another type.
> 
> What exactly would be easier about it?  ISTM you just described
> UNKNOWNNUMERIC to a T.

Apologies then; it sounded like UNKNOWNNUMERIC was going to be something
that was internal-use only, presumably making it much harder to
implement than just adding an additional type.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Reducing the overhead of NUMERIC data

From

Martijn van Oosterhout

Date:

04 November 2005, 19:40:55

On Fri, Nov 04, 2005 at 02:58:05PM -0500, Gregory Maxwell wrote:
> The correct question to ask is something like "Does it support non-bmp
> characters?" or "Does it really support UTF-16 or just UCS2?"
>
> UTF-16 is (now) a variable width encoding which is a strict superset
> of UCS2 which allows the representation of all Unicode characters.
> UCS2 is fixed width and only supports characters from the basic
> multilingual plane.  UTF-32 and UCS4 are (now) effectively the same
> thing and can represent all unicode characters with a 4 byte fixed
> length word.

It's all on their website:

: How is a Unicode string represented in ICU?
:
: A Unicode string is currently represented as UTF-16 by default. The
: endianess of UTF-16 is platform dependent. You can guarantee the
: endianess of UTF-16 by using a converter. UTF-16 strings can be
: converted to other Unicode forms by using a converter or with the UTF
: conversion macros.
:
: ICU does not use UCS-2. UCS-2 is a subset of UTF-16. UCS-2 does not
: support surrogates, and UTF-16 does support surrogates. This means
: that UCS-2 only supports UTF-16's Base Multilingual Plane (BMP). The
: notion of UCS-2 is deprecated and dead. Unicode 2.0 in 1996 changed
: its default encoding to UTF-16.
<snip>
: What is the performance difference between UTF-8 and UTF-16?
:
: Most of the time, the memory throughput of the hard drive and RAM is
: the main performance constraint. UTF-8 is 50% smaller than UTF-16 for
: US-ASCII, but UTF-8 is 50% larger than UTF-16 for East and South
: Asian scripts. There is no memory difference for Latin extensions,
: Greek, Cyrillic, Hebrew, and Arabic.
<snip>
http://icu.sourceforge.net/userguide/icufaq.html

: Using UTF-8 strings with ICU
:
: As mentioned in the overview of this chapter, ICU and most other
: Unicode-supporting software uses 16-bit Unicode for internal
: processing. However, there are circumstances where UTF-8 is used
: instead. This is usually the case for software that does little or no
: processing of non-ASCII characters, and/or for APIs that predate
: Unicode, use byte-based strings, and cannot be changed or replaced
: for various reasons.
<snip>
: While ICU does not natively use UTF-8 strings, there are many ways to
: work with UTF-8 strings and ICU. The following list is probably
: incomplete.
http://icu.sourceforge.net/userguide/strings.html#strings

Basically you use a "converter" to process the UTF-8 strings,
prusumably converting them to UTF-16 (which is not UCS-2 as noted
above). UTF-32 needs a converter also, so no point using that either.

> The code can demand UTF-16 but still be fine for non-BMP characters.
> However, many things which claim to support UTF-16 really only support
> UCS2 or at least have bugs in their handling of non-bmp characters.
> Software that supports UTF-8 is somewhat more likely to support
> non-bmp characters correctly since the variable length code paths get
> more of a workout in many environments. :)

I think ICU deals with that, but feel free to peruse the website
yourself...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Reducing the overhead of NUMERIC data

From

Gregory Maxwell

Date:

04 November 2005, 20:15:21

On 11/4/05, Martijn van Oosterhout <kleptog@svana.org> wrote:
[snip]
> : ICU does not use UCS-2. UCS-2 is a subset of UTF-16. UCS-2 does not
> : support surrogates, and UTF-16 does support surrogates. This means
> : that UCS-2 only supports UTF-16's Base Multilingual Plane (BMP). The
> : notion of UCS-2 is deprecated and dead. Unicode 2.0 in 1996 changed
> : its default encoding to UTF-16.
> <snip>

This means it's fine.. ICU's use of UTF-16 will not break our support
for all of unicode. Conversion too and from UTF-16 isn't cheap,
however, if you're doing it all the time. Storing ASCII in UTF-16 is
pretty lame. Widespread use of UTF-16 tends to hide bugs in the
handling of non-bmp characters. ...  I would be somewhat surprised to
see a substantial performance difference in working with UTF-16 data
over UTF-8, but then again ... they'd know and I wouldn't.

Other lame aspects of using unicode encodings other than UTF-8
internally is that it's harder to figure out what is text in GDB
output and such.. can make debugging more difficult.

Re: Reducing the overhead of NUMERIC data

From

Martijn van Oosterhout

Date:

05 November 2005, 06:40:46

On Fri, Nov 04, 2005 at 07:15:22PM -0500, Gregory Maxwell wrote:
> Other lame aspects of using unicode encodings other than UTF-8
> internally is that it's harder to figure out what is text in GDB
> output and such.. can make debugging more difficult.

Yeah, that's one of the reasons I think UTF-16 should be restricted to
userdata only. There's no particular reason why system tables and
indexes need to be in a locale sensetive collate order. If someone
really wants that they can pay for the conversion costs at query time.

The problem is that PostgreSQL doesn't support multiple collate orders,
that's being worked on though.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Reducing the overhead of NUMERIC data

From

Harald Fuchs

Date:

05 November 2005, 14:48:35

In article <e692861c0511041149n6fe36345oba7c43d1d48bef3d@mail.gmail.com>,
Gregory Maxwell <gmaxwell@gmail.com> writes:

> On 11/4/05, Martijn van Oosterhout <kleptog@svana.org> wrote:
>> Yeah, and while one way of removing that dependance is to use ICU, that
>> library wants everything in UTF-16. So we replace "copying to add NULL
>> to string" with "converting UTF-8 to UTF-16 on each call. Ugh! The
>> argument for UTF-16 is that if you're using a language that doesn't use
>> ASCII at all, UTF-8 gets inefficient pretty quickly.

> Is this really the case? Only unicode values 000800 - 00FFFF are
> smaller in UTF-16 than in UTF-8, and in their case it's three bytes vs
> two. Cyrilic, Arabic, Greek, Latin, etc are all two bytes in both.

IMHO the best encoding for "Cyrilic, Arabic, Greek, Latin, etc" is
ISO-8859-* - just one byte. You need UTF* only when you want to have
more than one of of them in the same column.

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

06 November 2005, 04:48:35

On Thu, 2005-11-03 at 10:32 -0500, Tom Lane wrote:
> I'd feel a lot happier about this if we could keep the dynamic range
> up to, say, 10^512 so that it's still true that NUMERIC can be a
> universal parse-time representation.  That would also make it even
> more unlikely that anyone would complain about loss of functionality.
> 
> To do that we'd need 8 bits for weight (-128..127 for a base-10K
> exponent is enough) but we need 9 bits for dscale which does not
> quite fit.  I think we could make it go by cramming the sign and
> the high-order dscale bit into the first NumericDigit --- the
> digit itself can only be 0..9999 so there are a couple of bits
> to spare.  This probably *would* slow down packing and unpacking of
> numerics, but just by a couple lines of C.  Arguably the net reduction
> in I/O costs would justify that.

I've got a working version of the code using the above scheme, with
these additional wrinkles:

NaN is indicated by weight=-128, giving a dynamic range of 10^508.

Zeroes are fully compressed, except when the Scale > 255. In that case,
the first digit is present to signify the presence of the high order
Scale bit.

Comments?

Once 8.1 is released, I'll go back and see if I can improve the coding
in a few days with fresh eyes, then submit a patch.

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

06 November 2005, 12:26:56

Simon Riggs <simon@2ndquadrant.com> writes:
> On Thu, 2005-11-03 at 10:32 -0500, Tom Lane wrote:
>> I think we could make it go by cramming the sign and
>> the high-order dscale bit into the first NumericDigit --- the
>> digit itself can only be 0..9999 so there are a couple of bits
>> to spare.

> I've got a working version of the code using the above scheme,

Really?  After I woke up a bit more I realized there was only one bit
and change to spare, not two, so I don't see how it would work.

(Unless you want to force a minimum of two NumericDigits when the dscale
exceeds 255, and flag the scale in the second one --- which might be OK
but it's getting messier and messier...)
        regards, tom lane

Re: Reducing the overhead of NUMERIC data

From

Simon Riggs

Date:

09 November 2005, 18:26:51

On Sun, 2005-11-06 at 11:26 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Thu, 2005-11-03 at 10:32 -0500, Tom Lane wrote:
> >> I think we could make it go by cramming the sign and
> >> the high-order dscale bit into the first NumericDigit --- the
> >> digit itself can only be 0..9999 so there are a couple of bits
> >> to spare.
> 
> > I've got a working version of the code using the above scheme,
> 
> Really?  After I woke up a bit more I realized there was only one bit
> and change to spare, not two, so I don't see how it would work.

Not sure why you think that. Seems to fit....

Best Regards, Simon Riggs

Re: Reducing the overhead of NUMERIC data

From

Tom Lane

Date:

09 November 2005, 18:42:47

Simon Riggs <simon@2ndquadrant.com> writes:
> On Sun, 2005-11-06 at 11:26 -0500, Tom Lane wrote:
>> Really?  After I woke up a bit more I realized there was only one bit
>> and change to spare, not two, so I don't see how it would work.

> Not sure why you think that. Seems to fit....

[ counts on fingers and toes too ... ]  Obviously, I was more awake the
first time.  Never mind :-(
        regards, tom lane