Thread: Reducing the overhead of NUMERIC data
Currently, the overhead of NUMERIC datatype is 8 bytes. Each value is stored on disk as typedef struct NumericData {int32 varlen; /* Variable size (std varlena header) */int16 n_weight; /* Weight of 1st digit */uint16 n_sign_dscale; /* Sign + display scale */char n_data[1]; /* Digits (really array of NumericDigit) } NumericData; Let's see if we can reduce that: varlen is int32 to match the standard varlena header. However, the max number of digits of the datatype is less than the threshold at which values get toasted. So no NUMERIC values ever get toasted - in which case, why worry about matching the size of varlena - lets reduce it to 2 bytes which still gives us up to 1000 digits as we have now. n_weight seems to exist because we do not store trailing zeroes. So 1000000 is stored as 1 with a weight of 6. My experience is that large numbers of trailing zeroes do not occur with any frequency in real measurement or financial data and that this is an over-optimization. This is probably a hang over from the original algorithm, rather than a conscious design goal for PostgreSQL? n_sign_dscale shows us where the decimal point is. We could actually store a marker representing the decimal point, which would cost us 0.5 byte rather than 2 bytes. Since we have 4 bits to represent a decimal number, that leaves a few bits spare to represent either a decimal- point-and-positive-sign and decimal-point-and-negative-sign. (We would still need to store trailing zeroes even after the decimal point). So, assuming I have this all correct, means we could reduce the on disk storage for NUMERIC datatypes to the following struct. This gives an overhead of just 2.5 bytes, plus the loss of the optimization of trailing zeroes, which I assess as having almost no value anyway in 99.9999% of data values (literally...). typedef struct NumericData {int16 varlen; /* Variable size (std varlena header) */char n_data[1]; /* Digits (really array of NumericDigit) } NumericData; so that sign/decimal point stored as 0.5 byte in the record. A saving of 5.5 bytes per NUMERIC column per row could make an interesting difference to row length in large tables. The above would require reasonable amount of change to the existing datatype code, so I'm not suggesting its a simple change to an .h file. But the changes as proposed would seem to be able to be made to the existing NUMERIC type, rather than invent another similar one. Is my thinking straight on this? Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > varlen is int32 to match the standard varlena header. However, the max > number of digits of the datatype is less than the threshold at which > values get toasted. So no NUMERIC values ever get toasted - in which > case, why worry about matching the size of varlena - lets reduce it to 2 > bytes which still gives us up to 1000 digits as we have now. Because that will require an extra case in the code that disassembles tuples, which will slow down *everything* even in databases that don't contain one single NUMERIC value. I think you need more than "let's save 2 bytes in NUMERICs" to justify that. > n_weight seems to exist because we do not store trailing zeroes. So > 1000000 is stored as 1 with a weight of 6. My experience is that large > numbers of trailing zeroes do not occur with any frequency in real > measurement or financial data and that this is an over-optimization. That seems debatable. Keep in mind that not storing extra zeroes cuts computation time as well as storage. > n_sign_dscale shows us where the decimal point is. We could actually > store a marker representing the decimal point, which would cost us 0.5 > byte rather than 2 bytes. Since we have 4 bits to represent a decimal > number, that leaves a few bits spare to represent either a decimal- > point-and-positive-sign and decimal-point-and-negative-sign. (We would > still need to store trailing zeroes even after the decimal point). This is completely bogus. How are you going to remember the sign except by always storing a marker? ISTM this proposal just moves the sign/decimalpoint overhead from one place to another, ie, somewhere in the NumericDigit array instead of in a fixed field. Also, you can't drop dscale without abandoning the efficient base-10000 representation, at least not unless you want people complaining that the database shows NUMERIC(3) data with four decimal places. It might be reasonable to restrict the range of NUMERIC to the point that we could fit the weight/sign/dscale into 2 bytes instead of 4, thereby saving 2 bytes per NUMERIC. I'm not excited about the other aspects of this, though. regards, tom lane
On Tue, Nov 01, 2005 at 09:22:17PM +0000, Simon Riggs wrote: > varlen is int32 to match the standard varlena header. However, the max > number of digits of the datatype is less than the threshold at which > values get toasted. So no NUMERIC values ever get toasted - in which > case, why worry about matching the size of varlena - lets reduce it to 2 > bytes which still gives us up to 1000 digits as we have now. The other ideas may have merit, I don't know. But this one is a no-goer. The backend currently has recognises three forms of Datum: - Fixed length, By value: integers, chars, anything short anough to fit in a word - Fixed length, By reference: datatime, etc, anything that's fixed length but too long for a word - Variable length: Anything variable: text, varchar(), etc The last all, without exception, have a varlena header. This makes the code easy, because all variable length values look the same for copying, loading, storing, etc. You are proposing a fourth type, say VARLENA2 which looks a lot like a verlena but it's not. I think the shear volume of code that would need to be checked is huge. Also, things like pg_attribute would need changing because you have to represent this new state somehow. I seriously doubt this isn't going to happen. Your other possible optimisations have other issues. > n_weight seems to exist because we do not store trailing zeroes. So > 1000000 is stored as 1 with a weight of 6. My experience is that large > numbers of trailing zeroes do not occur with any frequency in real > measurement or financial data and that this is an over-optimization. > This is probably a hang over from the original algorithm, rather than a > conscious design goal for PostgreSQL? But if you are storing large numbers then it's helpful. Whether it's worth the cost... > n_sign_dscale shows us where the decimal point is. We could actually > store a marker representing the decimal point, which would cost us 0.5 > byte rather than 2 bytes. Since we have 4 bits to represent a decimal > number, that leaves a few bits spare to represent either a decimal- > point-and-positive-sign and decimal-point-and-negative-sign. (We would > still need to store trailing zeroes even after the decimal point). Consider the algorithm: A number is stored as base + exponent. To multiply two numbers you can multiply the bases and add the exponents. OTOH, if you store the decimal inside the data, now you have to extract it again before you can do any calculating. So you've traded CPU time for disk space. Is diskspace cheaper or more expensive than CPU? Debatable I guess. > So, assuming I have this all correct, means we could reduce the on disk > storage for NUMERIC datatypes to the following struct. This gives an > overhead of just 2.5 bytes, plus the loss of the optimization of > trailing zeroes, which I assess as having almost no value anyway in > 99.9999% of data values (literally...). Actually, I have a table with a column declared as numeric(12,4) because there has to be 4 decimal places. As it turns out, the decimal places are mostly zero so the optimisation works for me. Interesting ideas, but there's a lot of hurdles to jump I think... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On Tue, Nov 01, 2005 at 04:54:11PM -0500, Tom Lane wrote: > It might be reasonable to restrict the range of NUMERIC to the point > that we could fit the weight/sign/dscale into 2 bytes instead of 4, > thereby saving 2 bytes per NUMERIC. I'm not excited about the other > aspects of this, though. FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to fit length info into 1 or 2 bytes. So there's something to be said for a small numeric type that has less overhead and a large numeric (what we have today). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, Nov 01, 2005 at 11:16:58PM +0100, Martijn van Oosterhout wrote: > Consider the algorithm: A number is stored as base + exponent. To > multiply two numbers you can multiply the bases and add the exponents. > OTOH, if you store the decimal inside the data, now you have to extract > it again before you can do any calculating. So you've traded CPU time > for disk space. Is diskspace cheaper or more expensive than CPU? > Debatable I guess. Well, I/O bandwidth is much more expensive than either CPU or disk space... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Martijn van Oosterhout <kleptog@svana.org> writes: > You are proposing a fourth type, say VARLENA2 which looks a lot like a > verlena but it's not. I think the shear volume of code that would need > to be checked is huge. Also, things like pg_attribute would need > changing because you have to represent this new state somehow. It wouldn't be an impossible amount of code --- for precedent see back when we made cstring into a full-fledged datatype (cstring is already a fourth option in your list BTW). That patch wasn't all that large IIRC. The issue in my mind is the performance implications of adding an additional case to places that are already hotspots. There were compelling functional reasons to pay that price to make cstring work, but "save 2 bytes per numeric" doesn't seem like it rises to that level. Maybe if we had a few other datatypes that could also use the feature. [ thinks... ] inet/cidr comes to mind but I don't see any others. The case seems a bit weak :-( regards, tom lane
On Tue, 2005-11-01 at 16:54 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > varlen is int32 to match the standard varlena header. However, the max > > number of digits of the datatype is less than the threshold at which > > values get toasted. So no NUMERIC values ever get toasted - in which > > case, why worry about matching the size of varlena - lets reduce it to 2 > > bytes which still gives us up to 1000 digits as we have now. > > Because that will require an extra case in the code that disassembles > tuples, which will slow down *everything* even in databases that don't > contain one single NUMERIC value. I think you need more than "let's > save 2 bytes in NUMERICs" to justify that. OK > > n_weight seems to exist because we do not store trailing zeroes. So > > 1000000 is stored as 1 with a weight of 6. My experience is that large > > numbers of trailing zeroes do not occur with any frequency in real > > measurement or financial data and that this is an over-optimization. > > That seems debatable. Keep in mind that not storing extra zeroes cuts > computation time as well as storage. Check what % difference this makes. 2 bytes on everything makes more difference than a 1 byte saving on a few percent of values. > > n_sign_dscale shows us where the decimal point is. We could actually > > store a marker representing the decimal point, which would cost us 0.5 > > byte rather than 2 bytes. Since we have 4 bits to represent a decimal > > number, that leaves a few bits spare to represent either a decimal- > > point-and-positive-sign and decimal-point-and-negative-sign. (We would > > still need to store trailing zeroes even after the decimal point). > > This is completely bogus. How are you going to remember the sign except > by always storing a marker? ISTM this proposal just moves the > sign/decimalpoint overhead from one place to another, ie, somewhere in > the NumericDigit array instead of in a fixed field. That is exactly my proposal. Thus an overhead of 0.5 bytes rather than 2 bytes, as I explained....but > Also, you can't drop dscale without abandoning the efficient base-10000 > representation, at least not unless you want people complaining that the > database shows NUMERIC(3) data with four decimal places. ... I take it I have misunderstood the storage format. > It might be reasonable to restrict the range of NUMERIC to the point > that we could fit the weight/sign/dscale into 2 bytes instead of 4, > thereby saving 2 bytes per NUMERIC. I'm not excited about the other > aspects of this, though. That seems easily doable - it seemed like something would stick. Restricting total number of digits to 255 and maxscale of 254 would allow that saving, yes? We can then have a BIGNUMERIC which would allow up to 1000 digits for anybody out there that ever got that high. I'm sure there's a few %, so I won't dismiss you entirely, guys... Best Regards, Simon Riggs
On Tue, Nov 01, 2005 at 05:40:35PM -0500, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > You are proposing a fourth type, say VARLENA2 which looks a lot like a > > verlena but it's not. I think the shear volume of code that would need > > to be checked is huge. Also, things like pg_attribute would need > > changing because you have to represent this new state somehow. > > It wouldn't be an impossible amount of code --- for precedent see back > when we made cstring into a full-fledged datatype (cstring is already > a fourth option in your list BTW). That patch wasn't all that large IIRC. > The issue in my mind is the performance implications of adding an > additional case to places that are already hotspots. There were > compelling functional reasons to pay that price to make cstring work, > but "save 2 bytes per numeric" doesn't seem like it rises to that level. > Maybe if we had a few other datatypes that could also use the feature. > [ thinks... ] inet/cidr comes to mind but I don't see any others. > The case seems a bit weak :-( Would varchar(255) fit into that case? There's a heck of a lot of people who use that as "well, dunno how big this is so I'll just use 255". A better use case would be places where you know you'll only need 10-20 characters; saving 2 bytes in those cases would likely be worth it. This would work for char as well (and given that people are probably not in the habit of defining very large char's it would probably be even more useful there). Of course that means either yet another varchar/char type, or we have some automatic cut-over for fields defined over a certain size... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to > fit length info into 1 or 2 bytes. So there's something to be said for a > small numeric type that has less overhead and a large numeric (what we > have today). I don't think it'd be worth having 2 types. Remember that the weight is measured in base-10k digits. Suppose for instancesign 1 bitweight 7 bits (-64 .. +63)dscale 8 bits(0..255) This gives us a dynamic range of 1e-256 to 1e255 as well as the ability to represent up to 255 displayable fraction digits. Does anyone know any real database applications where that's not enough? (I'm neglecting NaN here in supposing we need only 1 bit for sign, but we could have a special encoding for NaN. Perhaps disallow the weight = -64 case and use that to signal NaN.) regards, tom lane
On Tue, 2005-11-01 at 23:16 +0100, Martijn van Oosterhout wrote: lots of useful things, thank you. > > So, assuming I have this all correct, means we could reduce the on disk > > storage for NUMERIC datatypes to the following struct. This gives an > > overhead of just 2.5 bytes, plus the loss of the optimization of > > trailing zeroes, which I assess as having almost no value anyway in > > 99.9999% of data values (literally...). > > Actually, I have a table with a column declared as numeric(12,4) > because there has to be 4 decimal places. As it turns out, the decimal > places are mostly zero so the optimisation works for me. Of course it fits some data. The point is whether it is useful for most people's data. My contention is that *most* (but definitely nowhere near all) NUMERIC data is either financial or measured data. That usually means it has digits that follow Benfold's Law - which for this discussion is a variant on a uniform random distribution. Optimizing for trailing zeroes just isn't worth the very minimal benefits, in most cases. It doesn't really matter that it saves on storage and processing time in those cases - Amdahl's Law says we can ignore that saving because the optimized case is not prevalent enough for us to care. Anybody like to work out a piece of SQL to perform data profiling and derive the distribution of values with trailing zeroes? I'd be happy to be proved wrong with an analysis of real data tables. Best Regards, Simon Riggs
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Tue, Nov 01, 2005 at 05:40:35PM -0500, Tom Lane wrote: >> Maybe if we had a few other datatypes that could also use the feature. >> [ thinks... ] inet/cidr comes to mind but I don't see any others. >> The case seems a bit weak :-( > Would varchar(255) fit into that case? That's attractive at first thought, but not when you stop to consider that most of the string-datatype support is built around the assumption that text, varchar, and char share the same underlying representation. You'd have to write a whole bunch of new code to support such a datatype. regards, tom lane
Simon Riggs <simon@2ndquadrant.com> writes: > Anybody like to work out a piece of SQL to perform data profiling and > derive the distribution of values with trailing zeroes? Don't forget leading zeroes. And all-zero (we omit digits entirely in that case). I don't think you can claim that zero isn't a common case. regards, tom lane
On 11/1/05 2:38 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote: > > FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to > fit length info into 1 or 2 bytes. So there's something to be said for a > small numeric type that has less overhead and a large numeric (what we > have today). The 38 digit limit is the decimal size of a 128-bit signed integer. The optimization has less to do with the size of the length info and more to do with fast math and fixed structure size. J. Andrew Rogers
On Tue, 2005-11-01 at 18:15 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Anybody like to work out a piece of SQL to perform data profiling and > > derive the distribution of values with trailing zeroes? > > Don't forget leading zeroes. And all-zero (we omit digits entirely in > that case). I don't think you can claim that zero isn't a common case. The question is: how common? For INTEGERs I would accept that many are often zero. For NUMERIC, these are seldom exactly zero, IMHO. This is one of those issues where we need to run tests and take input. We cannot decide this sort of thing just by debate alone. So, I'll leave this as a less potentially fruitful line of enquiry. Best Regards, Simon Riggs
On 11/2/05, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2005-11-01 at 18:15 -0500, Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > Anybody like to work out a piece of SQL to perform data profiling and > > > derive the distribution of values with trailing zeroes? > > > > Don't forget leading zeroes. And all-zero (we omit digits entirely in > > that case). I don't think you can claim that zero isn't a common case. > > The question is: how common? > > For INTEGERs I would accept that many are often zero. For NUMERIC, these > are seldom exactly zero, IMHO. Seconded. My INTEGER data does have a quite a few zeros but most of my NUMERIC columns hold debits and credits. Those are almost never zero. > > This is one of those issues where we need to run tests and take input. > We cannot decide this sort of thing just by debate alone. So, I'll leave > this as a less potentially fruitful line of enquiry. > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
I am not able to quickly find your numeric format, so I'll just throw this in. MaxDB (I only mention this because the format and algorithms are now under the GPL, so they can be reviewed by the public) uses a nifty number format that allows the use memcpy to compare two numbers when they are in the same precision and scale. Basically, the first byte contains the sign and number of digits in the number (number of digits is complemented if the number is negative), then the next N bytes contain the actual decimal digits, where N is the number of decimal digits / 2 (so two decimal digits per byte). Trailing 0's are removed to save space. So, 0 is stored as {128} 1 is stored as {193, 16} 1000 is stored as {196, 16} 1001 is stored as {196, 16, 1} x{C4 10 01} -1 is stored as {63, 144} -1001 is stored as {60, 144} Their storage allows for a max of 63 digits in a number, but it should be no problem to increase the size to 2 bytes, thus allowing up to 16,385 digits. The advantages are:- ability to memcmp two numbers.- compact storage (can be made more compact if you choose to save hex digits instead of decimal, but I'm not sure you want to do that). The disadvantages are as follows:- this format does not remember the database definition for the number (that is, no precision or scale); numeric functions must be told what they are. It would be nice if the number kept track of that as well...- comparing two numbers that are not the same precision and scale means converting one or both (if both precision and scale are different you may have to convert both)- calculations (addition, subtraction, etc) require functions to extract the digits and do the calculation a digit at a time.- I do not know of any trig functions, so they would need tobe written If any one is interested, I would be happy to discuss this further. Mike Pollard SUPRA Server SQL Engineering and Support Cincom Systems, Inc
On Tue, 2005-11-01 at 17:55 -0500, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to > > fit length info into 1 or 2 bytes. So there's something to be said for a > > small numeric type that has less overhead and a large numeric (what we > > have today). > > I don't think it'd be worth having 2 types. Remember that the weight is > measured in base-10k digits. Suppose for instance > sign 1 bit > weight 7 bits (-64 .. +63) > dscale 8 bits (0..255) > This gives us a dynamic range of 1e-256 to 1e255 as well as the ability > to represent up to 255 displayable fraction digits. Does anyone know > any real database applications where that's not enough? > > (I'm neglecting NaN here in supposing we need only 1 bit for sign, > but we could have a special encoding for NaN. Perhaps disallow the > weight = -64 case and use that to signal NaN.) I've coded a short patch to do this, which is the result of two alternate patches and some thinking, but maybe not enough yet. The patch given here is different on two counts from above: This sets... #define NUMERIC_MAX_PRECISION 64 since #define NUMERIC_MAX_RESULT_SCALE (NUMERIC_MAX_PRECISION * 2) We don't seem to be able to use all of the bits actually available to us in the format. Perhaps we need to decouple these now? Previously, we had room for 14 bits, which gave a maximum of 16384. We were using NUMERIC_MAX of 1000, so doubling it didn't give problems. The above on-disk format showed sign & weight together, whereas the current code has sign and dscale together. Trying to put sign and weight together is somewhat difficult, since weight is itself a signed value. I coded it up that way around, which is reasonably straightforward but harder than the patch enclosed here. But AFAICS - which isn't that far normally I grant you, doing things that way around would require some twos-complement work to get things correct when weight is negative. That worries me. IMHO we should accept the step down in maximum numeric precision (down to "only" 64 digits) rather than put extra processing into every manipulation of a NUMERIC datatype. With luck, I've misunderstood and we can have both performance and precision. If not, I commend 64 digits to you as sufficient for every imaginable purpose - saving 2 bytes off every numeric column. (And still 28 decimal places more accurate than Oracle). Best Regards, Simon Riggs
Attachment
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2005-11-01 at 17:55 -0500, Tom Lane wrote: >> I don't think it'd be worth having 2 types. Remember that the weight is >> measured in base-10k digits. Suppose for instance >> sign 1 bit >> weight 7 bits (-64 .. +63) >> dscale 8 bits (0..255) > I've coded a short patch to do this, which is the result of two > alternate patches and some thinking, but maybe not enough yet. What your patch does is sign 2 bits weight 8 bits (-128..127) dscale 6 bits (0..63) which is simply pretty lame: weight effectively has a factor of 8 more dynamic range than dscale in this representation. What's the point of being able to represent 1 * 10000 ^ -128 (ie, 10^-512) if the dscale only lets you show 63 fractional digits? You've got to allocate the bits in a saner fashion. Yes, that takes a little more work. Also, since the internal (unpacked) calculation representation has a much wider dynamic range than this, it'd probably be appropriate to add some range checks to the code that forms a packed value from unpacked. regards, tom lane
On Wed, Nov 02, 2005 at 08:48:25AM +0000, Simon Riggs wrote: > On Tue, 2005-11-01 at 18:15 -0500, Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > Anybody like to work out a piece of SQL to perform data profiling and > > > derive the distribution of values with trailing zeroes? > > > > Don't forget leading zeroes. And all-zero (we omit digits entirely in > > that case). I don't think you can claim that zero isn't a common case. > > The question is: how common? > > For INTEGERs I would accept that many are often zero. For NUMERIC, these > are seldom exactly zero, IMHO. > > This is one of those issues where we need to run tests and take input. > We cannot decide this sort of thing just by debate alone. So, I'll leave > this as a less potentially fruitful line of enquiry. Is it worth comming up with some script that users can run against a table to provide us with real data? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, 2005-11-02 at 13:46 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2005-11-01 at 17:55 -0500, Tom Lane wrote: > >> I don't think it'd be worth having 2 types. Remember that the weight is > >> measured in base-10k digits. Suppose for instance > >> sign 1 bit > >> weight 7 bits (-64 .. +63) > >> dscale 8 bits (0..255) > > > I've coded a short patch to do this, which is the result of two > > alternate patches and some thinking, but maybe not enough yet. > > What your patch does is Thanks for checking this so quickly. > > sign 2 bits OK, thats just a mistake in my second patch. Thats easily corrected. Please ignore that for now. > weight 8 bits (-128..127) > dscale 6 bits (0..63) > > which is simply pretty lame: weight effectively has a factor of 8 more > dynamic range than dscale in this representation. What's the point of > being able to represent 1 * 10000 ^ -128 (ie, 10^-512) if the dscale > only lets you show 63 fractional digits? You've got to allocate the > bits in a saner fashion. Yes, that takes a little more work. I wasn't trying to claim the bit assignment made sense. My point was that the work to mangle the two fields together to make it make sense looked like it would take more CPU (since the standard representation of signed integers is different for +ve and -ve values). It is the more CPU I'm worried about, not the wasted bits on the weight. Spending CPU cycles on *all* numerics just so we can have numbers with > +/-64 decimal places doesn't seem a good trade. Hence I stuck the numeric sign back on the dscale, and so dscale and weight seem out of balance. So, AFAICS, the options are: 0 (current cvstip) Numeric range up to 1000, with additional 2 bytes per column value 1. Numeric range up to 128, but with overhead to extract last bit 2. Numeric range up to 64 I'm suggesting we choose (2).... other views are welcome. (I'll code it whichever way we decide.) > Also, since the internal (unpacked) calculation representation has a > much wider dynamic range than this, it'd probably be appropriate to add > some range checks to the code that forms a packed value from unpacked. Well, there already is one that does that, otherwise I would have added one as you suggest. (The unpacked code has int values, whereas the previous packed format used u/int16 values). Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > I wasn't trying to claim the bit assignment made sense. My point was > that the work to mangle the two fields together to make it make sense > looked like it would take more CPU (since the standard representation of > signed integers is different for +ve and -ve values). It is the more CPU > I'm worried about, not the wasted bits on the weight. I think that's purely hypothetical. The existing representation, as well as the one you propose, both require shift-and-mask operations to pull the fields out of the packed format. The format I'm suggesting would require some different shift-and-mask operations. As a first approximation it'd be a wash, and any actual differences would be CPU-specific enough that we shouldn't put a whole lot of weight on the point. (C compilers tend to be fairly bright about optimizing field extraction operations.) Moreover, you've forgotten the basic gating factor here, which is whether such a proposal will get accepted at all. Reducing the available range from 1000 digits to 255 might pass without too much objection, but dropping it down another factor of 4 to 63 starts to bring it uncomfortably close to mattering to people. [ thinks for a moment... ] Actually, neither proposal is going to get off the ground, because the parser's handling of numeric constants is predicated on the assumption that type NUMERIC can handle any valid value of FLOAT8, and so we can get away with converting to NUMERIC on sight and then coercing to float later if parse analysis finds out the constant should be float. If the dynamic range of NUMERIC is less than 10^308 then this fails. So we have to find another bit somewhere, or the idea is dead in the water. regards, tom lane
On Wed, Nov 02, 2005 at 12:53:07PM -0600, Jim C. Nasby wrote: > > This is one of those issues where we need to run tests and take input. > > We cannot decide this sort of thing just by debate alone. So, I'll leave > > this as a less potentially fruitful line of enquiry. > > Is it worth comming up with some script that users can run against a > table to provide us with real data? Like I said, I have a few columns of numeric(12,4). They're costs in cent, to 4 decimal places. The test is (column = trunc(column)). Sample data: col1 | col2 | col3 | col4 ---------+---------+--------+---------21.0000 | 10.1818 | 0.0000 | 21.000022.0000 | 11.2727 | 0.0000 | 22.000022.0000 | 6.0909 | 0.0000 | 22.0000 For each column (across 17 million rows): Col 1: 83% trailing zeros Col 2: 49% Col 3: 94% Col 4: 83% AIUI, I currently get the four decimal places for free, and the idea is to store them explicitly. Fact is, things that cost fractions of cents are not that common, in this database anyway. As for the argument in general, this table is so wide that any gain will vanish into the slack at the end of a block so it won't actually matter... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On Wed, 2005-11-02 at 15:09 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > I wasn't trying to claim the bit assignment made sense. My point was > > that the work to mangle the two fields together to make it make sense > > looked like it would take more CPU (since the standard representation of > > signed integers is different for +ve and -ve values). It is the more CPU > > I'm worried about, not the wasted bits on the weight. > > I think that's purely hypothetical. The existing representation, as > well as the one you propose, both require shift-and-mask operations > to pull the fields out of the packed format. The format I'm suggesting > would require some different shift-and-mask operations. As a first > approximation it'd be a wash, and any actual differences would be > CPU-specific enough that we shouldn't put a whole lot of weight on the > point. (C compilers tend to be fairly bright about optimizing field > extraction operations.) OK > Moreover, you've forgotten the basic gating factor here, which is > whether such a proposal will get accepted at all. Reducing the > available range from 1000 digits to 255 might pass without too much > objection, but dropping it down another factor of 4 to 63 starts to > bring it uncomfortably close to mattering to people. > > [ thinks for a moment... ] Actually, neither proposal is going to get > off the ground, because the parser's handling of numeric constants is > predicated on the assumption that type NUMERIC can handle any valid > value of FLOAT8, and so we can get away with converting to NUMERIC on > sight and then coercing to float later if parse analysis finds out the > constant should be float. If the dynamic range of NUMERIC is less than > 10^308 then this fails. So we have to find another bit somewhere, or > the idea is dead in the water. Well, that certainly is obscure, I'll give you that. 308 huh? The middle ground between 64 and 308 is somewhere around 255, yes? :-) I'll get on it. Including Catch-308. Best Regards, Simon Riggs
On Wed, 2005-11-02 at 15:09 -0500, Tom Lane wrote: > [ thinks for a moment... ] Actually, neither proposal is going to get > off the ground, because the parser's handling of numeric constants is > predicated on the assumption that type NUMERIC can handle any valid > value of FLOAT8, and so we can get away with converting to NUMERIC on > sight and then coercing to float later if parse analysis finds out the > constant should be float. If the dynamic range of NUMERIC is less than > 10^308 then this fails. So we have to find another bit somewhere, or > the idea is dead in the water. We convert a Value node to a Const in backend/parser/parse_node.c:make_const It seems straightforward enough to put in an additional test, similar to the ones already there so that if its too big for a decimal we make it a float straight away - only a float can be that big in that case. After that I can't really see what the problem is? There is only a single call where numeric_float8() is called anywhere in the code, which is during selectivity calculations. In that case we actually call numeric_float8_no_overflow(). If its a FLOAT8OID, then we can simply avoid the conversion, since it already would be one. Can you explain further? Thanks, Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > It seems straightforward enough to put in an additional test, similar to > the ones already there so that if its too big for a decimal we make it a > float straight away - only a float can be that big in that case. After > that I can't really see what the problem is? Wrong answer. You'll be introducing weird corner cases into the type resolution behavior. An approach that would actually have some credibility would be to not resolve constants to NUMERIC right away, but to invent an UNKNOWNNUMERIC pseudotype with coercion behavior comparable to the existing UNKNOWN type for string literals. This has been proposed before but hasn't really been needed so far. Of course, this converts the project from a minor localized hack on NUMERIC into a major piece of fiddling with the type resolution rules, with the potential for unforeseen side-effects on the behavior of other data types. It might be worth doing anyway --- I don't recall at the moment what problems UNKNOWNNUMERIC was intended to solve, but if they're still open issues then it's something we ought to get around to sometime. regards, tom lane
On Wed, Nov 02, 2005 at 06:12:37PM -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > It seems straightforward enough to put in an additional test, similar to > > the ones already there so that if its too big for a decimal we make it a > > float straight away - only a float can be that big in that case. After > > that I can't really see what the problem is? > > Wrong answer. You'll be introducing weird corner cases into the type > resolution behavior. > > An approach that would actually have some credibility would be to not > resolve constants to NUMERIC right away, but to invent an UNKNOWNNUMERIC > pseudotype with coercion behavior comparable to the existing UNKNOWN > type for string literals. This has been proposed before but hasn't > really been needed so far. Of course, this converts the project from a > minor localized hack on NUMERIC into a major piece of fiddling with the > type resolution rules, with the potential for unforeseen side-effects on > the behavior of other data types. It might be worth doing anyway --- I > don't recall at the moment what problems UNKNOWNNUMERIC was intended to > solve, but if they're still open issues then it's something we ought to > get around to sometime. Thought I'd look to see if I could find anything about UNKNOWNNUMERIC, but no such luck (ISTM we really need a better way to find discussion on old ideas...) But while looking I did find this TODO, which might be relevant to the current discussion: # Change NUMERIC to enforce the maximum precision, and increase it Unfortunately I can't find any reference to that in the archives... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
[patches removed] Tom Lane wrote: >Simon Riggs <simon@2ndquadrant.com> writes: > > >>It seems straightforward enough to put in an additional test, similar to >>the ones already there so that if its too big for a decimal we make it a >>float straight away - only a float can be that big in that case. After >>that I can't really see what the problem is? >> >> > >Wrong answer. You'll be introducing weird corner cases into the type >resolution behavior. > >An approach that would actually have some credibility would be to not >resolve constants to NUMERIC right away, but to invent an UNKNOWNNUMERIC >pseudotype with coercion behavior comparable to the existing UNKNOWN >type for string literals. This has been proposed before but hasn't >really been needed so far. Of course, this converts the project from a >minor localized hack on NUMERIC into a major piece of fiddling with the >type resolution rules, with the potential for unforeseen side-effects on >the behavior of other data types. It might be worth doing anyway --- I >don't recall at the moment what problems UNKNOWNNUMERIC was intended to >solve, but if they're still open issues then it's something we ought to >get around to sometime. > > > > Could someone please quantify how much bang we might get for what seems like quite a lot of bucks? I appreciate the need for speed, but the saving here strikes me as marginal at best, unless my instincts are all wrong (quite possible) cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Could someone please quantify how much bang we might get for what seems > like quite a lot of bucks? > I appreciate the need for speed, but the saving here strikes me as > marginal at best, unless my instincts are all wrong (quite possible) Two bytes per numeric value is not a lot, agreed. If we were willing to invent the "varlena2" datum format then we could save four bytes per numeric, plus reduce numeric's alignment requirement from int to short which would probably save another byte per value on average. I'm not sure that that's worth doing if numeric and inet are the only beneficiaries, but it might be. From a speed perspective the only thing favoring UNKNOWNNUMERIC is the possibility for saving some conversion operations when the grammar's initial guess about datatype is wrong. That's probably pretty marginal though. I was thinking of it more as a vehicle for helping us clean up the type-assignment behavior some more. The example I have in my notes is that "float4col = 1.8" is certain to fail since 1.8 will be taken as float8, not float4, and then you have precision mismatch problems. You can make it work by quoting: "float4col = '1.8'" but that's surely pretty ugly. If the constant were initially UNKNOWNNUMERIC then it would end up as float4 without that hack. regards, tom lane
On Wed, 2005-11-02 at 19:12 -0500, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > Could someone please quantify how much bang we might get for what seems > > like quite a lot of bucks? > > I appreciate the need for speed, but the saving here strikes me as > > marginal at best, unless my instincts are all wrong (quite possible) > > Two bytes per numeric value is not a lot, agreed. I'm optimising for Data Warehousing. If you have a very large table with a higher proportion of numerics on it, then your saving can be >5% of tablesize which could be very useful. For the general user, it might produce less benefit, I accept. At the moment we've established we can do this fairly much for free. i.e. nobody cares about the drop in digits (to 255) and the other coding seems very minimal. I've accepted that the runtime costs are almost zero. Even the coding is small, so we've talked more lines than the patch already. Seems like a quick win; Tom is helping me to make sure it is not a quick and dirty win and I very much appreciate that. [I've taken this off -patches to show that we are not really discussing minor mods to the original patch anymore.] > If we were willing to invent the "varlena2" datum format then we could > save four bytes per numeric, plus reduce numeric's alignment requirement > from int to short which would probably save another byte per value on > average. I'm not sure that that's worth doing if numeric and inet are > the only beneficiaries, but it might be. That and variations can be the next discussion. They sound good. Those ideas are complementary, if I understand them. Best Regards, Simon Riggs
On Thu, 2005-11-03 at 08:27 +0000, Simon Riggs wrote: > On Wed, 2005-11-02 at 19:12 -0500, Tom Lane wrote: > > If we were willing to invent the "varlena2" datum format then we could > > save four bytes per numeric, plus reduce numeric's alignment requirement > > from int to short which would probably save another byte per value on > > average. I'm not sure that that's worth doing if numeric and inet are > > the only beneficiaries, but it might be. > > That and variations can be the next discussion. They sound good. Kicking off the discussion on that... Varlena2 datum format sounds interesting. If we did that, I'd also like to apply that thought to VAR/CHAR(32000) and below. (The benefit of varlena2 is saving of 2 bytes + ~1 byte alignment, yes?, the other two bytes come from the other numeric savings discussed). Alternatively, what I'd been thinking about was altering the self- contained nature of PostgreSQL datatypes. In other databases, CHAR(12) and NUMERIC(12) are fixed length datatypes. In PostgreSQL, they are dynamically varying datatypes. What actually happens is that in many other systems the datatype is the same, but additional metadata is provided for that particular attribute. So CHAR(12) is a datatype of CHAR with a metadata item called length which is set to 12 for that attribute. On PostgreSQL, CHAR(12) is a bpchar datatype with all instantiations of that datatype having a 4 byte varlena header. In this example, all of those instantiations having the varlena header set to 12, so essentially wasting the 4 byte header. It seems like it would be an interesting move to allow the attribute metadata to be stored in the TupleDesc, so we can store it once, rather than once per row. If we did this we would need two datatypes where currently we need only one. We would still need variable-length char datatype VARCHAR and we would be inventing a new fixed-char datatype with metadata of length CHAR(n). This would give us two things: - reduce many attributes by 4 bytes in length - allow attribute access to increase considerably in speed for queries, sorts etc since more of the tuple offsets will be constant Anyway, I accept that many will say I clearly don't understand Object Relational. It seems like this could be done without actually breaking anything. The question is, how much work would it be? Best Regards, Simon Riggs
On Thu, Nov 03, 2005 at 01:49:46PM +0000, Simon Riggs wrote: > In other databases, CHAR(12) and NUMERIC(12) are fixed length datatypes. > In PostgreSQL, they are dynamically varying datatypes. Please explain how a CHAR(12) can store 12 UTF-8 characters when each character may be 1 to 4 bytes, unless the CHAR itself is variable length... > What actually happens is that in many other systems the datatype is the > same, but additional metadata is provided for that particular attribute. > So CHAR(12) is a datatype of CHAR with a metadata item called length > which is set to 12 for that attribute. We already have this metadata, it's called atttypmod and it's stored in pg_attribute. That's where the 12 for CHAR(12) is stored BTW. > On PostgreSQL, CHAR(12) is a bpchar datatype with all instantiations of > that datatype having a 4 byte varlena header. In this example, all of > those instantiations having the varlena header set to 12, so essentially > wasting the 4 byte header. Nope, the verlena header stores the actual length on disk. If you store "hello" in a char(12) field it takes only 9 bytes (4 for the header, 5 for the data), which is less than 12. Good ideas, but it all hinges on the fact that CHAR(12) can take a fixed amount of space, which simply isn't true in a multibyte encoding. Having a different header for things shorter than 255 bytes has been discussed before, that's another argument though. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Simon Riggs wrote: > On PostgreSQL, CHAR(12) is a bpchar datatype with all instantiations of > that datatype having a 4 byte varlena header. In this example, all of > those instantiations having the varlena header set to 12, so essentially > wasting the 4 byte header. We need the length word because the actual size in bytes is variable, due to multibyte encoding considerations. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Thu, 2005-11-03 at 11:13 -0300, Alvaro Herrera wrote: > Simon Riggs wrote: > > On PostgreSQL, CHAR(12) is a bpchar datatype with all instantiations of > > that datatype having a 4 byte varlena header. In this example, all of > > those instantiations having the varlena header set to 12, so essentially > > wasting the 4 byte header. > > We need the length word because the actual size in bytes is variable, > due to multibyte encoding considerations. Succinctly put, thanks. Incidentally, you remind me that other databases do *not* vary the character length, even if they do have varying length UTF-8 within them. So if you define CHAR(255) then it could blow up at a random length if you store UTF-8 within it. That's behaviour that I could never sanction, so I'll leave this now. Best Regards, Simon Riggs
On Thu, 3 Nov 2005, Simon Riggs wrote: > On Wed, 2005-11-02 at 19:12 -0500, Tom Lane wrote: > > Andrew Dunstan <andrew@dunslane.net> writes: > > > Could someone please quantify how much bang we might get for what seems > > > like quite a lot of bucks? > > > I appreciate the need for speed, but the saving here strikes me as > > > marginal at best, unless my instincts are all wrong (quite possible) > > > > Two bytes per numeric value is not a lot, agreed. > > I'm optimising for Data Warehousing. If you have a very large table with > a higher proportion of numerics on it, then your saving can be >5% of > tablesize which could be very useful. For the general user, it might > produce less benefit, I accept. > > At the moment we've established we can do this fairly much for free. > i.e. nobody cares about the drop in digits (to 255) and the other coding I don't believe the above is safe to say, yet. AFAICS, this has been discussed only on hackers (and patches) in this discussion, whereas this sort of change should probably be brought up on general as well to get a greater understanding of whether there are people who care. I expect that there won't be, but given that I'm still not sure what the plan to support applications upward is for this change, I think it's probably a good idea to query a larger segment of the population.
Simon Riggs wrote: > On Thu, 2005-11-03 at 11:13 -0300, Alvaro Herrera wrote: > >>Simon Riggs wrote: >> >>>On PostgreSQL, CHAR(12) is a bpchar datatype with all instantiations of >>>that datatype having a 4 byte varlena header. In this example, all of >>>those instantiations having the varlena header set to 12, so essentially >>>wasting the 4 byte header. >> >>We need the length word because the actual size in bytes is variable, >>due to multibyte encoding considerations. > > > Succinctly put, thanks. > > Incidentally, you remind me that other databases do *not* vary the > character length, even if they do have varying length UTF-8 within them. > So if you define CHAR(255) then it could blow up at a random length if > you store UTF-8 within it. > > That's behaviour that I could never sanction, so I'll leave this now. > > Best Regards, Simon Riggs > Just as a side note, in Oracle you can use the syntax (f.ex on on a db with utf-8 charset): column VARCHAR2(10 CHAR) ...to indicate that Oracle should fit 10 characters there. It might use up to 40 bytes in the db, but that's up to Oracle. If I s/10 CHAR/10, at most 10 characters will fit. This works very well. The only catch is that it's not good to use more than 1000 chars since oracle's varchars dont want to go past 4000 bytes. Best regards, Marcus
Stephan Szabo wrote: > > On Thu, 3 Nov 2005, Simon Riggs wrote: > > > On Wed, 2005-11-02 at 19:12 -0500, Tom Lane wrote: > > > Andrew Dunstan <andrew@dunslane.net> writes: > > > > Could someone please quantify how much bang we might get for what seems > > > > like quite a lot of bucks? > > > > I appreciate the need for speed, but the saving here strikes me as > > > > marginal at best, unless my instincts are all wrong (quite possible) > > > > > > Two bytes per numeric value is not a lot, agreed. > > > > I'm optimising for Data Warehousing. If you have a very large table with > > a higher proportion of numerics on it, then your saving can be >5% of > > tablesize which could be very useful. For the general user, it might > > produce less benefit, I accept. > > > > At the moment we've established we can do this fairly much for free. > > i.e. nobody cares about the drop in digits (to 255) and the other coding > > I don't believe the above is safe to say, yet. AFAICS, this has been > discussed only on hackers (and patches) in this discussion, whereas this > sort of change should probably be brought up on general as well to get a > greater understanding of whether there are people who care. I expect that > there won't be, but given that I'm still not sure what the plan to support > applications upward is for this change, I think it's probably a good idea > to query a larger segment of the population. Agreed. With the proposal, we are saving perhaps 5% storage space for numeric fields, but are adding code complexity and reducing its possible precision. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> On Thu, 3 Nov 2005, Simon Riggs wrote: >>> At the moment we've established we can do this fairly much for free. > Agreed. With the proposal, we are saving perhaps 5% storage space for > numeric fields, but are adding code complexity and reducing its possible > precision. Having to invent UNKNOWNNUMERIC is hardly what I'd call "for free". That takes it out of the realm of being a small localized project. I'd feel a lot happier about this if we could keep the dynamic range up to, say, 10^512 so that it's still true that NUMERIC can be a universal parse-time representation. That would also make it even more unlikely that anyone would complain about loss of functionality. To do that we'd need 8 bits for weight (-128..127 for a base-10K exponent is enough) but we need 9 bits for dscale which does not quite fit. I think we could make it go by cramming the sign and the high-order dscale bit into the first NumericDigit --- the digit itself can only be 0..9999 so there are a couple of bits to spare. This probably *would* slow down packing and unpacking of numerics, but just by a couple lines of C. Arguably the net reduction in I/O costs would justify that. regards, tom lane
On Thu, 2005-11-03 at 07:03 -0800, Stephan Szabo wrote: > I don't believe the above is safe to say, yet. AFAICS, this has been > discussed only on hackers (and patches) in this discussion, whereas this > sort of change should probably be brought up on general as well to get a > greater understanding of whether there are people who care. I expect that > there won't be, but given that I'm still not sure what the plan to support > applications upward is for this change, I think it's probably a good idea > to query a larger segment of the population. Fairly spoken, I believe. I would suggest that I work just a little more with Tom et al to make sure we have a plan that can fly. If so, I will query GENERAL as you suggest. Best Regards, Simon Riggs
Simon Riggs wrote: >On Wed, 2005-11-02 at 19:12 -0500, Tom Lane wrote: > > >>Andrew Dunstan <andrew@dunslane.net> writes: >> >> >>>Could someone please quantify how much bang we might get for what seems >>>like quite a lot of bucks? >>>I appreciate the need for speed, but the saving here strikes me as >>>marginal at best, unless my instincts are all wrong (quite possible) >>> >>> >>Two bytes per numeric value is not a lot, agreed. >> >> > >I'm optimising for Data Warehousing. If you have a very large table with >a higher proportion of numerics on it, then your saving can be >5% of >tablesize which could be very useful. For the general user, it might >produce less benefit, I accept. > > > Well, it could also be argued that DW apps could often get away with using floating point types, even where the primary source needs to be in fixed point for accuracy, and that could generate lots of savings in speed and space. But I guess everybody gets to make their own choices. cheers andrew
On Thu, Nov 03, 2005 at 03:09:26PM +0100, Martijn van Oosterhout wrote: > On Thu, Nov 03, 2005 at 01:49:46PM +0000, Simon Riggs wrote: > > In other databases, CHAR(12) and NUMERIC(12) are fixed length datatypes. > > In PostgreSQL, they are dynamically varying datatypes. > Please explain how a CHAR(12) can store 12 UTF-8 characters when each > character may be 1 to 4 bytes, unless the CHAR itself is variable > length... > ... > Nope, the verlena header stores the actual length on disk. If you store > "hello" in a char(12) field it takes only 9 bytes (4 for the header, 5 > for the data), which is less than 12. > ... > Having a different header for things shorter than 255 bytes has been > discussed before, that's another argument though. It's unfortunate that the length is encoded multiple times. In UTF-8, for instance, each character has its length encoded in the most significant bits. Complicated to extract, however, the data is encoded twice. 1 in the header, and 1 in the combination between the column attribute, and the per character lengths. For "other databases", the column could be encoded as 2 byte characters or 4 byte characters, allowing it to be fixed. I find myself doubting that ASCII characters could be encoded more efficiently in such formats, than the inclusion of a length header and per character length encoding, but for multibyte characters, the race is probably even. :-) I dunno... no opinion on the matter here, but I did want to point out that the field can be fixed length without a header. Those proposing such a change, however, should accept that this may result in an overall expense. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
On Thu, 2005-11-03 at 11:36 -0500, Andrew Dunstan wrote: > Well, it could also be argued that DW apps could often get away with > using floating point types, even where the primary source needs to be in > fixed point for accuracy, and that could generate lots of savings in > speed and space. But I guess everybody gets to make their own choices. I've never seen a requirement for a DW to be less accurate than the primary source. Its almost always the other way around. If its money, then no. If its other physical measurements, then yes. There's lots of scientific apps that wouldn't use NUMERICs much, but there's enough apps that do for me to care. Best Regards, Simon Riggs
On Thu, Nov 03, 2005 at 12:28:02PM -0500, mark@mark.mielke.cc wrote: > It's unfortunate that the length is encoded multiple times. In UTF-8, > for instance, each character has its length encoded in the most > significant bits. Complicated to extract, however, the data is encoded > twice. 1 in the header, and 1 in the combination between the column > attribute, and the per character lengths. > > For "other databases", the column could be encoded as 2 byte characters > or 4 byte characters, allowing it to be fixed. I find myself doubting > that ASCII characters could be encoded more efficiently in such formats, > than the inclusion of a length header and per character length encoding, > but for multibyte characters, the race is probably even. :-) That's called UTF-16 and is currently not supported by PostgreSQL at all. That may change, since the locale library ICU requires UTF-16 for everything. The question is, if someone declares a field CHAR(20), do they really mean to fix 40 bytes of storage for each and every row? I doubt it, that's even more wasteful of space than a varlena header. Which puts you right back to variable length fields. > I dunno... no opinion on the matter here, but I did want to point out > that the field can be fixed length without a header. Those proposing such > a change, however, should accept that this may result in an overall > expense. The only time this may be useful is for *very* short fields, in the order of 4 characters or less. Else the overhead swamps the varlena header... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On 2005-11-03, Martijn van Oosterhout <kleptog@svana.org> wrote: >> For "other databases", the column could be encoded as 2 byte characters >> or 4 byte characters, allowing it to be fixed. I find myself doubting >> that ASCII characters could be encoded more efficiently in such formats, >> than the inclusion of a length header and per character length encoding, >> but for multibyte characters, the race is probably even. :-) > > That's called UTF-16 and is currently not supported by PostgreSQL at > all. That may change, since the locale library ICU requires UTF-16 for > everything. Even UTF-16 is not fixed-length; some characters require 4 bytes, not 2. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
On Thu, 2005-11-03 at 10:32 -0500, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> On Thu, 3 Nov 2005, Simon Riggs wrote: > >>> At the moment we've established we can do this fairly much for free. > > > Agreed. With the proposal, we are saving perhaps 5% storage space for > > numeric fields, but are adding code complexity and reducing its possible > > precision. > > Having to invent UNKNOWNNUMERIC is hardly what I'd call "for free". > That takes it out of the realm of being a small localized project. > I'd feel a lot happier about this if we could keep the dynamic range > up to, say, 10^512 so that it's still true that NUMERIC can be a > universal parse-time representation. That would also make it even > more unlikely that anyone would complain about loss of functionality. > > To do that we'd need 8 bits for weight (-128..127 for a base-10K > exponent is enough) but we need 9 bits for dscale which does not > quite fit. I think we could make it go by cramming the sign and > the high-order dscale bit into the first NumericDigit --- the > digit itself can only be 0..9999 so there are a couple of bits > to spare. This probably *would* slow down packing and unpacking of > numerics, but just by a couple lines of C. Arguably the net reduction > in I/O costs would justify that. This sounds a much better solution and the code neater too. This still would be a very small patch, not very intrusive at all. I'll have a hack at this now. Best Regards, Simon Riggs
On 11/3/05, Martijn van Oosterhout <kleptog@svana.org> wrote: > That's called UTF-16 and is currently not supported by PostgreSQL at > all. That may change, since the locale library ICU requires UTF-16 for > everything. UTF-16 doesn't get us out of the variable length character game, for that we need UTF-32... Unless we were to only support UCS-2, which is what some databases do for their Unicode support. I think that would be a huge step back and as you pointed out below, it is not efficient. :) > The question is, if someone declares a field CHAR(20), do they really > mean to fix 40 bytes of storage for each and every row? I doubt it, > that's even more wasteful of space than a varlena header. > > Which puts you right back to variable length fields. Another way to look at this is in the context of compression: With unicode, characters are really 32bit values... But only a small range of these values is common. So we store and work with them in a compressed format, UTF-8. The costs of compression is that fixed width fields can not be fixed width, and the some operations are much more expensive than they would be otherwise. As such it might be more interesting to ask some other questions like: are we using the best compression algorithm for the application, and, why do we sometimes stack two compression algorithms? For longer fields would we be better off working with UTF-32 and being more agressive about where we LZ compress the fields? > > I dunno... no opinion on the matter here, but I did want to point out > > that the field can be fixed length without a header. Those proposing such > > a change, however, should accept that this may result in an overall > > expense. > > The only time this may be useful is for *very* short fields, in the > order of 4 characters or less. Else the overhead swamps the varlena > header... Not even 4 characters if we are to support all of unicode... Length + UTF-8 is a win vs UTF-32 in most cases for fields with more than one character.
Gregory Maxwell <gmaxwell@gmail.com> writes: > Another way to look at this is in the context of compression: With > unicode, characters are really 32bit values... But only a small range > of these values is common. So we store and work with them in a > compressed format, UTF-8. > As such it might be more interesting to ask some other questions like: > are we using the best compression algorithm for the application, and, > why do we sometimes stack two compression algorithms? Actually, the real reason we use UTF-8 and not any of the sorta-fixed-size representations of Unicode is that the backend is by and large an ASCII, null-terminated-string engine. *All* of the supported backend encodings are ASCII-superset codes. Making everything null-safe in order to allow use of UCS2 or UCS4 would be a huge amount of work, and the benefit is at best questionable. regards, tom lane
On Thu, Nov 03, 2005 at 09:17:43PM -0500, Tom Lane wrote: > Gregory Maxwell <gmaxwell@gmail.com> writes: > > Another way to look at this is in the context of compression: With > > unicode, characters are really 32bit values... But only a small range > > of these values is common. So we store and work with them in a > > compressed format, UTF-8. > > As such it might be more interesting to ask some other questions like: > > are we using the best compression algorithm for the application, and, > > why do we sometimes stack two compression algorithms? > Actually, the real reason we use UTF-8 and not any of the > sorta-fixed-size representations of Unicode is that the backend is by > and large an ASCII, null-terminated-string engine. *All* of the > supported backend encodings are ASCII-superset codes. Making > everything null-safe in order to allow use of UCS2 or UCS4 would be > a huge amount of work, and the benefit is at best questionable. Perhaps on a side note - my intuition (which sometimes lies) would tell me that, if the above is true, the backend is doing unnecessary copies of read-only data, if only, to insert a '\0' at the end of the strings. Is this true? I'm thinking along the lines of the other threads that speak of PostgreSQL being CPU or I/O bound, not disk bound, for many sorts of operations. Is PostgreSQL unnecessary copying string data around (and other data, I would assume). Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
On Fri, Nov 04, 2005 at 08:38:38AM -0500, mark@mark.mielke.cc wrote: > On Thu, Nov 03, 2005 at 09:17:43PM -0500, Tom Lane wrote: > > Actually, the real reason we use UTF-8 and not any of the > > sorta-fixed-size representations of Unicode is that the backend is by > > and large an ASCII, null-terminated-string engine. *All* of the > > supported backend encodings are ASCII-superset codes. Making > > everything null-safe in order to allow use of UCS2 or UCS4 would be > > a huge amount of work, and the benefit is at best questionable. > > Perhaps on a side note - my intuition (which sometimes lies) would tell > me that, if the above is true, the backend is doing unnecessary copies > of read-only data, if only, to insert a '\0' at the end of the strings. > Is this true? It's not quite that bad. Obviously for all on disk datatype zeros are allowed. Bit strings, arrays, timestamps, numerics can all have embedded nulls and they have a length header. Where this becomes an issue is for things like table names, field names, encoding names, etc. The "name" type is a fixed length string which is kept in a way that it can be treated as a C string. If these could contain null characters it would get messy. I do conceive that the backend could support a UTF-16 datatype which would be indexable and have various support functions. But as soon as it came to talking to clients, it would be converted back to UTF-8 because libpq treats all strings coming back as null terminated. Similarly, querys sent couldn't be anything other than UTF-8 also. And if users can't send or receive UTF-16 text, why should the backend store it that way? > I'm thinking along the lines of the other threads that speak of PostgreSQL > being CPU or I/O bound, not disk bound, for many sorts of operations. Is > PostgreSQL unnecessary copying string data around (and other data, I would > assume). Well, there is a bit of copying around while creating tuples and such, but it's not to add null terminators. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On Fri, Nov 04, 2005 at 04:13:29PM +0100, Martijn van Oosterhout wrote: > On Fri, Nov 04, 2005 at 08:38:38AM -0500, mark@mark.mielke.cc wrote: > > On Thu, Nov 03, 2005 at 09:17:43PM -0500, Tom Lane wrote: > > > Actually, the real reason we use UTF-8 and not any of the > > > sorta-fixed-size representations of Unicode is that the backend is by > > > and large an ASCII, null-terminated-string engine. *All* of the > > > supported backend encodings are ASCII-superset codes. Making > > > everything null-safe in order to allow use of UCS2 or UCS4 would be > > > a huge amount of work, and the benefit is at best questionable. > > Perhaps on a side note - my intuition (which sometimes lies) would tell > > me that, if the above is true, the backend is doing unnecessary copies > > of read-only data, if only, to insert a '\0' at the end of the strings. > > Is this true? > It's not quite that bad. Obviously for all on disk datatype zeros are > allowed. Bit strings, arrays, timestamps, numerics can all have > embedded nulls and they have a length header. Are you and Tom conflicting in opinion? :-) I read "the backend is by and large an ASCII, null-terminated-string engine" with "we use UTF-8 [for varlena strings?]" as, a lot of the code assumes varlena strings are '\0' terminated, and an assumption on my part, that the varlena strings are not stored in the backend with a '\0' terminator, therefore, they require being copied out, terminated with a '\0', before they can be used? Or perhaps I'm just confused. :-) > > I'm thinking along the lines of the other threads that speak of PostgreSQL > > being CPU or I/O bound, not disk bound, for many sorts of operations. Is > > PostgreSQL unnecessary copying string data around (and other data, I would > > assume). > Well, there is a bit of copying around while creating tuples and such, > but it's not to add null terminators. How much effort (past discussions that I've missed from a decade ago? hehe) has been put into determining whether a zero-copy architecture, or really, a minimum copy architecture, would address some of these bottlenecks? Am I dreaming? :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
mark@mark.mielke.cc writes: > I read "the backend is by and large an ASCII, null-terminated-string > engine" with "we use UTF-8 [for varlena strings?]" as, a lot of the > code assumes varlena strings are '\0' terminated, and an assumption > on my part, that the varlena strings are not stored in the backend > with a '\0' terminator, therefore, they require being copied out, > terminated with a '\0', before they can be used? There are places where we have to do that, the worst from a performance viewpoint being in string comparison --- we have to null-terminate both values before we can pass them to strcoll(). One of the large bits that would have to be done before we could even contemplate using UCS2/UCS4 is getting rid of our dependence on strcoll, since its API is null-terminated-string. > How much effort (past discussions that I've missed from a decade ago? > hehe) has been put into determining whether a zero-copy architecture, > or really, a minimum copy architecture, would address some of these > bottlenecks? Am I dreaming? :-) We've already done it in places, for instance the new implementation of "virtual tuples" in TupleTableSlots eliminates a lot of copying of pass-by-reference values. regards, tom lane
On Fri, Nov 04, 2005 at 01:54:04PM -0500, Tom Lane wrote: > mark@mark.mielke.cc writes: > > I read "the backend is by and large an ASCII, null-terminated-string > > engine" with "we use UTF-8 [for varlena strings?]" as, a lot of the > > code assumes varlena strings are '\0' terminated, and an assumption > > on my part, that the varlena strings are not stored in the backend > > with a '\0' terminator, therefore, they require being copied out, > > terminated with a '\0', before they can be used? > > There are places where we have to do that, the worst from a performance > viewpoint being in string comparison --- we have to null-terminate both > values before we can pass them to strcoll(). > > One of the large bits that would have to be done before we could even > contemplate using UCS2/UCS4 is getting rid of our dependence on strcoll, > since its API is null-terminated-string. Yeah, and while one way of removing that dependance is to use ICU, that library wants everything in UTF-16. So we replace "copying to add NULL to string" with "converting UTF-8 to UTF-16 on each call. Ugh! The argument for UTF-16 is that if you're using a language that doesn't use ASCII at all, UTF-8 gets inefficient pretty quickly. Locale sensetive, efficient storage, fast comparisons, pick any two! My guess is that in the long run there would be two basic string datatypes, one UTF-8, null terminated string used in the backend code as a standard C string, default collation strcmp. The other UTF-16 for user data that wants to be able to collate in a locale dependant way. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On 11/4/05, Martijn van Oosterhout <kleptog@svana.org> wrote: > Yeah, and while one way of removing that dependance is to use ICU, that > library wants everything in UTF-16. So we replace "copying to add NULL > to string" with "converting UTF-8 to UTF-16 on each call. Ugh! The > argument for UTF-16 is that if you're using a language that doesn't use > ASCII at all, UTF-8 gets inefficient pretty quickly. Is this really the case? Only unicode values 000800 - 00FFFF are smaller in UTF-16 than in UTF-8, and in their case it's three bytes vs two. Cyrilic, Arabic, Greek, Latin, etc are all two bytes in both. So, yes in some cases UTF-8 will use three bytes where UTF-16 would be two, but thats less inefficient than UTF-16 for ASCII, which many people find acceptable. > Locale sensetive, efficient storage, fast comparisons, pick any two! I don't know that the choices are that limited, as I indicated earlier in the thread I think it's useful to think of all of these encodings as just different compression algorithms. If our desire was to have all three, the backend could be made null safe and we could use the locale-sensitive and fast representation (Probably UTF-16 or UTF-32) in memory, and store on disk whatever is most efficient for storage. (lz compressed UTF-whatever for fat fields, UTF-8 for mostly ascii small fields, SCSU for non-ascii short fields (http://www.unicode.org/reports/tr6/), etc) > My guess is that in the long run there would be two basic string > datatypes, one UTF-8, null terminated string used in the backend code > as a standard C string, default collation strcmp. The other UTF-16 for > user data that wants to be able to collate in a locale dependant way. So if we need locale dependant colation we suffer 2x inflation for many texts, and multibyte complexity still required if we are to collate correctly when there are characters outside of the BMP. Yuck. Disk storage type, memory strorage type, user API type, and collation should be decoupled.
On 11/4/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > Yeah, and while one way of removing that dependance is to use ICU, that > > library wants everything in UTF-16. > > Really? Can't it do UCS4 (UTF-32)? There's a nontrivial population > of our users that isn't satisfied with UTF-16 anyway, so if that really > is a restriction of ICU then we're going to have to look elsewhere :-( The correct question to ask is something like "Does it support non-bmp characters?" or "Does it really support UTF-16 or just UCS2?" UTF-16 is (now) a variable width encoding which is a strict superset of UCS2 which allows the representation of all Unicode characters. UCS2 is fixed width and only supports characters from the basic multilingual plane. UTF-32 and UCS4 are (now) effectively the same thing and can represent all unicode characters with a 4 byte fixed length word. The code can demand UTF-16 but still be fine for non-BMP characters. However, many things which claim to support UTF-16 really only support UCS2 or at least have bugs in their handling of non-bmp characters. Software that supports UTF-8 is somewhat more likely to support non-bmp characters correctly since the variable length code paths get more of a workout in many environments. :)
Martijn van Oosterhout <kleptog@svana.org> writes: > Yeah, and while one way of removing that dependance is to use ICU, that > library wants everything in UTF-16. Really? Can't it do UCS4 (UTF-32)? There's a nontrivial population of our users that isn't satisfied with UTF-16 anyway, so if that really is a restriction of ICU then we're going to have to look elsewhere :-( regards, tom lane
On Thu, Nov 03, 2005 at 04:07:41PM +0100, Marcus Engene wrote: > Simon Riggs wrote: > >On Thu, 2005-11-03 at 11:13 -0300, Alvaro Herrera wrote: > > > >>Simon Riggs wrote: > >> > >>>On PostgreSQL, CHAR(12) is a bpchar datatype with all instantiations of > >>>that datatype having a 4 byte varlena header. In this example, all of > >>>those instantiations having the varlena header set to 12, so essentially > >>>wasting the 4 byte header. > >> > >>We need the length word because the actual size in bytes is variable, > >>due to multibyte encoding considerations. > > > > > >Succinctly put, thanks. > > > >Incidentally, you remind me that other databases do *not* vary the > >character length, even if they do have varying length UTF-8 within them. > >So if you define CHAR(255) then it could blow up at a random length if > >you store UTF-8 within it. > > > >That's behaviour that I could never sanction, so I'll leave this now. > > > >Best Regards, Simon Riggs > > > > Just as a side note, in Oracle you can use the syntax (f.ex on on a db > with utf-8 charset): > > column VARCHAR2(10 CHAR) > > ...to indicate that Oracle should fit 10 characters there. It might use > up to 40 bytes in the db, but that's up to Oracle. If I s/10 CHAR/10, at > most 10 characters will fit. > > This works very well. The only catch is that it's not good to use more > than 1000 chars since oracle's varchars dont want to go past 4000 bytes. Likewise other databases use different character types such as NCHAR (nationalized char), which is the 16 bit variant. I think it's perfectly acceptable to have a char type that is a fixed-width in terms of number of bytes, so long as we provide an alternative. Heck, in my experience char is only used to store things like hashes that are in ASCII anyway. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Thu, Nov 03, 2005 at 10:32:03AM -0500, Tom Lane wrote: > I'd feel a lot happier about this if we could keep the dynamic range > up to, say, 10^512 so that it's still true that NUMERIC can be a > universal parse-time representation. That would also make it even > more unlikely that anyone would complain about loss of functionality. Would it be feasable to have a type that satisfies that constraint but isn't generally intended for on-disk use? My thought is that this new type would be used mostly for casting purposes. Kind of like the UNKNOWNNUMBER but easier to do since it'd just be another type. (BTW, I'm not suggesting that we disallow un-disk storage of the type, only discourage it unless someone really, really needs an absurd number of digits). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Thu, Nov 03, 2005 at 10:32:03AM -0500, Tom Lane wrote: >> I'd feel a lot happier about this if we could keep the dynamic range >> up to, say, 10^512 so that it's still true that NUMERIC can be a >> universal parse-time representation. That would also make it even >> more unlikely that anyone would complain about loss of functionality. > Would it be feasable to have a type that satisfies that constraint but > isn't generally intended for on-disk use? My thought is that this new > type would be used mostly for casting purposes. Kind of like the > UNKNOWNNUMBER but easier to do since it'd just be another type. What exactly would be easier about it? ISTM you just described UNKNOWNNUMERIC to a T. regards, tom lane
On Fri, Nov 04, 2005 at 04:30:27PM -0500, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > On Thu, Nov 03, 2005 at 10:32:03AM -0500, Tom Lane wrote: > >> I'd feel a lot happier about this if we could keep the dynamic range > >> up to, say, 10^512 so that it's still true that NUMERIC can be a > >> universal parse-time representation. That would also make it even > >> more unlikely that anyone would complain about loss of functionality. > > > Would it be feasable to have a type that satisfies that constraint but > > isn't generally intended for on-disk use? My thought is that this new > > type would be used mostly for casting purposes. Kind of like the > > UNKNOWNNUMBER but easier to do since it'd just be another type. > > What exactly would be easier about it? ISTM you just described > UNKNOWNNUMERIC to a T. Apologies then; it sounded like UNKNOWNNUMERIC was going to be something that was internal-use only, presumably making it much harder to implement than just adding an additional type. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Fri, Nov 04, 2005 at 02:58:05PM -0500, Gregory Maxwell wrote: > The correct question to ask is something like "Does it support non-bmp > characters?" or "Does it really support UTF-16 or just UCS2?" > > UTF-16 is (now) a variable width encoding which is a strict superset > of UCS2 which allows the representation of all Unicode characters. > UCS2 is fixed width and only supports characters from the basic > multilingual plane. UTF-32 and UCS4 are (now) effectively the same > thing and can represent all unicode characters with a 4 byte fixed > length word. It's all on their website: : How is a Unicode string represented in ICU? : : A Unicode string is currently represented as UTF-16 by default. The : endianess of UTF-16 is platform dependent. You can guarantee the : endianess of UTF-16 by using a converter. UTF-16 strings can be : converted to other Unicode forms by using a converter or with the UTF : conversion macros. : : ICU does not use UCS-2. UCS-2 is a subset of UTF-16. UCS-2 does not : support surrogates, and UTF-16 does support surrogates. This means : that UCS-2 only supports UTF-16's Base Multilingual Plane (BMP). The : notion of UCS-2 is deprecated and dead. Unicode 2.0 in 1996 changed : its default encoding to UTF-16. <snip> : What is the performance difference between UTF-8 and UTF-16? : : Most of the time, the memory throughput of the hard drive and RAM is : the main performance constraint. UTF-8 is 50% smaller than UTF-16 for : US-ASCII, but UTF-8 is 50% larger than UTF-16 for East and South : Asian scripts. There is no memory difference for Latin extensions, : Greek, Cyrillic, Hebrew, and Arabic. <snip> http://icu.sourceforge.net/userguide/icufaq.html : Using UTF-8 strings with ICU : : As mentioned in the overview of this chapter, ICU and most other : Unicode-supporting software uses 16-bit Unicode for internal : processing. However, there are circumstances where UTF-8 is used : instead. This is usually the case for software that does little or no : processing of non-ASCII characters, and/or for APIs that predate : Unicode, use byte-based strings, and cannot be changed or replaced : for various reasons. <snip> : While ICU does not natively use UTF-8 strings, there are many ways to : work with UTF-8 strings and ICU. The following list is probably : incomplete. http://icu.sourceforge.net/userguide/strings.html#strings Basically you use a "converter" to process the UTF-8 strings, prusumably converting them to UTF-16 (which is not UCS-2 as noted above). UTF-32 needs a converter also, so no point using that either. > The code can demand UTF-16 but still be fine for non-BMP characters. > However, many things which claim to support UTF-16 really only support > UCS2 or at least have bugs in their handling of non-bmp characters. > Software that supports UTF-8 is somewhat more likely to support > non-bmp characters correctly since the variable length code paths get > more of a workout in many environments. :) I think ICU deals with that, but feel free to peruse the website yourself... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On 11/4/05, Martijn van Oosterhout <kleptog@svana.org> wrote: [snip] > : ICU does not use UCS-2. UCS-2 is a subset of UTF-16. UCS-2 does not > : support surrogates, and UTF-16 does support surrogates. This means > : that UCS-2 only supports UTF-16's Base Multilingual Plane (BMP). The > : notion of UCS-2 is deprecated and dead. Unicode 2.0 in 1996 changed > : its default encoding to UTF-16. > <snip> This means it's fine.. ICU's use of UTF-16 will not break our support for all of unicode. Conversion too and from UTF-16 isn't cheap, however, if you're doing it all the time. Storing ASCII in UTF-16 is pretty lame. Widespread use of UTF-16 tends to hide bugs in the handling of non-bmp characters. ... I would be somewhat surprised to see a substantial performance difference in working with UTF-16 data over UTF-8, but then again ... they'd know and I wouldn't. Other lame aspects of using unicode encodings other than UTF-8 internally is that it's harder to figure out what is text in GDB output and such.. can make debugging more difficult.
On Fri, Nov 04, 2005 at 07:15:22PM -0500, Gregory Maxwell wrote: > Other lame aspects of using unicode encodings other than UTF-8 > internally is that it's harder to figure out what is text in GDB > output and such.. can make debugging more difficult. Yeah, that's one of the reasons I think UTF-16 should be restricted to userdata only. There's no particular reason why system tables and indexes need to be in a locale sensetive collate order. If someone really wants that they can pay for the conversion costs at query time. The problem is that PostgreSQL doesn't support multiple collate orders, that's being worked on though. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
In article <e692861c0511041149n6fe36345oba7c43d1d48bef3d@mail.gmail.com>, Gregory Maxwell <gmaxwell@gmail.com> writes: > On 11/4/05, Martijn van Oosterhout <kleptog@svana.org> wrote: >> Yeah, and while one way of removing that dependance is to use ICU, that >> library wants everything in UTF-16. So we replace "copying to add NULL >> to string" with "converting UTF-8 to UTF-16 on each call. Ugh! The >> argument for UTF-16 is that if you're using a language that doesn't use >> ASCII at all, UTF-8 gets inefficient pretty quickly. > Is this really the case? Only unicode values 000800 - 00FFFF are > smaller in UTF-16 than in UTF-8, and in their case it's three bytes vs > two. Cyrilic, Arabic, Greek, Latin, etc are all two bytes in both. IMHO the best encoding for "Cyrilic, Arabic, Greek, Latin, etc" is ISO-8859-* - just one byte. You need UTF* only when you want to have more than one of of them in the same column.
On Thu, 2005-11-03 at 10:32 -0500, Tom Lane wrote: > I'd feel a lot happier about this if we could keep the dynamic range > up to, say, 10^512 so that it's still true that NUMERIC can be a > universal parse-time representation. That would also make it even > more unlikely that anyone would complain about loss of functionality. > > To do that we'd need 8 bits for weight (-128..127 for a base-10K > exponent is enough) but we need 9 bits for dscale which does not > quite fit. I think we could make it go by cramming the sign and > the high-order dscale bit into the first NumericDigit --- the > digit itself can only be 0..9999 so there are a couple of bits > to spare. This probably *would* slow down packing and unpacking of > numerics, but just by a couple lines of C. Arguably the net reduction > in I/O costs would justify that. I've got a working version of the code using the above scheme, with these additional wrinkles: NaN is indicated by weight=-128, giving a dynamic range of 10^508. Zeroes are fully compressed, except when the Scale > 255. In that case, the first digit is present to signify the presence of the high order Scale bit. Comments? Once 8.1 is released, I'll go back and see if I can improve the coding in a few days with fresh eyes, then submit a patch. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Thu, 2005-11-03 at 10:32 -0500, Tom Lane wrote: >> I think we could make it go by cramming the sign and >> the high-order dscale bit into the first NumericDigit --- the >> digit itself can only be 0..9999 so there are a couple of bits >> to spare. > I've got a working version of the code using the above scheme, Really? After I woke up a bit more I realized there was only one bit and change to spare, not two, so I don't see how it would work. (Unless you want to force a minimum of two NumericDigits when the dscale exceeds 255, and flag the scale in the second one --- which might be OK but it's getting messier and messier...) regards, tom lane
On Sun, 2005-11-06 at 11:26 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Thu, 2005-11-03 at 10:32 -0500, Tom Lane wrote: > >> I think we could make it go by cramming the sign and > >> the high-order dscale bit into the first NumericDigit --- the > >> digit itself can only be 0..9999 so there are a couple of bits > >> to spare. > > > I've got a working version of the code using the above scheme, > > Really? After I woke up a bit more I realized there was only one bit > and change to spare, not two, so I don't see how it would work. Not sure why you think that. Seems to fit.... Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Sun, 2005-11-06 at 11:26 -0500, Tom Lane wrote: >> Really? After I woke up a bit more I realized there was only one bit >> and change to spare, not two, so I don't see how it would work. > Not sure why you think that. Seems to fit.... [ counts on fingers and toes too ... ] Obviously, I was more awake the first time. Never mind :-( regards, tom lane