Thread: OCTET_LENGTH is wrong
I noticed OCTET_LENGTH will return the size of the data after TOAST may have compressed it. While this could be useful information, this behaviour has no basis in the SQL standard and it's not what is documented. Moreover, it eliminates the standard useful behaviour of OCTET_LENGTH, which is to show the length in bytes of a multibyte string. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > I noticed OCTET_LENGTH will return the size of the data after TOAST may > have compressed it. While this could be useful information, this > behaviour has no basis in the SQL standard and it's not what is > documented. Moreover, it eliminates the standard useful behaviour of > OCTET_LENGTH, which is to show the length in bytes of a multibyte string. I wondered about that too, the first time I noticed it. On the other hand, knowing the compressed length is kinda useful too, at least for hacking and DBA purposes. (One might also like to know whether a value has been moved out of line, which is not currently determinable.) I don't want to force an initdb at this stage, at least not without compelling reason, so adding more functions right now is not feasible. Maybe a TODO item for next time. That leaves us with the question whether to change OCTET_LENGTH now or leave it for later. Anyone? BTW, I noticed that textlength() is absolutely unreasonably slow when MULTIBYTE is enabled --- yesterday I was trying to profile TOAST overhead, and soon discovered that what I was looking at was nothing but pg_mblen() calls. It really needs a short-circuit path for single-byte encodings. regards, tom lane
> Peter Eisentraut <peter_e@gmx.net> writes: > > I noticed OCTET_LENGTH will return the size of the data after TOAST may > > have compressed it. While this could be useful information, this > > behaviour has no basis in the SQL standard and it's not what is > > documented. Moreover, it eliminates the standard useful behaviour of > > OCTET_LENGTH, which is to show the length in bytes of a multibyte string. > > I wondered about that too, the first time I noticed it. On the other > hand, knowing the compressed length is kinda useful too, at least for > hacking and DBA purposes. (One might also like to know whether a value > has been moved out of line, which is not currently determinable.) > > I don't want to force an initdb at this stage, at least not without > compelling reason, so adding more functions right now is not feasible. > Maybe a TODO item for next time. > > That leaves us with the question whether to change OCTET_LENGTH now > or leave it for later. Anyone? I am unconcerned about showing people the actual toasted length. Seems we should get octet_length() computed on the un-TOASTED length, if we can. > BTW, I noticed that textlength() is absolutely unreasonably slow when > MULTIBYTE is enabled --- yesterday I was trying to profile TOAST > overhead, and soon discovered that what I was looking at was nothing > but pg_mblen() calls. It really needs a short-circuit path for > single-byte encodings. Added to TODO: * Optimize textlength(), etc. for single-byte encodings -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Peter Eisentraut <peter_e@gmx.net> writes: > > I noticed OCTET_LENGTH will return the size of the data after TOAST may > > have compressed it. While this could be useful information, this > > behaviour has no basis in the SQL standard and it's not what is > > documented. Moreover, it eliminates the standard useful behaviour of > > OCTET_LENGTH, which is to show the length in bytes of a multibyte string. > > I wondered about that too, the first time I noticed it. On the other > hand, knowing the compressed length is kinda useful too, at least for > hacking and DBA purposes. (One might also like to know whether a value > has been moved out of line, which is not currently determinable.) It seems the behavior of OCTET_LENGTH varies acording to the corresponding data type: TEXT: returns the size of data AFTER TOAST VARCHAR and CHAR: returns the size of data BEFORE TOAST I think we should fix at least these inconsistencies but am not sure if it's totally wrong that OCTET_LENGTH returns the length AFTER TOAST. The SQL standard does not have any idea about TOAST of course. Also, I tend to agree with Tom's point about hackers and DBAs. > I don't want to force an initdb at this stage, at least not without > compelling reason, so adding more functions right now is not feasible. > Maybe a TODO item for next time. > > That leaves us with the question whether to change OCTET_LENGTH now > or leave it for later. Anyone? My opinion is leaving it for 7.3, with the idea (adding new functions). > BTW, I noticed that textlength() is absolutely unreasonably slow when > MULTIBYTE is enabled --- yesterday I was trying to profile TOAST > overhead, and soon discovered that what I was looking at was nothing > but pg_mblen() calls. It really needs a short-circuit path for > single-byte encodings. It's easy to optimize that. However I cannot access CVS anymore after the IP address change. Will post patches later... -- Tatsuo Ishii
Peter Eisentraut <peter_e@gmx.net> writes: > ... Moreover, it eliminates the standard useful behaviour of > OCTET_LENGTH, which is to show the length in bytes of a multibyte string. While I don't necessarily dispute this, I do kinda wonder where you derive the statement. AFAICS, SQL92 defines OCTET_LENGTH in terms of BIT_LENGTH: 6.6 General Rule 5: a) Let S be the <string value expression>. If the value of S is not the null value, then the resultis the smallest integer not less than the quotient of the division (BIT_LENGTH(S)/8). b) Otherwise,the result is the null value. and BIT_LENGTH is defined in the next GR: a) Let S be the <string value expression>. If the value of S is not the null value, then the resultis the number of bits in the value of S. b) Otherwise, the result is the null value. While SQL92 is pretty clear about <bit string>, I'm damned if I can see anywhere that they define how many bits are in a character string value. So who's to say what representation is to be used to count the bits? If, say, UTF-16 and UTF-8 are equally reasonable choices, then why shouldn't a compressed representation be reasonable too? regards, tom lane
> > BTW, I noticed that textlength() is absolutely unreasonably slow when > > MULTIBYTE is enabled --- yesterday I was trying to profile TOAST > > overhead, and soon discovered that what I was looking at was nothing > > but pg_mblen() calls. It really needs a short-circuit path for > > single-byte encodings. > > It's easy to optimize that. However I cannot access CVS anymore after > the IP address change. Will post patches later... Seems I got the cvs access again (I was asked my pass phrase again) and I have committed changes for this. Modified functions are: bpcharlen textlen varcharlen -- Tatsuo Ishii
> > > BTW, I noticed that textlength() is absolutely unreasonably slow when > > > MULTIBYTE is enabled --- yesterday I was trying to profile TOAST > > > overhead, and soon discovered that what I was looking at was nothing > > > but pg_mblen() calls. It really needs a short-circuit path for > > > single-byte encodings. > > > > It's easy to optimize that. However I cannot access CVS anymore after > > the IP address change. Will post patches later... > > Seems I got the cvs access again (I was asked my pass phrase again) > and I have committed changes for this. > > Modified functions are: > > bpcharlen > textlen > varcharlen Did you go with the pre or post-TOAST length for these types? I vote for pre-TOAST because it seems much more useful to ordinary users. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > > BTW, I noticed that textlength() is absolutely unreasonably slow when > > > MULTIBYTE is enabled --- yesterday I was trying to profile TOAST > > > overhead, and soon discovered that what I was looking at was nothing > > > but pg_mblen() calls. It really needs a short-circuit path for > > > single-byte encodings. > > > > It's easy to optimize that. However I cannot access CVS anymore after > > the IP address change. Will post patches later... > > Seems I got the cvs access again (I was asked my pass phrase again) > and I have committed changes for this. > > Modified functions are: > > bpcharlen > textlen > varcharlen OK, sorry, I see you did the optimization, not changed the length functio. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane writes: > a) Let S be the <string value expression>. If the value of S is > not the null value, then the result is the number of bits in > the value of S. > b) Otherwise, the result is the null value. > > While SQL92 is pretty clear about <bit string>, I'm damned if I can see > anywhere that they define how many bits are in a character string value. > So who's to say what representation is to be used to count the bits? > If, say, UTF-16 and UTF-8 are equally reasonable choices, then why > shouldn't a compressed representation be reasonable too? I think "the value of S" implies "the user-accessible representation of the value of S", in the sense, "How much memory do I need to allocate to store this value". Furthermore, the size of the TOAST representation that is returned now is just one particular of several intermediate representations. For instance, it does not include the VARHDRSZ and it does not include the size of the tuple headers when it's stored externally. Thus, this size is heavily skewed toward low numbers and doesn't tell you much about either the disk end or the user's end. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > I think "the value of S" implies "the user-accessible representation of > the value of S", in the sense, "How much memory do I need to allocate to > store this value". If I take that argument seriously, I have to conclude that OCTET_LENGTH should return the string length measured in the current client encoding (which may have little to do with its size in the server, if the server's encoding is different). If the client actually retrieves the string then that's how much memory he'll need. I presume that where you want to come out is OCTET_LENGTH = uncompressed length in the server's encoding ... but so far no one has really made a convincing argument why that answer is better or more spec-compliant than any other answer. In particular, it's not obvious to me why "number of bytes we're actually using on disk" is wrong. regards, tom lane
> I think "the value of S" implies "the user-accessible representation of > the value of S", in the sense, "How much memory do I need to allocate to > store this value". > > Furthermore, the size of the TOAST representation that is returned now is > just one particular of several intermediate representations. For > instance, it does not include the VARHDRSZ and it does not include the > size of the tuple headers when it's stored externally. Thus, this size is > heavily skewed toward low numbers and doesn't tell you much about either > the disk end or the user's end. Yes, good arguments. If we want to implement storage_length at some later time, I think the compressed length may be appropriate, but for general use, I think we need to return the uncompressed length, especially considering that multibyte makes the ordinary 2length return number of characters, so users need a way to get byte length. Attached is a patch that makes text return the same value type as char() and varchar() already do. As Tatsuo pointed out, they were inconsistent. All the other octet_length() functions look fine so it was only text that had this problem. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 Index: src/backend/utils/adt/varlena.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/adt/varlena.c,v retrieving revision 1.74 diff -c -r1.74 varlena.c *** src/backend/utils/adt/varlena.c 2001/10/25 05:49:46 1.74 --- src/backend/utils/adt/varlena.c 2001/11/18 19:11:52 *************** *** 273,284 **** Datum textoctetlen(PG_FUNCTION_ARGS) { ! struct varattrib *t = (struct varattrib *) PG_GETARG_RAW_VARLENA_P(0); ! if (!VARATT_IS_EXTERNAL(t)) ! PG_RETURN_INT32(VARATT_SIZE(t) - VARHDRSZ); ! ! PG_RETURN_INT32(t->va_content.va_external.va_extsize); } /* --- 273,281 ---- Datum textoctetlen(PG_FUNCTION_ARGS) { ! text *arg = PG_GETARG_VARCHAR_P(0); ! PG_RETURN_INT32(VARSIZE(arg) - VARHDRSZ); } /*
On Sun, 18 Nov 2001, Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > I think "the value of S" implies "the user-accessible representation of > > the value of S", in the sense, "How much memory do I need to allocate to > > store this value". > > If I take that argument seriously, I have to conclude that OCTET_LENGTH > should return the string length measured in the current client encoding > (which may have little to do with its size in the server, if the > server's encoding is different). If the client actually retrieves the > string then that's how much memory he'll need. > > I presume that where you want to come out is OCTET_LENGTH = uncompressed > length in the server's encoding ... but so far no one has really made > a convincing argument why that answer is better or more spec-compliant > than any other answer. In particular, it's not obvious to me why > "number of bytes we're actually using on disk" is wrong. I'm not sure, but if we say that the on disk representation is the value of the character value expression whose size is being checked, wouldn't that be inconsistent with the other uses of the character value expression in places like substr where we don't use the on disk representation? Unless you're saying that the string value expression that is that character value expression is the compressed one and the character value expression is the uncompressed one.
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > On Sun, 18 Nov 2001, Tom Lane wrote: >> I presume that where you want to come out is OCTET_LENGTH = uncompressed >> length in the server's encoding ... but so far no one has really made >> a convincing argument why that answer is better or more spec-compliant >> than any other answer. In particular, it's not obvious to me why >> "number of bytes we're actually using on disk" is wrong. > I'm not sure, but if we say that the on disk representation is the > value of the character value expression whose size is being checked, > wouldn't that be inconsistent with the other uses of the character value Yeah, it would be and is. In fact, the present code has some interesting behaviors: if foo.x is a text value long enough to be toasted, then you get different results from SELECT OCTET_LENGTH(x) FROM foo; SELECT OCTET_LENGTH(x || '') FROM foo; since the result of the concatenation expression won't be compressed. I'm not actually here to defend the existing code; in fact I believe the XXX comment on textoctetlen questioning its correctness is mine. What I am trying to point out is that the spec is so vague that it's not clear what the correct answer is. regards, tom lane
> Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > > On Sun, 18 Nov 2001, Tom Lane wrote: > >> I presume that where you want to come out is OCTET_LENGTH = uncompressed > >> length in the server's encoding ... but so far no one has really made > >> a convincing argument why that answer is better or more spec-compliant > >> than any other answer. In particular, it's not obvious to me why > >> "number of bytes we're actually using on disk" is wrong. > > > I'm not sure, but if we say that the on disk representation is the > > value of the character value expression whose size is being checked, > > wouldn't that be inconsistent with the other uses of the character value > > Yeah, it would be and is. In fact, the present code has some > interesting behaviors: if foo.x is a text value long enough to be > toasted, then you get different results from > > SELECT OCTET_LENGTH(x) FROM foo; > > SELECT OCTET_LENGTH(x || '') FROM foo; > > since the result of the concatenation expression won't be compressed. > > I'm not actually here to defend the existing code; in fact I believe the > XXX comment on textoctetlen questioning its correctness is mine. What > I am trying to point out is that the spec is so vague that it's not > clear what the correct answer is. Well, if the standard is unclear, we should assume to return the most reasonable answer, which has to be non-compressed length. In multibyte encodings, when we started returning length() in _characters_ instead of bytes, I assumed the major use for octet_length was to return the number of bytes needed to hold the value on the client side. In single byte encodings, octet_length is the same as length() so returning a compressed length may make sense, but I don't think we want different meanings for the function for single and multi-byte encodings. I guess the issue is that for single-byte encodings, octet_length is pretty useless because it is the same as length, but for multi-byte encodings, octet_length is invaluable and almost has to return non-compress bytes because uncompressed is that the client sees. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > octet_length is invaluable and almost has to return > non-compress bytes because uncompressed is that the client sees. What about encoding? regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > octet_length is invaluable and almost has to return > > non-compress bytes because uncompressed is that the client sees. ^^^^ what > What about encoding? Single-byte encodings have the same character and byte lengths. Only multi-byte encodings are different, right? In thinking about it, I think the function is called octet_length() to emphasize is returns the length in octets (bytes) rather than the length in characters. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > What about encoding? > > Single-byte encodings have the same character and byte lengths. Only > multi-byte encodings are different, right? > > In thinking about it, I think the function is called octet_length() > to emphasize is returns the length in octets (bytes) rather than the > length in characters. I think Tom's point is whether octet_length() should regard input text being encoded in the client side encoding or not. My vote is octet_length() assumes database encodeding. If you need client side encoded text length, you could do something like: select octet_length(convert('foo',pg_client_encoding())); Note that there was a nasty bug in convert() which prevents above working. I have committed fixes. -- Tatsuo Ishii
Tom Lane writes: > What > I am trying to point out is that the spec is so vague that it's not > clear what the correct answer is. I guess the authors of SQL92 never imagined someone would question what "value of S" means. In SQL99 they included it: SQL 99 Part 1, 4.4.3.2. A value of character type is a string (sequence) of characters drawn from some character repertoire. Just to be sure... SQL 99 Part 1, 3.1 q) q) sequence: An ordered collection of objects that are not necessarily distinct. I don't have a set theory text available, but I think this should give a fair indication that the number of bits in the value of S is the sum of the bits in each individual character (which is in turn vaguely defined elsewhere in SQL99) -- at least in Euclidean memory architectures. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > I don't have a set theory text available, but I think this should give a > fair indication that the number of bits in the value of S is the sum of > the bits in each individual character (which is in turn vaguely defined > elsewhere in SQL99) -- at least in Euclidean memory architectures. But "how many bits in a character?" is exactly the question at this point. To be fair, I don't think our notion of on-the-fly encoding translation is envisioned anywhere in the SQL spec, so perhaps we shouldn't expect it to tell us which encoding to count the bits in. regards, tom lane
Tom, While the text datatypes have additional issues with encodings, that is not true for the bytea type. I think it does make sense that a client be able to get the size in bytes that the bytea type value will return to the client. If you are storing files in a bytea column getting the file size by calling octet_length would be very useful. thanks, --Barry Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > >>I think "the value of S" implies "the user-accessible representation of >>the value of S", in the sense, "How much memory do I need to allocate to >>store this value". >> > > If I take that argument seriously, I have to conclude that OCTET_LENGTH > should return the string length measured in the current client encoding > (which may have little to do with its size in the server, if the > server's encoding is different). If the client actually retrieves the > string then that's how much memory he'll need. > > I presume that where you want to come out is OCTET_LENGTH = uncompressed > length in the server's encoding ... but so far no one has really made > a convincing argument why that answer is better or more spec-compliant > than any other answer. In particular, it's not obvious to me why > "number of bytes we're actually using on disk" is wrong. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > >
OK, I have applied this patch so text octet_length returns non-compressed length of data, to match octet_length of other types. I also removed the XXX comments added by Tom. --------------------------------------------------------------------------- > > I think "the value of S" implies "the user-accessible representation of > > the value of S", in the sense, "How much memory do I need to allocate to > > store this value". > > > > Furthermore, the size of the TOAST representation that is returned now is > > just one particular of several intermediate representations. For > > instance, it does not include the VARHDRSZ and it does not include the > > size of the tuple headers when it's stored externally. Thus, this size is > > heavily skewed toward low numbers and doesn't tell you much about either > > the disk end or the user's end. > > Yes, good arguments. If we want to implement storage_length at some > later time, I think the compressed length may be appropriate, but for > general use, I think we need to return the uncompressed length, > especially considering that multibyte makes the ordinary 2length return > number of characters, so users need a way to get byte length. > > Attached is a patch that makes text return the same value type as char() > and varchar() already do. As Tatsuo pointed out, they were > inconsistent. All the other octet_length() functions look fine so it > was only text that had this problem. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > Index: src/backend/utils/adt/varlena.c > =================================================================== > RCS file: /cvsroot/pgsql/src/backend/utils/adt/varlena.c,v > retrieving revision 1.74 > diff -c -r1.74 varlena.c > *** src/backend/utils/adt/varlena.c 2001/10/25 05:49:46 1.74 > --- src/backend/utils/adt/varlena.c 2001/11/18 19:11:52 > *************** > *** 273,284 **** > Datum > textoctetlen(PG_FUNCTION_ARGS) > { > ! struct varattrib *t = (struct varattrib *) PG_GETARG_RAW_VARLENA_P(0); > > ! if (!VARATT_IS_EXTERNAL(t)) > ! PG_RETURN_INT32(VARATT_SIZE(t) - VARHDRSZ); > ! > ! PG_RETURN_INT32(t->va_content.va_external.va_extsize); > } > > /* > --- 273,281 ---- > Datum > textoctetlen(PG_FUNCTION_ARGS) > { > ! text *arg = PG_GETARG_VARCHAR_P(0); > > ! PG_RETURN_INT32(VARSIZE(arg) - VARHDRSZ); > } > > /* > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Barry Lind <barry@xythos.com> writes: > While the text datatypes have additional issues with encodings, that is > not true for the bytea type. I think it does make sense that a client > be able to get the size in bytes that the bytea type value will return > to the client. bytea does that already. It's only text that has (or had, till a few minutes ago) the funny behavior. I'm not set on the notion that octet_length should return on-disk size; that's clearly not what's contemplated by SQL92, so I'm happy to agree that if we want that we should add a new function to get it. ("storage_length", maybe.) What's bothering me right now is the difference between client and server encodings. It seems that the only plausible use for octet_length is to do memory allocation on the client side, and for that purpose the length ought to be measured in the client encoding. People seem to be happy with letting octet_length take the easy way out (measure in the server encoding), and I'm trying to get someone to explain to me why that's the right behavior. I don't see it. regards, tom lane
Tom Lane wrote: >Peter Eisentraut <peter_e@gmx.net> writes: > >>... Moreover, it eliminates the standard useful behaviour of >>OCTET_LENGTH, which is to show the length in bytes of a multibyte string. >> > >While I don't necessarily dispute this, I do kinda wonder where you >derive the statement. AFAICS, SQL92 defines OCTET_LENGTH in terms >of BIT_LENGTH: > >6.6 General Rule 5: > > a) Let S be the <string value expression>. If the value of S is > not the null value, then the result is the smallest integer > not less than the quotient of the division (BIT_LENGTH(S)/8). > b) Otherwise, the result is the null value. > >and BIT_LENGTH is defined in the next GR: > > a) Let S be the <string value expression>. If the value of S is > not the null value, then the result is the number of bits in > the value of S. > b) Otherwise, the result is the null value. > >While SQL92 is pretty clear about <bit string>, I'm damned if I can see >anywhere that they define how many bits are in a character string value >So who's to say what representation is to be used to count the bits? >If, say, UTF-16 and UTF-8 are equally reasonable choices, then why >shouldn't a compressed representation be reasonable too? > One objection I have to this, is the fact that nobody uses the compressed representation in client libraries whrereas they do use both UTF-16 and UTF-8. At least UTF-8 is available as client encoding. And probably it is possible that the length of the "possibly compressed" representation can change without the underlying data changing (for example when you set a bit somewhere that disables compression and UPDATE some other field in the tuple) making the result of OCTET_LENGTH dependent on other things than the argument string. I also like the propery of _uncompressed_ OCTET_LENGTH that OCTET_LENGTH(s||s) == 2 * OCTET_LENGTH(s) which is almost never true for compressed length ---------------- Hannu
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> ! text *arg = PG_GETARG_VARCHAR_P(0); Er, shouldn't that be PG_GETARG_TEXT_P? regards, tom lane
> What's bothering me right now is the > difference between client and server encodings. It seems that the only > plausible use for octet_length is to do memory allocation on the client > side, and for that purpose the length ought to be measured in the client > encoding. People seem to be happy with letting octet_length take the > easy way out (measure in the server encoding), and I'm trying to get > someone to explain to me why that's the right behavior. I > don't see it. I agree. octet_length should be the number of bytes the client gets when he does "select textfield from atable". Andreas
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> ! text *arg = PG_GETARG_VARCHAR_P(0); > > Er, shouldn't that be PG_GETARG_TEXT_P? Sorry, fixed. Cut/paste error. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Summary: There have been three ideas of what octet_length() sould return: 1) compressed on-disk storage length2) byte length in server-side encoding3) byte length in client-side encoding 7.3 will do #2 for all data types. We didn't have text type doing #2 in 7.1.X, but it appears that is the only release where octet_length(text) returned #1. This is the patch that made octet_length(text) return #1 in 7.1.X: Revision 1.62 / (download) - annotate - [select for diffs] , Wed Jul 5 23:11:35 2000 UTC (16 months, 2 weeks ago) by tgl Changes since 1.61: +12 -20 lines Diff to previous 1.61 Update textin() and textout() to new fmgr style. This is just phase one of updating the whole text datatype, but thereare so dang many calls of these two routines that it seems worth a separate commit. The open question is whether we should be doing #3. If you want to use octet_length to allocate space on the client side, #3 is really the proper value, as Tom has argued. Tatsuo is happy with #2. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> It seems the behavior of OCTET_LENGTH varies acording to the > corresponding data type: > > TEXT: returns the size of data AFTER TOAST > VARCHAR and CHAR: returns the size of data BEFORE TOAST Fixed in CVS. TEXT now like CHAR/VARCHAR. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: >Barry Lind <barry@xythos.com> writes: > >>While the text datatypes have additional issues with encodings, that is >>not true for the bytea type. I think it does make sense that a client >>be able to get the size in bytes that the bytea type value will return >>to the client. >> > >bytea does that already. It's only text that has (or had, till a few >minutes ago) the funny behavior. > >I'm not set on the notion that octet_length should return on-disk size; >that's clearly not what's contemplated by SQL92, so I'm happy to agree >that if we want that we should add a new function to get it. >("storage_length", maybe.) What's bothering me right now is the >difference between client and server encodings. It seems that the only >plausible use for octet_length is to do memory allocation on the client >side, > Allocating memory seems for me to be drivers (libpq, JDBC, ODBC,...) problem and not something to be done by client code beforehand - at least for libpq (AFAIK) we don't have any means of giving it a pre-allocated storage area for one field. There is enough information in wire protocol for allocating right-sized chunks at the time query result is read. An additional call of "SELECT OCTET_LENGTH(someCol)" seems orders of magnitude slower than doing it at the right time in the driver . >and for that purpose the length ought to be measured in the client >encoding. People seem to be happy with letting octet_length take the >easy way out (measure in the server encoding), and I'm trying to get >someone to explain to me why that's the right behavior. I don't see it. > perhaps we need another function "OCTET_LENGTH(someCol, encoding)" for getting what we want and also client_encoding() and server_encoding() for supplying it some universal defaults ? OTOH, from reading on Unicode I've came to a conlusion that there are often several ways for expressing the same string in Unicode, so for server encoding not unicode and client requesting unicode (say UTF-8) there can be several different ways to express the same string. Thus there is no absolute OCTET_LENGTH for client_encoding for all cases. Thus giving the actual uncompressed length seems most reasonable. For unicode both in backend and frontend we could also make OCTET_LENGTH return not int but an integer-interval of shortest and longest possible encoding ;) ------------------ Hannu
On Mon, Nov 19, 2001 at 02:34:56PM -0500, Bruce Momjian wrote: > Summary: > > There have been three ideas of what octet_length() sould return: > > 1) compressed on-disk storage length > 2) byte length in server-side encoding > 3) byte length in client-side encoding Very nice is possibility of choice... What add everything: octet_length_storage() octet_length_server() octet_length_client()and problem of right choice put to user. And thestandard octet_length() make as alias to 1) or 2) or 3) -- depend on result of this discussion. > The open question is whether we should be doing #3. If you want to use > octet_length to allocate space on the client side, #3 is really the If Tom needs be sure, he can uses octet_length_client(). > proper value, as Tom has argued. Tatsuo is happy with #2. ...and Tatsuo can uses octet_length_server(). The important thing is that both will still happy :-) Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
> > There have been three ideas of what octet_length() sould return: > > > > 1) compressed on-disk storage length > > 2) byte length in server-side encoding > > 3) byte length in client-side encoding > > Very nice is possibility of choice... What add everything: > > octet_length_storage() > octet_length_server() > octet_length_client() We only need one of octet_length_server() or octet_length_client(). We could emulate the rest using convert() etc. -- Tatsuo Ishii
> > There have been three ideas of what octet_length() sould return: > > 1) compressed on-disk storage length > > 2) byte length in server-side encoding > > 3) byte length in client-side encoding ... > > The open question is whether we should be doing #3. There is no question in my mind that (3) must be the result of octet_length(). Any of the other options may give an interesting result, but of no practical use to a client trying to retrieve data. And everything is a client! - Thomas
> > > There have been three ideas of what octet_length() sould return: > > > 1) compressed on-disk storage length > > > 2) byte length in server-side encoding > > > 3) byte length in client-side encoding > ... > > > The open question is whether we should be doing #3. > > There is no question in my mind that (3) must be the result of > octet_length(). Any of the other options may give an interesting result, > but of no practical use to a client trying to retrieve data. And > everything is a client! Added to TODO: * Add octet_length_server() and octet_length_client() (Thomas, Tatsuo) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > > There have been three ideas of what octet_length() sould return: > > > 1) compressed on-disk storage length > > > 2) byte length in server-side encoding > > > 3) byte length in client-side encoding > ... > > > The open question is whether we should be doing #3. > > There is no question in my mind that (3) must be the result of > octet_length(). Any of the other options may give an interesting result, > but of no practical use to a client trying to retrieve data. And > everything is a client! Also added to TODO: * Make octet_length_client the same as octet_length() -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Thomas Lockhart wrote: >>>There have been three ideas of what octet_length() sould return: >>> 1) compressed on-disk storage length >>> 2) byte length in server-side encoding >>> 3) byte length in client-side encoding >>> >... > >>>The open question is whether we should be doing #3. >>> > >There is no question in my mind that (3) must be the result of >octet_length(). Any of the other options may give an interesting result, >but of no practical use to a client trying to retrieve data. > What practical use does #3 give ;) do you really envision a program that does 2 separate queries to retrieve some string, first to query its storage length and then to actually read it, instead of just reading it ? I don't think we evan have a interface in any of our libs where we can give a pre-allocated buffer to a client library to fill in. Or do we ? >And everything is a client! > So in a PL/PgSQL function doing some data manipulation through SPI the "client" is who - the server or the client or some third party ? --------------- Hannu
Tatsuo Ishii wrote: >>>There have been three ideas of what octet_length() sould return: >>> >>> 1) compressed on-disk storage length >>> 2) byte length in server-side encoding >>> 3) byte length in client-side encoding >>> >> Very nice is possibility of choice... What add everything: >> >> octet_length_storage() >> octet_length_server() >> octet_length_client() >> > >We only need one of octet_length_server() or octet_length_client(). > And i guess that octet_length_server() is cheaper as it does not do a convert() when not needed. > >We could emulate the rest using convert() etc. >-- >Tatsuo Ishii > >---------------------------(end of broadcast)--------------------------- >TIP 6: Have you searched our list archives? > >http://archives.postgresql.org >
Bruce Momjian wrote: >>>>There have been three ideas of what octet_length() sould return: >>>> 1) compressed on-disk storage length >>>> 2) byte length in server-side encoding >>>> 3) byte length in client-side encoding >>>> >>... >> >>>>The open question is whether we should be doing #3. >>>> >>There is no question in my mind that (3) must be the result of >>octet_length(). Any of the other options may give an interesting result, >>but of no practical use to a client trying to retrieve data. And >>everything is a client! >> > >Also added to TODO: > > * Make octet_length_client the same as octet_length() > Will this break backward compatibility ? ------------- Hannu
> >>>>There have been three ideas of what octet_length() sould return: > >>>> 1) compressed on-disk storage length > >>>> 2) byte length in server-side encoding > >>>> 3) byte length in client-side encoding > >>>> > >>... > >> > >>>>The open question is whether we should be doing #3. > >>>> > >>There is no question in my mind that (3) must be the result of > >>octet_length(). Any of the other options may give an interesting result, > >>but of no practical use to a client trying to retrieve data. And > >>everything is a client! > >> > > > >Also added to TODO: > > > > * Make octet_length_client the same as octet_length() > > > Will this break backward compatibility ? Well, sort of. 7.1 had text returning compressed length. We changed that to server-side encoding in 7.2. Changing that to client encoding will break clients, but what meaningful thing could they do with the server-side encoding? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Bruce Momjian writes: > > > Also added to TODO: > > > > * Make octet_length_client the same as octet_length() > > Have we decided on that one yet? Uh, Thomas said he was certain about it. I will add a question mark to the TODO item. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian writes: > Also added to TODO: > > * Make octet_length_client the same as octet_length() Have we decided on that one yet? -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > Tom Lane writes: >> What's bothering me right now is the difference between client and >> server encodings. > OCTET_LENGTH returns the size of its argument, not the size of some > possible future shape of that argument. That would serve equally well as an argument for returning the compressed length of the string, I think. You'll need to do better. My take on it is that when a particular client encoding is specified, Postgres does its best to provide the illusion that your data actually is stored in that encoding. If we don't make OCTET_LENGTH agree, then we're breaking the illusion. regards, tom lane
Tom Lane writes: > What's bothering me right now is the difference between client and > server encodings. It seems that the only plausible use for > octet_length is to do memory allocation on the client side, and for > that purpose the length ought to be measured in the client encoding. OCTET_LENGTH returns the size of its argument, not the size of some possible future shape of that argument. There is absolutely no guarantee that the string that is processed by OCTET_LENGTH will ever reach any kind of client. There are procedural languages, for instance, or CREATE TABLE AS. Whether or not this behaviour is most likely or most useful is a different question, but let's not silently readopt standard functions for non-standard purposes -- we've just gotten past that one. -- Peter Eisentraut peter_e@gmx.net
> Tom Lane writes: > > > What's bothering me right now is the difference between client and > > server encodings. It seems that the only plausible use for > > octet_length is to do memory allocation on the client side, and for > > that purpose the length ought to be measured in the client encoding. > > OCTET_LENGTH returns the size of its argument, not the size of some > possible future shape of that argument. There is absolutely no guarantee > that the string that is processed by OCTET_LENGTH will ever reach any kind > of client. There are procedural languages, for instance, or CREATE TABLE > AS. Yes, agreed. I argued that server-side octet_length would be valuable for server-side functions. However, others felt client-side was more important. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> OCTET_LENGTH returns the size of its argument, not the size of some > possible future shape of that argument. There is absolutely no guarantee > that the string that is processed by OCTET_LENGTH will ever reach any kind > of client. There are procedural languages, for instance, or CREATE TABLE > AS. > > Whether or not this behaviour is most likely or most useful is a different > question, but let's not silently readopt standard functions for > non-standard purposes -- we've just gotten past that one. I think the essential problem with OCTET_LENGTH(and with any other text functions) is we currently do not have a way to associate encoding information with each text object. Probably we could solve this after implementation of CREATE CHARACTER SET stuffs. -- Tatsuo Ishii
Tom Lane writes: > > OCTET_LENGTH returns the size of its argument, not the size of some > > possible future shape of that argument. > > That would serve equally well as an argument for returning the > compressed length of the string, I think. You'll need to do better. TOAST is not part of the conceptual computational model. The fact that the compressed representation is available to functions at all is somewhat peculiar (although I'm not questioning it). I've already attempted to show that returning the size of the compressed representation doesn't fit the letter of the standard. > My take on it is that when a particular client encoding is specified, > Postgres does its best to provide the illusion that your data actually > is stored in that encoding. If we don't make OCTET_LENGTH agree, then > we're breaking the illusion. The way I've seen it we consider the encoding conversion to happen "on the wire" while both the server and the client run in their own encoding. In that model it's appropriate that computations in the server use the encoding in the server. However, if the model is that it should appear to clients that the entire setup magically runs in "their" encoding then the other behaviour would be better. In that case the database encoding is really only an optimization hint because the actual encoding in the server is of no matter. This model would certainly be attractive as well, but there could be a few problems. For instance, I don't know if the convert() function would make sense then. (Does it even make sense now?) Also, we do need to consider carefully how to interface this "illusion" to operations contained strictly within the server (e.g., CREATE TABLE AS, column defaults) and to procedural languages that may or may not come with encoding ideas of their own. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > However, if the model is that it should appear to clients that the entire > setup magically runs in "their" encoding then the other behaviour would be > better. In that case the database encoding is really only an optimization > hint because the actual encoding in the server is of no matter. This > model would certainly be attractive as well, but there could be a few > problems. For instance, I don't know if the convert() function would make > sense then. (Does it even make sense now?) I'm not sure that it does; it seems not to fit the model well at all. For example, if I do "SELECT convert(somestring, someencoding)" where someencoding is anything but the server's encoding, then I will get bogus results, because when the data is returned to the client it will get an inappropriate server-to-client-encoding translation applied to it. Even if I ask to convert to the client encoding, I will get wrong answers (two passes of the conversion). Whatever you might expect convert to do, that wouldn't seem to be it. > Also, we do need to consider carefully how to interface this "illusion" to > operations contained strictly within the server (e.g., CREATE TABLE AS, > column defaults) and to procedural languages that may or may not come with > encoding ideas of their own. True. I think that pltcl has now got this more or less licked, but plperl hasn't ... regards, tom lane
> problems. For instance, I don't know if the convert() function would make > sense then. (Does it even make sense now?) Yes. Consider you have UNICODE database and want to sort by French or whatever LATIN locale. SELECT * FROM t1 ORDER BY convert(text_column,'LATIN1'); would be the only way to accomplish that. -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > Yes. Consider you have UNICODE database and want to sort by French or > whatever LATIN locale. > SELECT * FROM t1 ORDER BY convert(text_column,'LATIN1'); > would be the only way to accomplish that. That in itself would not get the job done; how is the sort operator to know what collation order you want? The SQL92 spec suggests that the syntax should be ... ORDER BY text_column COLLATE French; (note collation names are not standardized AFAICT). Seems to me it should then be the system's responsibility to make this happen, including any encoding conversion that might be needed before the comparisons could be done. regards, tom lane
Tom Lane wrote: > > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > Yes. Consider you have UNICODE database and want to sort by French or > > whatever LATIN locale. > > SELECT * FROM t1 ORDER BY convert(text_column,'LATIN1'); > > would be the only way to accomplish that. > > That in itself would not get the job done; how is the sort operator > to know what collation order you want? > > The SQL92 spec suggests that the syntax should be > > ... ORDER BY text_column COLLATE French; > > (note collation names are not standardized AFAICT). Seems to me it > should then be the system's responsibility to make this happen, > including any encoding conversion that might be needed before the > comparisons could be done. Thanks to postgreSQL's flexibility you can currently make a contrib function convert(text_column,'LATIN1',locale) that returns a (new) text_with_locale type that has locale_aware comparison operators. -------------- Hannu
Tatsuo Ishii writes: > > problems. For instance, I don't know if the convert() function would make > > sense then. (Does it even make sense now?) > > Yes. Consider you have UNICODE database and want to sort by French or > whatever LATIN locale. > > SELECT * FROM t1 ORDER BY convert(text_column,'LATIN1'); > > would be the only way to accomplish that. I don't think so. The sort order is independent of the character encoding, and vice versa. It must be, because 1) One language can be represented in different encodings and should obviously still sort the same. 2) One encoding can serve for plenty of languages, which all sort differently. -- Peter Eisentraut peter_e@gmx.net
> > Yes. Consider you have UNICODE database and want to sort by French or > > whatever LATIN locale. > > > > SELECT * FROM t1 ORDER BY convert(text_column,'LATIN1'); > > > > would be the only way to accomplish that. > > I don't think so. The sort order is independent of the character > encoding, and vice versa. It must be, because > > 1) One language can be represented in different encodings and should > obviously still sort the same. > > 2) One encoding can serve for plenty of languages, which all sort > differently. I assume you are talking about the concept of SQL92/99's COLLATE syntax. But I just talked about what we could do in 7.2 which apprarently does not have the SQL92's COLLATE syntax. BTW, > I don't think so. The sort order is independent of the character > encoding, and vice versa. It must be, because This seems different from SQL's CREATE COLLATION syntax. From SQL99's CREATE COLLATION definition: CREATE COLLATION <collation name> FOR <character set specification> FROM <existingcollation name> [ <pad characteristic> ] So it seems a collation depends on a character set. -- Tatsuo Ishii
> > Yes. Consider you have UNICODE database and want to sort by French or > > whatever LATIN locale. > > SELECT * FROM t1 ORDER BY convert(text_column,'LATIN1'); > > would be the only way to accomplish that. > > That in itself would not get the job done; how is the sort operator > to know what collation order you want? I assume the locale support enabled of course. > The SQL92 spec suggests that the syntax should be > > ... ORDER BY text_column COLLATE French; > > (note collation names are not standardized AFAICT). Seems to me it > should then be the system's responsibility to make this happen, > including any encoding conversion that might be needed before the > comparisons could be done. I'm not talking about our (hopefully) upcoming implementation of SQL92 COLLATE syntax. It's ideal and should be our goal, but what I have shown is how we could do the job in 7.2 now. -- Tatsuo Ishii
Tatsuo Ishii writes: > > I don't think so. The sort order is independent of the character > > encoding, and vice versa. It must be, because > > This seems different from SQL's CREATE COLLATION syntax. > >From SQL99's CREATE COLLATION definition: > > CREATE COLLATION <collation name> FOR > <character set specification> > FROM <existing collation name> > [ <pad characteristic> ] > > So it seems a collation depends on a character set. I see. But that really doesn't have anything to do with reality. In fact, it completely undermines the transparency of the character set encoding that we're probably trying to achieve. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut wrote: > > Tatsuo Ishii writes: > > > > I don't think so. The sort order is independent of the character > > > encoding, and vice versa. It must be, because > > > > This seems different from SQL's CREATE COLLATION syntax. > > >From SQL99's CREATE COLLATION definition: > > > > CREATE COLLATION <collation name> FOR > > <character set specification> > > FROM <existing collation name> > > [ <pad characteristic> ] > > > > So it seems a collation depends on a character set. > > I see. But that really doesn't have anything to do with reality. In > fact, it completely undermines the transparency of the character set > encoding that we're probably trying to achieve. COLLATION being independent of character set is a separate problem from COLLATION being _defined_ on character set - without a known character set I can't see how you can define it. i.e. "COLLACTION for any 8-bit charset" just does not make sense. ----------------- Hannu
> > I see. But that really doesn't have anything to do with reality. In > > fact, it completely undermines the transparency of the character set > > encoding that we're probably trying to achieve. > > COLLATION being independent of character set is a separate problem > from COLLATION being _defined_ on character set - without a known > character set I can't see how you can define it. > i.e. "COLLACTION for any 8-bit charset" just does not make sense. Correct. IGNORE_CASE collation will not apply to some languages those do not have upper/lower case concept such as Japanese. -- Tatsuo Ishii