Thread: OCTET_LENGTH is wrong

OCTET_LENGTH is wrong

From
Peter Eisentraut
Date:
I noticed OCTET_LENGTH will return the size of the data after TOAST may
have compressed it.  While this could be useful information, this
behaviour has no basis in the SQL standard and it's not what is
documented.  Moreover, it eliminates the standard useful behaviour of
OCTET_LENGTH, which is to show the length in bytes of a multibyte string.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: OCTET_LENGTH is wrong

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> I noticed OCTET_LENGTH will return the size of the data after TOAST may
> have compressed it.  While this could be useful information, this
> behaviour has no basis in the SQL standard and it's not what is
> documented.  Moreover, it eliminates the standard useful behaviour of
> OCTET_LENGTH, which is to show the length in bytes of a multibyte string.

I wondered about that too, the first time I noticed it.  On the other
hand, knowing the compressed length is kinda useful too, at least for
hacking and DBA purposes.  (One might also like to know whether a value
has been moved out of line, which is not currently determinable.)

I don't want to force an initdb at this stage, at least not without
compelling reason, so adding more functions right now is not feasible.
Maybe a TODO item for next time.

That leaves us with the question whether to change OCTET_LENGTH now
or leave it for later.  Anyone?

BTW, I noticed that textlength() is absolutely unreasonably slow when
MULTIBYTE is enabled --- yesterday I was trying to profile TOAST
overhead, and soon discovered that what I was looking at was nothing
but pg_mblen() calls.  It really needs a short-circuit path for
single-byte encodings.
        regards, tom lane


Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > I noticed OCTET_LENGTH will return the size of the data after TOAST may
> > have compressed it.  While this could be useful information, this
> > behaviour has no basis in the SQL standard and it's not what is
> > documented.  Moreover, it eliminates the standard useful behaviour of
> > OCTET_LENGTH, which is to show the length in bytes of a multibyte string.
> 
> I wondered about that too, the first time I noticed it.  On the other
> hand, knowing the compressed length is kinda useful too, at least for
> hacking and DBA purposes.  (One might also like to know whether a value
> has been moved out of line, which is not currently determinable.)
> 
> I don't want to force an initdb at this stage, at least not without
> compelling reason, so adding more functions right now is not feasible.
> Maybe a TODO item for next time.
> 
> That leaves us with the question whether to change OCTET_LENGTH now
> or leave it for later.  Anyone?

I am unconcerned about showing people the actual toasted length.  Seems
we should get octet_length() computed on the un-TOASTED length, if we
can.

> BTW, I noticed that textlength() is absolutely unreasonably slow when
> MULTIBYTE is enabled --- yesterday I was trying to profile TOAST
> overhead, and soon discovered that what I was looking at was nothing
> but pg_mblen() calls.  It really needs a short-circuit path for
> single-byte encodings.

Added to TODO:
* Optimize textlength(), etc. for single-byte encodings

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Tatsuo Ishii
Date:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > I noticed OCTET_LENGTH will return the size of the data after TOAST may
> > have compressed it.  While this could be useful information, this
> > behaviour has no basis in the SQL standard and it's not what is
> > documented.  Moreover, it eliminates the standard useful behaviour of
> > OCTET_LENGTH, which is to show the length in bytes of a multibyte string.
> 
> I wondered about that too, the first time I noticed it.  On the other
> hand, knowing the compressed length is kinda useful too, at least for
> hacking and DBA purposes.  (One might also like to know whether a value
> has been moved out of line, which is not currently determinable.)

It seems the behavior of OCTET_LENGTH varies acording to the
corresponding data type:

TEXT: returns the size of data AFTER TOAST
VARCHAR and CHAR: returns the size of data BEFORE TOAST

I think we should fix at least these inconsistencies but am not sure
if it's totally wrong that OCTET_LENGTH returns the length AFTER
TOAST. The SQL standard does not have any idea about TOAST of course.
Also, I tend to agree with Tom's point about hackers and DBAs.

> I don't want to force an initdb at this stage, at least not without
> compelling reason, so adding more functions right now is not feasible.
> Maybe a TODO item for next time.
>
> That leaves us with the question whether to change OCTET_LENGTH now
> or leave it for later.  Anyone?

My opinion is leaving it for 7.3, with the idea (adding new
functions).

> BTW, I noticed that textlength() is absolutely unreasonably slow when
> MULTIBYTE is enabled --- yesterday I was trying to profile TOAST
> overhead, and soon discovered that what I was looking at was nothing
> but pg_mblen() calls.  It really needs a short-circuit path for
> single-byte encodings.

It's easy to optimize that. However I cannot access CVS anymore after
the IP address change. Will post patches later...
--
Tatsuo Ishii



Re: OCTET_LENGTH is wrong

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> ... Moreover, it eliminates the standard useful behaviour of
> OCTET_LENGTH, which is to show the length in bytes of a multibyte string.

While I don't necessarily dispute this, I do kinda wonder where you
derive the statement.  AFAICS, SQL92 defines OCTET_LENGTH in terms
of BIT_LENGTH:

6.6 General Rule 5:
           a) Let S be the <string value expression>. If the value of S is             not the null value, then the
resultis the smallest integer             not less than the quotient of the division (BIT_LENGTH(S)/8).           b)
Otherwise,the result is the null value.
 

and BIT_LENGTH is defined in the next GR:
           a) Let S be the <string value expression>. If the value of S is             not the null value, then the
resultis the number of bits in             the value of S.           b) Otherwise, the result is the null value.
 

While SQL92 is pretty clear about <bit string>, I'm damned if I can see
anywhere that they define how many bits are in a character string value.
So who's to say what representation is to be used to count the bits?
If, say, UTF-16 and UTF-8 are equally reasonable choices, then why
shouldn't a compressed representation be reasonable too?
        regards, tom lane


Re: OCTET_LENGTH is wrong

From
Tatsuo Ishii
Date:
> > BTW, I noticed that textlength() is absolutely unreasonably slow when
> > MULTIBYTE is enabled --- yesterday I was trying to profile TOAST
> > overhead, and soon discovered that what I was looking at was nothing
> > but pg_mblen() calls.  It really needs a short-circuit path for
> > single-byte encodings.
> 
> It's easy to optimize that. However I cannot access CVS anymore after
> the IP address change. Will post patches later...

Seems I got the cvs access again (I was asked my pass phrase again)
and I have committed changes for this.

Modified functions are:

bpcharlen
textlen
varcharlen
--
Tatsuo Ishii


Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> > > BTW, I noticed that textlength() is absolutely unreasonably slow when
> > > MULTIBYTE is enabled --- yesterday I was trying to profile TOAST
> > > overhead, and soon discovered that what I was looking at was nothing
> > > but pg_mblen() calls.  It really needs a short-circuit path for
> > > single-byte encodings.
> > 
> > It's easy to optimize that. However I cannot access CVS anymore after
> > the IP address change. Will post patches later...
> 
> Seems I got the cvs access again (I was asked my pass phrase again)
> and I have committed changes for this.
> 
> Modified functions are:
> 
> bpcharlen
> textlen
> varcharlen

Did you go with the pre or post-TOAST length for these types?  I vote
for pre-TOAST because it seems much more useful to ordinary users.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> > > BTW, I noticed that textlength() is absolutely unreasonably slow when
> > > MULTIBYTE is enabled --- yesterday I was trying to profile TOAST
> > > overhead, and soon discovered that what I was looking at was nothing
> > > but pg_mblen() calls.  It really needs a short-circuit path for
> > > single-byte encodings.
> > 
> > It's easy to optimize that. However I cannot access CVS anymore after
> > the IP address change. Will post patches later...
> 
> Seems I got the cvs access again (I was asked my pass phrase again)
> and I have committed changes for this.
> 
> Modified functions are:
> 
> bpcharlen
> textlen
> varcharlen

OK, sorry, I see you did the optimization, not changed the length
functio.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Peter Eisentraut
Date:
Tom Lane writes:

>             a) Let S be the <string value expression>. If the value of S is
>               not the null value, then the result is the number of bits in
>               the value of S.
>             b) Otherwise, the result is the null value.
>
> While SQL92 is pretty clear about <bit string>, I'm damned if I can see
> anywhere that they define how many bits are in a character string value.
> So who's to say what representation is to be used to count the bits?
> If, say, UTF-16 and UTF-8 are equally reasonable choices, then why
> shouldn't a compressed representation be reasonable too?

I think "the value of S" implies "the user-accessible representation of
the value of S", in the sense, "How much memory do I need to allocate to
store this value".

Furthermore, the size of the TOAST representation that is returned now is
just one particular of several intermediate representations.  For
instance, it does not include the VARHDRSZ and it does not include the
size of the tuple headers when it's stored externally.  Thus, this size is
heavily skewed toward low numbers and doesn't tell you much about either
the disk end or the user's end.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: OCTET_LENGTH is wrong

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> I think "the value of S" implies "the user-accessible representation of
> the value of S", in the sense, "How much memory do I need to allocate to
> store this value".

If I take that argument seriously, I have to conclude that OCTET_LENGTH
should return the string length measured in the current client encoding
(which may have little to do with its size in the server, if the
server's encoding is different).  If the client actually retrieves the
string then that's how much memory he'll need.

I presume that where you want to come out is OCTET_LENGTH = uncompressed
length in the server's encoding ... but so far no one has really made
a convincing argument why that answer is better or more spec-compliant
than any other answer.  In particular, it's not obvious to me why
"number of bytes we're actually using on disk" is wrong.
        regards, tom lane


Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> I think "the value of S" implies "the user-accessible representation of
> the value of S", in the sense, "How much memory do I need to allocate to
> store this value".
>
> Furthermore, the size of the TOAST representation that is returned now is
> just one particular of several intermediate representations.  For
> instance, it does not include the VARHDRSZ and it does not include the
> size of the tuple headers when it's stored externally.  Thus, this size is
> heavily skewed toward low numbers and doesn't tell you much about either
> the disk end or the user's end.

Yes, good arguments.  If we want to implement storage_length at some
later time, I think the compressed length may be appropriate, but for
general use, I think we need to return the uncompressed length,
especially considering that multibyte makes the ordinary 2length return
number of characters, so users need a way to get byte length.

Attached is a patch that makes text return the same value type as char()
and varchar() already do.  As Tatsuo pointed out, they were
inconsistent.  All the other octet_length() functions look fine so it
was only text that had this problem.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
Index: src/backend/utils/adt/varlena.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/varlena.c,v
retrieving revision 1.74
diff -c -r1.74 varlena.c
*** src/backend/utils/adt/varlena.c    2001/10/25 05:49:46    1.74
--- src/backend/utils/adt/varlena.c    2001/11/18 19:11:52
***************
*** 273,284 ****
  Datum
  textoctetlen(PG_FUNCTION_ARGS)
  {
!     struct varattrib *t = (struct varattrib *) PG_GETARG_RAW_VARLENA_P(0);

!     if (!VARATT_IS_EXTERNAL(t))
!         PG_RETURN_INT32(VARATT_SIZE(t) - VARHDRSZ);
!
!     PG_RETURN_INT32(t->va_content.va_external.va_extsize);
  }

  /*
--- 273,281 ----
  Datum
  textoctetlen(PG_FUNCTION_ARGS)
  {
!     text    *arg = PG_GETARG_VARCHAR_P(0);

!     PG_RETURN_INT32(VARSIZE(arg) - VARHDRSZ);
  }

  /*

Re: OCTET_LENGTH is wrong

From
Stephan Szabo
Date:
On Sun, 18 Nov 2001, Tom Lane wrote:

> Peter Eisentraut <peter_e@gmx.net> writes:
> > I think "the value of S" implies "the user-accessible representation of
> > the value of S", in the sense, "How much memory do I need to allocate to
> > store this value".
>
> If I take that argument seriously, I have to conclude that OCTET_LENGTH
> should return the string length measured in the current client encoding
> (which may have little to do with its size in the server, if the
> server's encoding is different).  If the client actually retrieves the
> string then that's how much memory he'll need.
>
> I presume that where you want to come out is OCTET_LENGTH = uncompressed
> length in the server's encoding ... but so far no one has really made
> a convincing argument why that answer is better or more spec-compliant
> than any other answer.  In particular, it's not obvious to me why
> "number of bytes we're actually using on disk" is wrong.

I'm not sure, but if we say that the on disk representation is the
value of the character value expression whose size is being checked,
wouldn't that be inconsistent with the other uses of the character value
expression in places like substr where we don't use the on disk
representation?  Unless you're saying that the string value expression
that is that character value expression is the compressed one and
the character value expression is the uncompressed one.




Re: OCTET_LENGTH is wrong

From
Tom Lane
Date:
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> On Sun, 18 Nov 2001, Tom Lane wrote:
>> I presume that where you want to come out is OCTET_LENGTH = uncompressed
>> length in the server's encoding ... but so far no one has really made
>> a convincing argument why that answer is better or more spec-compliant
>> than any other answer.  In particular, it's not obvious to me why
>> "number of bytes we're actually using on disk" is wrong.

> I'm not sure, but if we say that the on disk representation is the
> value of the character value expression whose size is being checked,
> wouldn't that be inconsistent with the other uses of the character value

Yeah, it would be and is.  In fact, the present code has some
interesting behaviors: if foo.x is a text value long enough to be
toasted, then you get different results from
SELECT OCTET_LENGTH(x) FROM foo;
SELECT OCTET_LENGTH(x || '') FROM foo;

since the result of the concatenation expression won't be compressed.

I'm not actually here to defend the existing code; in fact I believe the
XXX comment on textoctetlen questioning its correctness is mine.  What
I am trying to point out is that the spec is so vague that it's not
clear what the correct answer is.
        regards, tom lane


Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> > On Sun, 18 Nov 2001, Tom Lane wrote:
> >> I presume that where you want to come out is OCTET_LENGTH = uncompressed
> >> length in the server's encoding ... but so far no one has really made
> >> a convincing argument why that answer is better or more spec-compliant
> >> than any other answer.  In particular, it's not obvious to me why
> >> "number of bytes we're actually using on disk" is wrong.
> 
> > I'm not sure, but if we say that the on disk representation is the
> > value of the character value expression whose size is being checked,
> > wouldn't that be inconsistent with the other uses of the character value
> 
> Yeah, it would be and is.  In fact, the present code has some
> interesting behaviors: if foo.x is a text value long enough to be
> toasted, then you get different results from
> 
>     SELECT OCTET_LENGTH(x) FROM foo;
> 
>     SELECT OCTET_LENGTH(x || '') FROM foo;
> 
> since the result of the concatenation expression won't be compressed.
> 
> I'm not actually here to defend the existing code; in fact I believe the
> XXX comment on textoctetlen questioning its correctness is mine.  What
> I am trying to point out is that the spec is so vague that it's not
> clear what the correct answer is.

Well, if the standard is unclear, we should assume to return the most
reasonable answer, which has to be non-compressed length.  

In multibyte encodings, when we started returning length() in
_characters_ instead of bytes, I assumed the major use for octet_length
was to return the number of bytes needed to hold the value on the client
side.

In single byte encodings, octet_length is the same as length() so
returning a compressed length may make sense, but I don't think we want
different meanings for the function for single and multi-byte encodings.

I guess the issue is that for single-byte encodings, octet_length is
pretty useless because it is the same as length, but for multi-byte
encodings, octet_length is invaluable and almost has to return
non-compress bytes because uncompressed is that the client sees.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> octet_length is invaluable and almost has to return
> non-compress bytes because uncompressed is that the client sees.

What about encoding?
        regards, tom lane


Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > octet_length is invaluable and almost has to return
> > non-compress bytes because uncompressed is that the client sees.                                              ^^^^
                                           what
 

> What about encoding?

Single-byte encodings have the same character and byte lengths.  Only
multi-byte encodings are different, right?

In thinking about it, I think the function is called octet_length()
to emphasize is returns the length in octets (bytes) rather than the
length in characters.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Tatsuo Ishii
Date:
> > What about encoding?
> 
> Single-byte encodings have the same character and byte lengths.  Only
> multi-byte encodings are different, right?
> 
> In thinking about it, I think the function is called octet_length()
> to emphasize is returns the length in octets (bytes) rather than the
> length in characters.

I think Tom's point is whether octet_length() should regard input text
being encoded in the client side encoding or not.

My vote is octet_length() assumes database encodeding.
If you need client side encoded text length, you could do something
like:

select octet_length(convert('foo',pg_client_encoding()));

Note that there was a nasty bug in convert() which prevents above
working. I have committed fixes.
--
Tatsuo Ishii


Re: OCTET_LENGTH is wrong

From
Peter Eisentraut
Date:
Tom Lane writes:

> What
> I am trying to point out is that the spec is so vague that it's not
> clear what the correct answer is.

I guess the authors of SQL92 never imagined someone would question what
"value of S" means.  In SQL99 they included it:

SQL 99 Part 1, 4.4.3.2.
        A value of character type is a string (sequence) of characters        drawn from some character repertoire.

Just to be sure...

SQL 99 Part 1, 3.1 q)
        q) sequence: An ordered collection of objects that are not           necessarily distinct.

I don't have a set theory text available, but I think this should give a
fair indication that the number of bits in the value of S is the sum of
the bits in each individual character (which is in turn vaguely defined
elsewhere in SQL99) -- at least in Euclidean memory architectures.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: OCTET_LENGTH is wrong

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> I don't have a set theory text available, but I think this should give a
> fair indication that the number of bits in the value of S is the sum of
> the bits in each individual character (which is in turn vaguely defined
> elsewhere in SQL99) -- at least in Euclidean memory architectures.

But "how many bits in a character?" is exactly the question at this
point.  To be fair, I don't think our notion of on-the-fly encoding
translation is envisioned anywhere in the SQL spec, so perhaps we
shouldn't expect it to tell us which encoding to count the bits in.
        regards, tom lane


Re: OCTET_LENGTH is wrong

From
Barry Lind
Date:
Tom,

While the text datatypes have additional issues with encodings, that is 
not true for the bytea type.  I think it does make sense that a client 
be able to get the size in bytes that the bytea type value will return 
to the client.  If you are storing files in a bytea column getting the 
file size by calling octet_length would be very useful.

thanks,
--Barry


Tom Lane wrote:

> Peter Eisentraut <peter_e@gmx.net> writes:
> 
>>I think "the value of S" implies "the user-accessible representation of
>>the value of S", in the sense, "How much memory do I need to allocate to
>>store this value".
>>
> 
> If I take that argument seriously, I have to conclude that OCTET_LENGTH
> should return the string length measured in the current client encoding
> (which may have little to do with its size in the server, if the
> server's encoding is different).  If the client actually retrieves the
> string then that's how much memory he'll need.
> 
> I presume that where you want to come out is OCTET_LENGTH = uncompressed
> length in the server's encoding ... but so far no one has really made
> a convincing argument why that answer is better or more spec-compliant
> than any other answer.  In particular, it's not obvious to me why
> "number of bytes we're actually using on disk" is wrong.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 
> 




Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
OK, I have applied this patch so text octet_length returns
non-compressed length of data, to match octet_length of other types.

I also removed the XXX comments added by Tom.

---------------------------------------------------------------------------

> > I think "the value of S" implies "the user-accessible representation of
> > the value of S", in the sense, "How much memory do I need to allocate to
> > store this value".
> > 
> > Furthermore, the size of the TOAST representation that is returned now is
> > just one particular of several intermediate representations.  For
> > instance, it does not include the VARHDRSZ and it does not include the
> > size of the tuple headers when it's stored externally.  Thus, this size is
> > heavily skewed toward low numbers and doesn't tell you much about either
> > the disk end or the user's end.
> 
> Yes, good arguments.  If we want to implement storage_length at some
> later time, I think the compressed length may be appropriate, but for
> general use, I think we need to return the uncompressed length,
> especially considering that multibyte makes the ordinary 2length return
> number of characters, so users need a way to get byte length.
> 
> Attached is a patch that makes text return the same value type as char()
> and varchar() already do.  As Tatsuo pointed out, they were
> inconsistent.  All the other octet_length() functions look fine so it
> was only text that had this problem.
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

> Index: src/backend/utils/adt/varlena.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/utils/adt/varlena.c,v
> retrieving revision 1.74
> diff -c -r1.74 varlena.c
> *** src/backend/utils/adt/varlena.c    2001/10/25 05:49:46    1.74
> --- src/backend/utils/adt/varlena.c    2001/11/18 19:11:52
> ***************
> *** 273,284 ****
>   Datum
>   textoctetlen(PG_FUNCTION_ARGS)
>   {
> !     struct varattrib *t = (struct varattrib *) PG_GETARG_RAW_VARLENA_P(0);
>   
> !     if (!VARATT_IS_EXTERNAL(t))
> !         PG_RETURN_INT32(VARATT_SIZE(t) - VARHDRSZ);
> ! 
> !     PG_RETURN_INT32(t->va_content.va_external.va_extsize);
>   }
>   
>   /*
> --- 273,281 ----
>   Datum
>   textoctetlen(PG_FUNCTION_ARGS)
>   {
> !     text    *arg = PG_GETARG_VARCHAR_P(0);
>   
> !     PG_RETURN_INT32(VARSIZE(arg) - VARHDRSZ);
>   }
>   
>   /*

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Tom Lane
Date:
Barry Lind <barry@xythos.com> writes:
> While the text datatypes have additional issues with encodings, that is 
> not true for the bytea type.  I think it does make sense that a client 
> be able to get the size in bytes that the bytea type value will return 
> to the client.

bytea does that already.  It's only text that has (or had, till a few
minutes ago) the funny behavior.

I'm not set on the notion that octet_length should return on-disk size;
that's clearly not what's contemplated by SQL92, so I'm happy to agree
that if we want that we should add a new function to get it.
("storage_length", maybe.)  What's bothering me right now is the
difference between client and server encodings.  It seems that the only
plausible use for octet_length is to do memory allocation on the client
side, and for that purpose the length ought to be measured in the client
encoding.  People seem to be happy with letting octet_length take the
easy way out (measure in the server encoding), and I'm trying to get
someone to explain to me why that's the right behavior.  I don't see it.
        regards, tom lane


Re: OCTET_LENGTH is wrong

From
Hannu Krosing
Date:

Tom Lane wrote:

>Peter Eisentraut <peter_e@gmx.net> writes:
>
>>... Moreover, it eliminates the standard useful behaviour of
>>OCTET_LENGTH, which is to show the length in bytes of a multibyte string.
>>
>
>While I don't necessarily dispute this, I do kinda wonder where you
>derive the statement.  AFAICS, SQL92 defines OCTET_LENGTH in terms
>of BIT_LENGTH:
>
>6.6 General Rule 5:
>
>            a) Let S be the <string value expression>. If the value of S is
>              not the null value, then the result is the smallest integer
>              not less than the quotient of the division (BIT_LENGTH(S)/8).
>            b) Otherwise, the result is the null value.
>
>and BIT_LENGTH is defined in the next GR:
>
>            a) Let S be the <string value expression>. If the value of S is
>              not the null value, then the result is the number of bits in
>              the value of S.
>            b) Otherwise, the result is the null value.
>
>While SQL92 is pretty clear about <bit string>, I'm damned if I can see
>anywhere that they define how many bits are in a character string value
>So who's to say what representation is to be used to count the bits?
>If, say, UTF-16 and UTF-8 are equally reasonable choices, then why
>shouldn't a compressed representation be reasonable too?
>
One objection I have to this, is the fact that nobody uses the compressed
representation in client libraries whrereas they do use both UTF-16 and 
UTF-8.
At least UTF-8 is available as client encoding.

And probably it is possible that the length of the "possibly compressed" 
representation
can change without the underlying data changing (for example when you 
set a bit
somewhere that disables compression and UPDATE some other field in the 
tuple)
making the result of OCTET_LENGTH dependent on other things than the 
argument
string.

I also like the propery of _uncompressed_ OCTET_LENGTH that
OCTET_LENGTH(s||s) == 2 * OCTET_LENGTH(s)
which is almost never true for compressed length

----------------
Hannu




Re: OCTET_LENGTH is wrong

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> !     text    *arg = PG_GETARG_VARCHAR_P(0);

Er, shouldn't that be PG_GETARG_TEXT_P?
        regards, tom lane


Re: OCTET_LENGTH is wrong

From
"Zeugswetter Andreas SB SD"
Date:
> What's bothering me right now is the
> difference between client and server encodings.  It seems that the
only
> plausible use for octet_length is to do memory allocation on the
client
> side, and for that purpose the length ought to be measured in the
client
> encoding.  People seem to be happy with letting octet_length take the
> easy way out (measure in the server encoding), and I'm trying to get
> someone to explain to me why that's the right behavior.  I 
> don't see it.

I agree. octet_length should be the number of bytes the client gets when
he 
does "select textfield from atable".

Andreas


Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >> !     text    *arg = PG_GETARG_VARCHAR_P(0);
> 
> Er, shouldn't that be PG_GETARG_TEXT_P?

Sorry, fixed.  Cut/paste error.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
Summary:

There have been three ideas of what octet_length() sould return:
1) compressed on-disk storage length2) byte length in server-side encoding3) byte length in client-side encoding

7.3 will do #2 for all data types.  We didn't have text type doing #2 in
7.1.X, but it appears that is the only release where octet_length(text)
returned #1.  This is the patch that made octet_length(text) return #1
in 7.1.X:
 Revision 1.62 / (download) - annotate - [select for diffs] , Wed Jul 5 23:11:35 2000 UTC (16 months, 2 weeks ago) by
tgl Changes since 1.61: +12 -20 lines Diff to previous 1.61 
 
 Update textin() and textout() to new fmgr style.  This is just phase one of updating the whole text datatype, but
thereare so dang many calls of these two routines that it seems worth a separate commit.
 

The open question is whether we should be doing #3.  If you want to use
octet_length to allocate space on the client side, #3 is really the
proper value, as Tom has argued.  Tatsuo is happy with #2.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> It seems the behavior of OCTET_LENGTH varies acording to the
> corresponding data type:
> 
> TEXT: returns the size of data AFTER TOAST
> VARCHAR and CHAR: returns the size of data BEFORE TOAST

Fixed in CVS.  TEXT now like CHAR/VARCHAR.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Hannu Krosing
Date:

Tom Lane wrote:

>Barry Lind <barry@xythos.com> writes:
>
>>While the text datatypes have additional issues with encodings, that is 
>>not true for the bytea type.  I think it does make sense that a client 
>>be able to get the size in bytes that the bytea type value will return 
>>to the client.
>>
>
>bytea does that already.  It's only text that has (or had, till a few
>minutes ago) the funny behavior.
>
>I'm not set on the notion that octet_length should return on-disk size;
>that's clearly not what's contemplated by SQL92, so I'm happy to agree
>that if we want that we should add a new function to get it.
>("storage_length", maybe.)  What's bothering me right now is the
>difference between client and server encodings.  It seems that the only
>plausible use for octet_length is to do memory allocation on the client
>side,
>
Allocating memory seems for me to be drivers (libpq, JDBC, ODBC,...) 
problem and
not something to be done by client code beforehand - at least for libpq 
(AFAIK) we
don't have any means of giving it a pre-allocated storage area for one 
field.

There is enough information in wire protocol for allocating right-sized 
chunks at the
time query result is read. An additional call of "SELECT 
OCTET_LENGTH(someCol)"
seems orders of magnitude slower than doing it at the right time in the 
driver .

>and for that purpose the length ought to be measured in the client
>encoding.  People seem to be happy with letting octet_length take the
>easy way out (measure in the server encoding), and I'm trying to get
>someone to explain to me why that's the right behavior.  I don't see it.
>
perhaps we need another function "OCTET_LENGTH(someCol, encoding)" for
getting what we want and also client_encoding() and server_encoding() 
for supplying
it some universal defaults ?

OTOH, from reading on Unicode I've came to a conlusion that there are 
often several
ways for expressing the same string in Unicode, so for server encoding 
not unicode and
client requesting unicode (say UTF-8) there can be several different 
ways to express
the same string. Thus there is no absolute OCTET_LENGTH for 
client_encoding for
all cases. Thus giving the actual uncompressed length seems most reasonable.

For unicode both in backend and frontend we could also make OCTET_LENGTH
return not int but an integer-interval of shortest and longest possible 
encoding ;)

------------------
Hannu






Re: OCTET_LENGTH is wrong

From
Karel Zak
Date:
On Mon, Nov 19, 2001 at 02:34:56PM -0500, Bruce Momjian wrote:
> Summary:
> 
> There have been three ideas of what octet_length() sould return:
> 
>     1) compressed on-disk storage length
>     2) byte length in server-side encoding
>     3) byte length in client-side encoding
Very nice is possibility of choice... What add everything:
   octet_length_storage()   octet_length_server()   octet_length_client()and problem of right choice put to user. And
thestandard octet_length() make as alias to 1) or 2) or 3) -- depend on result of this discussion.
 

> The open question is whether we should be doing #3.  If you want to use
> octet_length to allocate space on the client side, #3 is really the
If Tom needs be sure, he can uses octet_length_client().

> proper value, as Tom has argued.  Tatsuo is happy with #2.

...and Tatsuo can uses octet_length_server(). The important thing
is that both will still happy :-)
       Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: OCTET_LENGTH is wrong

From
Tatsuo Ishii
Date:
> > There have been three ideas of what octet_length() sould return:
> > 
> >     1) compressed on-disk storage length
> >     2) byte length in server-side encoding
> >     3) byte length in client-side encoding
> 
>  Very nice is possibility of choice... What add everything:
> 
>     octet_length_storage()
>     octet_length_server()
>     octet_length_client()

We only need one of octet_length_server() or octet_length_client().
We could emulate the rest using convert() etc.
--
Tatsuo Ishii


Re: OCTET_LENGTH is wrong

From
Thomas Lockhart
Date:
> > There have been three ideas of what octet_length() sould return:
> >       1) compressed on-disk storage length
> >       2) byte length in server-side encoding
> >       3) byte length in client-side encoding
...
> > The open question is whether we should be doing #3.

There is no question in my mind that (3) must be the result of
octet_length(). Any of the other options may give an interesting result,
but of no practical use to a client trying to retrieve data. And
everything is a client!
                      - Thomas


Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> > > There have been three ideas of what octet_length() sould return:
> > >       1) compressed on-disk storage length
> > >       2) byte length in server-side encoding
> > >       3) byte length in client-side encoding
> ...
> > > The open question is whether we should be doing #3.
> 
> There is no question in my mind that (3) must be the result of
> octet_length(). Any of the other options may give an interesting result,
> but of no practical use to a client trying to retrieve data. And
> everything is a client!

Added to TODO:
* Add octet_length_server() and octet_length_client() (Thomas, Tatsuo)

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> > > There have been three ideas of what octet_length() sould return:
> > >       1) compressed on-disk storage length
> > >       2) byte length in server-side encoding
> > >       3) byte length in client-side encoding
> ...
> > > The open question is whether we should be doing #3.
> 
> There is no question in my mind that (3) must be the result of
> octet_length(). Any of the other options may give an interesting result,
> but of no practical use to a client trying to retrieve data. And
> everything is a client!

Also added to TODO:
* Make octet_length_client the same as octet_length() 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Hannu Krosing
Date:

Thomas Lockhart wrote:

>>>There have been three ideas of what octet_length() sould return:
>>>      1) compressed on-disk storage length
>>>      2) byte length in server-side encoding
>>>      3) byte length in client-side encoding
>>>
>...
>
>>>The open question is whether we should be doing #3.
>>>
>
>There is no question in my mind that (3) must be the result of
>octet_length(). Any of the other options may give an interesting result,
>but of no practical use to a client trying to retrieve data.
>
What practical use does #3 give ;) do you really envision a program that 
does 2
separate queries to retrieve some string, first to query its storage 
length and then
to actually read it, instead of just reading it ?

I don't think we evan have a interface in any of our libs where we can 
give a
pre-allocated buffer to a client library to fill in. Or do we ?

>And everything is a client!
>
So in a PL/PgSQL function doing some data manipulation through SPI the 
"client" is
who - the server or the client or some third party ?

---------------
Hannu



Re: OCTET_LENGTH is wrong

From
Hannu Krosing
Date:

Tatsuo Ishii wrote:

>>>There have been three ideas of what octet_length() sould return:
>>>
>>>    1) compressed on-disk storage length
>>>    2) byte length in server-side encoding
>>>    3) byte length in client-side encoding
>>>
>> Very nice is possibility of choice... What add everything:
>>
>>    octet_length_storage()
>>    octet_length_server()
>>    octet_length_client()
>>
>
>We only need one of octet_length_server() or octet_length_client().
>
And i guess that octet_length_server() is cheaper as it does not do a 
convert()
when not needed.

>
>We could emulate the rest using convert() etc.
>--
>Tatsuo Ishii
>
>---------------------------(end of broadcast)---------------------------
>TIP 6: Have you searched our list archives?
>
>http://archives.postgresql.org
>




Re: OCTET_LENGTH is wrong

From
Hannu Krosing
Date:

Bruce Momjian wrote:

>>>>There have been three ideas of what octet_length() sould return:
>>>>      1) compressed on-disk storage length
>>>>      2) byte length in server-side encoding
>>>>      3) byte length in client-side encoding
>>>>
>>...
>>
>>>>The open question is whether we should be doing #3.
>>>>
>>There is no question in my mind that (3) must be the result of
>>octet_length(). Any of the other options may give an interesting result,
>>but of no practical use to a client trying to retrieve data. And
>>everything is a client!
>>
>
>Also added to TODO:
>
>    * Make octet_length_client the same as octet_length() 
>
Will this break backward compatibility ?

-------------
Hannu



Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> >>>>There have been three ideas of what octet_length() sould return:
> >>>>      1) compressed on-disk storage length
> >>>>      2) byte length in server-side encoding
> >>>>      3) byte length in client-side encoding
> >>>>
> >>...
> >>
> >>>>The open question is whether we should be doing #3.
> >>>>
> >>There is no question in my mind that (3) must be the result of
> >>octet_length(). Any of the other options may give an interesting result,
> >>but of no practical use to a client trying to retrieve data. And
> >>everything is a client!
> >>
> >
> >Also added to TODO:
> >
> >    * Make octet_length_client the same as octet_length() 
> >
> Will this break backward compatibility ?

Well, sort of.  7.1 had text returning compressed length.  We changed
that to server-side encoding in 7.2.  Changing that to client encoding
will break clients, but what meaningful thing could they do with the
server-side encoding?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> Bruce Momjian writes:
> 
> > Also added to TODO:
> >
> >     * Make octet_length_client the same as octet_length()
> 
> Have we decided on that one yet?

Uh, Thomas said he was certain about it.  I will add a question mark to
the TODO item.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Peter Eisentraut
Date:
Bruce Momjian writes:

> Also added to TODO:
>
>     * Make octet_length_client the same as octet_length()

Have we decided on that one yet?

-- 
Peter Eisentraut   peter_e@gmx.net



Re: OCTET_LENGTH is wrong

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane writes:
>> What's bothering me right now is the difference between client and
>> server encodings.

> OCTET_LENGTH returns the size of its argument, not the size of some
> possible future shape of that argument.

That would serve equally well as an argument for returning the
compressed length of the string, I think.  You'll need to do better.

My take on it is that when a particular client encoding is specified,
Postgres does its best to provide the illusion that your data actually
is stored in that encoding.  If we don't make OCTET_LENGTH agree, then
we're breaking the illusion.
        regards, tom lane


Re: OCTET_LENGTH is wrong

From
Peter Eisentraut
Date:
Tom Lane writes:

> What's bothering me right now is the difference between client and
> server encodings.  It seems that the only plausible use for
> octet_length is to do memory allocation on the client side, and for
> that purpose the length ought to be measured in the client encoding.

OCTET_LENGTH returns the size of its argument, not the size of some
possible future shape of that argument.  There is absolutely no guarantee
that the string that is processed by OCTET_LENGTH will ever reach any kind
of client.  There are procedural languages, for instance, or CREATE TABLE
AS.

Whether or not this behaviour is most likely or most useful is a different
question, but let's not silently readopt standard functions for
non-standard purposes -- we've just gotten past that one.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: OCTET_LENGTH is wrong

From
Bruce Momjian
Date:
> Tom Lane writes:
> 
> > What's bothering me right now is the difference between client and
> > server encodings.  It seems that the only plausible use for
> > octet_length is to do memory allocation on the client side, and for
> > that purpose the length ought to be measured in the client encoding.
> 
> OCTET_LENGTH returns the size of its argument, not the size of some
> possible future shape of that argument.  There is absolutely no guarantee
> that the string that is processed by OCTET_LENGTH will ever reach any kind
> of client.  There are procedural languages, for instance, or CREATE TABLE
> AS.

Yes, agreed.  I argued that server-side octet_length would be valuable
for server-side functions.  However, others felt client-side was more
important.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: OCTET_LENGTH is wrong

From
Tatsuo Ishii
Date:
> OCTET_LENGTH returns the size of its argument, not the size of some
> possible future shape of that argument.  There is absolutely no guarantee
> that the string that is processed by OCTET_LENGTH will ever reach any kind
> of client.  There are procedural languages, for instance, or CREATE TABLE
> AS.
> 
> Whether or not this behaviour is most likely or most useful is a different
> question, but let's not silently readopt standard functions for
> non-standard purposes -- we've just gotten past that one.

I think the essential problem with OCTET_LENGTH(and with any other
text functions) is we currently do not have a way to associate
encoding information with each text object. Probably we could solve
this after implementation of CREATE CHARACTER SET stuffs.
--
Tatsuo Ishii


Re: OCTET_LENGTH is wrong

From
Peter Eisentraut
Date:
Tom Lane writes:

> > OCTET_LENGTH returns the size of its argument, not the size of some
> > possible future shape of that argument.
>
> That would serve equally well as an argument for returning the
> compressed length of the string, I think.  You'll need to do better.

TOAST is not part of the conceptual computational model.  The fact that
the compressed representation is available to functions at all is somewhat
peculiar (although I'm not questioning it).  I've already attempted to
show that returning the size of the compressed representation doesn't fit
the letter of the standard.

> My take on it is that when a particular client encoding is specified,
> Postgres does its best to provide the illusion that your data actually
> is stored in that encoding.  If we don't make OCTET_LENGTH agree, then
> we're breaking the illusion.

The way I've seen it we consider the encoding conversion to happen "on the
wire" while both the server and the client run in their own encoding.  In
that model it's appropriate that computations in the server use the
encoding in the server.

However, if the model is that it should appear to clients that the entire
setup magically runs in "their" encoding then the other behaviour would be
better.  In that case the database encoding is really only an optimization
hint because the actual encoding in the server is of no matter.  This
model would certainly be attractive as well, but there could be a few
problems.  For instance, I don't know if the convert() function would make
sense then.  (Does it even make sense now?)

Also, we do need to consider carefully how to interface this "illusion" to
operations contained strictly within the server (e.g., CREATE TABLE AS,
column defaults) and to procedural languages that may or may not come with
encoding ideas of their own.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: OCTET_LENGTH is wrong

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> However, if the model is that it should appear to clients that the entire
> setup magically runs in "their" encoding then the other behaviour would be
> better.  In that case the database encoding is really only an optimization
> hint because the actual encoding in the server is of no matter.  This
> model would certainly be attractive as well, but there could be a few
> problems.  For instance, I don't know if the convert() function would make
> sense then.  (Does it even make sense now?)

I'm not sure that it does; it seems not to fit the model well at all.
For example, if I do "SELECT convert(somestring, someencoding)" where
someencoding is anything but the server's encoding, then I will get
bogus results, because when the data is returned to the client it
will get an inappropriate server-to-client-encoding translation
applied to it.  Even if I ask to convert to the client encoding,
I will get wrong answers (two passes of the conversion).  Whatever you
might expect convert to do, that wouldn't seem to be it.

> Also, we do need to consider carefully how to interface this "illusion" to
> operations contained strictly within the server (e.g., CREATE TABLE AS,
> column defaults) and to procedural languages that may or may not come with
> encoding ideas of their own.

True.  I think that pltcl has now got this more or less licked, but
plperl hasn't ...
        regards, tom lane


Re: OCTET_LENGTH is wrong

From
Tatsuo Ishii
Date:
> problems.  For instance, I don't know if the convert() function would make
> sense then.  (Does it even make sense now?)

Yes. Consider you have UNICODE database and want to sort by French or
whatever LATIN locale.
 SELECT * FROM t1 ORDER BY convert(text_column,'LATIN1');

would be the only way to accomplish that.
--
Tatsuo Ishii


Re: OCTET_LENGTH is wrong

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> Yes. Consider you have UNICODE database and want to sort by French or
> whatever LATIN locale.
>      SELECT * FROM t1 ORDER BY convert(text_column,'LATIN1');
> would be the only way to accomplish that.

That in itself would not get the job done; how is the sort operator
to know what collation order you want?

The SQL92 spec suggests that the syntax should be
... ORDER BY text_column COLLATE French;

(note collation names are not standardized AFAICT).  Seems to me it
should then be the system's responsibility to make this happen,
including any encoding conversion that might be needed before the
comparisons could be done.
        regards, tom lane


Re: OCTET_LENGTH is wrong

From
Hannu Krosing
Date:
Tom Lane wrote:
> 
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > Yes. Consider you have UNICODE database and want to sort by French or
> > whatever LATIN locale.
> >        SELECT * FROM t1 ORDER BY convert(text_column,'LATIN1');
> > would be the only way to accomplish that.
> 
> That in itself would not get the job done; how is the sort operator
> to know what collation order you want?
> 
> The SQL92 spec suggests that the syntax should be
> 
>         ... ORDER BY text_column COLLATE French;
> 
> (note collation names are not standardized AFAICT).  Seems to me it
> should then be the system's responsibility to make this happen,
> including any encoding conversion that might be needed before the
> comparisons could be done.

Thanks to postgreSQL's flexibility you can currently make a contrib 
function convert(text_column,'LATIN1',locale) that returns a (new) 
text_with_locale type that has locale_aware comparison operators.

--------------
Hannu


Re: OCTET_LENGTH is wrong

From
Peter Eisentraut
Date:
Tatsuo Ishii writes:

> > problems.  For instance, I don't know if the convert() function would make
> > sense then.  (Does it even make sense now?)
>
> Yes. Consider you have UNICODE database and want to sort by French or
> whatever LATIN locale.
>
>      SELECT * FROM t1 ORDER BY convert(text_column,'LATIN1');
>
> would be the only way to accomplish that.

I don't think so.  The sort order is independent of the character
encoding, and vice versa.  It must be, because

1) One language can be represented in different encodings and should
obviously still sort the same.

2) One encoding can serve for plenty of languages, which all sort
differently.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: OCTET_LENGTH is wrong

From
Tatsuo Ishii
Date:
> > Yes. Consider you have UNICODE database and want to sort by French or
> > whatever LATIN locale.
> >
> >      SELECT * FROM t1 ORDER BY convert(text_column,'LATIN1');
> >
> > would be the only way to accomplish that.
> 
> I don't think so.  The sort order is independent of the character
> encoding, and vice versa.  It must be, because
> 
> 1) One language can be represented in different encodings and should
> obviously still sort the same.
> 
> 2) One encoding can serve for plenty of languages, which all sort
> differently.

I assume you are talking about the concept of SQL92/99's COLLATE
syntax.  But I just talked about what we could do in 7.2 which
apprarently does not have the SQL92's COLLATE syntax.

BTW,

> I don't think so.  The sort order is independent of the character
> encoding, and vice versa.  It must be, because

This seems different from SQL's CREATE COLLATION syntax.
From SQL99's CREATE COLLATION definition:
             CREATE COLLATION <collation name> FOR             <character set specification>               FROM
<existingcollation name>                 [ <pad characteristic> ]
 

So it seems a collation depends on a character set.
--
Tatsuo Ishii


Re: OCTET_LENGTH is wrong

From
Tatsuo Ishii
Date:
> > Yes. Consider you have UNICODE database and want to sort by French or
> > whatever LATIN locale.
> >      SELECT * FROM t1 ORDER BY convert(text_column,'LATIN1');
> > would be the only way to accomplish that.
> 
> That in itself would not get the job done; how is the sort operator
> to know what collation order you want?

I assume the locale support enabled of course.

> The SQL92 spec suggests that the syntax should be
> 
>     ... ORDER BY text_column COLLATE French;
> 
> (note collation names are not standardized AFAICT).  Seems to me it
> should then be the system's responsibility to make this happen,
> including any encoding conversion that might be needed before the
> comparisons could be done.

I'm not talking about our (hopefully) upcoming implementation of SQL92
COLLATE syntax. It's ideal and should be our goal, but what I have
shown is how we could do the job in 7.2 now.
--
Tatsuo Ishii


Re: OCTET_LENGTH is wrong

From
Peter Eisentraut
Date:
Tatsuo Ishii writes:

> > I don't think so.  The sort order is independent of the character
> > encoding, and vice versa.  It must be, because
>
> This seems different from SQL's CREATE COLLATION syntax.
> >From SQL99's CREATE COLLATION definition:
>
>               CREATE COLLATION <collation name> FOR
>               <character set specification>
>                 FROM <existing collation name>
>                   [ <pad characteristic> ]
>
> So it seems a collation depends on a character set.

I see.  But that really doesn't have anything to do with reality.  In
fact, it completely undermines the transparency of the character set
encoding that we're probably trying to achieve.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: OCTET_LENGTH is wrong

From
Hannu Krosing
Date:
Peter Eisentraut wrote:
> 
> Tatsuo Ishii writes:
> 
> > > I don't think so.  The sort order is independent of the character
> > > encoding, and vice versa.  It must be, because
> >
> > This seems different from SQL's CREATE COLLATION syntax.
> > >From SQL99's CREATE COLLATION definition:
> >
> >               CREATE COLLATION <collation name> FOR
> >               <character set specification>
> >                 FROM <existing collation name>
> >                   [ <pad characteristic> ]
> >
> > So it seems a collation depends on a character set.
> 
> I see.  But that really doesn't have anything to do with reality.  In
> fact, it completely undermines the transparency of the character set
> encoding that we're probably trying to achieve.

COLLATION being independent of character set is a separate problem 
from COLLATION being _defined_ on character set - without a known 
character set I can't see how you can define it. 
i.e. "COLLACTION for any 8-bit charset" just does not make sense.

-----------------
Hannu


Re: OCTET_LENGTH is wrong

From
Tatsuo Ishii
Date:
> > I see.  But that really doesn't have anything to do with reality.  In
> > fact, it completely undermines the transparency of the character set
> > encoding that we're probably trying to achieve.
> 
> COLLATION being independent of character set is a separate problem 
> from COLLATION being _defined_ on character set - without a known 
> character set I can't see how you can define it. 
> i.e. "COLLACTION for any 8-bit charset" just does not make sense.

Correct. IGNORE_CASE collation will not apply to some languages those
do not have upper/lower case concept such as Japanese.
--
Tatsuo Ishii