Thread: Multibyte char encoding atttypmod weirdness

Multibyte char encoding atttypmod weirdness

From
"Huaxin WANG"
Date:
Version: PostgreSQL 7.2.1 (7.3 not tested)

Summary:

When locale is set to multibyte char encoding languages,
such as ja_JP.eucjp, and char encoding set to EUC_JP, for the char(20)
columns (attributes), the libpq ((PGresult *)res)->attDescs[0].atttypmod
returned by PQfmod(res, 0) is not correct.  It's neither 20, nor 20+4 as
reported in the hackers' mail list [1], but something varying (which I
failed
to figure out).  In my specific case, it's 25.

Is it a bug, or a feature that needs special care which is not
documented
in the postgresql documents?  Is this extra byte overhead reflected by
VARHDRSZ?  But a simple fgrep -r VARHDRSZ in the header files showed:

internal/c.h:#define VARHDRSZ           ((int32) sizeof(int32))
internal/c.h: * always VARSIZE(ptr) - VARHDRSZ.
server/access/tuptoaster.h:                             VARHDRSZ))
server/utils/varbit.h:/* Header overhead *in addition to* VARHDRSZ */
server/utils/varbit.h:#define VARBITBYTES(PTR)  (VARSIZE(PTR) -
VARHDRSZ - VARBITHDRSZ)
server/utils/varbit.h:
VARHDRSZ + VARBITHDRSZ)
server/c.h:#define VARHDRSZ             ((int32) sizeof(int32))
server/c.h: * always VARSIZE(ptr) - VARHDRSZ.

which means VARHDRSZ should be sizeof(int32), which is always a constant
4
bytes.  Is the VARBITHDRSZ relevant to this problem?  But VARBITHDRSZ is
not
defined in any header files "make install-all-headers" installed.

BTW, if it's not a bug, this kind of implementation inconsistent with
common
sense is ugly and a potential of buggy code.

[1] http://archives.postgresql.org/pgsql-hackers/1998-03/msg00430.php

Re: Multibyte char encoding atttypmod weirdness

From
Tom Lane
Date:
"Huaxin WANG" <wanghx@netspeed-tech.com> writes:
> When locale is set to multibyte char encoding languages,
> such as ja_JP.eucjp, and char encoding set to EUC_JP, for the char(20)
> columns (attributes), the libpq ((PGresult *)res)->attDescs[0].atttypmod
> returned by PQfmod(res, 0) is not correct.  It's neither 20, nor 20+4 as
> reported in the hackers' mail list [1], but something varying (which I
> failed
> to figure out).  In my specific case, it's 25.

I don't think so.  A column declared as char(N) *will* have an atttypmod
of N+4.  The actual physical length in bytes of a column entry might
be more, though, since we measure N in terms of characters not bytes.

            regards, tom lane

Re: Multibyte char encoding atttypmod weirdness

From
"Huaxin WANG"
Date:
Sorry but I made a mistake in describing the problem.

PQfmod(...) returns 20 + 4, but strlen(PQgetvalue(...)) returns
something varying, more than 24.

Since you said atttypmod is char len + 4, "The actual physical length in
bytes of a column entry might be more", it's dependant to the current
locale settings and multibyte/wide char related functions should be used
to calculate the byte length.  Is there a simple and direct way to know
this byte lenght through libpq API?  I will try to figure it out.

Thank you very much for you informative and helpful reply.

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Huaxin WANG" <wanghx@netspeed-tech.com>
Cc: <pgsql-bugs@postgresql.org>
Sent: Monday, February 24, 2003 11:07 PM
Subject: Re: [BUGS] Multibyte char encoding atttypmod weirdness


> "Huaxin WANG" <wanghx@netspeed-tech.com> writes:
> > When locale is set to multibyte char encoding languages,
> > such as ja_JP.eucjp, and char encoding set to EUC_JP, for the
char(20)
> > columns (attributes), the libpq ((PGresult
*)res)->attDescs[0].atttypmod
> > returned by PQfmod(res, 0) is not correct.  It's neither 20, nor
20+4 as
> > reported in the hackers' mail list [1], but something varying (which
I
> > failed
> > to figure out).  In my specific case, it's 25.
>
> I don't think so.  A column declared as char(N) *will* have an
atttypmod
> of N+4.  The actual physical length in bytes of a column entry might
> be more, though, since we measure N in terms of characters not bytes.
>
> regards, tom lane
>