Thread: Multibyte char encoding atttypmod weirdness
Version: PostgreSQL 7.2.1 (7.3 not tested) Summary: When locale is set to multibyte char encoding languages, such as ja_JP.eucjp, and char encoding set to EUC_JP, for the char(20) columns (attributes), the libpq ((PGresult *)res)->attDescs[0].atttypmod returned by PQfmod(res, 0) is not correct. It's neither 20, nor 20+4 as reported in the hackers' mail list [1], but something varying (which I failed to figure out). In my specific case, it's 25. Is it a bug, or a feature that needs special care which is not documented in the postgresql documents? Is this extra byte overhead reflected by VARHDRSZ? But a simple fgrep -r VARHDRSZ in the header files showed: internal/c.h:#define VARHDRSZ ((int32) sizeof(int32)) internal/c.h: * always VARSIZE(ptr) - VARHDRSZ. server/access/tuptoaster.h: VARHDRSZ)) server/utils/varbit.h:/* Header overhead *in addition to* VARHDRSZ */ server/utils/varbit.h:#define VARBITBYTES(PTR) (VARSIZE(PTR) - VARHDRSZ - VARBITHDRSZ) server/utils/varbit.h: VARHDRSZ + VARBITHDRSZ) server/c.h:#define VARHDRSZ ((int32) sizeof(int32)) server/c.h: * always VARSIZE(ptr) - VARHDRSZ. which means VARHDRSZ should be sizeof(int32), which is always a constant 4 bytes. Is the VARBITHDRSZ relevant to this problem? But VARBITHDRSZ is not defined in any header files "make install-all-headers" installed. BTW, if it's not a bug, this kind of implementation inconsistent with common sense is ugly and a potential of buggy code. [1] http://archives.postgresql.org/pgsql-hackers/1998-03/msg00430.php
"Huaxin WANG" <wanghx@netspeed-tech.com> writes: > When locale is set to multibyte char encoding languages, > such as ja_JP.eucjp, and char encoding set to EUC_JP, for the char(20) > columns (attributes), the libpq ((PGresult *)res)->attDescs[0].atttypmod > returned by PQfmod(res, 0) is not correct. It's neither 20, nor 20+4 as > reported in the hackers' mail list [1], but something varying (which I > failed > to figure out). In my specific case, it's 25. I don't think so. A column declared as char(N) *will* have an atttypmod of N+4. The actual physical length in bytes of a column entry might be more, though, since we measure N in terms of characters not bytes. regards, tom lane
Sorry but I made a mistake in describing the problem. PQfmod(...) returns 20 + 4, but strlen(PQgetvalue(...)) returns something varying, more than 24. Since you said atttypmod is char len + 4, "The actual physical length in bytes of a column entry might be more", it's dependant to the current locale settings and multibyte/wide char related functions should be used to calculate the byte length. Is there a simple and direct way to know this byte lenght through libpq API? I will try to figure it out. Thank you very much for you informative and helpful reply. ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Huaxin WANG" <wanghx@netspeed-tech.com> Cc: <pgsql-bugs@postgresql.org> Sent: Monday, February 24, 2003 11:07 PM Subject: Re: [BUGS] Multibyte char encoding atttypmod weirdness > "Huaxin WANG" <wanghx@netspeed-tech.com> writes: > > When locale is set to multibyte char encoding languages, > > such as ja_JP.eucjp, and char encoding set to EUC_JP, for the char(20) > > columns (attributes), the libpq ((PGresult *)res)->attDescs[0].atttypmod > > returned by PQfmod(res, 0) is not correct. It's neither 20, nor 20+4 as > > reported in the hackers' mail list [1], but something varying (which I > > failed > > to figure out). In my specific case, it's 25. > > I don't think so. A column declared as char(N) *will* have an atttypmod > of N+4. The actual physical length in bytes of a column entry might > be more, though, since we measure N in terms of characters not bytes. > > regards, tom lane >