Thread: COPY ENCODING revisited

COPY ENCODING revisited

From

Itagaki Takahiro

Date:

17 February 2011, 02:50:08

COPY ENCODING patch was returned with feedback,
  https://commitfest.postgresql.org/action/patch_view?id=501
but we still need it for file_fdw.  Using client_encoding at runtime
is reasonable for one-time COPY command, but logically nonsense for
persistent file_fdw tables.

Base on the latest patch,
  http://archives.postgresql.org/pgsql-hackers/2011-01/msg02903.php
I added pg_any_to_server() and pg_server_to_any() functions instead of
exposing FmgrInfo in pg_wchar.h.  They are same as pg_client_to_server()
and pg_server_to_client(), but accept any encoding. They use cached
conversion procs only if the specified encoding matches the client encoding.

According to Harada's research,
  http://archives.postgresql.org/pgsql-hackers/2011-01/msg02397.php
non-cached conversions are slower than cached ones. This version provides
the same performance before when file and client encoding are same,
but would be a bit slower on other cases. We could improve the performance
in future versions, for example, caching each used conversion proc in
pg_do_pg_do_encoding_conversion().

file_fdw will support ENCODING option. Also, if not specified it might
have to store the client_encoding at CREATE FOREIGN TABLE. Even if we use
a different client_encoding at SELECT, the encoding at definition is used.

ENCODING 'quoted name' issue is also fixed; it always requires quoted names.
I think we only accept non-quoted text as identifier names. Unquoted text
should be treated as "double quoted", but encoding names are not identifiers.

--
Itagaki Takahiro

Attachment

copy_encoding-20110217.patch

Re: COPY ENCODING revisited

From

Robert Haas

Date:

17 February 2011, 17:57:38

On Wed, Feb 16, 2011 at 10:45 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:
> COPY ENCODING patch was returned with feedback,
>  https://commitfest.postgresql.org/action/patch_view?id=501
> but we still need it for file_fdw.  Using client_encoding at runtime
> is reasonable for one-time COPY command, but logically nonsense for
> persistent file_fdw tables.
>
> Base on the latest patch,
>  http://archives.postgresql.org/pgsql-hackers/2011-01/msg02903.php
> I added pg_any_to_server() and pg_server_to_any() functions instead of
> exposing FmgrInfo in pg_wchar.h.  They are same as pg_client_to_server()
> and pg_server_to_client(), but accept any encoding. They use cached
> conversion procs only if the specified encoding matches the client encoding.
>
> According to Harada's research,
>  http://archives.postgresql.org/pgsql-hackers/2011-01/msg02397.php
> non-cached conversions are slower than cached ones. This version provides
> the same performance before when file and client encoding are same,
> but would be a bit slower on other cases. We could improve the performance
> in future versions, for example, caching each used conversion proc in
> pg_do_pg_do_encoding_conversion().
>
> file_fdw will support ENCODING option. Also, if not specified it might
> have to store the client_encoding at CREATE FOREIGN TABLE. Even if we use
> a different client_encoding at SELECT, the encoding at definition is used.
>
> ENCODING 'quoted name' issue is also fixed; it always requires quoted names.
> I think we only accept non-quoted text as identifier names. Unquoted text
> should be treated as "double quoted", but encoding names are not identifiers.

I am not qualified to fully review this patch because I'm not all that
familiar with the encoding stuff, but it looks reasonably sensible on
a quick read-through.  I am supportive of making a change in this area
even at this late date, because it seems to me that if we're not going
to change this then we're pretty much giving up on having a usable
file_fdw in 9.1.  And since postgresql_fdw isn't in very good shape
either, that would mean we may as well give up on SQL/MED.  We might
have to do that anyway, but I don't think we should do it just because
of this issue, if there's a reasonable fix.

I don't think the fact that the performance bites is a reason not to
do this.  As you say, that can always be improved in the future.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: COPY ENCODING revisited

From

Hitoshi Harada

Date:

17 February 2011, 18:04:14

2011/2/17 Itagaki Takahiro <itagaki.takahiro@gmail.com>:
> Base on the latest patch,
>  http://archives.postgresql.org/pgsql-hackers/2011-01/msg02903.php
> I added pg_any_to_server() and pg_server_to_any() functions instead of
> exposing FmgrInfo in pg_wchar.h.  They are same as pg_client_to_server()
> and pg_server_to_client(), but accept any encoding. They use cached
> conversion procs only if the specified encoding matches the client encoding.

It sounds good to me since the approach doesn't make any overhead to
the current behavior, although additional ENCODING option usage gets a
bit slower. Nothing lost.

FWIW, I finally found the good example to cache miscellaneous data in
file local, namely regexp.c. It allocates compiled regular expressions
up to 32 by using malloc(). We need only 4 or so for encoding
conversion cache, in which cache search doesn't seem like overhead. I
don't have time to make it as patch, but in a few days I'll try it.

Regards,


--
Hitoshi Harada

Re: COPY ENCODING revisited

From

Itagaki Takahiro

Date:

18 February 2011, 02:42:26

On Fri, Feb 18, 2011 at 04:04, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> FWIW, I finally found the good example to cache miscellaneous data in
> file local, namely regexp.c. It allocates compiled regular expressions
> up to 32 by using malloc().

I'm not exactly sure the cache usage in mbutils.c because it doesn't have
routine for cache invalidation at all.  Conversion procs are rarely
replaced, but we might not ought to keep cached functions unlimitedly.

Regexp cache is OK because the algorithm cannot be modified at runtime.

> We need only 4 or so for encoding
> conversion cache, in which cache search doesn't seem like overhead.

We need to research what we should cache for conversion procs.
We will need 4 bytes per conversion pair if we cache only OIDs,
but sizeof(FmgrInfo) bytes if we use the same way as ToXXXConvProc cache.

-- 
Itagaki Takahiro

Re: COPY ENCODING revisited

From

Itagaki Takahiro

Date:

18 February 2011, 10:18:48

On Fri, Feb 18, 2011 at 03:57, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 16, 2011 at 10:45 PM, Itagaki Takahiro
> I am not qualified to fully review this patch because I'm not all that
> familiar with the encoding stuff, but it looks reasonably sensible on
> a quick read-through.  I am supportive of making a change in this area
> even at this late date, because it seems to me that if we're not going
> to change this then we're pretty much giving up on having a usable
> file_fdw in 9.1.  And since postgresql_fdw isn't in very good shape
> either, that would mean we may as well give up on SQL/MED.  We might
> have to do that anyway, but I don't think we should do it just because
> of this issue, if there's a reasonable fix.

One design issue is the new function names: extern char *pg_client_to_server(const char *s, int len); extern char
*pg_server_to_client(constchar *s, int len); 
+ extern char *pg_any_to_server(const char *s, int len, int encoding);
+ extern char *pg_server_to_any(const char *s, int len, int encoding);

They don't contain any words related to "encoding" or "conversion".

Ishii-san, do you have comments? I guess you designed the area.
Please let me know if there are better alternatives.

--
Itagaki Takahiro

Re: COPY ENCODING revisited

From

Itagaki Takahiro

Date:

21 February 2011, 04:34:22

On Fri, Feb 18, 2011 at 20:12, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:
> + extern char *pg_any_to_server(const char *s, int len, int encoding);
> + extern char *pg_server_to_any(const char *s, int len, int encoding);

I applied the version with additional codes for file_fdw.

-- 
Itagaki Takahiro