Re: Perform COPY FROM encoding conversions in larger chunks - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Perform COPY FROM encoding conversions in larger chunks
Date
Msg-id 02da25ef-b579-2236-d3cd-0d07819cce98@iki.fi
Whole thread Raw
In response to Re: Perform COPY FROM encoding conversions in larger chunks  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: Perform COPY FROM encoding conversions in larger chunks  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On 28/01/2021 01:23, John Naylor wrote:
> Hi Heikki,
> 
> 0001 through 0003 are straightforward, and I think they can be committed 
> now if you like.
> 
> 0004 is also pretty straightforward. The check you proposed upthread for 
> pg_upgrade seems like the best solution to make that workable. I'll take 
> a look at 0005 soon.
> 
> I measured the conversions that were rewritten in 0003, and there is 
> indeed a noticeable speedup:
> 
> Big5 to EUC-TW:
> 
> head    196ms
> 0001-3  152ms
> 
> EUC-TW to Big5:
> 
> head    190ms
> 0001-3  144ms
> 
> I've attached the driver function for reference. Example use:
> 
> select drive_conversion(
>    1000, 'euc_tw'::name, 'big5'::name,
>    convert('a few kB of utf8 text here', 'utf8', 'euc_tw')
> );

Thanks! I have committed patches 0001 and 0003 in this series, with 
minor comment fixes. Next I'm going to write the pg_upgrade check for 
patch 0004, to get that into a committable state too.

> I took a look at the test suite also, and the only thing to note is a 
> couple places where the comment doesn't match the code:
> 
> +  -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2)
> +  byte1 = hex('0e');
> +  for byte2 in hex('a1')..hex('df') loop
> +    return next b(byte1, byte2);
> +  end loop;
> +
> +  -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3)
> +  byte1 = hex('0f');
> +  for byte2 in hex('a1')..hex('fe') loop
> +    for byte3 in hex('a1')..hex('fe') loop
> +      return next b(byte1, byte2, byte3);
> +    end loop;
> +  end loop;
> 
> Not sure if it matters , but thought I'd mention it anyway.

Good catch! The comments were correct, and the tests were wrong, not 
testing those 2- and 3-byte encoded characters as intened. Doesn't 
matter for testing this patch, I only included those euc_jis_2004 tets 
for the sake of completeness, but if someone finds this test suite in 
the archives and want to use it for something real, make sure you fix 
that first.

- Heikki



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: VACUUM (DISABLE_PAGE_SKIPPING on)
Next
From: Greg Nancarrow
Date:
Subject: Re: Parallel INSERT (INTO ... SELECT ...)