Re: Bug in UTF8-Validation Code? - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: Bug in UTF8-Validation Code?
Date
Msg-id 45FCBA2E.7010303@dunslane.net
Whole thread Raw
In response to Re: Bug in UTF8-Validation Code?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Bug in UTF8-Validation Code?  (Grzegorz Jaskiewicz <gj@pointblue.com.pl>)
List pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Here are some timing tests in 1m rows of random utf8 encoded 100 char 
>> data. It doesn't look to me like the saving you're suggesting is worth 
>> the trouble.
>>     
>
> Hmm ... not sure I believe your numbers.  Using a test file of 1m lines
> of 100 random latin1 characters converted to utf8 (thus, about half and
> half 7-bit ASCII and 2-byte utf8 characters), I get this in SQL_ASCII
> encoding:
>
> regression=# \timing
> Timing is on.
> regression=# create temp table test(f1 text);
> CREATE TABLE
> Time: 5.047 ms
> regression=# copy test from '/home/tgl/zzz1m';
> COPY 1000000
> Time: 4337.089 ms
>
> and this in UTF8 encoding:
>
> utf8=# \timing
> Timing is on.
> utf8=# create temp table test(f1 text);
> CREATE TABLE
> Time: 5.108 ms
> utf8=# copy test from '/home/tgl/zzz1m';
> COPY 1000000
> Time: 7776.583 ms
>
> The numbers aren't super repeatable, but it sure looks to me like the
> encoding check adds at least 50% to the runtime in this example; so
> doing it twice seems unpleasant.
>
> (This is CVS HEAD, compiled without assert checking, on an x86_64
> Fedora Core 6 box.)
>
>     
>   

Here are some test results that are closer to yours. I used a temp table 
and had cassert off and fsync off, and tried with several encodings.

The additional load from the test isn't 50%, (I think you have added the 
cost of going from ascii to utf8 to the cost of the test to get that 
50%) but it is  nevertheless appreciable.

I agree that we should look at not testing if the client and server 
encodings are the same, so we can reduce the difference.

cheers

andrew
                   Run SQL_ASCII LATIN1  UTF8

                     1   4659.38 4766.07  9134.53
                     2   7999.64 4003.13  6231.41
                     3   4178.46 6178.89  7266.39
  Without test       4    4201.7 3930.84 10154.38
                     5   4092.44 4444.52  9438.24
                     6   3977.34 4197.09  8866.56
               Average   4851.49 4586.76  8515.25


                     1  11993.86 12625.8 10109.89
                     2   4647.16 9192.53 11251.27
  With test          3   4211.02 9903.77 10097.37
                     4   9203.62 7045.06 10372.25
                     5   4121.39 4138.78 10386.92
                     6   3722.73 4552.09  7432.56
               Average   6316.63 7909.67  9941.71






pgsql-hackers by date:

Previous
From: Grzegorz Jaskiewicz
Date:
Subject: Re: [PATCHES] Bitmapscan changes
Next
From: Grzegorz Jaskiewicz
Date:
Subject: Re: Bug in UTF8-Validation Code?