Re: Bug in UTF8-Validation Code? - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: Bug in UTF8-Validation Code?
Date
Msg-id 45FC4F85.7090804@dunslane.net
Whole thread Raw
In response to Re: Bug in UTF8-Validation Code?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Bug in UTF8-Validation Code?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers

Tom Lane wrote:
> I wrote:
>   
>> Actually, I have to take back that objection: on closer look, COPY
>> validates the data only once and does so before applying its own
>> backslash-escaping rules.  So there is a risk in that path too.
>>     
>
>   
>> It's still pretty annoying to be validating the data twice in the
>> common case where no backslash reduction occurred, but I'm not sure
>> I see any good way to avoid it.
>>     
>
> Further thought here: if we put encoding verification into textin()
> and related functions, could we *remove* it from COPY IN, in the common
> case where client and server encodings are the same?  Currently, copy.c
> forces a trip through pg_client_to_server for multibyte encodings
> even when the encodings are the same, so as to perform validation.
> But I'm wondering whether we'd still need that.  There's no risk of
> SQL injection in COPY data.  Bogus input encoding could possibly
> make for confusion about where the field boundaries are, but bad
> data is bad data in any case.
>
>             regards, tom lane
>
>   


Here are some timing tests in 1m rows of random utf8 encoded 100 char 
data. It doesn't look to me like the saving you're suggesting is worth 
the trouble.

baseline:

Time: 28228.325 ms
Time: 25987.740 ms
Time: 25950.707 ms
Time: 25756.371 ms
Time: 27589.719 ms
Time: 25774.417 ms


after adding suggested extra test to textin():


Time: 26722.376 ms
Time: 28343.226 ms
Time: 26529.364 ms
Time: 28020.140 ms
Time: 24836.853 ms
Time: 24860.530 ms


Script is:

\timing
create table xyz (x text);
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
drop table xyz;


Test platform: FC6, Athlon64.


cheers

andrew



pgsql-hackers by date:

Previous
From: "Hiroshi Saito"
Date:
Subject: Re: Bison 2.1 on win32
Next
From: "Florian G. Pflug"
Date:
Subject: Re: Project suggestion: benchmark utility for PostgreSQL