Re: Perform COPY FROM encoding conversions in larger chunks - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Perform COPY FROM encoding conversions in larger chunks
Date
Msg-id 06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi
Whole thread Raw
In response to Re: Perform COPY FROM encoding conversions in larger chunks  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: Perform COPY FROM encoding conversions in larger chunks  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
On 28/01/2021 01:23, John Naylor wrote:
> Hi Heikki,
> 
> 0001 through 0003 are straightforward, and I think they can be committed 
> now if you like.

Thanks for the review!

I did some more rigorous microbenchmarking of patch 1 and 2. I used the 
attached test script, which calls convert_from() function to perform 
UTF-8 verification on two large strings, about 60kb each. One of the 
strings is pure ASCII, and the other is an HTML page that contains a mix 
of ASCII and multibyte characters.

Compiled with "gcc -O2", gcc version 10.2.1 20210110 (Debian 10.2.1-6)

            | mixed | ascii
-----------+-------+-------
  master    |  1866 |  1250
  patch 1   |   959 |   507
  patch 1+2 |  1396 |   987

So, the first patch, 
0001-Add-new-mbverifystr-function-for-each-encoding.patch, made huge 
difference. Even with pure ASCII input. That's very surprising, because 
there is already a fast-path for pure-ASCII input in pg_verify_mbstr_len().

Even more surprising was that the second patch 
(0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch) 
actually made things worse again. I thought it would give a modest gain, 
but nope.

It seems to me that GCC is not doing good job at optimizing the loop in 
pg_verify_mbstr(). The first patch fixes that, but the second patch 
somehow trips up GCC again.

So I also tried this with "gcc -O3" and clang:

Compiled with "gcc -O3"

            | mixed | ascii
-----------+-------+-------
  master    |  1522 |  1225
  patch 1   |   753 |   507
  patch 1+2 |   868 |   507

Compiled with "clang -O2", Debian clang version 11.0.1-2

            | mixed | ascii
-----------+-------+-------
  master    |  1257 |   520
  patch 1   |   899 |   507
  patch 1+2 |   884 |   508

With gcc -O3, the results are a better, but still the second patch seems 
harmful. With clang, I got the result I expected: Almost no difference 
with pure-ASCII input, because there's already a fast-path for that, and 
a nice speedup with multibyte characters. Still, I was surprised how big 
the speedup from the first patch was, and how little additional gain the 
second patch gives.

Based on these results, I'm going to commit the first patch, but not the 
second one. There are much faster UTF-8 verification routines out there, 
using SIMD instructions and whatnot, and we should consider adopting one 
of those, but that's future work.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: "Hou, Zhijie"
Date:
Subject: RE: Determine parallel-safety of partition relations for Inserts
Next
From: Masahiko Sawada
Date:
Subject: Commitfest 2021-01 ends in 3 days