Home > mailing lists

Perform COPY FROM encoding conversions in larger chunks - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Perform COPY FROM encoding conversions in larger chunks
Date	December 16, 2020 15:17:58
Msg-id	e7861509-3960-538a-9025-b75a61188e01@iki.fi Whole thread Raw
Responses	Re: Perform COPY FROM encoding conversions in larger chunks Re: Perform COPY FROM encoding conversions in larger chunks Re: Perform COPY FROM encoding conversions in larger chunks
List	pgsql-hackers

Tree view

I've been looking at the COPY FROM parsing code, trying to refactor it 
so that the parallel COPY would be easier to implement. I haven't 
touched parallelism itself, just looking for ways to smoothen the way. 
And for ways to speed up COPY in general.

Currently, COPY FROM parses the input one line at a time. Each line is 
converted to the database encoding separately, or if the file encoding 
matches the database encoding, we just check that the input is valid for 
the encoding. It would be more efficient to do the encoding 
conversion/verification in larger chunks. At least potentially; the 
current conversion/verification implementations work one byte a time so 
it doesn't matter too much, but there are faster algorithms out there 
that use SIMD instructions or lookup tables that benefit from larger inputs.

So I'd like to change it so that the encoding conversion/verification is 
done before splitting the input into lines. The problem is that the 
conversion and verification functions throw an error on incomplete 
input. So we can't pass them a chunk of N raw bytes, if we don't know 
where the character boundaries are. The first step in this effort is to 
change the encoding and conversion routines to allow that. Attached 
patches 0001-0004 do that:

For encoding conversions, change the signature of the conversion 
function, by adding a "bool noError" argument and making them return the 
number of input bytes successfully converted. That way, the conversion 
function can be called in a streaming fashion: load a buffer with raw 
input without caring about the character boundaries, call the conversion 
function to convert it except for the few bytes at the end that might be 
an incomplete character, load the buffer with more data, and repeat.

For encoding verification, add a new function that works similarly. It 
takes N bytes of raw input, verifies as much of it as possible, and 
returns the number of input bytes that were valid. In principle, this 
could've been implemented by calling the existing pg_encoding_mblen() 
and pg_encoding_verifymb() functions in a loop, but it would be too 
slow. This adds encoding-specific functions for that. The UTF-8 
implementation is slightly optimized by basically inlining the 
pg_utf8_mblen() call, the other implementations are pretty naive.

- Heikki

Attachment

pgsql-hackers by date:

From: Brar Piening
Date: 16 December 2020, 13:00:44
Subject: Re: Minor documentation error regarding streaming replication protocol

From: Fujii Masao
Date: 16 December 2020, 15:49:07
Subject: Deadlock between backend and recovery may not be detected

Perform COPY FROM encoding conversions in larger chunks - Mailing list pgsql-hackers

Attachment

Previous

Next