Perform COPY FROM encoding conversions in larger chunks - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Perform COPY FROM encoding conversions in larger chunks
Date
Msg-id e7861509-3960-538a-9025-b75a61188e01@iki.fi
Whole thread Raw
Responses Re: Perform COPY FROM encoding conversions in larger chunks
Re: Perform COPY FROM encoding conversions in larger chunks
Re: Perform COPY FROM encoding conversions in larger chunks
List pgsql-hackers
I've been looking at the COPY FROM parsing code, trying to refactor it 
so that the parallel COPY would be easier to implement. I haven't 
touched parallelism itself, just looking for ways to smoothen the way. 
And for ways to speed up COPY in general.

Currently, COPY FROM parses the input one line at a time. Each line is 
converted to the database encoding separately, or if the file encoding 
matches the database encoding, we just check that the input is valid for 
the encoding. It would be more efficient to do the encoding 
conversion/verification in larger chunks. At least potentially; the 
current conversion/verification implementations work one byte a time so 
it doesn't matter too much, but there are faster algorithms out there 
that use SIMD instructions or lookup tables that benefit from larger inputs.

So I'd like to change it so that the encoding conversion/verification is 
done before splitting the input into lines. The problem is that the 
conversion and verification functions throw an error on incomplete 
input. So we can't pass them a chunk of N raw bytes, if we don't know 
where the character boundaries are. The first step in this effort is to 
change the encoding and conversion routines to allow that. Attached 
patches 0001-0004 do that:

For encoding conversions, change the signature of the conversion 
function, by adding a "bool noError" argument and making them return the 
number of input bytes successfully converted. That way, the conversion 
function can be called in a streaming fashion: load a buffer with raw 
input without caring about the character boundaries, call the conversion 
function to convert it except for the few bytes at the end that might be 
an incomplete character, load the buffer with more data, and repeat.

For encoding verification, add a new function that works similarly. It 
takes N bytes of raw input, verifies as much of it as possible, and 
returns the number of input bytes that were valid. In principle, this 
could've been implemented by calling the existing pg_encoding_mblen() 
and pg_encoding_verifymb() functions in a loop, but it would be too 
slow. This adds encoding-specific functions for that. The UTF-8 
implementation is slightly optimized by basically inlining the 
pg_utf8_mblen() call, the other implementations are pretty naive.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Brar Piening
Date:
Subject: Re: Minor documentation error regarding streaming replication protocol
Next
From: Fujii Masao
Date:
Subject: Deadlock between backend and recovery may not be detected