Re: Re: Perform COPY FROM encoding conversions in larger chunks - Mailing list pgsql-hackers

From Chapman Flack
Subject Re: Re: Perform COPY FROM encoding conversions in larger chunks
Date
Msg-id 608DB4BB.8080204@anastigmatix.net
Whole thread Raw
In response to Re: Perform COPY FROM encoding conversions in larger chunks  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On 04/01/21 05:27, Heikki Linnakangas wrote:
> I read through the patches one more time, made a few small comment fixes,
> and pushed.

Wow, this whole thread escaped my attention at the time, though my ears
would have perked right up if the subject had been something like
'improve encoding conversion API to stream a buffer at a time'. I think
this is of great interest beyond one particular use case in COPY FROM.
For example, it could limit the allocations needed when streaming a large
text value out to a client; it might be used to advantage with the recent
work in incrementally detoasting large values, and so on.

This part seems a little underdeveloped:

> * TODO: The conversion function interface is not great.  Firstly, it
> * would be nice to pass through the destination buffer size to the
> * conversion function, so that if you pass a shorter destination buffer, it
> * could still continue to fill up the whole buffer.  Currently, we have to
> * assume worst case expansion and stop the conversion short, even if there
> * is in fact space left in the destination buffer.  Secondly, it would be
> * nice to return the number of bytes written to the caller, to avoid a call
> * to strlen().

If I understand correctly, this patch already makes a breaking change to
the conversion function API. If that's going to be the case anyway, I wonder
if it's worth going further and changing the API further to eliminate this
odd limitation.

There seems to be a sort of common shape that conversion APIs have evolved
toward, that can be seen in both the ICU4C converters [0] and in Java's [1].
This current tweak to our conversion API seems to get allllmmoooosst there,
but just not quite. For example, noError allows us to keep control when
the function has stopped converting, but we don't find out which reason
it stopped.

If we just went the rest of the way and structured the API like those
existing ones, then:

- it would be super easy to write wrappers around ICU4C converters, if
  there were any we wanted to use;

- I could very easily write wrappers presenting any PG-supported charset
  as a Java charset.

The essence of the API common to ICU4C and Java is this:

1. You pass the function the address and length of a source buffer,
   the address and length of a destination buffer, and a flag that is true
   if you know there is no further input where this source buffer came from.
   (It's allowable to pass false and only then discover you really have no
   more input after all; then you just make one final call passing true.)

2. The function eats as much as it can of the source buffer, fills as much
   as it can of the destination buffer, and returns indicating one of four
   reasons it stopped:

   underflow - ran out of source buffer
   overflow  - ran out of destination buffer
   malformed - something in source buffer isn't valid in that representation
   unmappable - a valid codepoint not available in destination encoding

   Based on that, the caller refills the source buffer, or drains the
   destination buffer, or handles or reports the malformed or unmappable,
   then repeats.

3. The function should update pointers on return to indicate how much
   of the source buffer it consumed and how much of the destination buffer
   it filled.

4. If it left any bytes unconsumed in the source buffer, the caller must
   preserve them (perhaps moved to the front) for the next call.

5. The converter can have internal state (so it is an object in Java, or
   has a UConverter struct allocated in ICU4C, to have a place for its
   state). The state gets flushed on the final call where the flag is
   passed true. In many cases, the converter can be implemented without
   keeping internal state, if it simply leaves, for example, an
   incomplete sequence at the end of the source buffer unconsumed, so the
   caller will move it to the front and supply the rest. On the other hand,
   any unconsumed input after the final call with flush flag true must be
   treated as malformed.

6. On a malformed or unmappable return, the source buffer is left pointed
   at the start of the offending sequence and the length in bytes of
   that sequence is available for error reporting/recovery.

The efficient handling of states, returning updated pointers, and so on,
probably requires a function signature with 'internal' in it ... but
the current function signature already has 'internal', so that doesn't
seem like a deal-breaker.


Thoughts? It seems a shame to make a breaking change in the conversion
API, only to still end up with an API that "is not great" and is still
impedance-mismatched to other existing prominent conversion APIs.

Regards,
-Chap


[0]
https://unicode-org.github.io/icu/userguide/conversion/converters.html#3-buffered-or-streamed

[1]

https://docs.oracle.com/javase/9/docs/api/java/nio/charset/CharsetDecoder.html#decode-java.nio.ByteBuffer-java.nio.CharBuffer-boolean-



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Regex performance regression induced by match-all code
Next
From: Michael Banck
Date:
Subject: Re: Granting control of SUSET gucs to non-superusers