Re: Re: Perform COPY FROM encoding conversions in larger chunks - Mailing list pgsql-hackers
From | Chapman Flack |
---|---|
Subject | Re: Re: Perform COPY FROM encoding conversions in larger chunks |
Date | |
Msg-id | 608DB4BB.8080204@anastigmatix.net Whole thread Raw |
In response to | Re: Perform COPY FROM encoding conversions in larger chunks (Heikki Linnakangas <hlinnaka@iki.fi>) |
List | pgsql-hackers |
On 04/01/21 05:27, Heikki Linnakangas wrote: > I read through the patches one more time, made a few small comment fixes, > and pushed. Wow, this whole thread escaped my attention at the time, though my ears would have perked right up if the subject had been something like 'improve encoding conversion API to stream a buffer at a time'. I think this is of great interest beyond one particular use case in COPY FROM. For example, it could limit the allocations needed when streaming a large text value out to a client; it might be used to advantage with the recent work in incrementally detoasting large values, and so on. This part seems a little underdeveloped: > * TODO: The conversion function interface is not great. Firstly, it > * would be nice to pass through the destination buffer size to the > * conversion function, so that if you pass a shorter destination buffer, it > * could still continue to fill up the whole buffer. Currently, we have to > * assume worst case expansion and stop the conversion short, even if there > * is in fact space left in the destination buffer. Secondly, it would be > * nice to return the number of bytes written to the caller, to avoid a call > * to strlen(). If I understand correctly, this patch already makes a breaking change to the conversion function API. If that's going to be the case anyway, I wonder if it's worth going further and changing the API further to eliminate this odd limitation. There seems to be a sort of common shape that conversion APIs have evolved toward, that can be seen in both the ICU4C converters [0] and in Java's [1]. This current tweak to our conversion API seems to get allllmmoooosst there, but just not quite. For example, noError allows us to keep control when the function has stopped converting, but we don't find out which reason it stopped. If we just went the rest of the way and structured the API like those existing ones, then: - it would be super easy to write wrappers around ICU4C converters, if there were any we wanted to use; - I could very easily write wrappers presenting any PG-supported charset as a Java charset. The essence of the API common to ICU4C and Java is this: 1. You pass the function the address and length of a source buffer, the address and length of a destination buffer, and a flag that is true if you know there is no further input where this source buffer came from. (It's allowable to pass false and only then discover you really have no more input after all; then you just make one final call passing true.) 2. The function eats as much as it can of the source buffer, fills as much as it can of the destination buffer, and returns indicating one of four reasons it stopped: underflow - ran out of source buffer overflow - ran out of destination buffer malformed - something in source buffer isn't valid in that representation unmappable - a valid codepoint not available in destination encoding Based on that, the caller refills the source buffer, or drains the destination buffer, or handles or reports the malformed or unmappable, then repeats. 3. The function should update pointers on return to indicate how much of the source buffer it consumed and how much of the destination buffer it filled. 4. If it left any bytes unconsumed in the source buffer, the caller must preserve them (perhaps moved to the front) for the next call. 5. The converter can have internal state (so it is an object in Java, or has a UConverter struct allocated in ICU4C, to have a place for its state). The state gets flushed on the final call where the flag is passed true. In many cases, the converter can be implemented without keeping internal state, if it simply leaves, for example, an incomplete sequence at the end of the source buffer unconsumed, so the caller will move it to the front and supply the rest. On the other hand, any unconsumed input after the final call with flush flag true must be treated as malformed. 6. On a malformed or unmappable return, the source buffer is left pointed at the start of the offending sequence and the length in bytes of that sequence is available for error reporting/recovery. The efficient handling of states, returning updated pointers, and so on, probably requires a function signature with 'internal' in it ... but the current function signature already has 'internal', so that doesn't seem like a deal-breaker. Thoughts? It seems a shame to make a breaking change in the conversion API, only to still end up with an API that "is not great" and is still impedance-mismatched to other existing prominent conversion APIs. Regards, -Chap [0] https://unicode-org.github.io/icu/userguide/conversion/converters.html#3-buffered-or-streamed [1] https://docs.oracle.com/javase/9/docs/api/java/nio/charset/CharsetDecoder.html#decode-java.nio.ByteBuffer-java.nio.CharBuffer-boolean-
pgsql-hackers by date: