On 04/02/2021 03:50, Kyotaro Horiguchi wrote:
> At Wed, 3 Feb 2021 15:46:30 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
>> On 03/02/2021 15:38, John Naylor wrote:
>>> On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi
>>> <mailto:hlinnaka@iki.fi>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > While playing with COPY FROM refactorings in another thread, I noticed
>>> > corner case where I think backslash escaping doesn't work correctly.
>>> > Consider the following input:
>>> >
>>> > \么.foo
>>> I've seen multibyte delimiters in the wild, so it's not as outlandish
>>> as it seems.
>>
>> We don't actually support multi-byte characters as delimiters or quote
>> or escape characters:
>>
>> postgres=# copy copytest from 'foo' with (delimiter '么');
>> ERROR: COPY delimiter must be a single one-byte character
>>
>>> The fix is simple enough, so +1.
>>
>> Thanks, I'll commit and backpatch shortly.
>
> I'm not sure the assumption in the second hunk always holds, but
> that's fine at least with Shift-JIS and -2004 since they are two-byte
> encoding.
The assumption is that a multi-byte character cannot have a special
meaning, as far as the loop in CopyReadLineText is concerned. The
characters with special meaning are '\\', '\n' and '\r'. That hold
regardless of encoding.
Thinking about this a bit more, I think the attached patch is slightly
better. Normally in the loop, raw_buf_ptr points to the next byte to
consume, and 'c' is the last consumed byte. At the end of the loop, we
check 'c' to see if it was a multi-byte character, and skip its 2nd, 3rd
and 4th byte if necessary. The crux of the bug is that after the
"raw_buf_ptr++;" to skip the character after the backslash, we left c to
'\\', even though we already consumed the first byte of the next
character. Because of that, the end-of-the-loop check didn't correctly
treat it as a multi-byte character. So a more straightforward fix is to
set 'c' to the byte we skipped over.
- Heikki