Re: Bug in COPY FROM backslash escaping multi-byte chars - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Bug in COPY FROM backslash escaping multi-byte chars
Date
Msg-id c78113ac-8c25-5892-1269-d174d11e3a9f@iki.fi
Whole thread Raw
In response to Re: Bug in COPY FROM backslash escaping multi-byte chars  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
On 04/02/2021 03:50, Kyotaro Horiguchi wrote:
> At Wed, 3 Feb 2021 15:46:30 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
>> On 03/02/2021 15:38, John Naylor wrote:
>>> On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi
>>> <mailto:hlinnaka@iki.fi>> wrote:
>>>   >
>>>   > Hi,
>>>   >
>>>   > While playing with COPY FROM refactorings in another thread, I noticed
>>>   > corner case where I think backslash escaping doesn't work correctly.
>>>   > Consider the following input:
>>>   >
>>>   > \么.foo
>>> I've seen multibyte delimiters in the wild, so it's not as outlandish
>>> as it seems.
>>
>> We don't actually support multi-byte characters as delimiters or quote
>> or escape characters:
>>
>> postgres=# copy copytest from 'foo' with (delimiter '么');
>> ERROR:  COPY delimiter must be a single one-byte character
>>
>>> The fix is simple enough, so +1.
>>
>> Thanks, I'll commit and backpatch shortly.
> 
> I'm not sure the assumption in the second hunk always holds, but
> that's fine at least with Shift-JIS and -2004 since they are two-byte
> encoding.

The assumption is that a multi-byte character cannot have a special 
meaning, as far as the loop in CopyReadLineText is concerned. The 
characters with special meaning are '\\', '\n' and '\r'. That hold 
regardless of encoding.

Thinking about this a bit more, I think the attached patch is slightly 
better. Normally in the loop, raw_buf_ptr points to the next byte to 
consume, and 'c' is the last consumed byte. At the end of the loop, we 
check 'c' to see if it was a multi-byte character, and skip its 2nd, 3rd 
and 4th byte if necessary. The crux of the bug is that after the 
"raw_buf_ptr++;" to skip the character after the backslash, we left c to 
'\\', even though we already consumed the first byte of the next 
character. Because of that, the end-of-the-loop check didn't correctly 
treat it as a multi-byte character. So a more straightforward fix is to 
set 'c' to the byte we skipped over.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Multiple full page writes in a single checkpoint?
Next
From: Andres Freund
Date:
Subject: Re: logical replication worker accesses catalogs in error context callback