Thread: psql blows up on BOM character sequence

psql blows up on BOM character sequence

From
Jim Nasby
Date:
See http://www.postgresql.org/message-id/4AFEAB39.3000009@dunslane.net

This is still broken as of fairly recent HEAD; any objections to adding it to TODO?
--
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: psql blows up on BOM character sequence

From
Merlin Moncure
Date:
On Fri, Mar 21, 2014 at 4:02 PM, Jim Nasby <jim@nasby.net> wrote:
> See http://www.postgresql.org/message-id/4AFEAB39.3000009@dunslane.net
>
> This is still broken as of fairly recent HEAD; any objections to adding it to TODO?

Agreed: this is a major annoyance.

merlin



Re: psql blows up on BOM character sequence

From
Andrew Dunstan
Date:
On 03/21/2014 05:06 PM, Merlin Moncure wrote:
> On Fri, Mar 21, 2014 at 4:02 PM, Jim Nasby <jim@nasby.net> wrote:
>> See http://www.postgresql.org/message-id/4AFEAB39.3000009@dunslane.net
>>
>> This is still broken as of fairly recent HEAD; any objections to adding it to TODO?
> Agreed: this is a major annoyance.
>



Surely if it were really a major annoyance, someone would have sent code 
to fix it during the last 4 years and more since the above.

I suspect it's a minor annoyance :-)

But by all means add it to the TODO list if it's not there already.


cheers

andrew



Re: psql blows up on BOM character sequence

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Surely if it were really a major annoyance, someone would have sent code 
> to fix it during the last 4 years and more since the above.

The code would probably be pretty trivial, *if* we had consensus on
what the behavior ought to be.  I'm not sure if we do.  People who
only use Unicode would probably like it if BOMs were unconditionally
swallowed, whether or not psql thinks the client_encoding is UTF8.
(And I seem to recall somebody even proposing that finding a BOM
be cause to switch the client_encoding to UTF8.)  However, these
ideas are complete nonstarters for people who habitually use other
encodings.

The argument about SQL syntax carries no weight for me, at least --- what
about COPY data files?  And I don't really want to suppose that \i can
never be used to insert a portion of a SQL command, either.

I'd be okay with swallowing a leading BOM if and only if client encoding
is UTF8.  This should apply to any file psql reads, whether script or
data.
        regards, tom lane



Re: psql blows up on BOM character sequence

From
Merlin Moncure
Date:
On Fri, Mar 21, 2014 at 4:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'd be okay with swallowing a leading BOM if and only if client encoding
> is UTF8.  This should apply to any file psql reads, whether script or
> data.

Yeah.  The one case that doesn't solve is:

cat f1.sql f2.sql | psql ...

Which is common usage in deployment systems where combining things
inside a single transaction scope is important.

There is no way for psql to handle that case though unless you'd strip
*all* BOMs encountered.  Compounding this problem is that there's no
practical way AFAIK to send multiple file to psql via single command
line invocation.  If you pass multiple -f arguments all but one is
ignored.

merlin



Re: psql blows up on BOM character sequence

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> There is no way for psql to handle that case though unless you'd strip
> *all* BOMs encountered.  Compounding this problem is that there's no
> practical way AFAIK to send multiple file to psql via single command
> line invocation.  If you pass multiple -f arguments all but one is
> ignored.

Well, that seems like a solvable but rather independent problem.
I guess one issue is how you'd define the meaning of --single ...
one transaction per run, or one per file?
        regards, tom lane



Re: psql blows up on BOM character sequence

From
"David E. Wheeler"
Date:
On Mar 21, 2014, at 2:16 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

> Surely if it were really a major annoyance, someone would have sent code to fix it during the last 4 years and more
sincethe above. 
>
> I suspect it's a minor annoyance :-)
>
> But by all means add it to the TODO list if it's not there already.

I have cleaned up many a BOM added to files that made psql blow up. I think PGAdmin III was a culprit, though I’m not
sure(I don’t use, it, cleaned up after coworkers who do). 

David




Re: psql blows up on BOM character sequence

From
Jim Nasby
Date:
On 3/21/14, 8:13 PM, David E. Wheeler wrote:
> On Mar 21, 2014, at 2:16 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>> Surely if it were really a major annoyance, someone would have sent code to fix it during the last 4 years and more
sincethe above.
 
>>
>> I suspect it's a minor annoyance :-)
>>
>> But by all means add it to the TODO list if it's not there already.
>
> I have cleaned up many a BOM added to files that made psql blow up. I think PGAdmin III was a culprit, though I’m not
sure(I don’t use, it, cleaned up after coworkers who do).
 

Yes, my coworker that figured out what the problem was said the culprit here is actually pgAdmin. :(
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: psql blows up on BOM character sequence

From
Jim Nasby
Date:
On 3/21/14, 4:54 PM, Tom Lane wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> There is no way for psql to handle that case though unless you'd strip
>> *all* BOMs encountered.  Compounding this problem is that there's no
>> practical way AFAIK to send multiple file to psql via single command
>> line invocation.  If you pass multiple -f arguments all but one is
>> ignored.
>
> Well, that seems like a solvable but rather independent problem.
> I guess one issue is how you'd define the meaning of --single ...
> one transaction per run, or one per file?

Well, if you're catting multiple files into psql -1, you'd get all the files in one transaction, right? So I'd say
that'swhat should happen.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: psql blows up on BOM character sequence

From
Guillaume Lelarge
Date:
On Sat, 2014-03-22 at 11:23 -0500, Jim Nasby wrote:
> On 3/21/14, 8:13 PM, David E. Wheeler wrote:
> > On Mar 21, 2014, at 2:16 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> >
> >> Surely if it were really a major annoyance, someone would have sent code to fix it during the last 4 years and
moresince the above.
 
> >>
> >> I suspect it's a minor annoyance :-)
> >>
> >> But by all means add it to the TODO list if it's not there already.
> >
> > I have cleaned up many a BOM added to files that made psql blow up. I think PGAdmin III was a culprit, though I’m
notsure (I don’t use, it, cleaned up after coworkers who do).
 
> 
> Yes, my coworker that figured out what the problem was said the culprit here is actually pgAdmin. :(

Just a quick comment on this. Yes, pgAdmin always added a BOM in every
SQL files it wrote. That being said, since 2010, pgAdmin has an option
that allows the user to request the BOM writing (IOW, it's disabled by
default for new installations of pgAdmin). See in the
preferences/options window, "Query tool / Query file" sub-item, and
"Write BOM for UTF files" checkbox. Make sure it's unchecked.

Either your coworker has an older release (that would be older than 1.14
IIRC), or he didn't change the setting to make it work like he would
like.


-- 
Guillaume
http://blog.guillaume.lelarge.info
http://www.dalibo.com




Re: psql blows up on BOM character sequence

From
"David E. Wheeler"
Date:
On Mar 23, 2014, at 8:03, Guillaume Lelarge <guillaume@lelarge.info> wrote:

Just a quick comment on this. Yes, pgAdmin always added a BOM in every
SQL files it wrote.

From http://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom:

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

2.6 Encoding Schemes

... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

Re: psql blows up on BOM character sequence

From
Jim Nasby
Date:
On 3/22/14, 11:26 AM, Jim Nasby wrote:
> On 3/21/14, 4:54 PM, Tom Lane wrote:
>> Merlin Moncure <mmoncure@gmail.com> writes:
>>> There is no way for psql to handle that case though unless you'd strip
>>> *all* BOMs encountered.  Compounding this problem is that there's no
>>> practical way AFAIK to send multiple file to psql via single command
>>> line invocation.  If you pass multiple -f arguments all but one is
>>> ignored.
>>
>> Well, that seems like a solvable but rather independent problem.
>> I guess one issue is how you'd define the meaning of --single ...
>> one transaction per run, or one per file?
>
> Well, if you're catting multiple files into psql -1, you'd get all the files in one transaction, right? So I'd say
that'swhat should happen.
 

It occurs to me that we're going about this the wrong way...

The error here isn't being generated by psql; it's generated by the backend. In the context of a statement (and not,
say,a COPY command).
 

So instead of trying to handle this on the psql side[1], I think we need to handle it in the backend; specifically in
theparser. Is there an easy way to get the parser to ignore the BOM character in the context of commands (but not in
strings)?

[1]: Obviously, BOM could still screw up a psql command like \d. We'd want to address that as well; but I suspect
backendsare the more common scenario.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: psql blows up on BOM character sequence

From
Andrew Dunstan
Date:
On 03/24/2014 02:50 PM, Jim Nasby wrote:
> On 3/22/14, 11:26 AM, Jim Nasby wrote:
>> On 3/21/14, 4:54 PM, Tom Lane wrote:
>>> Merlin Moncure <mmoncure@gmail.com> writes:
>>>> There is no way for psql to handle that case though unless you'd strip
>>>> *all* BOMs encountered.  Compounding this problem is that there's no
>>>> practical way AFAIK to send multiple file to psql via single command
>>>> line invocation.  If you pass multiple -f arguments all but one is
>>>> ignored.
>>>
>>> Well, that seems like a solvable but rather independent problem.
>>> I guess one issue is how you'd define the meaning of --single ...
>>> one transaction per run, or one per file?
>>
>> Well, if you're catting multiple files into psql -1, you'd get all 
>> the files in one transaction, right? So I'd say that's what should 
>> happen.
>
> It occurs to me that we're going about this the wrong way...
>
> The error here isn't being generated by psql; it's generated by the 
> backend. In the context of a statement (and not, say, a COPY command).
>
> So instead of trying to handle this on the psql side[1], I think we 
> need to handle it in the backend; specifically in the parser. Is there 
> an easy way to get the parser to ignore the BOM character in the 
> context of commands (but not in strings)?
>
> [1]: Obviously, BOM could still screw up a psql command like \d. We'd 
> want to address that as well; but I suspect backends are the more 
> common scenario.


But what about COPY files? I don't see why there is less of a case for 
eating a leading BOM for a COPY file (or COPY stdin for that matter, 
given that it can come from \copy) than for an SQL file.

I suspect suspect trying to do this in the parser will be quite messy. 
This needs to happen before the input is converted to the server 
encoding, I think.

cheers

andrew




Re: psql blows up on BOM character sequence

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> I suspect suspect trying to do this in the parser will be quite messy. 
> This needs to happen before the input is converted to the server 
> encoding, I think.

Indeed --- what if the server isn't using utf8 internally?

And a larger point is that the server has no idea where the file
boundaries are.  If we were to do this server-side, we'd essentially
end up discarding BOM anywhere, which is more libertine than I care
to be.
        regards, tom lane



Re: psql blows up on BOM character sequence

From
Merlin Moncure
Date:
On Mon, Mar 24, 2014 at 2:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> I suspect suspect trying to do this in the parser will be quite messy.
>> This needs to happen before the input is converted to the server
>> encoding, I think.
>
> Indeed --- what if the server isn't using utf8 internally?
>
> And a larger point is that the server has no idea where the file
> boundaries are.  If we were to do this server-side, we'd essentially
> end up discarding BOM anywhere, which is more libertine than I care
> to be.

Right -- I had a feeling you'd say that.  That's why the best solution
ISTM is to allow psql to be invoked in such a way that it *does* know
the file boundaries for consolidated scripts; this means better
handling of multiple file arguments.  psql -1 already requires '-f' to
work (vs cat foo.sql | psql) and that's pretty reasonable.  BOM
handling fixes should probably be injected in cases where the precise
beginning points of the file are known, which AFAICT are \i and -f.
So, in short, it seems prudent to:

1. make multiple -f invocation work (with -1 spanning)
2. strip BOM from -f or \i foo.sql if it's there

That will fix all non redirection usages.  Cases involving redirection
are not psql's bailiwick.

merlin



Re: psql blows up on BOM character sequence

From
Merlin Moncure
Date:
On Mon, Mar 24, 2014 at 2:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> psql -1 already requires '-f' to work

actually, it doesn't.  this was fixed recently.

merlin



Re: psql blows up on BOM character sequence

From
Jim Nasby
Date:
On 3/24/14, 1:59 PM, Andrew Dunstan wrote:
>> It occurs to me that we're going about this the wrong way...
>>
>> The error here isn't being generated by psql; it's generated by the backend. In the context of a statement (and not,
say,a COPY command).
 
>>
>> So instead of trying to handle this on the psql side[1], I think we need to handle it in the backend; specifically
inthe parser. Is there an easy way to get the parser to ignore the BOM character in the context of commands (but not in
strings)?
>>
>> [1]: Obviously, BOM could still screw up a psql command like \d. We'd want to address that as well; but I suspect
backendsare the more common scenario.
 
>
>
> But what about COPY files? I don't see why there is less of a case for eating a leading BOM for a COPY file (or COPY
stdinfor that matter, given that it can come from \copy) than for an SQL file.
 

Wait... I thought that was one of the objections... that we wanted to leave a BOM in something like a COPY untouched?
Ifthat's not the case, why not just strip BOM wherever it shows up in psql input? (Granted, not good for \copy or copy;
performance,so we might want to special case those, but that doesn't seem unreasonable...)
 

> I suspect suspect trying to do this in the parser will be quite messy. This needs to happen before the input is
convertedto the server encoding, I think.
 

My hope was that there's a point in the parser where we know whether we're dealing with a command strong or raw data,
andthat we'd be able to only strip this from command strings... or better yet, get the code that looks for a command
stringto simply ignore BOM when it's parsing.
 

Uh... could we just treat BOM as another whitespace character? ISTM the case is basically the same: we don't want "
INSERT... VALUES( '  extra  spaces  ' )  ;  " to blow up because of extra white space, but obviously '  extra  spaces
'needs to stay intact
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: psql blows up on BOM character sequence

From
Tom Lane
Date:
Jim Nasby <jim@nasby.net> writes:
> Wait... I thought that was one of the objections... that we wanted to
> leave a BOM in something like a COPY untouched?

I think most of us are okay with stripping a BOM that appears at the
*beginning* of a text file (assuming there's reason to believe the file
is in UTF8 encoding).  BOM sequences embedded later in the file are a lot
more debatable, and I for one don't want to assume those can be dropped.
I don't know of any legitimate usage of such cases, and think it's
probably better to report an encoding error.

> Uh... could we just treat BOM as another whitespace character?

A BOM is *most certainly not* whitespace.  The only even semi-legitimate
usage it has in UTF8 is as a file encoding marker.  You can bet that the
user whose text editor made the file did not think he had whitespace at
the front.  Anyway, your proposition that leading whitespace is ignorable
fails completely for data files.
        regards, tom lane



Re: psql blows up on BOM character sequence

From
Tatsuo Ishii
Date:
>> Just a quick comment on this. Yes, pgAdmin always added a BOM in every
>> SQL files it wrote.
> 
> From http://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom:
> 
> According to the Unicode standard, the BOM for UTF-8 files is not recommended:
> 
> 2.6 Encoding Schemes
> 
> ... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data
isconverted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order
Mark”subsection in Section 16.8, Specials, for more information.
 

Right. I think unconditionally adding BOM to a file is evil.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp



Re: psql blows up on BOM character sequence

From
Tatsuo Ishii
Date:
> The code would probably be pretty trivial, *if* we had consensus on
> what the behavior ought to be.  I'm not sure if we do.  People who
> only use Unicode would probably like it if BOMs were unconditionally
> swallowed, whether or not psql thinks the client_encoding is UTF8.
> (And I seem to recall somebody even proposing that finding a BOM
> be cause to switch the client_encoding to UTF8.)

This is a bad idea. ISO 8859-1 uses 0xfe and 0xff (BOM) for some
characters.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp



Re: psql blows up on BOM character sequence

From
Andrew Dunstan
Date:
On 03/24/2014 08:28 PM, Tatsuo Ishii wrote:
>> The code would probably be pretty trivial, *if* we had consensus on
>> what the behavior ought to be.  I'm not sure if we do.  People who
>> only use Unicode would probably like it if BOMs were unconditionally
>> swallowed, whether or not psql thinks the client_encoding is UTF8.
>> (And I seem to recall somebody even proposing that finding a BOM
>> be cause to switch the client_encoding to UTF8.)
> This is a bad idea. ISO 8859-1 uses 0xfe and 0xff (BOM) for some
> characters.
>
>


Yeah, I think there is no consensus to do anything unless the client 
encoding is UTF8.

cheers

andrew



Re: psql blows up on BOM character sequence

From
Craig Ringer
Date:
On 03/25/2014 02:50 AM, Jim Nasby wrote:
> So instead of trying to handle this on the psql side[1], I think we need
> to handle it in the backend; specifically in the parser. Is there an
> easy way to get the parser to ignore the BOM character in the context of
> commands (but not in strings)?

I disagree. The server deals with SQL statements and client data
streams, it doesn't deal with files.

The BOM is an artifact of *files*. You don't expect to see a BOM in a
UTF-8 string passed to a function call in a library; nor should you
expect one to be passed to you on a network protocol that isn't about
exchanging files.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: psql blows up on BOM character sequence

From
Craig Ringer
Date:
On 03/25/2014 07:05 AM, Tom Lane wrote:
> Jim Nasby <jim@nasby.net> writes:
>> Wait... I thought that was one of the objections... that we wanted to
>> leave a BOM in something like a COPY untouched?
> 
> I think most of us are okay with stripping a BOM that appears at the
> *beginning* of a text file (assuming there's reason to believe the file
> is in UTF8 encoding).  BOM sequences embedded later in the file are a lot
> more debatable, and I for one don't want to assume those can be dropped.
> I don't know of any legitimate usage of such cases, and think it's
> probably better to report an encoding error.

Yep, it's absolutely an encoding error IMO.

Something like:

ERROR: Invalid UTF-8 - probable UTF-8 byte-order mark detected
mid-data-stream.
HINT: Multiple files with byte-order marks were probably concatenated
with a tool that is not Unicode-aware.




-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services