Thread: psql blows up on BOM character sequence
See http://www.postgresql.org/message-id/4AFEAB39.3000009@dunslane.net This is still broken as of fairly recent HEAD; any objections to adding it to TODO? -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Fri, Mar 21, 2014 at 4:02 PM, Jim Nasby <jim@nasby.net> wrote: > See http://www.postgresql.org/message-id/4AFEAB39.3000009@dunslane.net > > This is still broken as of fairly recent HEAD; any objections to adding it to TODO? Agreed: this is a major annoyance. merlin
On 03/21/2014 05:06 PM, Merlin Moncure wrote: > On Fri, Mar 21, 2014 at 4:02 PM, Jim Nasby <jim@nasby.net> wrote: >> See http://www.postgresql.org/message-id/4AFEAB39.3000009@dunslane.net >> >> This is still broken as of fairly recent HEAD; any objections to adding it to TODO? > Agreed: this is a major annoyance. > Surely if it were really a major annoyance, someone would have sent code to fix it during the last 4 years and more since the above. I suspect it's a minor annoyance :-) But by all means add it to the TODO list if it's not there already. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Surely if it were really a major annoyance, someone would have sent code > to fix it during the last 4 years and more since the above. The code would probably be pretty trivial, *if* we had consensus on what the behavior ought to be. I'm not sure if we do. People who only use Unicode would probably like it if BOMs were unconditionally swallowed, whether or not psql thinks the client_encoding is UTF8. (And I seem to recall somebody even proposing that finding a BOM be cause to switch the client_encoding to UTF8.) However, these ideas are complete nonstarters for people who habitually use other encodings. The argument about SQL syntax carries no weight for me, at least --- what about COPY data files? And I don't really want to suppose that \i can never be used to insert a portion of a SQL command, either. I'd be okay with swallowing a leading BOM if and only if client encoding is UTF8. This should apply to any file psql reads, whether script or data. regards, tom lane
On Fri, Mar 21, 2014 at 4:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'd be okay with swallowing a leading BOM if and only if client encoding > is UTF8. This should apply to any file psql reads, whether script or > data. Yeah. The one case that doesn't solve is: cat f1.sql f2.sql | psql ... Which is common usage in deployment systems where combining things inside a single transaction scope is important. There is no way for psql to handle that case though unless you'd strip *all* BOMs encountered. Compounding this problem is that there's no practical way AFAIK to send multiple file to psql via single command line invocation. If you pass multiple -f arguments all but one is ignored. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > There is no way for psql to handle that case though unless you'd strip > *all* BOMs encountered. Compounding this problem is that there's no > practical way AFAIK to send multiple file to psql via single command > line invocation. If you pass multiple -f arguments all but one is > ignored. Well, that seems like a solvable but rather independent problem. I guess one issue is how you'd define the meaning of --single ... one transaction per run, or one per file? regards, tom lane
On Mar 21, 2014, at 2:16 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > Surely if it were really a major annoyance, someone would have sent code to fix it during the last 4 years and more sincethe above. > > I suspect it's a minor annoyance :-) > > But by all means add it to the TODO list if it's not there already. I have cleaned up many a BOM added to files that made psql blow up. I think PGAdmin III was a culprit, though I’m not sure(I don’t use, it, cleaned up after coworkers who do). David
On 3/21/14, 8:13 PM, David E. Wheeler wrote: > On Mar 21, 2014, at 2:16 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > >> Surely if it were really a major annoyance, someone would have sent code to fix it during the last 4 years and more sincethe above. >> >> I suspect it's a minor annoyance :-) >> >> But by all means add it to the TODO list if it's not there already. > > I have cleaned up many a BOM added to files that made psql blow up. I think PGAdmin III was a culprit, though I’m not sure(I don’t use, it, cleaned up after coworkers who do). Yes, my coworker that figured out what the problem was said the culprit here is actually pgAdmin. :( -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 3/21/14, 4:54 PM, Tom Lane wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> There is no way for psql to handle that case though unless you'd strip >> *all* BOMs encountered. Compounding this problem is that there's no >> practical way AFAIK to send multiple file to psql via single command >> line invocation. If you pass multiple -f arguments all but one is >> ignored. > > Well, that seems like a solvable but rather independent problem. > I guess one issue is how you'd define the meaning of --single ... > one transaction per run, or one per file? Well, if you're catting multiple files into psql -1, you'd get all the files in one transaction, right? So I'd say that'swhat should happen. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Sat, 2014-03-22 at 11:23 -0500, Jim Nasby wrote: > On 3/21/14, 8:13 PM, David E. Wheeler wrote: > > On Mar 21, 2014, at 2:16 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > >> Surely if it were really a major annoyance, someone would have sent code to fix it during the last 4 years and moresince the above. > >> > >> I suspect it's a minor annoyance :-) > >> > >> But by all means add it to the TODO list if it's not there already. > > > > I have cleaned up many a BOM added to files that made psql blow up. I think PGAdmin III was a culprit, though I’m notsure (I don’t use, it, cleaned up after coworkers who do). > > Yes, my coworker that figured out what the problem was said the culprit here is actually pgAdmin. :( Just a quick comment on this. Yes, pgAdmin always added a BOM in every SQL files it wrote. That being said, since 2010, pgAdmin has an option that allows the user to request the BOM writing (IOW, it's disabled by default for new installations of pgAdmin). See in the preferences/options window, "Query tool / Query file" sub-item, and "Write BOM for UTF files" checkbox. Make sure it's unchecked. Either your coworker has an older release (that would be older than 1.14 IIRC), or he didn't change the setting to make it work like he would like. -- Guillaume http://blog.guillaume.lelarge.info http://www.dalibo.com
On Mar 23, 2014, at 8:03, Guillaume Lelarge <guillaume@lelarge.info> wrote:
Just a quick comment on this. Yes, pgAdmin always added a BOM in every
SQL files it wrote.
According to the Unicode standard, the BOM for UTF-8 files is not recommended:
2.6 Encoding Schemes
... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.
On 3/22/14, 11:26 AM, Jim Nasby wrote: > On 3/21/14, 4:54 PM, Tom Lane wrote: >> Merlin Moncure <mmoncure@gmail.com> writes: >>> There is no way for psql to handle that case though unless you'd strip >>> *all* BOMs encountered. Compounding this problem is that there's no >>> practical way AFAIK to send multiple file to psql via single command >>> line invocation. If you pass multiple -f arguments all but one is >>> ignored. >> >> Well, that seems like a solvable but rather independent problem. >> I guess one issue is how you'd define the meaning of --single ... >> one transaction per run, or one per file? > > Well, if you're catting multiple files into psql -1, you'd get all the files in one transaction, right? So I'd say that'swhat should happen. It occurs to me that we're going about this the wrong way... The error here isn't being generated by psql; it's generated by the backend. In the context of a statement (and not, say,a COPY command). So instead of trying to handle this on the psql side[1], I think we need to handle it in the backend; specifically in theparser. Is there an easy way to get the parser to ignore the BOM character in the context of commands (but not in strings)? [1]: Obviously, BOM could still screw up a psql command like \d. We'd want to address that as well; but I suspect backendsare the more common scenario. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 03/24/2014 02:50 PM, Jim Nasby wrote: > On 3/22/14, 11:26 AM, Jim Nasby wrote: >> On 3/21/14, 4:54 PM, Tom Lane wrote: >>> Merlin Moncure <mmoncure@gmail.com> writes: >>>> There is no way for psql to handle that case though unless you'd strip >>>> *all* BOMs encountered. Compounding this problem is that there's no >>>> practical way AFAIK to send multiple file to psql via single command >>>> line invocation. If you pass multiple -f arguments all but one is >>>> ignored. >>> >>> Well, that seems like a solvable but rather independent problem. >>> I guess one issue is how you'd define the meaning of --single ... >>> one transaction per run, or one per file? >> >> Well, if you're catting multiple files into psql -1, you'd get all >> the files in one transaction, right? So I'd say that's what should >> happen. > > It occurs to me that we're going about this the wrong way... > > The error here isn't being generated by psql; it's generated by the > backend. In the context of a statement (and not, say, a COPY command). > > So instead of trying to handle this on the psql side[1], I think we > need to handle it in the backend; specifically in the parser. Is there > an easy way to get the parser to ignore the BOM character in the > context of commands (but not in strings)? > > [1]: Obviously, BOM could still screw up a psql command like \d. We'd > want to address that as well; but I suspect backends are the more > common scenario. But what about COPY files? I don't see why there is less of a case for eating a leading BOM for a COPY file (or COPY stdin for that matter, given that it can come from \copy) than for an SQL file. I suspect suspect trying to do this in the parser will be quite messy. This needs to happen before the input is converted to the server encoding, I think. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I suspect suspect trying to do this in the parser will be quite messy. > This needs to happen before the input is converted to the server > encoding, I think. Indeed --- what if the server isn't using utf8 internally? And a larger point is that the server has no idea where the file boundaries are. If we were to do this server-side, we'd essentially end up discarding BOM anywhere, which is more libertine than I care to be. regards, tom lane
On Mon, Mar 24, 2014 at 2:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I suspect suspect trying to do this in the parser will be quite messy. >> This needs to happen before the input is converted to the server >> encoding, I think. > > Indeed --- what if the server isn't using utf8 internally? > > And a larger point is that the server has no idea where the file > boundaries are. If we were to do this server-side, we'd essentially > end up discarding BOM anywhere, which is more libertine than I care > to be. Right -- I had a feeling you'd say that. That's why the best solution ISTM is to allow psql to be invoked in such a way that it *does* know the file boundaries for consolidated scripts; this means better handling of multiple file arguments. psql -1 already requires '-f' to work (vs cat foo.sql | psql) and that's pretty reasonable. BOM handling fixes should probably be injected in cases where the precise beginning points of the file are known, which AFAICT are \i and -f. So, in short, it seems prudent to: 1. make multiple -f invocation work (with -1 spanning) 2. strip BOM from -f or \i foo.sql if it's there That will fix all non redirection usages. Cases involving redirection are not psql's bailiwick. merlin
On Mon, Mar 24, 2014 at 2:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > psql -1 already requires '-f' to work actually, it doesn't. this was fixed recently. merlin
On 3/24/14, 1:59 PM, Andrew Dunstan wrote: >> It occurs to me that we're going about this the wrong way... >> >> The error here isn't being generated by psql; it's generated by the backend. In the context of a statement (and not, say,a COPY command). >> >> So instead of trying to handle this on the psql side[1], I think we need to handle it in the backend; specifically inthe parser. Is there an easy way to get the parser to ignore the BOM character in the context of commands (but not in strings)? >> >> [1]: Obviously, BOM could still screw up a psql command like \d. We'd want to address that as well; but I suspect backendsare the more common scenario. > > > But what about COPY files? I don't see why there is less of a case for eating a leading BOM for a COPY file (or COPY stdinfor that matter, given that it can come from \copy) than for an SQL file. Wait... I thought that was one of the objections... that we wanted to leave a BOM in something like a COPY untouched? Ifthat's not the case, why not just strip BOM wherever it shows up in psql input? (Granted, not good for \copy or copy; performance,so we might want to special case those, but that doesn't seem unreasonable...) > I suspect suspect trying to do this in the parser will be quite messy. This needs to happen before the input is convertedto the server encoding, I think. My hope was that there's a point in the parser where we know whether we're dealing with a command strong or raw data, andthat we'd be able to only strip this from command strings... or better yet, get the code that looks for a command stringto simply ignore BOM when it's parsing. Uh... could we just treat BOM as another whitespace character? ISTM the case is basically the same: we don't want " INSERT... VALUES( ' extra spaces ' ) ; " to blow up because of extra white space, but obviously ' extra spaces 'needs to stay intact -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Jim Nasby <jim@nasby.net> writes: > Wait... I thought that was one of the objections... that we wanted to > leave a BOM in something like a COPY untouched? I think most of us are okay with stripping a BOM that appears at the *beginning* of a text file (assuming there's reason to believe the file is in UTF8 encoding). BOM sequences embedded later in the file are a lot more debatable, and I for one don't want to assume those can be dropped. I don't know of any legitimate usage of such cases, and think it's probably better to report an encoding error. > Uh... could we just treat BOM as another whitespace character? A BOM is *most certainly not* whitespace. The only even semi-legitimate usage it has in UTF8 is as a file encoding marker. You can bet that the user whose text editor made the file did not think he had whitespace at the front. Anyway, your proposition that leading whitespace is ignorable fails completely for data files. regards, tom lane
>> Just a quick comment on this. Yes, pgAdmin always added a BOM in every >> SQL files it wrote. > > From http://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom: > > According to the Unicode standard, the BOM for UTF-8 files is not recommended: > > 2.6 Encoding Schemes > > ... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data isconverted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark”subsection in Section 16.8, Specials, for more information. Right. I think unconditionally adding BOM to a file is evil. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
> The code would probably be pretty trivial, *if* we had consensus on > what the behavior ought to be. I'm not sure if we do. People who > only use Unicode would probably like it if BOMs were unconditionally > swallowed, whether or not psql thinks the client_encoding is UTF8. > (And I seem to recall somebody even proposing that finding a BOM > be cause to switch the client_encoding to UTF8.) This is a bad idea. ISO 8859-1 uses 0xfe and 0xff (BOM) for some characters. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On 03/24/2014 08:28 PM, Tatsuo Ishii wrote: >> The code would probably be pretty trivial, *if* we had consensus on >> what the behavior ought to be. I'm not sure if we do. People who >> only use Unicode would probably like it if BOMs were unconditionally >> swallowed, whether or not psql thinks the client_encoding is UTF8. >> (And I seem to recall somebody even proposing that finding a BOM >> be cause to switch the client_encoding to UTF8.) > This is a bad idea. ISO 8859-1 uses 0xfe and 0xff (BOM) for some > characters. > > Yeah, I think there is no consensus to do anything unless the client encoding is UTF8. cheers andrew
On 03/25/2014 02:50 AM, Jim Nasby wrote: > So instead of trying to handle this on the psql side[1], I think we need > to handle it in the backend; specifically in the parser. Is there an > easy way to get the parser to ignore the BOM character in the context of > commands (but not in strings)? I disagree. The server deals with SQL statements and client data streams, it doesn't deal with files. The BOM is an artifact of *files*. You don't expect to see a BOM in a UTF-8 string passed to a function call in a library; nor should you expect one to be passed to you on a network protocol that isn't about exchanging files. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 03/25/2014 07:05 AM, Tom Lane wrote: > Jim Nasby <jim@nasby.net> writes: >> Wait... I thought that was one of the objections... that we wanted to >> leave a BOM in something like a COPY untouched? > > I think most of us are okay with stripping a BOM that appears at the > *beginning* of a text file (assuming there's reason to believe the file > is in UTF8 encoding). BOM sequences embedded later in the file are a lot > more debatable, and I for one don't want to assume those can be dropped. > I don't know of any legitimate usage of such cases, and think it's > probably better to report an encoding error. Yep, it's absolutely an encoding error IMO. Something like: ERROR: Invalid UTF-8 - probable UTF-8 byte-order mark detected mid-data-stream. HINT: Multiple files with byte-order marks were probably concatenated with a tool that is not Unicode-aware. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services