Thread: Using psql -f to load a UTF8 file
I am running Postgres 8.4.1 on Windows XP Professional Service Pack 3. My database is UTF8. I use psql -f to load files containing DDL and DML commands. I discovered a long time ago that psql does not like UTF8 files: it complains about the byte order mark on the first line. Up to now I have worked round that by making sure that the files were saved as what Microsoft calls "ANSI". However, that option is not available if I want to insert data which includes non-ASCII characters.
I have found a suggestion that psql can be told to expect UTF8 files by creating a file called psqlrc.conf containing the line \encoding unicode. I have tried putting this file (i) in the Postgres data directory, along with postgresql.conf, and (ii) in %APPDATA%\postgresql, but I still get an error:
psql:120919_insert_into_letter.sql:1: ERROR: syntax error at or near "insert
"
LINE 1: insert into LETTER_VAR (var_name, type) values ('REPORT_COP...
"
LINE 1: insert into LETTER_VAR (var_name, type) values ('REPORT_COP...
I have found a workaround, which is to start the file with a line containing just a semicolon. Then the empty "statement" fails, but the other statements succeed:
psql:120919_insert_into_letter_copy2.sql:1: ERROR: syntax error at or near "
"
LINE 1: ;
^
INSERT 0 1
INSERT 0 1
"
LINE 1: ;
^
INSERT 0 1
INSERT 0 1
However, I feel sure that there must be a better way.
I have noted BUG report #6271: psql -f reporting unexpected syntax errors on first command. This involves the same problem. Álvaro Herrera advised that "You need to remove [the byte order mark] before passing the file to psql". But how am I supposed to remove the byte order mark from a UTF8 file? I thought that the whole point of the byte order mark was to tell programs what the file encoding is. Other programs, such as Python, rely on this.
Torsdag 20. september 2012 16.56.16 skrev Alan Millington : > psql". But how am I supposed to remove the byte order mark from a UTF8 > file? I thought that the whole point of the byte order mark was to tell > programs what the file encoding is. Other programs, such as Python, rely > on this. http://en.wikipedia.org/wiki/Byte_order_mark While the Byte Order Mark is important for UTF-16, it's totally irrelevant to the UTF-8 encoding. Still you'll find several editors that automatically input BOMs in every text file. There is usually a setting "Insert Byte Order Mark" somewhere in the configuration, and it may be on by default. regards, Leif
Alan Millington <admillington@yahoo.co.uk> writes: > I am running Postgres 8.4.1 on Windows XP Professional Service Pack 3. My database is UTF8. I use psql -f to load filescontaining DDL and DML commands. I discovered a long time ago that psql does not like UTF8 files: it complains aboutthe byte order mark on the first line. Up to now I have worked round that by making sure that the files were saved aswhat Microsoft calls "ANSI". However, that option is not available if I want to insert data which includes non-ASCII characters. FWIW, psql 9.0 and later will ignore an initial BOM if the client encoding is UTF8. regards, tom lane
Torsdag 20. september 2012 19.27.22 skrev Alan Millington : > Thank you for the link. I am using Notepad, which inserts the byte order > mark. Following the links a bit further, I gather that the version of > Notepad that I am using may not identify a UTF8 file correctly if the byte > order mark is omitted. Also, as I mentioned, Python makes use of it. (From > the Python documentation on Encoding declarations: "If the first bytes of > the file are the UTF-8 byte-order mark ('\xef\xbb\xbf'), the declared file > encoding is UTF-8 (this is supported, among others, by Microsoft’s > Notepad).") > The conclusion seems to be that I must use one editor for Python, and > another for Postgres. It's been a long time since I last wrote a Python script, but I've always used the explicit encoding directive: #! /usr/bin/env python # -*- encoding: utf-8 -*- See http://docs.python.org/release/2.5.1/ref/encodings.html which also mentions the BOM method as an alternative. regards, Leif
On 09/20/2012 10:44 AM, Leif Biberg Kristensen wrote: > Torsdag 20. september 2012 19.27.22 skrev Alan Millington : >> Thank you for the link. I am using Notepad, which inserts the byte order >> mark. Following the links a bit further, I gather that the version of >> Notepad that I am using may not identify a UTF8 file correctly if the byte >> order mark is omitted. Also, as I mentioned, Python makes use of it. (From >> the Python documentation on Encoding declarations: "If the first bytes of >> the file are the UTF-8 byte-order mark ('\xef\xbb\xbf'), the declared file >> encoding is UTF-8 (this is supported, among others, by Microsoft’s >> Notepad).") >> The conclusion seems to be that I must use one editor for Python, and >> another for Postgres. I would strongly advise against using Notepad for any kind of text editing. Wordpad works better, or even better yet Notepad ++: http://notepad-plus-plus.org/ > > It's been a long time since I last wrote a Python script, but I've always used > the explicit encoding directive: > > #! /usr/bin/env python > # -*- encoding: utf-8 -*- > > See http://docs.python.org/release/2.5.1/ref/encodings.html which also > mentions the BOM method as an alternative. > > regards, Leif > > -- Adrian Klaver adrian.klaver@gmail.com
On 09/20/12 7:56 AM, Alan Millington wrote: > I discovered a long time ago that psql does not like UTF8 files: it > complains about the byte order mark on the first line. in case it wasn't clear from previous replies, Windows native Unicode format is NOT UTF8, its UTF16, where every character is stored as 2 bytes. -- john r pierce N 37, W 122 santa cruz ca mid-left coast
On 09/20/2012 11:44 PM, Leif Biberg Kristensen wrote: > Torsdag 20. september 2012 16.56.16 skrev Alan Millington : >> psql". But how am I supposed to remove the byte order mark from a UTF8 >> file? I thought that the whole point of the byte order mark was to tell >> programs what the file encoding is. Other programs, such as Python, rely >> on this. > > http://en.wikipedia.org/wiki/Byte_order_mark > > While the Byte Order Mark is important for UTF-16, it's totally irrelevant to > the UTF-8 encoding. I strongly disagree. The BOM provides a useful and standard way to differentiate UTF-8 encoded text files from the random pile of encodings that any given file could be. On many platforms (including all Windows versions) the default system text encoding for 8-bit text is not UTF-8. On such systems, a BOM in a UTF-8 file allows a program/editor to reliably work out that it's UTF-8 and treat it as such, rather than mangling it by interpreting it as the local system encoding. psql should accept UTF-8 with BOM. -- Craig Ringer
On Fri, Sep 21, 2012 at 11:21 AM, Craig Ringer <ringerc@ringerc.id.au> wrote: > I strongly disagree. The BOM provides a useful and standard way to > differentiate UTF-8 encoded text files from the random pile of encodings > that any given file could be. The only reliable way to ascertain the encoding of a hunk of data is with something out-of-band. Relying on the first three bytes being \xEF\xBB\xBF is not much more reliable than detecting based on octet frequency, which is what leads to the "Bush hid the facts" hack in Notepad. This is why many Internet protocols have metadata carried along with the file (eg Content-type in HTTP), rather than relying on internal evidence. > psql should accept UTF-8 with BOM. However, this I would agree with. It's cheap enough to detect, and aside from arbitrarily trying to kill Notepad (which won't happen anyway), there's not a lot of reason to choke on the BOM. But it's not a big deal. ChrisA
Thank you for the link. I am using Notepad, which inserts the byte order mark. Following the links a bit further, I gather that the version of Notepad that I am using may not identify a UTF8 file correctly if the byte order mark is omitted. Also, as I mentioned, Python makes use of it. (From the Python documentation on Encoding declarations: "If the first bytes of the file are the UTF-8 byte-order mark ('\xef\xbb\xbf'), the declared file encoding is UTF-8 (this is supported, among others, by Microsoft’s Notepad).")
The conclusion seems to be that I must use one editor for Python, and another for Postgres.
From: Leif Biberg Kristensen <leif@solumslekt.org>
To: Postgres general mailing list <pgsql-general@postgresql.org>
Cc: Alan Millington <admillington@yahoo.co.uk>
Sent: Thursday, 20 September 2012, 16:44
Subject: Re: [GENERAL] Using psql -f to load a UTF8 file
To: Postgres general mailing list <pgsql-general@postgresql.org>
Cc: Alan Millington <admillington@yahoo.co.uk>
Sent: Thursday, 20 September 2012, 16:44
Subject: Re: [GENERAL] Using psql -f to load a UTF8 file
Torsdag 20. september 2012 16.56.16 skrev Alan Millington :
> psql". But how am I supposed to remove the byte order mark from a UTF8
> file? I thought that the whole point of the byte order mark was to tell
> programs what the file encoding is. Other programs, such as Python, rely
> on this.
http://en.wikipedia.org/wiki/Byte_order_mark
While the Byte Order Mark is important for UTF-16, it's totally irrelevant to
the UTF-8 encoding. Still you'll find several editors that automatically input
BOMs in every text file. There is usually a setting "Insert Byte Order Mark"
somewhere in the configuration, and it may be on by default.
regards, Leif
On 09/20/12 10:27 AM, Alan Millington wrote: > I am using Notepad, which inserts the byte order mark. Following the > links a bit further, I gather that the version of Notepad that I am > using may not identify a UTF8 file correctly if the byte order mark is > omitted. Also, as I mentioned, Python makes use of it. (From the > Python documentation on Encoding declarations: "If the first bytes of > the file are the UTF-8 byte-order mark ('\xef\xbb\xbf'), the declared > file encoding is UTF-8 (this is supported, among others, by > Microsoft’s Notepad).") I've never seen Notepad generate UTF8. Usually its either 8 bit ASCII (ISO8559-1 or something), or its UTF16 aka "Unicode". -- john r pierce N 37, W 122 santa cruz ca mid-left coast
On Fri, Sep 21, 2012 at 2:39 PM, John R Pierce <pierce@hogranch.com> wrote: > On 09/20/12 10:27 AM, Alan Millington wrote: >> >> I am using Notepad, which inserts the byte order mark. Following the links >> a bit further, I gather that the version of Notepad that I am using may not >> identify a UTF8 file correctly if the byte order mark is omitted. Also, as I >> mentioned, Python makes use of it. (From the Python documentation on >> Encoding declarations: "If the first bytes of the file are the UTF-8 >> byte-order mark ('\xef\xbb\xbf'), the declared file encoding is UTF-8 (this >> is supported, among others, by Microsoft’s Notepad).") > > I've never seen Notepad generate UTF8. Usually its either 8 bit ASCII > (ISO8559-1 or something), or its UTF16 aka "Unicode". Those are the defaults; you can tell it to save as UTF-8. But the general advice is: Don't use Notepad! It can't handle Unix newlines either (something which annoys me periodically when I'm on a borrowed Windows machine and need to view a file quickly). There are many better editors around; Notepad++ was mentioned, and NoteTab is another good one. My personal preference is SciTE, available for Linux as well as Windows. You'll start to realize how handy syntax highlighting is when your next bug is caught even before you save, because the apostrophe in the quoted string breaks the colorization. Anything that reduces debugging time can't be a bad thing! ChrisA
> I strongly disagree. The BOM provides a useful and standard way to differentiate UTF-8 encoded text files from the randompile of encodings that any given file could be. I agree on the concept, but I'm having a bit of trouble understanding how a "Byte Order Marker" is useful to an 8-bit encoding.It's not possible to get the byte order wrong in UTF-8, is it? Yes, being able to mark data as being encoded as UTF-8 is useful, but is a BOM the right tool? Alban Hertroys -- Screwing up is an excellent way to attach something to the ceiling.
On Fri, Sep 21, 2012 at 09:21:36AM +0800, Craig Ringer wrote: > On 09/20/2012 11:44 PM, Leif Biberg Kristensen wrote: > > Torsdag 20. september 2012 16.56.16 skrev Alan Millington : > >>psql". But how am I supposed to remove the byte order mark from a UTF8 > >>file? I thought that the whole point of the byte order mark was to tell > >>programs what the file encoding is. Other programs, such as Python, rely > >>on this. > > > >http://en.wikipedia.org/wiki/Byte_order_mark > > > >While the Byte Order Mark is important for UTF-16, it's totally irrelevant to > >the UTF-8 encoding. > > I strongly disagree. The BOM provides a useful and standard way to > differentiate UTF-8 encoded text files from the random pile of > encodings that any given file could be. Use of the BOM in UTF-8 causes a host of display and interoperability problems, and is considered by many to be a broken practice. It's also pointless since there are no byte ordering issues with UTF-8. Best to not use it at all. In any case, the BOM byte sequence does not unambiguously identify UTF-8; it's equally valid for 8-bit charsets, so an external means of specifying the encoding is preferable and more robust. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' schroot and sbuild http://alioth.debian.org/projects/buildd-tools `- GPG Public Key F33D 281D 470A B443 6756 147C 07B3 C8BC 4083 E800
On 09/20/2012 10:27 AM, Alan Millington wrote: > Thank you for the link. I am using Notepad, which inserts the byte order > mark. Following the links a bit further, I gather that the version of > Notepad that I am using may not identify a UTF8 file correctly if the > byte order mark is omitted. Also, as I mentioned, Python makes use of > it. (From the Python documentation on Encoding declarations: "If the > first bytes of the file are the UTF-8 byte-order mark ('\xef\xbb\xbf'), > the declared file encoding is UTF-8 (this is supported, among others, by > Microsoft’s Notepad).") > The conclusion seems to be that I must use one editor for Python, and > another for Postgres. The problem is Notepad has a limited view of the text world. If you use it only in the Microsoft universe and for simple tasks it is sufficient. The issues arise when you try to work with files outside that scope. For long term peace of mind you will better served by more capable text editors(see other posts for suggestions). As to the Python documentation, it goes on to say: http://docs.python.org/tutorial/interpreter.html#source-code-encoding "Notice that this signature is not understood in older Python releases (2.2 and earlier), and also not understood by the operating system for script files with #! lines (only used on Unix systems)." From the Zen of Python: "Explicit is better than implicit" Following that the preferred way of declaring an encoding would be: # -*- coding: utf-8 -*- See the above link for more information. > -- Adrian Klaver adrian.klaver@gmail.com
On 2012-09-21, Craig Ringer <ringerc@ringerc.id.au> wrote: > I strongly disagree. The BOM provides a useful and standard way to > differentiate UTF-8 encoded text files what is stopping non utf8 files from starting with something that looks like a BOM? -- ⚂⚃ 100% natural