Thread: Using psql -f to load a UTF8 file

Using psql -f to load a UTF8 file

From

Alan Millington

Date:

20 September 2012, 15:34:06

I am running Postgres 8.4.1 on Windows XP Professional Service Pack 3. My database is UTF8. I use psql -f to load files containing DDL and DML commands. I discovered a long time ago that psql does not like UTF8 files: it complains about the byte order mark on the first line. Up to now I have worked round that by making sure that the files were saved as what Microsoft calls "ANSI". However, that option is not available if I want to insert data which includes non-ASCII characters.

I have found a suggestion that psql can be told to expect UTF8 files by creating a file called psqlrc.conf containing the line \encoding unicode. I have tried putting this file (i) in the Postgres data directory, along with postgresql.conf, and (ii) in %APPDATA%\postgresql, but I still get an error:

psql:120919_insert_into_letter.sql:1: ERROR:  syntax error at or near "ï»¿insert
"
LINE 1: ï»¿insert into LETTER_VAR (var_name, type) values ('REPORT_COP...

I have found a workaround, which is to start the file with a line containing just a semicolon. Then the empty "statement" fails, but the other statements succeed:

psql:120919_insert_into_letter_copy2.sql:1: ERROR:  syntax error at or near "ï»¿
"
LINE 1: ï»¿;
        ^
INSERT 0 1
INSERT 0 1

However, I feel sure that there must be a better way.

I have noted BUG report #6271: psql -f reporting unexpected syntax errors on first command. This involves the same problem. Álvaro Herrera advised that "You need to remove [the byte order mark] before passing the file to psql". But how am I supposed to remove the byte order mark from a UTF8 file? I thought that the whole point of the byte order mark was to tell programs what the file encoding is. Other programs, such as Python, rely on this.

Re: Using psql -f to load a UTF8 file

From

Leif Biberg Kristensen

Date:

20 September 2012, 15:45:05

 Torsdag 20. september 2012 16.56.16 skrev Alan Millington :
> psql". But how am I supposed to remove the byte order mark from a UTF8
> file? I thought that the whole point of the byte order mark was to tell
> programs what the file encoding is. Other programs, such as Python, rely
> on this.

http://en.wikipedia.org/wiki/Byte_order_mark

While the Byte Order Mark is important for UTF-16, it's totally irrelevant to
the UTF-8 encoding. Still you'll find several editors that automatically input
BOMs in every text file. There is usually a setting "Insert Byte Order Mark"
somewhere in the configuration, and it may be on by default.

regards, Leif

Re: Using psql -f to load a UTF8 file

From

Tom Lane

Date:

20 September 2012, 17:22:14

Alan Millington <admillington@yahoo.co.uk> writes:
> I am running Postgres 8.4.1 on Windows XP Professional Service Pack 3. My database is UTF8. I use psql -f to load
filescontaining DDL and DML commands. I discovered a long time ago that psql does not like UTF8 files: it complains
aboutthe byte order mark on the first line. Up to now I have worked round that by making sure that the files were saved
aswhat Microsoft calls "ANSI". However, that option is not available if I want to insert data which includes non-ASCII
characters.

FWIW, psql 9.0 and later will ignore an initial BOM if the client
encoding is UTF8.

            regards, tom lane

Re: Using psql -f to load a UTF8 file

From

Leif Biberg Kristensen

Date:

20 September 2012, 17:44:05

 Torsdag 20. september 2012 19.27.22 skrev Alan Millington :
> Thank you for the link. I am using Notepad, which inserts the byte order
> mark. Following the links a bit further, I gather that the version of
> Notepad that I am using may not identify a UTF8 file correctly if the byte
> order mark is omitted. Also, as I mentioned, Python makes use of it. (From
> the Python documentation on Encoding declarations: "If the first bytes of
> the file are the UTF-8 byte-order mark ('\xef\xbb\xbf'), the declared file
> encoding is UTF-8 (this is supported, among others, by Microsoft’s
> Notepad).")
> The conclusion seems to be that I must use one editor for Python, and
> another for Postgres.

It's been a long time since I last wrote a Python script, but I've always used
the explicit encoding directive:

#! /usr/bin/env python
# -*- encoding: utf-8 -*-

See http://docs.python.org/release/2.5.1/ref/encodings.html which also
mentions the BOM method as an alternative.

regards, Leif

Re: Using psql -f to load a UTF8 file

From

Adrian Klaver

Date:

20 September 2012, 17:54:51

On 09/20/2012 10:44 AM, Leif Biberg Kristensen wrote:
>   Torsdag 20. september 2012 19.27.22 skrev Alan Millington :
>> Thank you for the link. I am using Notepad, which inserts the byte order
>> mark. Following the links a bit further, I gather that the version of
>> Notepad that I am using may not identify a UTF8 file correctly if the byte
>> order mark is omitted. Also, as I mentioned, Python makes use of it. (From
>> the Python documentation on Encoding declarations: "If the first bytes of
>> the file are the UTF-8 byte-order mark ('\xef\xbb\xbf'), the declared file
>> encoding is UTF-8 (this is supported, among others, by Microsoft’s
>> Notepad).")
>> The conclusion seems to be that I must use one editor for Python, and
>> another for Postgres.

I would strongly advise against using Notepad for any kind of text
editing.  Wordpad works better, or even better yet Notepad ++:

http://notepad-plus-plus.org/

>
> It's been a long time since I last wrote a Python script, but I've always used
> the explicit encoding directive:
>
> #! /usr/bin/env python
> # -*- encoding: utf-8 -*-
>
> See http://docs.python.org/release/2.5.1/ref/encodings.html which also
> mentions the BOM method as an alternative.
>
> regards, Leif
>
>


--
Adrian Klaver
adrian.klaver@gmail.com

Re: Using psql -f to load a UTF8 file

From

John R Pierce

Date:

20 September 2012, 18:33:30

On 09/20/12 7:56 AM, Alan Millington wrote:
> I discovered a long time ago that psql does not like UTF8 files: it
> complains about the byte order mark on the first line.

in case it wasn't clear from previous replies, Windows native Unicode
format is NOT UTF8, its UTF16, where every character is stored as 2 bytes.

--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast

Re: Using psql -f to load a UTF8 file

From

Craig Ringer

Date:

21 September 2012, 01:21:49

On 09/20/2012 11:44 PM, Leif Biberg Kristensen wrote:
>   Torsdag 20. september 2012 16.56.16 skrev Alan Millington :
>> psql". But how am I supposed to remove the byte order mark from a UTF8
>> file? I thought that the whole point of the byte order mark was to tell
>> programs what the file encoding is. Other programs, such as Python, rely
>> on this.
>
> http://en.wikipedia.org/wiki/Byte_order_mark
>
> While the Byte Order Mark is important for UTF-16, it's totally irrelevant to
> the UTF-8 encoding.

I strongly disagree. The BOM provides a useful and standard way to
differentiate UTF-8 encoded text files from the random pile of encodings
that any given file could be.

On many platforms (including all Windows versions) the default system
text encoding for 8-bit text is not UTF-8. On such systems, a BOM in a
UTF-8 file allows a program/editor to reliably work out that it's UTF-8
and treat it as such, rather than mangling it by interpreting it as the
local system encoding.

psql should accept UTF-8 with BOM.

--
Craig Ringer

Re: Using psql -f to load a UTF8 file

From

Chris Angelico

Date:

21 September 2012, 03:07:18

On Fri, Sep 21, 2012 at 11:21 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
> I strongly disagree. The BOM provides a useful and standard way to
> differentiate UTF-8 encoded text files from the random pile of encodings
> that any given file could be.

The only reliable way to ascertain the encoding of a hunk of data is
with something out-of-band. Relying on the first three bytes being
\xEF\xBB\xBF is not much more reliable than detecting based on octet
frequency, which is what leads to the "Bush hid the facts" hack in
Notepad. This is why many Internet protocols have metadata carried
along with the file (eg Content-type in HTTP), rather than relying on
internal evidence.

> psql should accept UTF-8 with BOM.

However, this I would agree with. It's cheap enough to detect, and
aside from arbitrarily trying to kill Notepad (which won't happen
anyway), there's not a lot of reason to choke on the BOM. But it's not
a big deal.

ChrisA

Re: Using psql -f to load a UTF8 file

From

Alan Millington

Date:

21 September 2012, 03:12:42

Thank you for the link. I am using Notepad, which inserts the byte order mark. Following the links a bit further, I gather that the version of Notepad that I am using may not identify a UTF8 file correctly if the byte order mark is omitted. Also, as I mentioned, Python makes use of it. (From the Python documentation on Encoding declarations: "If the first bytes of the file are the UTF-8 byte-order mark ('\xef\xbb\xbf'), the declared file encoding is UTF-8 (this is supported, among others, by Microsoft’s Notepad).")

The conclusion seems to be that I must use one editor for Python, and another for Postgres.

From: Leif Biberg Kristensen <leif@solumslekt.org>
To: Postgres general mailing list <pgsql-general@postgresql.org>
Cc: Alan Millington <admillington@yahoo.co.uk>
Sent: Thursday, 20 September 2012, 16:44
Subject: Re: [GENERAL] Using psql -f to load a UTF8 file

Torsdag 20. september 2012 16.56.16 skrev Alan Millington :
> psql". But how am I supposed to remove the byte order mark from a UTF8
> file? I thought that the whole point of the byte order mark was to tell
> programs what the file encoding is. Other programs, such as Python, rely
> on this.

http://en.wikipedia.org/wiki/Byte_order_mark

While the Byte Order Mark is important for UTF-16, it's totally irrelevant to
the UTF-8 encoding. Still you'll find several editors that automatically input
BOMs in every text file. There is usually a setting "Insert Byte Order Mark"
somewhere in the configuration, and it may be on by default.

regards, Leif

Re: Using psql -f to load a UTF8 file

From

John R Pierce

Date:

21 September 2012, 04:39:29

On 09/20/12 10:27 AM, Alan Millington wrote:
> I am using Notepad, which inserts the byte order mark. Following the
> links a bit further, I gather that the version of Notepad that I am
> using may not identify a UTF8 file correctly if the byte order mark is
> omitted. Also, as I mentioned, Python makes use of it. (From the
> Python documentation on Encoding declarations: "If the first bytes of
> the file are the UTF-8 byte-order mark ('\xef\xbb\xbf'), the declared
> file encoding is UTF-8 (this is supported, among others, by
> Microsoft’s Notepad).")

I've never seen Notepad generate UTF8.   Usually its either 8 bit ASCII
(ISO8559-1 or something), or its UTF16 aka "Unicode".



--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast

Re: Using psql -f to load a UTF8 file

From

Chris Angelico

Date:

21 September 2012, 05:30:06

On Fri, Sep 21, 2012 at 2:39 PM, John R Pierce <pierce@hogranch.com> wrote:
> On 09/20/12 10:27 AM, Alan Millington wrote:
>>
>> I am using Notepad, which inserts the byte order mark. Following the links
>> a bit further, I gather that the version of Notepad that I am using may not
>> identify a UTF8 file correctly if the byte order mark is omitted. Also, as I
>> mentioned, Python makes use of it. (From the Python documentation on
>> Encoding declarations: "If the first bytes of the file are the UTF-8
>> byte-order mark ('\xef\xbb\xbf'), the declared file encoding is UTF-8 (this
>> is supported, among others, by Microsoft’s Notepad).")
>
> I've never seen Notepad generate UTF8.   Usually its either 8 bit ASCII
> (ISO8559-1 or something), or its UTF16 aka "Unicode".

Those are the defaults; you can tell it to save as UTF-8.

But the general advice is: Don't use Notepad! It can't handle Unix
newlines either (something which annoys me periodically when I'm on a
borrowed Windows machine and need to view a file quickly). There are
many better editors around; Notepad++ was mentioned, and NoteTab is
another good one. My personal preference is SciTE, available for Linux
as well as Windows. You'll start to realize how handy syntax
highlighting is when your next bug is caught even before you save,
because the apostrophe in the quoted string breaks the colorization.
Anything that reduces debugging time can't be a bad thing!

ChrisA

Re: Using psql -f to load a UTF8 file

From

Alban Hertroys

Date:

21 September 2012, 07:01:52

> I strongly disagree. The BOM provides a useful and standard way to differentiate UTF-8 encoded text files from the
randompile of encodings that any given file could be. 


I agree on the concept, but I'm having a bit of trouble understanding how a "Byte Order Marker" is useful to an 8-bit
encoding.It's not possible to get the byte order wrong in UTF-8, is it? 
Yes, being able to mark data as being encoded as UTF-8 is useful, but is a BOM the right tool?

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.

Re: Using psql -f to load a UTF8 file

From

Roger Leigh

Date:

21 September 2012, 09:40:51

On Fri, Sep 21, 2012 at 09:21:36AM +0800, Craig Ringer wrote:
> On 09/20/2012 11:44 PM, Leif Biberg Kristensen wrote:
> >  Torsdag 20. september 2012 16.56.16 skrev Alan Millington :
> >>psql". But how am I supposed to remove the byte order mark from a UTF8
> >>file? I thought that the whole point of the byte order mark was to tell
> >>programs what the file encoding is. Other programs, such as Python, rely
> >>on this.
> >
> >http://en.wikipedia.org/wiki/Byte_order_mark
> >
> >While the Byte Order Mark is important for UTF-16, it's totally irrelevant to
> >the UTF-8 encoding.
>
> I strongly disagree. The BOM provides a useful and standard way to
> differentiate UTF-8 encoded text files from the random pile of
> encodings that any given file could be.

Use of the BOM in UTF-8 causes a host of display and interoperability
problems, and is considered by many to be a broken practice.  It's
also pointless since there are no byte ordering issues with UTF-8.
Best to not use it at all.  In any case, the BOM byte sequence does
not unambiguously identify UTF-8; it's equally valid for 8-bit
charsets, so an external means of specifying the encoding is
preferable and more robust.


Regards,
Roger

--
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux    http://people.debian.org/~rleigh/
 `. `'   schroot and sbuild  http://alioth.debian.org/projects/buildd-tools
   `-    GPG Public Key      F33D 281D 470A B443 6756 147C 07B3 C8BC 4083 E800

Re: Using psql -f to load a UTF8 file

From

Adrian Klaver

Date:

21 September 2012, 14:34:23

On 09/20/2012 10:27 AM, Alan Millington wrote:
> Thank you for the link. I am using Notepad, which inserts the byte order
> mark. Following the links a bit further, I gather that the version of
> Notepad that I am using may not identify a UTF8 file correctly if the
> byte order mark is omitted. Also, as I mentioned, Python makes use of
> it. (From the Python documentation on Encoding declarations: "If the
> first bytes of the file are the UTF-8 byte-order mark ('\xef\xbb\xbf'),
> the declared file encoding is UTF-8 (this is supported, among others, by
> Microsoft’s Notepad).")
> The conclusion seems to be that I must use one editor for Python, and
> another for Postgres.

The problem is Notepad has a limited view of the text world. If you use
it only in the Microsoft universe and for simple tasks it is sufficient.
The issues arise when you try to work with files outside that scope. For
long term peace of mind you will better served by more capable text
editors(see other posts for suggestions). As to the Python
documentation, it goes on to say:

http://docs.python.org/tutorial/interpreter.html#source-code-encoding

"Notice that this signature is not understood in older Python releases
(2.2 and earlier), and also not understood by the operating system for
script files with #! lines (only used on Unix systems)."

 From the Zen of Python:
"Explicit is better than implicit"

Following that the preferred way of declaring an encoding would be:
# -*- coding: utf-8 -*-

See the above link for more information.
>

--
Adrian Klaver
adrian.klaver@gmail.com

Re: Using psql -f to load a UTF8 file

From

Jasen Betts

Date:

23 September 2012, 07:31:13

On 2012-09-21, Craig Ringer <ringerc@ringerc.id.au> wrote:
> I strongly disagree. The BOM provides a useful and standard way to
> differentiate UTF-8 encoded text files

what is stopping non utf8 files from starting with something that
looks like a BOM?



--
⚂⚃ 100% natural