Thread: encoding confusion with \copy command

encoding confusion with \copy command

From
Martin Waite
Date:
Hi,

I have a postgresql 7.4 server and client on Centos 6.4.  The database server is using UTF-8 encoding.

I have been exploring the use of the \copy command for importing CSV data generated by SQL Server 2008.  SQL Server 2008 export tool does not escape quotes that are in the content of fields, and so it is useful to be able to specify obscure characters in the quote option in the \copy command to work around this issue.

When I run the following commands in psql, I am surprised that QUOTE is limited to characters in the range 0x01 - 0x7f, and that UTF8 is mentioned in the error message if characters outside the range are chosen:

\encoding WIN1252
\copy yuml from '/tmp/yuml.csv'  WITH CSV HEADER ENCODING 'WIN1252' QUOTE as E'\xff';
ERROR:  invalid byte sequence for encoding "UTF8": 0xff


I thought that if the client (psql) is WIN1252, and the CSV file is specified as WIN1252, then I could specify any valid WIN1252 character as the quote character.   Instead, I am limited to the range of characters that can be encoded as a single byte in UTF-8. Actually, 0x00 is not accepted either, so the range is 0x01 - 0x7F.

Is this a bug or expected behaviour ?

Is it the case that the server does the actual CSV parsing, and that given that my server is in UTF8, I am therefore limited to single-byte UTF8 characters ?

regards,
Martin

Re: encoding confusion with \copy command

From
Martin Waite
Date:
Hi Adrian,

I apologise - I meant 9.4

regards,
Martin

On 17 September 2014 14:35, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 09/17/2014 03:03 AM, Martin Waite wrote:
Hi,

I have a postgresql 7.4 server and client on Centos 6.4.  The database
server is using UTF-8 encoding.

First I think we need to establish what version of Postgres you using. Are you really using 7.4?


I have been exploring the use of the \copy command for importing CSV
data generated by SQL Server 2008.  SQL Server 2008 export tool does not
escape quotes that are in the content of fields, and so it is useful to
be able to specify obscure characters in the quote option in the \copy
command to work around this issue.

When I run the following commands in psql, I am surprised that QUOTE is
limited to characters in the range 0x01 - 0x7f, and that UTF8 is
mentioned in the error message if characters outside the range are chosen:

    \encoding WIN1252
    \copy yuml from '/tmp/yuml.csv'  WITH CSV HEADER ENCODING 'WIN1252'
    QUOTE as E'\xff';
    ERROR:  invalid byte sequence for encoding "UTF8": 0xff


If you are actually on Postgres 7.4 the above would not be a viable command.




I thought that if the client (psql) is WIN1252, and the CSV file is
specified as WIN1252, then I could specify any valid WIN1252 character
as the quote character.   Instead, I am limited to the range of
characters that can be encoded as a single byte in UTF-8. Actually, 0x00
is not accepted either, so the range is 0x01 - 0x7F.

Is this a bug or expected behaviour ?

Is it the case that the server does the actual CSV parsing, and that
given that my server is in UTF8, I am therefore limited to single-byte
UTF8 characters ?

Actually depending on version you may be limited to ASCII.


regards,
Martin


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: encoding confusion with \copy command

From
Adrian Klaver
Date:
On 09/17/2014 03:03 AM, Martin Waite wrote:
> Hi,
>
> I have a postgresql 7.4 server and client on Centos 6.4.  The database
> server is using UTF-8 encoding.

First I think we need to establish what version of Postgres you using.
Are you really using 7.4?

>
> I have been exploring the use of the \copy command for importing CSV
> data generated by SQL Server 2008.  SQL Server 2008 export tool does not
> escape quotes that are in the content of fields, and so it is useful to
> be able to specify obscure characters in the quote option in the \copy
> command to work around this issue.
>
> When I run the following commands in psql, I am surprised that QUOTE is
> limited to characters in the range 0x01 - 0x7f, and that UTF8 is
> mentioned in the error message if characters outside the range are chosen:
>
>     \encoding WIN1252
>     \copy yuml from '/tmp/yuml.csv'  WITH CSV HEADER ENCODING 'WIN1252'
>     QUOTE as E'\xff';
>     ERROR:  invalid byte sequence for encoding "UTF8": 0xff


If you are actually on Postgres 7.4 the above would not be a viable command.

>
>
>
> I thought that if the client (psql) is WIN1252, and the CSV file is
> specified as WIN1252, then I could specify any valid WIN1252 character
> as the quote character.   Instead, I am limited to the range of
> characters that can be encoded as a single byte in UTF-8. Actually, 0x00
> is not accepted either, so the range is 0x01 - 0x7F.
>
> Is this a bug or expected behaviour ?
>
> Is it the case that the server does the actual CSV parsing, and that
> given that my server is in UTF8, I am therefore limited to single-byte
> UTF8 characters ?

Actually depending on version you may be limited to ASCII.

>
> regards,
> Martin


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: encoding confusion with \copy command

From
Adrian Klaver
Date:
On 09/17/2014 06:48 AM, Martin Waite wrote:
> Hi Adrian,
>
> I apologise - I meant 9.4

Looks like you will need an intermediate step. A quick search found the
below which might offer a solution:

http://www.excel-sql-server.com/sql-server-export-to-excel-using-bcp-sqlcmd-csv.htm

FYI the good stuff is at the bottom.

There is also this SO answer that refers to SQL Server Management
Studio(not sure if you have):

http://stackoverflow.com/questions/6115054/how-to-get-export-output-in-real-csv-format-in-sql-server-managment-studio

>
> regards,
> Martin


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: encoding confusion with \copy command

From
Martin Waite
Date:
You are right that I need an intermediate step.  I will probably use a CSV parser that is liberal in what it accepts, but writes out strict CSV data suitable for postgres.

Thanks for the help.

On 17 September 2014 15:40, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 09/17/2014 06:48 AM, Martin Waite wrote:
Hi Adrian,

I apologise - I meant 9.4

Looks like you will need an intermediate step. A quick search found the below which might offer a solution:

http://www.excel-sql-server.com/sql-server-export-to-excel-using-bcp-sqlcmd-csv.htm

FYI the good stuff is at the bottom.

There is also this SO answer that refers to SQL Server Management Studio(not sure if you have):

http://stackoverflow.com/questions/6115054/how-to-get-export-output-in-real-csv-format-in-sql-server-managment-studio



regards,
Martin


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: encoding confusion with \copy command

From
John DeSoi
Date:
On Sep 17, 2014, at 11:52 AM, Martin Waite <waite.134@gmail.com> wrote:

> You are right that I need an intermediate step.  I will probably use a CSV parser that is liberal in what it accepts,
butwrites out strict CSV data suitable for postgres. 
>

If you find such a utility, please share. My clients love Excel, but it takes perfectly valid CSV files and makes them
unreadableby Postgres. In particular, Excel saves rows with fewer columns than the header header row if the cells are
empty.It also mangles valid UTF-8. I often take Excel CSV files and re-save them from Open Office to fix them for
Postgresimport. 

John DeSoi, Ph.D.



Re: encoding confusion with \copy command

From
Adrian Klaver
Date:
On 09/17/2014 06:08 PM, John DeSoi wrote:
>
> On Sep 17, 2014, at 11:52 AM, Martin Waite <waite.134@gmail.com> wrote:
>
>> You are right that I need an intermediate step.  I will probably use a CSV parser that is liberal in what it
accepts,but writes out strict CSV data suitable for postgres. 
>>
>
> If you find such a utility, please share. My clients love Excel, but it takes perfectly valid CSV files and makes
themunreadable by Postgres. In particular, Excel saves rows with fewer columns than the header header row if the cells
areempty. It also mangles valid UTF-8. I often take Excel CSV files and re-save them from Open Office to fix them for
Postgresimport. 

Have you tried Save As Unicode:

http://www.ablebits.com/office-addins-blog/2014/04/24/convert-excel-csv/#export-csv-utf8

I vaguely remember using Access to pull in an Excel file and saving CSV
from there. I also seem to remember an option when saving a text file to
create an export template that allowed you to specify the csv options
and formatting. This was with the Pro version of Office and the data
tools installed.

What I do now is use the Python xlrd module to read in the Excel file
and then the csv module to output a CSV file.

>
> John DeSoi, Ph.D.
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: encoding confusion with \copy command

From
John R Pierce
Date:
On 9/17/2014 6:08 PM, John DeSoi wrote:
> On Sep 17, 2014, at 11:52 AM, Martin Waite<waite.134@gmail.com>  wrote:
>
>> >You are right that I need an intermediate step.  I will probably use a CSV parser that is liberal in what it
accepts,but writes out strict CSV data suitable for postgres. 
>> >
> If you find such a utility, please share. My clients love Excel, but it takes perfectly valid CSV files and makes
themunreadable by Postgres. In particular, Excel saves rows with fewer columns than the header header row if the cells
areempty. It also mangles valid UTF-8. I often take Excel CSV files and re-save them from Open Office to fix them for
Postgresimport. 

I've used the npgsql .net driver http://npgsql.projects.pgfoundry.org/
with excel... you can directly query tables into spreadsheets via the
'data' feature in excel.



--
john r pierce                                      37N 122W
somewhere on the middle of the left coast