Thread: encoding confusion with \copy command
Hi,
I have a postgresql 7.4 server and client on Centos 6.4. The database server is using UTF-8 encoding.
I have been exploring the use of the \copy command for importing CSV data generated by SQL Server 2008. SQL Server 2008 export tool does not escape quotes that are in the content of fields, and so it is useful to be able to specify obscure characters in the quote option in the \copy command to work around this issue.
When I run the following commands in psql, I am surprised that QUOTE is limited to characters in the range 0x01 - 0x7f, and that UTF8 is mentioned in the error message if characters outside the range are chosen:
\encoding WIN1252\copy yuml from '/tmp/yuml.csv' WITH CSV HEADER ENCODING 'WIN1252' QUOTE as E'\xff';ERROR: invalid byte sequence for encoding "UTF8": 0xff
I thought that if the client (psql) is WIN1252, and the CSV file is specified as WIN1252, then I could specify any valid WIN1252 character as the quote character. Instead, I am limited to the range of characters that can be encoded as a single byte in UTF-8. Actually, 0x00 is not accepted either, so the range is 0x01 - 0x7F.
Is this a bug or expected behaviour ?
Is it the case that the server does the actual CSV parsing, and that given that my server is in UTF8, I am therefore limited to single-byte UTF8 characters ?
regards,
Martin
Hi Adrian,
I apologise - I meant 9.4
regards,
Martin
On 17 September 2014 14:35, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 09/17/2014 03:03 AM, Martin Waite wrote:Hi,
I have a postgresql 7.4 server and client on Centos 6.4. The database
server is using UTF-8 encoding.
First I think we need to establish what version of Postgres you using. Are you really using 7.4?
I have been exploring the use of the \copy command for importing CSV
data generated by SQL Server 2008. SQL Server 2008 export tool does not
escape quotes that are in the content of fields, and so it is useful to
be able to specify obscure characters in the quote option in the \copy
command to work around this issue.
When I run the following commands in psql, I am surprised that QUOTE is
limited to characters in the range 0x01 - 0x7f, and that UTF8 is
mentioned in the error message if characters outside the range are chosen:
\encoding WIN1252
\copy yuml from '/tmp/yuml.csv' WITH CSV HEADER ENCODING 'WIN1252'
QUOTE as E'\xff';
ERROR: invalid byte sequence for encoding "UTF8": 0xff
If you are actually on Postgres 7.4 the above would not be a viable command.
I thought that if the client (psql) is WIN1252, and the CSV file is
specified as WIN1252, then I could specify any valid WIN1252 character
as the quote character. Instead, I am limited to the range of
characters that can be encoded as a single byte in UTF-8. Actually, 0x00
is not accepted either, so the range is 0x01 - 0x7F.
Is this a bug or expected behaviour ?
Is it the case that the server does the actual CSV parsing, and that
given that my server is in UTF8, I am therefore limited to single-byte
UTF8 characters ?
Actually depending on version you may be limited to ASCII.
regards,
Martin
--
Adrian Klaver
adrian.klaver@aklaver.com
On 09/17/2014 03:03 AM, Martin Waite wrote: > Hi, > > I have a postgresql 7.4 server and client on Centos 6.4. The database > server is using UTF-8 encoding. First I think we need to establish what version of Postgres you using. Are you really using 7.4? > > I have been exploring the use of the \copy command for importing CSV > data generated by SQL Server 2008. SQL Server 2008 export tool does not > escape quotes that are in the content of fields, and so it is useful to > be able to specify obscure characters in the quote option in the \copy > command to work around this issue. > > When I run the following commands in psql, I am surprised that QUOTE is > limited to characters in the range 0x01 - 0x7f, and that UTF8 is > mentioned in the error message if characters outside the range are chosen: > > \encoding WIN1252 > \copy yuml from '/tmp/yuml.csv' WITH CSV HEADER ENCODING 'WIN1252' > QUOTE as E'\xff'; > ERROR: invalid byte sequence for encoding "UTF8": 0xff If you are actually on Postgres 7.4 the above would not be a viable command. > > > > I thought that if the client (psql) is WIN1252, and the CSV file is > specified as WIN1252, then I could specify any valid WIN1252 character > as the quote character. Instead, I am limited to the range of > characters that can be encoded as a single byte in UTF-8. Actually, 0x00 > is not accepted either, so the range is 0x01 - 0x7F. > > Is this a bug or expected behaviour ? > > Is it the case that the server does the actual CSV parsing, and that > given that my server is in UTF8, I am therefore limited to single-byte > UTF8 characters ? Actually depending on version you may be limited to ASCII. > > regards, > Martin -- Adrian Klaver adrian.klaver@aklaver.com
On 09/17/2014 06:48 AM, Martin Waite wrote: > Hi Adrian, > > I apologise - I meant 9.4 Looks like you will need an intermediate step. A quick search found the below which might offer a solution: http://www.excel-sql-server.com/sql-server-export-to-excel-using-bcp-sqlcmd-csv.htm FYI the good stuff is at the bottom. There is also this SO answer that refers to SQL Server Management Studio(not sure if you have): http://stackoverflow.com/questions/6115054/how-to-get-export-output-in-real-csv-format-in-sql-server-managment-studio > > regards, > Martin -- Adrian Klaver adrian.klaver@aklaver.com
You are right that I need an intermediate step. I will probably use a CSV parser that is liberal in what it accepts, but writes out strict CSV data suitable for postgres.
Thanks for the help.
On 17 September 2014 15:40, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 09/17/2014 06:48 AM, Martin Waite wrote:Hi Adrian,
I apologise - I meant 9.4
Looks like you will need an intermediate step. A quick search found the below which might offer a solution:
http://www.excel-sql-server.com/sql-server-export-to-excel-using-bcp-sqlcmd-csv.htm
FYI the good stuff is at the bottom.
There is also this SO answer that refers to SQL Server Management Studio(not sure if you have):
http://stackoverflow.com/questions/6115054/how-to-get-export-output-in-real-csv-format-in-sql-server-managment-studio
On Sep 17, 2014, at 11:52 AM, Martin Waite <waite.134@gmail.com> wrote: > You are right that I need an intermediate step. I will probably use a CSV parser that is liberal in what it accepts, butwrites out strict CSV data suitable for postgres. > If you find such a utility, please share. My clients love Excel, but it takes perfectly valid CSV files and makes them unreadableby Postgres. In particular, Excel saves rows with fewer columns than the header header row if the cells are empty.It also mangles valid UTF-8. I often take Excel CSV files and re-save them from Open Office to fix them for Postgresimport. John DeSoi, Ph.D.
On 09/17/2014 06:08 PM, John DeSoi wrote: > > On Sep 17, 2014, at 11:52 AM, Martin Waite <waite.134@gmail.com> wrote: > >> You are right that I need an intermediate step. I will probably use a CSV parser that is liberal in what it accepts,but writes out strict CSV data suitable for postgres. >> > > If you find such a utility, please share. My clients love Excel, but it takes perfectly valid CSV files and makes themunreadable by Postgres. In particular, Excel saves rows with fewer columns than the header header row if the cells areempty. It also mangles valid UTF-8. I often take Excel CSV files and re-save them from Open Office to fix them for Postgresimport. Have you tried Save As Unicode: http://www.ablebits.com/office-addins-blog/2014/04/24/convert-excel-csv/#export-csv-utf8 I vaguely remember using Access to pull in an Excel file and saving CSV from there. I also seem to remember an option when saving a text file to create an export template that allowed you to specify the csv options and formatting. This was with the Pro version of Office and the data tools installed. What I do now is use the Python xlrd module to read in the Excel file and then the csv module to output a CSV file. > > John DeSoi, Ph.D. > > > -- Adrian Klaver adrian.klaver@aklaver.com
On 9/17/2014 6:08 PM, John DeSoi wrote: > On Sep 17, 2014, at 11:52 AM, Martin Waite<waite.134@gmail.com> wrote: > >> >You are right that I need an intermediate step. I will probably use a CSV parser that is liberal in what it accepts,but writes out strict CSV data suitable for postgres. >> > > If you find such a utility, please share. My clients love Excel, but it takes perfectly valid CSV files and makes themunreadable by Postgres. In particular, Excel saves rows with fewer columns than the header header row if the cells areempty. It also mangles valid UTF-8. I often take Excel CSV files and re-save them from Open Office to fix them for Postgresimport. I've used the npgsql .net driver http://npgsql.projects.pgfoundry.org/ with excel... you can directly query tables into spreadsheets via the 'data' feature in excel. -- john r pierce 37N 122W somewhere on the middle of the left coast