Thread: pgsql cannot read utf8 files moved from windows correctly!
H i, I copied a table in sqlserver2005 to a txt file(There were many chinese words in it).I saved it as a file encoded by ANSI,but I cant open it in ubuntu.I tried GBK,GB18030, UTF8,It just could not be opened. Then I save it in windows with encoding UTF8,then I can open it in ubuntu.I copied it to postgresql,but the file could not be read correctly.For example,here is a file: --book.txt bookid(int) bookname(varchar(30)) 1 Java I created a table "book" in postgre,then I input the command line: copy book from '/home/postgres/data/book.txt' The error was: error:invalid input syntax for integer:" 1"; context:line 1,column bookid I know that every line of utf8 files is started with "fffe" or "feff" and ended with "\r\n" in windows but not in linux,so the character "1" has a space before it in the error line. Is there any way I can transfer utf8 file in windows to linux system? Thank you!
On Tue, Dec 18, 2007 at 02:53:16PM +0800, bookman bookman wrote: > I know that every line of utf8 files is started with "fffe" or "feff" > and ended with "\r\n" in windows but not in linux,so the character > "1" has a space before it in the error line. Err, no. In UTF-16 files it is common to begin the *file* with that character, but UTF-8 doesn't have that character anywhere, it's illegal. Just stripping them out should be fine. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution inevitable. > -- John F Kennedy
Attachment
On 12/20/07, Martijn van Oosterhout <kleptog@svana.org> wrote: > On Tue, Dec 18, 2007 at 02:53:16PM +0800, bookman bookman wrote: > > I know that every line of utf8 files is started with "fffe" or "feff" > > and ended with "\r\n" in windows but not in linux,so the character > > "1" has a space before it in the error line. > Err, no. In UTF-16 files it is common to begin the *file* with that > character, but UTF-8 doesn't have that character anywhere, it's > illegal. Just stripping them out should be fine. A BOM is perfectly legal in UTF-8, and it's commonly used as a signature to indicate the text is UTF-8 instead of another encoding. But yes, it is at the beginning of the file only. http://unicode.org/faq/utf_bom.html#29
it seems the use of BOM in UTF-8 is discouraged http://unicode.org/faq/utf_bom.html#BOM FF FE is UTF16-Little Endian FE FF is UTF16-Big Endian Please verify- Bedankt/ Martin- ----- Original Message ----- From: "Trevor Talbot" <quension@gmail.com> To: <pgsql-general@postgresql.org> Sent: Sunday, December 23, 2007 10:39 AM Subject: Re: [GENERAL] pgsql cannot read utf8 files moved from windows correctly! > On 12/20/07, Martijn van Oosterhout <kleptog@svana.org> wrote: > > On Tue, Dec 18, 2007 at 02:53:16PM +0800, bookman bookman wrote: > > > > I know that every line of utf8 files is started with "fffe" or "feff" > > > and ended with "\r\n" in windows but not in linux,so the character > > > "1" has a space before it in the error line. > > > Err, no. In UTF-16 files it is common to begin the *file* with that > > character, but UTF-8 doesn't have that character anywhere, it's > > illegal. Just stripping them out should be fine. > > A BOM is perfectly legal in UTF-8, and it's commonly used as a > signature to indicate the text is UTF-8 instead of another encoding. > But yes, it is at the beginning of the file only. > > http://unicode.org/faq/utf_bom.html#29 > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org/ >
bookman bookman wrote: > H i, > > I copied a table in sqlserver2005 to a txt file(There were many > chinese words in it).I saved it as a file encoded by ANSI,but I cant > open it in ubuntu.I tried GBK,GB18030, > UTF8,It just could not be opened. > > Then I save it in windows with encoding UTF8,then I can open it in > ubuntu.I copied it to postgresql,but the file could not be read > correctly.For example,here is a file: > > --book.txt > bookid(int) bookname(varchar(30)) > 1 Java > > I created a table "book" in postgre,then I input the command line: > copy book from '/home/postgres/data/book.txt' > The error was: > error:invalid input syntax for integer:" 1"; > context:line 1,column bookid > I know that every line of utf8 files is started with "fffe" or "feff" > and ended with "\r\n" in windows but not in linux,so the character > "1" has a space before it in the error line. > Not long ago i ran into a similar problem with UTF-8 and BOM. It turned out that a client of mine had edited some files in an old version of Homesite for Windows, which has a bit of an issue in this area: http://kb.adobe.com/selfservice/viewContent.do?externalId=tn_19059&sliceId=1 Perhaps yours is a related problem? brian
On 12/23/07, Martin Gainty <mgainty@hotmail.com> wrote: > it seems the use of BOM in UTF-8 is discouraged > http://unicode.org/faq/utf_bom.html#BOM Where do you see it being discouraged?
the specifics.. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided. M-- ----- Original Message ----- From: "Trevor Talbot" <quension@gmail.com> To: <pgsql-general@postgresql.org> Sent: Sunday, December 23, 2007 1:55 PM Subject: Re: [GENERAL] pgsql cannot read utf8 files moved from windows correctly! > On 12/23/07, Martin Gainty <mgainty@hotmail.com> wrote: > > > it seems the use of BOM in UTF-8 is discouraged > > http://unicode.org/faq/utf_bom.html#BOM > > Where do you see it being discouraged? > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
On 12/23/00, Martin Gainty <mgainty@hotmail.com> wrote: > the specifics.. > > Some byte oriented protocols expect ASCII characters at the beginning of a > file. > If UTF-8 is used with these protocols, use of the BOM as encoding form > signature should be avoided. Sure, but that isn't true of generic text files, which is one of the major applications of a UTF-8 BOM. Especially when said text files are being fed to something that understands multiple encodings. The other items on that page say as much...