Thread: MSSQL to PostgreSQL : Encoding problem
Hi list ! I already posted this as "COPY FROM encoding error", but I have been doing some more tests since then. I'm trying to export data from MS SQL Server to PostgreSQL. The tables are quite big (>20M rows), so a CSV export and a "COPY FROM3 import seems to be the only reasonable solution. In DTS, I have 3 options to export a table as a text file : ANSI, OEM and UNICODE. I tried all these options (and I have three files, one for each). I then try to import into PostgreSQL. The farther I can get is when using the UNICODE export, and importing it using a client_encoding set to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...). The copy then stops with an error : ERROR: invalid byte sequence for encoding "UTF8": 0xff État SQL :22021 The problematic character is the euro currency symbol. Does anyone know how I can solve this ? Thanks a lot ! -- Arnaud
Arnaud Lesauvage wrote: > Hi list ! > > I already posted this as "COPY FROM encoding error", but I have been > doing some more tests since then. > > I'm trying to export data from MS SQL Server to PostgreSQL. > The tables are quite big (>20M rows), so a CSV export and a "COPY FROM3 > import seems to be the only reasonable solution. Or go via MS-Access/Perl and ODBC/DBI perhaps? > In DTS, I have 3 options to export a table as a text file : ANSI, OEM > and UNICODE. > I tried all these options (and I have three files, one for each). Well, what character-set is your database in? > I then try to import into PostgreSQL. The farther I can get is when > using the UNICODE export, and importing it using a client_encoding set > to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...). > The copy then stops with an error : > ERROR: invalid byte sequence for encoding "UTF8": 0xff > État SQL :22021 > > The problematic character is the euro currency symbol. You'll want UTF-8 or LATIN9 for the euro symbol. LATIN1 supports that character-number but it is used for a different symbol. Your first step needs to be to find out what character-set your data is in. Your second is then to decide what char-set you want to use to store it in PG. Then you can decide how to get there. -- Richard Huxton Archonet Ltd
Arnaud Lesauvage wrote: > > >> I then try to import into PostgreSQL. The farther I can get is when >> using the UNICODE export, and importing it using a client_encoding >> set to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...). >> The copy then stops with an error : >> ERROR: invalid byte sequence for encoding "UTF8": 0xff >> État SQL :22021 >> >> The problematic character is the euro currency symbol. > > Exporting from MS SQL server as unicode is going to give you full Unicode, not UTF8. Full unicde is 2 bytes per character and UTF8 is 1, same as ASCII. You will have to encode the Unicode data to UTF8 I have done this in Delphi using it's built in UTF8 encoding and decoding routines. You can get a free copy of Delphi Turbo Explorer which includes components for MS SQL server and ODBC, so it would be pretty straight forward to get this working. The actual method in Delphi is system.UTF8Encode(widestring). This will encode unicode to UTF8 which is compatible with a Postgresql UTF8 database. I am sure Perl could do it also. Hope this helps. -- Tony Caduto AM Software Design http://www.amsoftwaredesign.com Home of PG Lightning Admin for Postgresql Your best bet for Postgresql Administration
Tony Caduto wrote: > Arnaud Lesauvage wrote: >> >> >>> I then try to import into PostgreSQL. The farther I can get is when >>> using the UNICODE export, and importing it using a client_encoding >>> set to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...). >>> The copy then stops with an error : >>> ERROR: invalid byte sequence for encoding "UTF8": 0xff >>> État SQL :22021 >>> >>> The problematic character is the euro currency symbol. >> >> > Exporting from MS SQL server as unicode is going to give you full > Unicode, not UTF8. Full unicde is 2 bytes per character and UTF8 is 1, > same as ASCII. > You will have to encode the Unicode data to UTF8 Well, UTF8 is a minimum of one byte, but can be longer for non-ASCII characters. The idea being that chars below 128 map to ASCII. There's also UTF16 and I believe UTF32 with 2+ and four byte characters. > I have done this in Delphi using it's built in UTF8 encoding and > decoding routines. You can get a free copy of Delphi Turbo Explorer > which includes components for MS SQL server and ODBC, so it would be > pretty straight forward to get this working. > > The actual method in Delphi is system.UTF8Encode(widestring). This will > encode unicode to UTF8 which is compatible with a Postgresql UTF8 database. Ah, that's useful to know. Windows just doesn't have the same quantity of tools installed as a *nix platform. > I am sure Perl could do it also. And in one line if you're clever enough no doubt ;-) -- Richard Huxton Archonet Ltd
Tomi NA a écrit : >> I think I'll go this way... No other choice, actually ! >> The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS. >> I don't really understand what this is. It supports the euro >> symbol, so it is probably not pure LATIN1, right ? > > I suppose you'd have to look at the latin1 codepage character table > somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the > question. :) Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that LATIN1 is missing the euro sign... Grrrrr I hate this !!! -- Arnaud
Richard Huxton a écrit : > Arnaud Lesauvage wrote: >> Hi list ! >> >> I already posted this as "COPY FROM encoding error", but I have been >> doing some more tests since then. >> >> I'm trying to export data from MS SQL Server to PostgreSQL. >> The tables are quite big (>20M rows), so a CSV export and a "COPY FROM3 >> import seems to be the only reasonable solution. > > Or go via MS-Access/Perl and ODBC/DBI perhaps? Yes, I think it would work. The problem is that the DB is too big for this king of export. Using DTS from MSSQL to export directly to PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per second in a 2-columns table with ~20M rows. That means several days just for this table, and I have bigger ones ! >> In DTS, I have 3 options to export a table as a text file : ANSI, OEM >> and UNICODE. >> I tried all these options (and I have three files, one for each). > > Well, what character-set is your database in? Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS. DTS documentation tells me that exporting in ANSI should export using the current codepage. According to my local setting, my codepage is Windows-1252. This file is not correctly read by COPY when using client_encoding of WIN1252 though... >> I then try to import into PostgreSQL. The farther I can get is when >> using the UNICODE export, and importing it using a client_encoding set >> to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...). >> The copy then stops with an error : >> ERROR: invalid byte sequence for encoding "UTF8": 0xff >> État SQL :22021 >> >> The problematic character is the euro currency symbol. > > You'll want UTF-8 or LATIN9 for the euro symbol. LATIN1 supports that > character-number but it is used for a different symbol. > > Your first step needs to be to find out what character-set your data is in. > Your second is then to decide what char-set you want to use to store it > in PG. > Then you can decide how to get there. In PG, UTF8 was my choice (the DB already exists, I am just adding some tables that are still stored in MSSQL), and according to what you say this was the right choice. The problem is really about reading this file I think. I thought that given the character sets available in PostgreSQL, I would be able to COPY directly from my exported files. If I have to convert them using some third party tool, I'll do that, but that's a bit more painful... -- Arnaud
2006/11/21, Arnaud Lesauvage <thewild@freesurf.fr>: > Hi list ! > > I already posted this as "COPY FROM encoding error", but I have > been doing some more tests since then. > > I'm trying to export data from MS SQL Server to PostgreSQL. > The tables are quite big (>20M rows), so a CSV export and a "COPY > FROM3 import seems to be the only reasonable solution. I believe you might have more luck working without files altogether. Use an ETL tool like kettle or even DTS with the pgsql ODBC driver. That's exactly what those tools are for. You still have to get the encodings right, though. I suggest unicode for pgsql, but only you know how the MSSQL database is encoded. t.n.a.
Richard Huxton a écrit : > Arnaud Lesauvage wrote: >> Richard Huxton a écrit : >>> >>> Or go via MS-Access/Perl and ODBC/DBI perhaps? >> >> Yes, I think it would work. The problem is that the DB is too big for >> this king of export. Using DTS from MSSQL to export directly to >> PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per >> second in a 2-columns table with ~20M rows. That means several days just >> for this table, and I have bigger ones ! > > Well it's about 0.25 days, but if it's too long, it's too long. Sure, sorry for the confusion, the problem is with the other tables (same number of rows but a lot of columns, some very large). >> > Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS. >> > DTS documentation tells me that exporting in ANSI should export using >> > the current codepage. >> > According to my local setting, my codepage is Windows-1252. >> > This file is not correctly read by COPY when using client_encoding of >> > WIN1252 though... > > Hmm. Odd that they don't agree on what WIN1252 is. I'm not sure how to > check the file and confirm one way or the other. Anyone else on the list > got an idea? I just downloaded the GnuWin32 version of iconv. I'm giving it a try and I'll tell you haw it went. -- Arnaud
Arnaud Lesauvage wrote: > Tomi NA a écrit : > >>I think I'll go this way... No other choice, actually ! > >>The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS. > >>I don't really understand what this is. It supports the euro > >>symbol, so it is probably not pure LATIN1, right ? > > > >I suppose you'd have to look at the latin1 codepage character table > >somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the > >question. :) > > Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that > LATIN1 is missing the euro sign... > Grrrrr I hate this !!! So use Latin9 ... -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
2006/11/22, Arnaud Lesauvage <thewild@freesurf.fr>: > Tomi NA a écrit : > > 2006/11/21, Arnaud Lesauvage <thewild@freesurf.fr>: > >> Hi list ! > >> > >> I already posted this as "COPY FROM encoding error", but I have > >> been doing some more tests since then. > >> > >> I'm trying to export data from MS SQL Server to PostgreSQL. > >> The tables are quite big (>20M rows), so a CSV export and a "COPY > >> FROM3 import seems to be the only reasonable solution. > > > > I believe you might have more luck working without files altogether. > > Use an ETL tool like kettle or even DTS with the pgsql ODBC driver. > > That's exactly what those tools are for. > > > > You still have to get the encodings right, though. > > I suggest unicode for pgsql, but only you know how the MSSQL database > > is encoded. > > I think I'll go this way... No other choice, actually ! > The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS. > I don't really understand what this is. It supports the euro > symbol, so it is probably not pure LATIN1, right ? I suppose you'd have to look at the latin1 codepage character table somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the question. :) t.n.a.
> > I have done this in Delphi using it's built in UTF8 encoding and > > decoding routines. You can get a free copy of Delphi > Turbo Explorer > > which includes components for MS SQL server and ODBC, so it > would be > > pretty straight forward to get this working. > > > > The actual method in Delphi is system.UTF8Encode(widestring). This > > will encode unicode to UTF8 which is compatible with a > Postgresql UTF8 database. > > Ah, that's useful to know. Windows just doesn't have the same > quantity of tools installed as a *nix platform. If your file is small enough, you can just open it up in Notepad and re-save it as UTF8. It might play funny with the BOMs though (byte-order-marks). There is also, IIRC, an iconv binary available for Windows that should be able to do such a conversion. Can't rememebr where thuogh :-) //Magnus
> >> I already posted this as "COPY FROM encoding error", but I > have been > >> doing some more tests since then. > >> > >> I'm trying to export data from MS SQL Server to PostgreSQL. > >> The tables are quite big (>20M rows), so a CSV export and a "COPY > >> FROM3 import seems to be the only reasonable solution. > > > > Or go via MS-Access/Perl and ODBC/DBI perhaps? > > Yes, I think it would work. The problem is that the DB is too > big for this king of export. Using DTS from MSSQL to export > directly to PostgreSQL using psqlODBC Unicode Driver, I > exported ~1000 rows per second in a 2-columns table with ~20M > rows. That means several days just for this table, and I have > bigger ones ! > Interesting. What did you set the "Inser batch size" to? (I think that's available for all transformatino tasks). And did you remember to check the box for "use transactions"? While it's never as fast as a COPY, it should be possible to make it faster than that, Ithink. Another option is to just BCP the file out, and then COPY it into postgresql. No nice GUI, but you can tune almost everything with BCP. //Magnus
Alvaro Herrera a écrit : > Arnaud Lesauvage wrote: >> Tomi NA a écrit : >> >>I think I'll go this way... No other choice, actually ! >> >>The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS. >> >>I don't really understand what this is. It supports the euro >> >>symbol, so it is probably not pure LATIN1, right ? >> > >> >I suppose you'd have to look at the latin1 codepage character table >> >somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the >> >question. :) >> >> Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that >> LATIN1 is missing the euro sign... >> Grrrrr I hate this !!! > > So use Latin9 ... Of course, but it doesn't work !!! Whatever client encoding I choose in postgresql before COPYing, I get the 'invalid byte sequence error'. The farther I can get is exporting to UNICODE and importing as UTF8. Then COPY only breaks on the euro symbol (otherwise it breaks very early, I think on the first "non-ascii" character). -- Arnaud
Tomi NA a écrit : > 2006/11/21, Arnaud Lesauvage <thewild@freesurf.fr>: >> Hi list ! >> >> I already posted this as "COPY FROM encoding error", but I have >> been doing some more tests since then. >> >> I'm trying to export data from MS SQL Server to PostgreSQL. >> The tables are quite big (>20M rows), so a CSV export and a "COPY >> FROM3 import seems to be the only reasonable solution. > > I believe you might have more luck working without files altogether. > Use an ETL tool like kettle or even DTS with the pgsql ODBC driver. > That's exactly what those tools are for. > > You still have to get the encodings right, though. > I suggest unicode for pgsql, but only you know how the MSSQL database > is encoded. I think I'll go this way... No other choice, actually ! The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS. I don't really understand what this is. It supports the euro symbol, so it is probably not pure LATIN1, right ? -- Arnaud
Arnaud Lesauvage wrote: > Alvaro Herrera a écrit : > >Arnaud Lesauvage wrote: > >>Tomi NA a écrit : > >>>>I think I'll go this way... No other choice, actually ! > >>>>The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS. > >>>>I don't really understand what this is. It supports the euro > >>>>symbol, so it is probably not pure LATIN1, right ? > >>> > >>>I suppose you'd have to look at the latin1 codepage character table > >>>somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the > >>>question. :) > >> > >>Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that > >>LATIN1 is missing the euro sign... > >>Grrrrr I hate this !!! > > > >So use Latin9 ... > > Of course, but it doesn't work !!! > Whatever client encoding I choose in postgresql before > COPYing, I get the 'invalid byte sequence error'. Humm ... how are you choosing the client encoding? Is it actually working? I don't see how choosing Latin1 or Latin9 and feeding whatever byte sequence would give you an "invalid byte sequence". These charsets don't have any way to validate the bytes, as opposed to what UTF-8 can do. So you could end up with invalid bytes if you choose the wrong client encoding, but that's a different error. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Arnaud Lesauvage wrote: > Richard Huxton a écrit : >> >> Or go via MS-Access/Perl and ODBC/DBI perhaps? > > Yes, I think it would work. The problem is that the DB is too big for > this king of export. Using DTS from MSSQL to export directly to > PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per > second in a 2-columns table with ~20M rows. That means several days just > for this table, and I have bigger ones ! Well it's about 0.25 days, but if it's too long, it's too long. >>> In DTS, I have 3 options to export a table as a text file : ANSI, OEM >>> and UNICODE. >>> I tried all these options (and I have three files, one for each). >> >> Well, what character-set is your database in? > > > Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS. > DTS documentation tells me that exporting in ANSI should export using > the current codepage. > According to my local setting, my codepage is Windows-1252. > This file is not correctly read by COPY when using client_encoding of > WIN1252 though... Hmm. Odd that they don't agree on what WIN1252 is. I'm not sure how to check the file and confirm one way or the other. Anyone else on the list got an idea? -- Richard Huxton Archonet Ltd
> > Of course, but it doesn't work !!! > Whatever client encoding I choose in postgresql before COPYing, I get > the 'invalid byte sequence error'. > > The farther I can get is exporting to UNICODE and importing as UTF8. > Then COPY only breaks on the euro symbol (otherwise it breaks very > early, I think on the first "non-ascii" character). > > -- Like I said before UNICODE <> UTF8 That's why the COPY command breaks on the Euro symbol. You will have to export as UNICODE, then encode it as UTF8, then you won't get the breakage. UTF8 is simply a means to store UNICODE pretty much as ASCII text. You could grab a copy of Delphi TurboExplorer and create a import routine using the dbGO ADO components and the PG ODBC driver. Basicly you need to encode any UNICODE data going to the PG server with the system.utf8encode function: [Delphi] function *UTF8Encode*(const WS: WideString): UTF8String; Call Utf8Encode to convert a Unicode string to UTF-8. WS is the Unicode string to convert. Utf8Encode returns the corresponding UTF-8 string. I would imagine that Perl also has such routines, but I don't know for sure. These routines might be in FreePascal as well. -- Tony Caduto AM Software Design http://www.amsoftwaredesign.com Home of PG Lightning Admin for Postgresql Your best bet for Postgresql Administration
Magnus Hagander a écrit : >> > I have done this in Delphi using it's built in UTF8 encoding and >> > decoding routines. You can get a free copy of Delphi >> Turbo Explorer >> > which includes components for MS SQL server and ODBC, so it >> would be >> > pretty straight forward to get this working. >> > >> > The actual method in Delphi is system.UTF8Encode(widestring). This >> > will encode unicode to UTF8 which is compatible with a >> Postgresql UTF8 database. >> >> Ah, that's useful to know. Windows just doesn't have the same >> quantity of tools installed as a *nix platform. > > If your file is small enough, you can just open it up in Notepad and > re-save it as UTF8. It might play funny with the BOMs though > (byte-order-marks). > > There is also, IIRC, an iconv binary available for Windows that should > be able to do such a conversion. Can't rememebr where thuogh :-) The file is way too big for notepad. It is even too big for notepad++. I do have the GnuWin32 version of iconv (*great* software collection, BTW), but still no go... I tried iconv -f "CP1252" -t "UTF-8" detailrecherche_ansi.csv >detailrecherche_cp1252utf8.csv and iconv -f "LATIN-9" -t "UTF-8" detailrecherche_ansi.csv >detailrecherche_latin9utf8.csv Both don't want to load as UTF8 (invalid byte sequence x00). I am desperate... -- Arnaud
Alvaro Herrera a écrit : > Arnaud Lesauvage wrote: >> Alvaro Herrera a écrit : >> >Arnaud Lesauvage wrote: >> >>Tomi NA a écrit : >> >>>>I think I'll go this way... No other choice, actually ! >> >>>>The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS. >> >>>>I don't really understand what this is. It supports the euro >> >>>>symbol, so it is probably not pure LATIN1, right ? >> >>> >> >>>I suppose you'd have to look at the latin1 codepage character table >> >>>somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the >> >>>question. :) >> >> >> >>Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that >> >>LATIN1 is missing the euro sign... >> >>Grrrrr I hate this !!! >> > >> >So use Latin9 ... >> >> Of course, but it doesn't work !!! >> Whatever client encoding I choose in postgresql before >> COPYing, I get the 'invalid byte sequence error'. > > Humm ... how are you choosing the client encoding? Is it actually > working? I don't see how choosing Latin1 or Latin9 and feeding whatever > byte sequence would give you an "invalid byte sequence". These charsets > don't have any way to validate the bytes, as opposed to what UTF-8 can > do. So you could end up with invalid bytes if you choose the wrong > client encoding, but that's a different error. > mydb=# SET client_encoding TO LATIN9; SET mydb=# COPY statistiques.detailrecherche (log_gid, champrecherche, valeurrecherche) FROM 'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV; ERROR: invalid byte sequence for encoding "LATIN9": 0x00 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". CONTEXT: COPY detailrecherche, line 9212 mydb=# SET client_encoding TO WIN1252; SET mydb=# COPY statistiques.detailrecherche (log_gid, champrecherche, valeurrecherche) FROM 'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV; ERROR: invalid byte sequence for encoding "WIN1252": 0x00 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". CONTEXT: COPY detailrecherche, line 9212 Really, I'd rather have another error, but this is all I can get. This is with the "ANSI" export. With the "UNICODE" export : mydb=# SET client_encoding TO UTF8; SET mydb=# COPY statistiques.detailrecherche (log_gid, champrecherche, valeurrecherche) FROM 'E:\\Production\\Temp\\detailrecherche_unicode.csv' CSV; ERROR: invalid byte sequence for encoding "UTF8": 0xff HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". CONTEXT: COPY detailrecherche, line 592680 So, line 592680 is *a lot* better, but it is still not good! -- Arnaud
>>>> Or go via MS-Access/Perl and ODBC/DBI perhaps? >>> >>> Yes, I think it would work. The problem is that the DB is too big for >>> this king of export. Using DTS from MSSQL to export directly to >>> PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per >>> second in a 2-columns table with ~20M rows. That means several days just >>> for this table, and I have bigger ones ! >> >> Well it's about 0.25 days, but if it's too long, it's too long. > > Sure, sorry for the confusion, the problem is with the other tables (same > number of rows but a lot of columns, some very large). > well, if its too slow, then you will have to dump the db to a textfile (DTS does this for you) and then convert the textfile to utf8 manually before importing it to pgsql. iconv for win32 will help you there. i found tho it removes some wanted special characters, so watch out. a less "scientific" approach would be using an unicode-aware texteditor to convert it (ultraedit does this pretty nicely, for example). have had good results with it. loading several million rows will always take some time, tho. - thomas
Arnaud Lesauvage wrote: > mydb=# SET client_encoding TO LATIN9; > SET > mydb=# COPY statistiques.detailrecherche (log_gid, > champrecherche, valeurrecherche) FROM > 'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV; > ERROR: invalid byte sequence for encoding "LATIN9": 0x00 > HINT: This error can also happen if the byte sequence does > not match the encoding expected by the server, which is > controlled by "client_encoding". Huh, why do you have a "0x00" byte in there? That's certainly not Latin9 (nor UTF8 as far as I know). Is the file actually Latin-something or did you convert it to something else at some point? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Arnaud Lesauvage wrote: > Alvaro Herrera a écrit : > >Arnaud Lesauvage wrote: > > > >>mydb=# SET client_encoding TO LATIN9; > >>SET > >>mydb=# COPY statistiques.detailrecherche (log_gid, > >>champrecherche, valeurrecherche) FROM > >>'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV; > >>ERROR: invalid byte sequence for encoding "LATIN9": 0x00 > >>HINT: This error can also happen if the byte sequence does > >>not match the encoding expected by the server, which is > >>controlled by "client_encoding". > > > >Huh, why do you have a "0x00" byte in there? That's certainly not > >Latin9 (nor UTF8 as far as I know). > > > >Is the file actually Latin-something or did you convert it to something > >else at some point? > > This is the file generated by DTS with "ANSI" encoding. It > was not altered in any way after that ! > The doc states that ANSI exports with the local codepage > (which is Win1252). That's all I know. :( I thought Win1252 was supposed to be almost the same as Latin1. While I'd expect certain differences, I wouldn't expect it to use 0x00 as data! Maybe you could have DTS export Unicode, which would presumably be UTF-16, then recode that to something else (possibly UTF-8) with GNU iconv. FWIW, I think the preferred way to set the client encoding on psql is \encoding. I'm not sure if it does anything different from the SET command though. -- Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 18.1", W 73º 13' 56.4" "In Europe they call me Niklaus Wirth; in the US they call me Nickel's worth. That's because in Europe they call me by name, and in the US by value!"
Alvaro Herrera a écrit : > Arnaud Lesauvage wrote: >> Alvaro Herrera a écrit : >> >Arnaud Lesauvage wrote: >> > >> >>mydb=# SET client_encoding TO LATIN9; >> >>SET >> >>mydb=# COPY statistiques.detailrecherche (log_gid, >> >>champrecherche, valeurrecherche) FROM >> >>'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV; >> >>ERROR: invalid byte sequence for encoding "LATIN9": 0x00 >> >>HINT: This error can also happen if the byte sequence does >> >>not match the encoding expected by the server, which is >> >>controlled by "client_encoding". >> > >> >Huh, why do you have a "0x00" byte in there? That's certainly not >> >Latin9 (nor UTF8 as far as I know). >> > >> >Is the file actually Latin-something or did you convert it to something >> >else at some point? >> >> This is the file generated by DTS with "ANSI" encoding. It >> was not altered in any way after that ! >> The doc states that ANSI exports with the local codepage >> (which is Win1252). That's all I know. :( > > I thought Win1252 was supposed to be almost the same as Latin1. While > I'd expect certain differences, I wouldn't expect it to use 0x00 as > data! > > Maybe you could have DTS export Unicode, which would presumably be > UTF-16, then recode that to something else (possibly UTF-8) with GNU > iconv. UTF-16 ! That's something I haven't tried ! I'll try an iconv conversion tomorrow from UTF16 to UTF8 ! -- Arnaud
Arnaud Lesauvage wrote: > > I thought Win1252 was supposed to be almost the same as Latin1. While > > I'd expect certain differences, I wouldn't expect it to use 0x00 as > > data! > > > > Maybe you could have DTS export Unicode, which would presumably be > > UTF-16, then recode that to something else (possibly UTF-8) with GNU > > iconv. > > UTF-16 ! That's something I haven't tried ! > I'll try an iconv conversion tomorrow from UTF16 to UTF8 ! Right! To clarify, Unicode is the character set, and UTF8 and UTF16 are ways of representing that characters set in 8-bit and 16-bit segments, respectively. PostgreSQL only suports UTF8, and Win32 only supports UTF16 in the operating system. And 0x00 is not a valid value in any of those, that I know of, but perhaps it is in UTF16. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
> > > I thought Win1252 was supposed to be almost the same as Latin1. > > > While I'd expect certain differences, I wouldn't expect it to use > > > 0x00 as data! > > > > > > Maybe you could have DTS export Unicode, which would > presumably be > > > UTF-16, then recode that to something else (possibly > UTF-8) with GNU > > > iconv. > > > > UTF-16 ! That's something I haven't tried ! > > I'll try an iconv conversion tomorrow from UTF16 to UTF8 ! > > Right! To clarify, Unicode is the character set, and UTF8 > and UTF16 are ways of representing that characters set in > 8-bit and 16-bit segments, respectively. PostgreSQL only > suports UTF8, and Win32 only supports > UTF16 in the operating system. And 0x00 is not a valid value > in any of those, that I know of, but perhaps it is in UTF16. Actually, Win32 supports UTF8 as well. There are a few operations that aren't supported on it, but you can certainly read and write files in it from most builtin apps. One other problem is that in most (all) win32 documentation talks about UNICODE when they mean UTF16 (in <= NT4, UCS-2). And PostgreSQL used to say UNICODE when we meant UTF8. Adds to the confusion. Finally, UTF-8 does not represent the characters in 8-bit segments - it can use anything from 8 to 32 bits. UTF-16 always uses 16 bits. This also means that you acn't talk about "0x00 being valid" in UTF-16, because all characters are 16-bit. It would be "0x0000" or "0x00 0x00". But that requires an application that knows UTF16, which postgresql doesn't, so it reports on the first 0x00. //Magnus
It also might be a big/little endian problem, although I always thought that was platform specific, not locale specific. Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use the two-byte or four-byte versions of UCSencoding using the system's default endian setting. There's many Unicode codepage formats that iconv supports: UTF-8 ISO-10646-UCS-2 UCS-2 CSUNICODE UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11 UCS-2LE UNICODELITTLE ISO-10646-UCS-4 UCS-4 CSUCS4 UCS-4BE UCS-4LE UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE UNICODE-1-1-UTF-7 UTF-7 CSUNICODE11UTF7 UCS-2-INTERNAL UCS-2-SWAPPED UCS-4-INTERNAL UCS-4-SWAPPED Gee, didn't Unicode just so simplify this codepage mess? Remember when it was just ASCII, EBCDIC, ANSI, and localized codepages? -- Brandon Aiken CS/IT Systems Engineer -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Arnaud Lesauvage Sent: Wednesday, November 22, 2006 12:38 PM To: Arnaud Lesauvage; General Subject: Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem Alvaro Herrera a écrit : > Arnaud Lesauvage wrote: >> Alvaro Herrera a écrit : >> >Arnaud Lesauvage wrote: >> > >> >>mydb=# SET client_encoding TO LATIN9; >> >>SET >> >>mydb=# COPY statistiques.detailrecherche (log_gid, >> >>champrecherche, valeurrecherche) FROM >> >>'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV; >> >>ERROR: invalid byte sequence for encoding "LATIN9": 0x00 >> >>HINT: This error can also happen if the byte sequence does >> >>not match the encoding expected by the server, which is >> >>controlled by "client_encoding". >> > >> >Huh, why do you have a "0x00" byte in there? That's certainly not >> >Latin9 (nor UTF8 as far as I know). >> > >> >Is the file actually Latin-something or did you convert it to something >> >else at some point? >> >> This is the file generated by DTS with "ANSI" encoding. It >> was not altered in any way after that ! >> The doc states that ANSI exports with the local codepage >> (which is Win1252). That's all I know. :( > > I thought Win1252 was supposed to be almost the same as Latin1. While > I'd expect certain differences, I wouldn't expect it to use 0x00 as > data! > > Maybe you could have DTS export Unicode, which would presumably be > UTF-16, then recode that to something else (possibly UTF-8) with GNU > iconv. UTF-16 ! That's something I haven't tried ! I'll try an iconv conversion tomorrow from UTF16 to UTF8 ! -- Arnaud ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
2006/11/22, Brandon Aiken <BAiken@winemantech.com>: > Gee, didn't Unicode just so simplify this codepage mess? Remember when it was just ASCII, EBCDIC, ANSI, and localizedcodepages? Unicode is a heaven sent, compared to 3 or 4 codepages representing any given (obviously non-English) language, and 3 or 4 more for every other language you have to deal with in your application. Perfect? Hardly. But then again, much more so than natural languages. I'd say we'd deliver products 10-20% faster (in the company I work in) if people looked ahead a couple of decades ago and decided upon something along the lines of unicode instead of ASCII. Cheers, t.n.a.
On Wed, Nov 22, 2006 at 01:55:55PM -0500, Brandon Aiken wrote: > Gee, didn't Unicode just so simplify this codepage mess? Remember > when it was just ASCII, EBCDIC, ANSI, and localized codepages? I think that's one reason why Unix has standardised on UTF-8 rather than one of the other Unicode variants. For transmission between systems it's the easiest to get right... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Brandon Aiken a écrit : > It also might be a big/little endian problem, although I always thought that was platform specific, not locale specific. > > Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use the two-byte or four-byte versions of UCSencoding using the system's default endian setting. Guys, it worked !!!! UCS-4-INTERNAL was the right choice !!! I love you all ! (now I just have an out of memory problem, but that's going to be a new thread) -- Arnaud
Arnaud Lesauvage a écrit : > Brandon Aiken a écrit : >> It also might be a big/little endian problem, although I always thought that was platform specific, not locale specific. >> >> Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use the two-byte or four-byte versions of UCSencoding using the system's default endian setting. > > > Guys, it worked !!!! > UCS-4-INTERNAL was the right choice !!! > > I love you all ! > > (now I just have an out of memory problem, but that's going > to be a new thread) Guys, it did not work !!! :( I thought it worked because postgres seemed to be loading the file and failing at the end with an "out of memory" error, but in fact I think the conversion remove all end-of-line characters (one line of 1.5GB was too much for COPY...). Still searching ! -- Arnaud
2006/11/23, Arnaud Lesauvage <thewild@freesurf.fr>: > Arnaud Lesauvage a écrit : > > Brandon Aiken a écrit : > >> It also might be a big/little endian problem, although I always thought that was platform specific, not locale specific. > >> > >> Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use the two-byte or four-byte versions ofUCS encoding using the system's default endian setting. > > > > > > Guys, it worked !!!! > > UCS-4-INTERNAL was the right choice !!! > > > > I love you all ! > > > > (now I just have an out of memory problem, but that's going > > to be a new thread) > > Guys, it did not work !!! :( > I thought it worked because postgres seemed to be loading > the file and failing at the end with an "out of memory" > error, but in fact I think the conversion remove all > end-of-line characters (one line of 1.5GB was too much for > COPY...). > > Still searching ! It will take you a day or two to get started, and then a day or two to get the job done, but you really might want to look into kettle or some other ETL tool to do the job. It looks to me like you're trying to screw in a screw using a hammer. t.n.a.
Tomi NA a écrit : > 2006/11/23, Arnaud Lesauvage <thewild@freesurf.fr>: >> Arnaud Lesauvage a écrit : >> > Brandon Aiken a écrit : >> >> It also might be a big/little endian problem, although I always thought that was platform specific, not locale specific. >> >> >> >> Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use the two-byte or four-byte versions ofUCS encoding using the system's default endian setting. >> > >> > >> > Guys, it worked !!!! >> > UCS-4-INTERNAL was the right choice !!! >> > >> > I love you all ! >> > >> > (now I just have an out of memory problem, but that's going >> > to be a new thread) >> >> Guys, it did not work !!! :( >> I thought it worked because postgres seemed to be loading >> the file and failing at the end with an "out of memory" >> error, but in fact I think the conversion remove all >> end-of-line characters (one line of 1.5GB was too much for >> COPY...). >> >> Still searching ! > > It will take you a day or two to get started, and then a day or two to > get the job done, but you really might want to look into kettle or > some other ETL tool to do the job. > It looks to me like you're trying to screw in a screw using a hammer. Yes, I might try something else. I was thinking that others would probably run into this problem sometime, and that our investigations might help them. I think I'll forget about this COPY stuff and just export with DTS through PostgreSQL ODBC Unicode. -- Arnaud