Thread: Character encodings...

Character encodings...

From
Michael Sobolev
Date:
I am trying to fill up a database using psql program. A file I have prepared
contains Russian in KOI8-R encoding.  When I try to process this file using
`psql -f file db', it fails: no diagnostics, nothing; it just shows that EOF is
reached.  When I replace Russian letters with something in ASCII, it works just
fine.  The main problem is that my second file gets processed just fine.

Where to look to?  What additional information is needed? :)

Thanks,

--
Mike

Re: Character encodings...

From
Oleg Broytmann
Date:
On Thu, 13 Apr 2000, Michael Sobolev wrote:
> I am trying to fill up a database using psql program. A file I have prepared
> contains Russian in KOI8-R encoding.  When I try to process this file using
> `psql -f file db', it fails: no diagnostics, nothing; it just shows that EOF is
> reached.  When I replace Russian letters with something in ASCII, it works just
> fine.  The main problem is that my second file gets processed just fine.
>
> Where to look to?  What additional information is needed? :)

   OS, locale, Postgres version, whether Postgres was compiled with locale,
multibyte...

Oleg.
----
    Oleg Broytmann    http://members.xoom.com/phd2.1/    phd2@earthling.net
           Programmers don't die, they just GOSUB without RETURN.


Re: Character encodings...

From
Michael Sobolev
Date:
On Thu, Apr 13, 2000 at 10:20:39AM +0000, Oleg Broytmann wrote:
>    OS, locale, Postgres version, whether Postgres was compiled with locale,
> multibyte...
Debian GNU/Linux (frozen), 6.5.3-17 (-17 -- debian revision), yes, =UNICODE.

--
Mike

Re: Character encodings...

From
"Oliver Elphick"
Date:
Michael Sobolev wrote:
  >On Thu, Apr 13, 2000 at 10:20:39AM +0000, Oleg Broytmann wrote:
  >>    OS, locale, Postgres version, whether Postgres was compiled with locale
      >,
  >> multibyte...
  >Debian GNU/Linux (frozen), 6.5.3-17 (-17 -- debian revision), yes, =UNICODE.

Turn on logging in the backend (edit /etc/postgresql/postmaster.init) and
restart the postmaster (/etc/init.d/postgresql restart).  See what you get
in the log.

--
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
               PGP key from public servers; key ID 32B8FAA1
                 ========================================
     "I sought the LORD, and he heard me, and delivered me
      from all my fears."    Psalms 34:41



Re: Character encodings...

From
Oleg Broytmann
Date:
On Thu, 13 Apr 2000, Michael Sobolev wrote:
> >    OS, locale, Postgres version, whether Postgres was compiled with locale,
> > multibyte...
> Debian GNU/Linux (frozen), 6.5.3-17 (-17 -- debian revision), yes, =UNICODE.

   Not sure how well Postgres works with UNICODE. It works pretty well with
KOI8-R and Windows-1251 encodings...

Oleg.
----
    Oleg Broytmann    http://members.xoom.com/phd2.1/    phd2@earthling.net
           Programmers don't die, they just GOSUB without RETURN.


Re: Character encodings...

From
Michael Sobolev
Date:
On Thu, Apr 13, 2000 at 11:54:17AM +0100, Oliver Elphick wrote:
> Turn on logging in the backend (edit /etc/postgresql/postmaster.init) and
> restart the postmaster (/etc/init.d/postgresql restart).  See what you get
> in the log.
What level of debug should be sufficient?

I've got an impression that it's psql that does not process correctly the
stuff.

I have a very simple statement:

    insert into news values ('2000-04-13', NULL, '');

This works just fine.  Now I replace '' with 'A' (A -- 65).  It still works
just fine.  Now I replace this latin A with Russian A.  And psql shows:

$ psql -f test.sql stuff
insert into news values ('2000-04-12', NULL, 'А');
EOF

--
Mike

Re: Character encodings...

From
"Oliver Elphick"
Date:
Michael Sobolev wrote:
  >On Thu, Apr 13, 2000 at 11:54:17AM +0100, Oliver Elphick wrote:
  >> Turn on logging in the backend (edit /etc/postgresql/postmaster.init) and
  >> restart the postmaster (/etc/init.d/postgresql restart).  See what you get
  >> in the log.
  >What level of debug should be sufficient?

2

Also set PGECHO in postmaster.init, so that queries are echoed in the log.

  >I've got an impression that it's psql that does not process correctly the
  >stuff.
  >
  >I have a very simple statement:
  >
  >    insert into news values ('2000-04-13', NULL, '');
  >
  >This works just fine.  Now I replace '' with 'A' (A -- 65).  It still works
  >just fine.  Now I replace this latin A with Russian A.  And psql shows:
  >
  >$ psql -f test.sql stuff
  >insert into news values ('2000-04-12', NULL, 'á');
  >EOF

The trouble is, I don't know how to test this.  How do I produce Russian
characters on an English keyboard?

--
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
               PGP key from public servers; key ID 32B8FAA1
                 ========================================
     "I sought the LORD, and he heard me, and delivered me
      from all my fears."    Psalms 34:41



Re: Character encodings...

From
Michael Sobolev
Date:
On Thu, Apr 13, 2000 at 02:52:13PM +0100, Oliver Elphick wrote:
>   >What level of debug should be sufficient?
>
> 2
>
> Also set PGECHO in postmaster.init, so that queries are echoed in the log.
OK.  I'll try.

> The trouble is, I don't know how to test this.  How do I produce Russian
> characters on an English keyboard?
I am almost sure that this may fail if it's just a character from the upper
half of 256.  In vim: ^V240 :)

--
Mike

Re: Character encodings...

From
Michael Sobolev
Date:
On Thu, Apr 13, 2000 at 02:52:13PM +0100, Oliver Elphick wrote:
> 2
>
> Also set PGECHO in postmaster.init, so that queries are echoed in the log.
Here it goes.  I would not say it's very useful...  Russian a has code 225
(decimal).

--
Mike

binding ShmemCreate(key=52e2c1, size=2006016)
/usr/lib/postgresql/bin/postmaster: ServerLoop:        handling reading 4
/usr/lib/postgresql/bin/postmaster: ServerLoop:        handling reading 4
/usr/lib/postgresql/bin/postmaster: ServerLoop:        handling writing 4
/usr/lib/postgresql/bin/postmaster: BackendStartup: pid 30613 user mss db stuff socket 4
/usr/lib/postgresql/bin/postmaster child[30613]: starting with (/usr/lib/postgresql/bin/postgres -d2 -B 128 -E -v131072
-pstuff ) 
FindExec: found "/usr/lib/postgresql/bin/postgres" using argv[0]
debug info:
    User         = mss
    RemoteHost   = localhost
    RemotePort   = 0
    DatabaseName = stuff
    Verbose      = 2
    Noversion    = f
    timings      = f
    dates        = European
    bufsize      = 128
    sortmem      = 512
    query echo   = t
InitPostgres
    reset_client_encoding()..
    reset_client_encoding() done.
StartTransactionCommand
query: select getdatabaseencoding()
ProcessQuery
CommitTransactionCommand
StartTransactionCommand
query: SET client_encoding = 'UNICODE'
ProcessUtility: SET client_encoding = 'UNICODE'
CommitTransactionCommand
proc_exit(0) [#0]
shmem_exit(0) [#0]
exit(0)
/usr/lib/postgresql/bin/postmaster: reaping dead processes...
/usr/lib/postgresql/bin/postmaster: CleanupProc: pid 30613 exited with status 0

Re: Character encodings...

From
Peter Eisentraut
Date:
On Thu, 13 Apr 2000, Michael Sobolev wrote:

> Here it goes.  I would not say it's very useful...  Russian a has code 225
> (decimal).

> StartTransactionCommand
> query: SET client_encoding = 'UNICODE'
> ProcessUtility: SET client_encoding = 'UNICODE'
> CommitTransactionCommand
> proc_exit(0) [#0]
> shmem_exit(0) [#0]
> exit(0)
> /usr/lib/postgresql/bin/postmaster: reaping dead processes...
> /usr/lib/postgresql/bin/postmaster: CleanupProc: pid 30613 exited with status 0

That looks like the query never got to the backend. This is either a bug
in psql or the multibyte suite. I seem to recall that Unicode isn't fully
supported, so I'd go for the latter. Can Tatsuo comment?


--
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden


Re: Character encodings...

From
Tatsuo Ishii
Date:
> On Thu, 13 Apr 2000, Michael Sobolev wrote:
>
> > Here it goes.  I would not say it's very useful...  Russian a has code 225
> > (decimal).
>
> > StartTransactionCommand
> > query: SET client_encoding = 'UNICODE'
> > ProcessUtility: SET client_encoding = 'UNICODE'
> > CommitTransactionCommand
> > proc_exit(0) [#0]
> > shmem_exit(0) [#0]
> > exit(0)
> > /usr/lib/postgresql/bin/postmaster: reaping dead processes...
> > /usr/lib/postgresql/bin/postmaster: CleanupProc: pid 30613 exited with status 0
>
> That looks like the query never got to the backend. This is either a bug
> in psql or the multibyte suite. I seem to recall that Unicode isn't fully
> supported, so I'd go for the latter. Can Tatsuo comment?

Oh, he is using the multibyte support and expects an automatic code
conversion between KOI8-R and UNICODE that is not supported yet.

What he need to do is creating a database with encoding KOI8-R or
ISO-8859-5.

# make a KOI8-R database
$ createdb -E KOI8

or

# make a ISO-8859-5 database
$ createdb -E LATIN5

In the next case, he might want to set PGCLIENTENCODING environment
variable so that a conversion between KOI8-R and ISO-8859-5
automatically performed.

# if you want to use KOI8-R on your client.
$ export PGCLIENTENCODING=KOI8
or
% setenv PGCLIENTENCODING KOI8
--
Tatsuo Ishii

Re: Character encodings...

From
Michael Sobolev
Date:
On Fri, Apr 14, 2000 at 03:44:09PM +0900, Tatsuo Ishii wrote:
> Oh, he is using the multibyte support and expects an automatic code
> conversion between KOI8-R and UNICODE that is not supported yet.
Not exactly.  If you had a look on my first message, you would see that the
problem I see that the behaviour is not consistent.  Some time this data gets
through, and sometimes it does not.  I'd say that an arbitrary text in KOI8-R
can hardly be something reasonable in UTF-8, so I'd see that all (yes, ALL) my
requests would fail (and preferably with correct diagnostics).

> # make a KOI8-R database
> $ createdb -E KOI8
Thanks.  I was looking for something like this in man page, but unfortunately
it does not have this information.

> In the next case, he might want to set PGCLIENTENCODING environment
> variable so that a conversion between KOI8-R and ISO-8859-5
> automatically performed.
What are the requirements for this to work?

Thanks,

--
Mike

Re: Character encodings...

From
Tatsuo Ishii
Date:
> On Fri, Apr 14, 2000 at 03:44:09PM +0900, Tatsuo Ishii wrote:
> > Oh, he is using the multibyte support and expects an automatic code
> > conversion between KOI8-R and UNICODE that is not supported yet.
> Not exactly.  If you had a look on my first message, you would see that the
> problem I see that the behaviour is not consistent.  Some time this data gets
> through, and sometimes it does not.  I'd say that an arbitrary text in KOI8-R
> can hardly be something reasonable in UTF-8, so I'd see that all (yes, ALL) my
> requests would fail (and preferably with correct diagnostics).

Sorry. I don't understand your point. What I wanted to say was KOI8-R
and UTF-8 are totally different encodings (except ASCII part).

> > # make a KOI8-R database
> > $ createdb -E KOI8
> Thanks.  I was looking for something like this in man page, but unfortunately
> it does not have this information.

Please look at doc/README.mb.

> > In the next case, he might want to set PGCLIENTENCODING environment
> > variable so that a conversion between KOI8-R and ISO-8859-5
> > automatically performed.
> What are the requirements for this to work?

Please explain your backgrounds. If you need KOI8-R only, you could
forget about ISO-8859-5.
--
Tatsuo Ishii