Thread: client encoding name normalization in psycopg 2.4

client encoding name normalization in psycopg 2.4

From

Peter Eisentraut

Date:

07 April 2011, 16:47:20

Psycopg 2.4 now tries to map the PostgreSQL client encoding to a Python
codec.  But it fails to consider some variant spellings.  For example, I
have

PGCLIENTENCODING=UTF-8

in my environment, which completely breaks the new psycopg version with

no Python codec for client encoding 'UTF-8'

(The underlying reason for this is that I use

export PGCLIENTENCODING=$(locale charmap)

which gives variant spelling of encoding names across operating
systems.)

The PostgreSQL backend normalizes an encoding name by removing all
non-alnum characters from it.  I suggest psycopg do the same.

Attached is a patch that implements that.  Note that the PostgreSQL
backend version of this actually lowercases the encoding names during
normalization.  I have made this patch uppercase them to keep the patch
smaller, but you may want to consider doing the lowercasing, to keep
things consistent.

Attachment

psycopg-clientencoding-fix.patch

Re: client encoding name normalization in psycopg 2.4

From

Federico Di Gregorio

Date:

08 April 2011, 08:34:45

On 07/04/11 21:46, Peter Eisentraut wrote:
[snip]
> Attached is a patch that implements that.  Note that the PostgreSQL
> backend version of this actually lowercases the encoding names during
> normalization.  I have made this patch uppercase them to keep the patch
> smaller, but you may want to consider doing the lowercasing, to keep
> things consistent.

The patch seems fine to me. I'll check it in later during the we.

federico

--
Federico Di Gregorio                         federico.digregorio@dndg.it
Studio Associato Di Nunzio e Di Gregorio                  http://dndg.it
 If nobody understand you, that doesn't mean you're an artist.
                                                            -- anonymous

Re: client encoding name normalization in psycopg 2.4

From

Daniele Varrazzo

Date:

08 April 2011, 09:59:44

On Fri, Apr 8, 2011 at 12:34 PM, Federico Di Gregorio
<federico.digregorio@dndg.it> wrote:
> On 07/04/11 21:46, Peter Eisentraut wrote:
> [snip]
>> Attached is a patch that implements that.  Note that the PostgreSQL
>> backend version of this actually lowercases the encoding names during
>> normalization.  I have made this patch uppercase them to keep the patch
>> smaller, but you may want to consider doing the lowercasing, to keep
>> things consistent.
>
> The patch seems fine to me. I'll check it in later during the we.

I was working on the patch, but there's something not straightforward.

I think assuming that psycopg2.extensions.encodings[conn.encoding]
will always work is reasonable (also because it's the only way to
convert the PG encoding to a Python encoding). The patch breaks this
assumption, without which getting the Python codec name from the PG
encoding becomes a convoluted operation.

A better fix is probably to set connection.encoding to the normalized
string, so that the lookup will always work. This means that
connection.encoding is no more exactly what returned by SHOW
connection_encoding but I don't think this is really important
(furthermore the current code already converts it in uppercase).

Note that fixing the encodings mapping to set the normalized names as
key is not required: the mapping is extended with the normalized names
when the psycopg2.extensions module is imported. Albeit the non-normal
names would no more strictly required, I'd prefer to leave them as
these variants are the one published in the PG documentation
(http://www.postgresql.org/docs/9.0/static/multibyte.html) and I think
it's a good thing to have them mapped to Python.

I've implemented the above in a fix-encoding branch in my github
repos. If it's fine I'll merge to my devel (now the unit test passes
entirely setting a non-normal PGCLIENTENCODING).

-- Daniele

Re: client encoding name normalization in psycopg 2.4

From

Federico Di Gregorio

Date:

08 April 2011, 10:14:17

On 08/04/11 14:59, Daniele Varrazzo wrote:
> On Fri, Apr 8, 2011 at 12:34 PM, Federico Di Gregorio
> <federico.digregorio@dndg.it> wrote:
>> > On 07/04/11 21:46, Peter Eisentraut wrote:
>> > [snip]
>>> >> Attached is a patch that implements that.  Note that the PostgreSQL
>>> >> backend version of this actually lowercases the encoding names during
>>> >> normalization.  I have made this patch uppercase them to keep the patch
>>> >> smaller, but you may want to consider doing the lowercasing, to keep
>>> >> things consistent.
>> >
>> > The patch seems fine to me. I'll check it in later during the we.
> I was working on the patch, but there's something not straightforward.
>
> I think assuming that psycopg2.extensions.encodings[conn.encoding]
> will always work is reasonable (also because it's the only way to
> convert the PG encoding to a Python encoding). The patch breaks this
> assumption, without which getting the Python codec name from the PG
> encoding becomes a convoluted operation.
>
> A better fix is probably to set connection.encoding to the normalized
> string, so that the lookup will always work. This means that
> connection.encoding is no more exactly what returned by SHOW
> connection_encoding but I don't think this is really important
> (furthermore the current code already converts it in uppercase).

Yep, that was my idea too. The patch already contains code to normalize
the encoding so it should be straightforward to have a normalized
conn.encoding.

federico

--
Federico Di Gregorio                         federico.digregorio@dndg.it
Studio Associato Di Nunzio e Di Gregorio                  http://dndg.it
  99.99999999999999999999% still isn't 100% but sometimes suffice. -- Me

Re: client encoding name normalization in psycopg 2.4

From

Daniele Varrazzo

Date:

18 April 2011, 13:42:46

2011/4/7 Peter Eisentraut <peter_e@gmx.net>:
> Psycopg 2.4 now tries to map the PostgreSQL client encoding to a Python
> codec.  But it fails to consider some variant spellings.  For example, I
> have
>
> PGCLIENTENCODING=UTF-8
>
> in my environment, which completely breaks the new psycopg version with
>
> no Python codec for client encoding 'UTF-8'
>
> (The underlying reason for this is that I use
>
> export PGCLIENTENCODING=$(locale charmap)
>
> which gives variant spelling of encoding names across operating
> systems.)
>
> The PostgreSQL backend normalizes an encoding name by removing all
> non-alnum characters from it.  I suggest psycopg do the same.

Oddly enough, the recent commit that has broken the JDBC driver (with
the server canonicalizing the encoding name) would have made psycopg
2.4 work correctly, as it would have received back from the server one
of the expected encoding names (reference:
http://archives.postgresql.org/pgsql-hackers/2011-04/msg00882.php).

About this issue, I've tested psycopg devel (the soon-to-be-released
2.4.1) with a 9.1 head server and there is no problem to report with
this combination (beside fixing the 2.4 problems with currently
released server versions).

-- Daniele