Thread: client encoding name normalization in psycopg 2.4
Psycopg 2.4 now tries to map the PostgreSQL client encoding to a Python codec. But it fails to consider some variant spellings. For example, I have PGCLIENTENCODING=UTF-8 in my environment, which completely breaks the new psycopg version with no Python codec for client encoding 'UTF-8' (The underlying reason for this is that I use export PGCLIENTENCODING=$(locale charmap) which gives variant spelling of encoding names across operating systems.) The PostgreSQL backend normalizes an encoding name by removing all non-alnum characters from it. I suggest psycopg do the same. Attached is a patch that implements that. Note that the PostgreSQL backend version of this actually lowercases the encoding names during normalization. I have made this patch uppercase them to keep the patch smaller, but you may want to consider doing the lowercasing, to keep things consistent.
Attachment
On 07/04/11 21:46, Peter Eisentraut wrote: [snip] > Attached is a patch that implements that. Note that the PostgreSQL > backend version of this actually lowercases the encoding names during > normalization. I have made this patch uppercase them to keep the patch > smaller, but you may want to consider doing the lowercasing, to keep > things consistent. The patch seems fine to me. I'll check it in later during the we. federico -- Federico Di Gregorio federico.digregorio@dndg.it Studio Associato Di Nunzio e Di Gregorio http://dndg.it If nobody understand you, that doesn't mean you're an artist. -- anonymous
On Fri, Apr 8, 2011 at 12:34 PM, Federico Di Gregorio <federico.digregorio@dndg.it> wrote: > On 07/04/11 21:46, Peter Eisentraut wrote: > [snip] >> Attached is a patch that implements that. Note that the PostgreSQL >> backend version of this actually lowercases the encoding names during >> normalization. I have made this patch uppercase them to keep the patch >> smaller, but you may want to consider doing the lowercasing, to keep >> things consistent. > > The patch seems fine to me. I'll check it in later during the we. I was working on the patch, but there's something not straightforward. I think assuming that psycopg2.extensions.encodings[conn.encoding] will always work is reasonable (also because it's the only way to convert the PG encoding to a Python encoding). The patch breaks this assumption, without which getting the Python codec name from the PG encoding becomes a convoluted operation. A better fix is probably to set connection.encoding to the normalized string, so that the lookup will always work. This means that connection.encoding is no more exactly what returned by SHOW connection_encoding but I don't think this is really important (furthermore the current code already converts it in uppercase). Note that fixing the encodings mapping to set the normalized names as key is not required: the mapping is extended with the normalized names when the psycopg2.extensions module is imported. Albeit the non-normal names would no more strictly required, I'd prefer to leave them as these variants are the one published in the PG documentation (http://www.postgresql.org/docs/9.0/static/multibyte.html) and I think it's a good thing to have them mapped to Python. I've implemented the above in a fix-encoding branch in my github repos. If it's fine I'll merge to my devel (now the unit test passes entirely setting a non-normal PGCLIENTENCODING). -- Daniele
On 08/04/11 14:59, Daniele Varrazzo wrote: > On Fri, Apr 8, 2011 at 12:34 PM, Federico Di Gregorio > <federico.digregorio@dndg.it> wrote: >> > On 07/04/11 21:46, Peter Eisentraut wrote: >> > [snip] >>> >> Attached is a patch that implements that. Note that the PostgreSQL >>> >> backend version of this actually lowercases the encoding names during >>> >> normalization. I have made this patch uppercase them to keep the patch >>> >> smaller, but you may want to consider doing the lowercasing, to keep >>> >> things consistent. >> > >> > The patch seems fine to me. I'll check it in later during the we. > I was working on the patch, but there's something not straightforward. > > I think assuming that psycopg2.extensions.encodings[conn.encoding] > will always work is reasonable (also because it's the only way to > convert the PG encoding to a Python encoding). The patch breaks this > assumption, without which getting the Python codec name from the PG > encoding becomes a convoluted operation. > > A better fix is probably to set connection.encoding to the normalized > string, so that the lookup will always work. This means that > connection.encoding is no more exactly what returned by SHOW > connection_encoding but I don't think this is really important > (furthermore the current code already converts it in uppercase). Yep, that was my idea too. The patch already contains code to normalize the encoding so it should be straightforward to have a normalized conn.encoding. federico -- Federico Di Gregorio federico.digregorio@dndg.it Studio Associato Di Nunzio e Di Gregorio http://dndg.it 99.99999999999999999999% still isn't 100% but sometimes suffice. -- Me
2011/4/7 Peter Eisentraut <peter_e@gmx.net>: > Psycopg 2.4 now tries to map the PostgreSQL client encoding to a Python > codec. But it fails to consider some variant spellings. For example, I > have > > PGCLIENTENCODING=UTF-8 > > in my environment, which completely breaks the new psycopg version with > > no Python codec for client encoding 'UTF-8' > > (The underlying reason for this is that I use > > export PGCLIENTENCODING=$(locale charmap) > > which gives variant spelling of encoding names across operating > systems.) > > The PostgreSQL backend normalizes an encoding name by removing all > non-alnum characters from it. I suggest psycopg do the same. Oddly enough, the recent commit that has broken the JDBC driver (with the server canonicalizing the encoding name) would have made psycopg 2.4 work correctly, as it would have received back from the server one of the expected encoding names (reference: http://archives.postgresql.org/pgsql-hackers/2011-04/msg00882.php). About this issue, I've tested psycopg devel (the soon-to-be-released 2.4.1) with a 9.1 head server and there is no problem to report with this combination (beside fixing the 2.4 problems with currently released server versions). -- Daniele