Thread: Re: [COMMITTERS] pgsql: Fix mapping of PostgreSQL encodings to Python encodings.

Re: [COMMITTERS] pgsql: Fix mapping of PostgreSQL encodings to Python encodings.

From
Heikki Linnakangas
Date:
On 05.07.2012 23:31, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@iki.fi>  writes:
>> Fix mapping of PostgreSQL encodings to Python encodings.
>
> The buildfarm doesn't like this --- did you check for side effects on
> regression test results?

Hmm, I ran the regressions tests, but not with C encoding. With the 
patch, you no longer get the errdetail you used to, when an encoding 
conversion fails:

> ***************
> *** 41,47 ****
>
>   SELECT unicode_plan1();
>   ERROR:  spiexceptions.InternalError: could not convert Python Unicode object to PostgreSQL server encoding
> - DETAIL:  UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0: ordinal not in range(128)
>   CONTEXT:  Traceback (most recent call last):
>     PL/Python function "unicode_plan1", line 3, in <module>
>       rv = plpy.execute(plan, [u"\x80"], 1)
> --- 39,44 ----

We could just update the expected output, there's two expected outputs 
for this test case and one of them is now wrong. But it'd actually be 
quite a shame to lose that extra information, that's quite valuable. 
Perhaps we should go back to using PLu_elog() here, and find some other 
way to avoid the recursion.

- Heikki


On 05/07/12 22:37, Heikki Linnakangas wrote:
> On 05.07.2012 23:31, Tom Lane wrote:
>> Heikki Linnakangas<heikki.linnakangas@iki.fi> writes:
>>> Fix mapping of PostgreSQL encodings to Python encodings.
>>
>> The buildfarm doesn't like this --- did you check for side effects on
>> regression test results?
>
> Hmm, I ran the regressions tests, but not with C encoding. With the
> patch, you no longer get the errdetail you used to, when an encoding
> conversion fails:
>
>> ***************
>> *** 41,47 ****
>>
>> SELECT unicode_plan1();
>> ERROR: spiexceptions.InternalError: could not convert Python Unicode
>> object to PostgreSQL server encoding
>> - DETAIL: UnicodeEncodeError: 'ascii' codec can't encode character
>> u'\x80' in position 0: ordinal not in range(128)
>> CONTEXT: Traceback (most recent call last):
>> PL/Python function "unicode_plan1", line 3, in <module>
>> rv = plpy.execute(plan, [u"\x80"], 1)
>> --- 39,44 ----
>
> We could just update the expected output, there's two expected outputs
> for this test case and one of them is now wrong. But it'd actually be
> quite a shame to lose that extra information, that's quite valuable.
> Perhaps we should go back to using PLu_elog() here, and find some other
> way to avoid the recursion.

Seems that the problem is that the LC_ALL=C makes Postgres use SQL_ASCII 
as the database encoding and as the comment states, translating PG's 
SQL_ASCII to Python's "ascii" is not ideal.

The problem is that PLyUnicode_Bytes is (via an ifdef) used as 
PyString_ToString on Python3, which means that there are numerous call 
sites and new ones might appear in any moment. I'm not that keen on 
invoking the traceback machinery on low-level encoding errors.

Hm, since PyUnicode_Bytes should get a unicode object and return bytes 
in the server encoding, we might just say that for SQL_ASCII we 
arbitrarily choose UTF-8 to encode the unicode codepoints, so we'd just 
set serverenc = "utf-8" in the first switch case.

That doesn't solve the problem of the missing error detail, though.

Jan


On tor, 2012-07-05 at 22:53 +0200, Jan Urbański wrote:
> The problem is that PLyUnicode_Bytes is (via an ifdef) used as 
> PyString_ToString on Python3, which means that there are numerous call
> sites and new ones might appear in any moment. I'm not that keen on 
> invoking the traceback machinery on low-level encoding errors.

Why not?



On 05/07/12 23:30, Peter Eisentraut wrote:
> On tor, 2012-07-05 at 22:53 +0200, Jan Urbański wrote:
>> The problem is that PLyUnicode_Bytes is (via an ifdef) used as
>> PyString_ToString on Python3, which means that there are numerous call
>> sites and new ones might appear in any moment. I'm not that keen on
>> invoking the traceback machinery on low-level encoding errors.
>
> Why not?

Because it can lead to recursion errors, like the one this patch was 
supposed to fix. The traceback machinery calls into the encoding 
functions, because it converts Python strings (like function names) into 
C strings.


Re: Re: [COMMITTERS] pgsql: Fix mapping of PostgreSQL encodings to Python encodings.

From
Heikki Linnakangas
Date:
On 06.07.2012 00:54, Jan Urbański wrote:
> On 05/07/12 23:30, Peter Eisentraut wrote:
>> On tor, 2012-07-05 at 22:53 +0200, Jan Urbański wrote:
>>> The problem is that PLyUnicode_Bytes is (via an ifdef) used as
>>> PyString_ToString on Python3, which means that there are numerous call
>>> sites and new ones might appear in any moment. I'm not that keen on
>>> invoking the traceback machinery on low-level encoding errors.
>>
>> Why not?
>
> Because it can lead to recursion errors, like the one this patch was
> supposed to fix. The traceback machinery calls into the encoding
> functions, because it converts Python strings (like function names) into
> C strings.

In the backend elog routines, there is a global variable 
'recursion_depth', which is incremented when an error-handling routine 
is entered, and decremented afterwards. Can we use a similar mechinism 
in PLy_elog() to detect and stop recursion?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On 06/07/12 10:05, Heikki Linnakangas wrote:
> On 06.07.2012 00:54, Jan Urbański wrote:
>> On 05/07/12 23:30, Peter Eisentraut wrote:
>>> On tor, 2012-07-05 at 22:53 +0200, Jan Urbański wrote:
>>>> The problem is that PLyUnicode_Bytes is (via an ifdef) used as
>>>> PyString_ToString on Python3, which means that there are numerous call
>>>> sites and new ones might appear in any moment. I'm not that keen on
>>>> invoking the traceback machinery on low-level encoding errors.
>>>
>>> Why not?
>>
>> Because it can lead to recursion errors, like the one this patch was
>> supposed to fix. The traceback machinery calls into the encoding
>> functions, because it converts Python strings (like function names) into
>> C strings.
>
> In the backend elog routines, there is a global variable
> 'recursion_depth', which is incremented when an error-handling routine
> is entered, and decremented afterwards. Can we use a similar mechinism
> in PLy_elog() to detect and stop recursion?

I guess we can, I'll try to do some tests in order to see if there's an 
easy user-triggereable way of causing PLy_elog to recurse and if not 
then a guard like this should be enough as a safety measure against as 
yet unknown conditions (as opposed to something we expect to happen 
regularly).

Cheers,
Jan


On 06/07/12 10:14, Jan Urbański wrote:
> On 06/07/12 10:05, Heikki Linnakangas wrote:
>> In the backend elog routines, there is a global variable
>> 'recursion_depth', which is incremented when an error-handling routine
>> is entered, and decremented afterwards. Can we use a similar mechinism
>> in PLy_elog() to detect and stop recursion?
>
> I guess we can, I'll try to do some tests in order to see if there's an
> easy user-triggereable way of causing PLy_elog to recurse and if not
> then a guard like this should be enough as a safety measure against as
> yet unknown conditions (as opposed to something we expect to happen
> regularly).

Attached is a patch that stores the recursion level of PLy_traceback and
prevents it from running if it's too deep (PLy_traceback is the one
doing heavy lifting, that's why I chose to put the logic to skip running
there).

I tried a few things and was not able to easily invoke the infinite
recursion condition, but I did notice that there are two more encodings
that have different names in Postgres and in Python (KOI8-R and KOI8-U)
and added them to the switch.

There's still trouble with EUC_TW and MULE_INTERNAL which don't have
Python equivalents. EUC-TW has been discussed in
http://bugs.python.org/issue2066 and rejected (see
http://bugs.python.org/issue2066#msg113731).

If you use any of these encodings, you *will* get into the recursion
trouble described eariler, just as before the path you'd get into it
with CP1252 as your encoding.

What shall we do about those? Ignore them? Document that if you're sing
one of these encodings then PL/Python with Python 2 will be crippled and
with Python 3 just won't work?

Cheers,
Jan

Attachment

Re: Re: [COMMITTERS] pgsql: Fix mapping of PostgreSQL encodings to Python encodings.

From
Heikki Linnakangas
Date:
On 06.07.2012 18:01, Jan Urbański wrote:
> There's still trouble with EUC_TW and MULE_INTERNAL which don't have
> Python equivalents. EUC-TW has been discussed in
> http://bugs.python.org/issue2066 and rejected (see
> http://bugs.python.org/issue2066#msg113731).
>
> If you use any of these encodings, you *will* get into the recursion
> trouble described eariler, just as before the path you'd get into it
> with CP1252 as your encoding.
>
> What shall we do about those? Ignore them? Document that if you're sing
> one of these encodings then PL/Python with Python 2 will be crippled and
> with Python 3 just won't work?

We could convert to UTF-8, and use the PostgreSQL functions to convert 
from UTF-8 to the server encoding. Double conversion might be slow, but 
I think it would be better than failing.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On fre, 2012-07-06 at 18:53 +0300, Heikki Linnakangas wrote:
> > What shall we do about those? Ignore them? Document that if you're sing
> > one of these encodings then PL/Python with Python 2 will be crippled and
> > with Python 3 just won't work?
> 
> We could convert to UTF-8, and use the PostgreSQL functions to convert 
> from UTF-8 to the server encoding. Double conversion might be slow, but 
> I think it would be better than failing. 

Actually, we already do the other direction that way
(PLyUnicode_FromStringAndSize) , so maybe it would be more consistent to
always use this.

I would hesitate to use this as a kind of fallback, because then we
would sometimes be using PostgreSQL's recoding tables and sometimes
Python's recoding tables, which could became confusing.




On 06/07/12 22:47, Peter Eisentraut wrote:
> On fre, 2012-07-06 at 18:53 +0300, Heikki Linnakangas wrote:
>>> What shall we do about those? Ignore them? Document that if you're sing
>>> one of these encodings then PL/Python with Python 2 will be crippled and
>>> with Python 3 just won't work?
>>
>> We could convert to UTF-8, and use the PostgreSQL functions to convert
>> from UTF-8 to the server encoding. Double conversion might be slow, but
>> I think it would be better than failing.
>
> Actually, we already do the other direction that way
> (PLyUnicode_FromStringAndSize) , so maybe it would be more consistent to
> always use this.
>
> I would hesitate to use this as a kind of fallback, because then we
> would sometimes be using PostgreSQL's recoding tables and sometimes
> Python's recoding tables, which could became confusing.

So you're in favour of doing unicode -> bytes by encoding with UTF-8 and 
then using the server's encoding functions?


Re: Re: [COMMITTERS] pgsql: Fix mapping of PostgreSQL encodings to Python encodings.

From
Heikki Linnakangas
Date:
On 07.07.2012 00:12, Jan Urbański wrote:
> On 06/07/12 22:47, Peter Eisentraut wrote:
>> On fre, 2012-07-06 at 18:53 +0300, Heikki Linnakangas wrote:
>>>> What shall we do about those? Ignore them? Document that if you're sing
>>>> one of these encodings then PL/Python with Python 2 will be crippled
>>>> and
>>>> with Python 3 just won't work?
>>>
>>> We could convert to UTF-8, and use the PostgreSQL functions to convert
>>> from UTF-8 to the server encoding. Double conversion might be slow, but
>>> I think it would be better than failing.
>>
>> Actually, we already do the other direction that way
>> (PLyUnicode_FromStringAndSize) , so maybe it would be more consistent to
>> always use this.
>>
>> I would hesitate to use this as a kind of fallback, because then we
>> would sometimes be using PostgreSQL's recoding tables and sometimes
>> Python's recoding tables, which could became confusing.
>
> So you're in favour of doing unicode -> bytes by encoding with UTF-8 and
> then using the server's encoding functions?

Sounds reasonable to me. The extra conversion between UTF-8 and UCS-2 
should be quite fast, and it would be good to be consistent in the way 
we do conversions in both directions.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On 12/07/12 11:08, Heikki Linnakangas wrote:
> On 07.07.2012 00:12, Jan Urbański wrote:
>> On 06/07/12 22:47, Peter Eisentraut wrote:
>>> On fre, 2012-07-06 at 18:53 +0300, Heikki Linnakangas wrote:
>>>>> What shall we do about those? Ignore them? Document that if you're
>>>>> sing
>>>>> one of these encodings then PL/Python with Python 2 will be crippled
>>>>> and
>>>>> with Python 3 just won't work?
>>>>
>>>> We could convert to UTF-8, and use the PostgreSQL functions to convert
>>>> from UTF-8 to the server encoding. Double conversion might be slow, but
>>>> I think it would be better than failing.
>>>
>>> Actually, we already do the other direction that way
>>> (PLyUnicode_FromStringAndSize) , so maybe it would be more consistent to
>>> always use this.
>>>
>>> I would hesitate to use this as a kind of fallback, because then we
>>> would sometimes be using PostgreSQL's recoding tables and sometimes
>>> Python's recoding tables, which could became confusing.
>>
>> So you're in favour of doing unicode -> bytes by encoding with UTF-8 and
>> then using the server's encoding functions?
>
> Sounds reasonable to me. The extra conversion between UTF-8 and UCS-2
> should be quite fast, and it would be good to be consistent in the way
> we do conversions in both directions.
>

I'll implement that than (sorry for not following up on that eariler).

J


On 13/07/12 13:38, Jan Urbański wrote:
> On 12/07/12 11:08, Heikki Linnakangas wrote:
>> On 07.07.2012 00:12, Jan Urbański wrote:
>>> So you're in favour of doing unicode -> bytes by encoding with UTF-8 and
>>> then using the server's encoding functions?
>>
>> Sounds reasonable to me. The extra conversion between UTF-8 and UCS-2
>> should be quite fast, and it would be good to be consistent in the way
>> we do conversions in both directions.
>>
>
> I'll implement that than (sorry for not following up on that eariler).

Here's a patch that always encodes Python unicode objects using UTF-8
and then uses Postgres's internal functions to produce bytes in the
server encoding.

Cheers,
Jan

Attachment

Re: Re: [COMMITTERS] pgsql: Fix mapping of PostgreSQL encodings to Python encodings.

From
Heikki Linnakangas
Date:
On 14.07.2012 17:50, Jan Urbański wrote:
> On 13/07/12 13:38, Jan Urbański wrote:
>> On 12/07/12 11:08, Heikki Linnakangas wrote:
>>> On 07.07.2012 00:12, Jan Urbański wrote:
>>>> So you're in favour of doing unicode -> bytes by encoding with UTF-8
>>>> and
>>>> then using the server's encoding functions?
>>>
>>> Sounds reasonable to me. The extra conversion between UTF-8 and UCS-2
>>> should be quite fast, and it would be good to be consistent in the way
>>> we do conversions in both directions.
>>>
>>
>> I'll implement that than (sorry for not following up on that eariler).
>
> Here's a patch that always encodes Python unicode objects using UTF-8
> and then uses Postgres's internal functions to produce bytes in the
> server encoding.

Thanks.

If pg_do_encoding_conversion() throws an error, you don't get a chance 
to call Py_DECREF() to release the string. Is that a problem?

If an error occurs in PLy_traceback(), after incrementing 
recursion_depth, you don't get a chance to decrement it again. I'm not 
sure if the Py* function calls can fail, but at least seemingly trivial 
things like initStringInfo() can throw an out-of-memory error.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On 18/07/12 17:17, Heikki Linnakangas wrote:
> On 14.07.2012 17:50, Jan Urbański wrote:
>
> If pg_do_encoding_conversion() throws an error, you don't get a chance
> to call Py_DECREF() to release the string. Is that a problem?
>
> If an error occurs in PLy_traceback(), after incrementing
> recursion_depth, you don't get a chance to decrement it again. I'm not
> sure if the Py* function calls can fail, but at least seemingly trivial
> things like initStringInfo() can throw an out-of-memory error.

Of course you're right (on both accounts).

Here's a version with a bunch of PG_TRies thrown in.

Cheers,
Jan

Attachment
On 20/07/12 08:59, Jan Urbański wrote:
> On 18/07/12 17:17, Heikki Linnakangas wrote:
>> On 14.07.2012 17:50, Jan Urbański wrote:
>>
>> If pg_do_encoding_conversion() throws an error, you don't get a chance
>> to call Py_DECREF() to release the string. Is that a problem?
>>
>> If an error occurs in PLy_traceback(), after incrementing
>> recursion_depth, you don't get a chance to decrement it again. I'm not
>> sure if the Py* function calls can fail, but at least seemingly trivial
>> things like initStringInfo() can throw an out-of-memory error.
>
> Of course you're right (on both accounts).
>
> Here's a version with a bunch of PG_TRies thrown in.

Silly me, playing tricks with postincrements before fully waking up.

Here's v3, with a correct inequality test for exceeding the traceback
recursion test.

J

Attachment

Re: Re: [COMMITTERS] pgsql: Fix mapping of PostgreSQL encodings to Python encodings.

From
Heikki Linnakangas
Date:
On 20.07.2012 10:13, Jan Urbański wrote:
> On 20/07/12 08:59, Jan Urbański wrote:
>> On 18/07/12 17:17, Heikki Linnakangas wrote:
>>> On 14.07.2012 17:50, Jan Urbański wrote:
>>>
>>> If pg_do_encoding_conversion() throws an error, you don't get a chance
>>> to call Py_DECREF() to release the string. Is that a problem?
>>>
>>> If an error occurs in PLy_traceback(), after incrementing
>>> recursion_depth, you don't get a chance to decrement it again. I'm not
>>> sure if the Py* function calls can fail, but at least seemingly trivial
>>> things like initStringInfo() can throw an out-of-memory error.
>>
>> Of course you're right (on both accounts).
>>
>> Here's a version with a bunch of PG_TRies thrown in.
>
> Silly me, playing tricks with postincrements before fully waking up.
>
> Here's v3, with a correct inequality test for exceeding the traceback
> recursion test.

Committed the convert-via-UTF-8 part of this. I'll take a closer look at 
the recursion check next.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On 06/08/12 13:59, Heikki Linnakangas wrote:
> On 20.07.2012 10:13, Jan Urbański wrote:
>> On 20/07/12 08:59, Jan Urbański wrote:
>>> On 18/07/12 17:17, Heikki Linnakangas wrote:
>>>> On 14.07.2012 17:50, Jan Urbański wrote:
>>>>
>>>> If pg_do_encoding_conversion() throws an error, you don't get a chance
>>>> to call Py_DECREF() to release the string. Is that a problem?
>>>>
>>>> If an error occurs in PLy_traceback(), after incrementing
>>>> recursion_depth, you don't get a chance to decrement it again. I'm not
>>>> sure if the Py* function calls can fail, but at least seemingly trivial
>>>> things like initStringInfo() can throw an out-of-memory error.
>>>
>>> Of course you're right (on both accounts).
>>>
>>> Here's a version with a bunch of PG_TRies thrown in.
>>
>> Silly me, playing tricks with postincrements before fully waking up.
>>
>> Here's v3, with a correct inequality test for exceeding the traceback
>> recursion test.
>
> Committed the convert-via-UTF-8 part of this. I'll take a closer look at
> the recursion check next.

Thanks!

Jan