Thread: [pgsql-www] Further UTF8/MIME fixes for the commitfest app

[pgsql-www] Further UTF8/MIME fixes for the commitfest app

From
ilmari@ilmari.org (Dagfinn Ilmari Mannsåker)
Date:
Hi Magnus,

Thanks for the quick fix for the email From header bug yesterday.  I
haven't had a chance to test it yet, because I don't want to pollute the
mailing list with test messages.

Please find attached two more patches:

#1 applies the same fix when Cc-ing patch authors
#2 MIME-decodes headers received from the mailing list archive JSON API

I haven't been able to talk to the JSON api, so I couldn't test them
properly, but I did some stand-alone testing of the code snippets.

Note that the MIME decoding only works properly if running under Python
3; the Python 2 version of email.header.decode_header() has broken
detection of the end of encoded-words.

Thanks,

ilmari

-- 
- Twitter seems more influential [than blogs] in the 'gets reported in
  the mainstream press' sense at least.               - Matt McLeod
- That'd be because the content of a tweet is easier to condense down
  to a mainstream media article.                      - Calle Dybedahl


-- 
Sent via pgsql-www mailing list (pgsql-www@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-www

Attachment

Re: [pgsql-www] Further UTF8/MIME fixes for the commitfest app

From
ilmari@ilmari.org (Dagfinn Ilmari Mannsåker)
Date:
Hi again,

ilmari@ilmari.org (Dagfinn Ilmari Manns�ker) writes:

> #2 MIME-decodes headers received from the mailing list archive JSON API
[ ]
> Note that the MIME decoding only works properly if running under Python
> 3; the Python 2 version of email.header.decode_header() has broken
> detection of the end of encoded-words.

I just noticed that the mailing list archive suffers from the same
problem, and not for all MIME-encoded headers, only the ones affected by
the above bug. Switching load_message.py to using Python 3 (and running
reparse_message.py) might be enough to fix it.

Cheers,

ilmari

-- 
- Twitter seems more influential [than blogs] in the 'gets reported in the mainstream press' sense at least.
  - Matt McLeod
 
- That'd be because the content of a tweet is easier to condense down to a mainstream media article.
 - Calle Dybedahl
 



Re: [pgsql-www] Further UTF8/MIME fixes for the commitfest app

From
Magnus Hagander
Date:
On Wed, Mar 1, 2017 at 5:35 PM, Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> wrote:
Hi Magnus,

Thanks for the quick fix for the email From header bug yesterday.  I
haven't had a chance to test it yet, because I don't want to pollute the
mailing list with test messages.

Apologies for a late response - it's been conference weeks.. THanks, however, for your efforts!

 
Please find attached two more patches:

#1 applies the same fix when Cc-ing patch authors

Oops. I thought I looked for that but missed it.

I think this calls for putting it in the UserWrapper() class instead, so I did that instead of using your patch, but have pushed a fix with it.

 
#2 MIME-decodes headers received from the mailing list archive JSON API

I haven't been able to talk to the JSON api, so I couldn't test them
properly, but I did some stand-alone testing of the code snippets.

Note that the MIME decoding only works properly if running under Python
3; the Python 2 version of email.header.decode_header() has broken
detection of the end of encoded-words.

Is the patch still an improvement on python2?

It'll take a bit more than that to just enable python3, as the environment the django app runs in is not python3 and we've got a standardized one across all the smaller webapps. We do need to update it to python3, but that task hasn't been done yet. 

Also, based on your other email about the list archives -- if we fix this in the archives, does that make this patch unnecessary?


--

Re: [pgsql-www] Further UTF8/MIME fixes for the commitfest app

From
ilmari@ilmari.org (Dagfinn Ilmari Mannsåker)
Date:
Magnus Hagander <magnus@hagander.net> writes:

> On Wed, Mar 1, 2017 at 5:35 PM, Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
> wrote:
[…]
>> #2 MIME-decodes headers received from the mailing list archive JSON API
>>
>> I haven't been able to talk to the JSON api, so I couldn't test them
>> properly, but I did some stand-alone testing of the code snippets.
>>
>> Note that the MIME decoding only works properly if running under Python
>> 3; the Python 2 version of email.header.decode_header() has broken
>> detection of the end of encoded-words.
>
> Is the patch still an improvement on python2?

No, because it'd be affected by the same problem that causes the
undecoded headers to be returned from the archive app.

> Also, based on your other email about the list archives -- if we fix this
> in the archives, does that make this patch unnecessary?

Yes, this patch is unnecessary if the archive app is fixed, and
insufficient if the commitfest app isn't upgraded to python3.

One possible workaround until upgrading to python3 is feasible would be
for the archive app to do some more munging (akin to the existing
_re_mailworkaround), and inject a space between an encoded-word and an
immediately-adjacent opening/closing paren.


- ilmari

-- 
"The surreality of the universe tends towards a maximum" -- Skud's Law
"Never formulate a law or axiom that you're not prepared to live withthe consequences of."
--Skud's Meta-Law
 



Re: [pgsql-www] Further UTF8/MIME fixes for the commitfest app

From
Magnus Hagander
Date:


On Tue, Mar 14, 2017 at 2:07 PM, Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> wrote:
Magnus Hagander <magnus@hagander.net> writes:

> On Wed, Mar 1, 2017 at 5:35 PM, Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
> wrote:
[…]
>> #2 MIME-decodes headers received from the mailing list archive JSON API
>>
>> I haven't been able to talk to the JSON api, so I couldn't test them
>> properly, but I did some stand-alone testing of the code snippets.
>>
>> Note that the MIME decoding only works properly if running under Python
>> 3; the Python 2 version of email.header.decode_header() has broken
>> detection of the end of encoded-words.
>
> Is the patch still an improvement on python2?

No, because it'd be affected by the same problem that causes the
undecoded headers to be returned from the archive app.

OK.

 
> Also, based on your other email about the list archives -- if we fix this
> in the archives, does that make this patch unnecessary?

Yes, this patch is unnecessary if the archive app is fixed, and
insufficient if the commitfest app isn't upgraded to python3.

One possible workaround until upgrading to python3 is feasible would be
for the archive app to do some more munging (akin to the existing
_re_mailworkaround), and inject a space between an encoded-word and an
immediately-adjacent opening/closing paren.

Actually, if I read that one right, it would be enough to upgrade the *loader* part of the archives, which is a much more contained problem, as it pretty much only has dependencies on the standard library.

Will have to run some detailed tests on that of course, to make sure it doesn't break anything else (like we have to reparse the 1.2 million messages in the archives and see if something else changes - but we have tools for this), but I think that's probably the best way forward from here.

--

Re: Further UTF8/MIME fixes for the commitfest app

From
Magnus Hagander
Date:


On Sun, Mar 19, 2017 at 2:45 PM, Magnus Hagander <magnus@hagander.net> wrote:


On Tue, Mar 14, 2017 at 2:07 PM, Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> wrote:
Magnus Hagander <magnus@hagander.net> writes:

> On Wed, Mar 1, 2017 at 5:35 PM, Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
> wrote:
[…]
>> #2 MIME-decodes headers received from the mailing list archive JSON API
>>
>> I haven't been able to talk to the JSON api, so I couldn't test them
>> properly, but I did some stand-alone testing of the code snippets.
>>
>> Note that the MIME decoding only works properly if running under Python
>> 3; the Python 2 version of email.header.decode_header() has broken
>> detection of the end of encoded-words.
>
> Is the patch still an improvement on python2?

No, because it'd be affected by the same problem that causes the
undecoded headers to be returned from the archive app.

OK.

 
> Also, based on your other email about the list archives -- if we fix this
> in the archives, does that make this patch unnecessary?

Yes, this patch is unnecessary if the archive app is fixed, and
insufficient if the commitfest app isn't upgraded to python3.

One possible workaround until upgrading to python3 is feasible would be
for the archive app to do some more munging (akin to the existing
_re_mailworkaround), and inject a space between an encoded-word and an
immediately-adjacent opening/closing paren.

Actually, if I read that one right, it would be enough to upgrade the *loader* part of the archives, which is a much more contained problem, as it pretty much only has dependencies on the standard library.

Will have to run some detailed tests on that of course, to make sure it doesn't break anything else (like we have to reparse the 1.2 million messages in the archives and see if something else changes - but we have tools for this), but I think that's probably the best way forward from here.


I took a look at this, but it's not a lot of fun.

We currently use utidylib to clean HTML. This one only supports Python 2.

We could move to tidylib (notably without the u), which uses newer versions of everything and exists for python3. But the Python 3 version is not available until debian stretch.

We'd also have to carefully examine the difference from using tidylib vs utidylib, and should probably do that as a separate step. I guess we'll have to start there.

--