Re: No easy way to join discussion in existing thread when not subscribed - Mailing list pgsql-www

From Amir Rohan
Subject Re: No easy way to join discussion in existing thread when not subscribed
Date
Msg-id trinity-0dd0e5d2-38a2-4da7-92da-845e6ed83813-1443555278323@3capp-mailcom-lxa04
Whole thread Raw
Responses Re: No easy way to join discussion in existing thread when not subscribed
Re: No easy way to join discussion in existing thread when not subscribed
List pgsql-www
On 09/29/2015 08:44 PM, Stefan Kaltenbrunner wrote:

> Hi Amir!
> 

Hi Stefan, thanks for reviewing.

>>
>> Please, see attached patch adding a "whole thread[as mbox] link"
>> to pgarchives, by popular request upthread.
> 
> Thanks a lot for the patch - I took a quick look at the patch and have a
> few comments to make..
> 
>>

>>
>> If you'd like changes (hard limits, strip attachments, etc'),
>> do let me know.
> 
> some other things:
> 
> * while this is a preexisting issue in the code (most of the http auth
> requests are handled directly in lighttpd so nobody noticed so far i
> guess) please use "Please authenticate with user 'archives' and
> 'password' antispam"

I don't follow. The Auth handling is duplicated from the "raw" message
view, and I verified basic auth is required, prompt and all.

> * have you verified that the resulting mbox actually contains the
> newline seperator after each message(I have not checked whether the
> source data has it)?

Yes.

I also checked that both Mutt and Thunderbird can import
the generated files.
AFAICT, Adding an extra newline actualy ends up being one too
many. That may be a quirk from the test data I used from
the loader scripts and mboxes provided, but I can't be sure,
so I kept it on, guessing that email clients will tolerate
this sort of benign spec violation. TB and Mutt do.

btw, there's something off with the mbox processing chain you use.
I think it is non-compliant with the spec (as per qmail manpage of
yore), which requires so called ">From quoting":

http://qmail.org/man/man5/mbox.html

See for example <20150802150506.GH11473@alap3.anarazel.de> in
pgsql-hackers.201508, which includes email messages as mime
attachments and triggers (I believe) "missing Message-Ids"
warnings from your tool, and is perhaps mangled in the archives,
I've seen a few dozen of those while testing.

> * are you sure that using unicode() for building the output is going to
> work on all input? - I dont think you can assume that the source data is
> ASCII clean and/or has only valid unicode code points for mapping
> 

Good catch, but I caught earlier and send v2, which keeps everything in
bytes() and avoids encoding issues altogether.

>> It checks hiddenstatus, but does materialize the entire raw thread
>> in memory (including attachments) to form the response, which
>> is unbounded in principle and can be sizable in practice.
>>
>> Perhaps django can do streaming requests, so we can bound
>> the memory usage + timeout. It's been a while.
>
> this is dangerous - the box we are running on has limited <RAM>
> <...>
> Have you done any (approximate) measurements on what the additional
> in-memory overhead in both pg (to build the response) and in django is
> compared to the resulting  mbox?
>

That's more complicated to answer. First, the "whole thread" view
already materializes all the messages in a threads, excepting
attachments, so whether we include those is the main (only?) issue.
We can drop them (stripping them may cost some cpu) if all else fails.

The average email length in -hackers was about 10k in august.
The largest thread contained 91 messages, the median was 3.
So, say it takes 1M to store an mbox file for a large thread,
assuming august is a representative sample.

Although python (py2, especially) is quite wasteful about strings,  it
has only constant overhead per object for bytes() objects, and psycopg2
returns buffer objects which have the same property. Both these
assertions check out with `sys.getsizeof`.
So we're left with pretty much 1:1 memory to final mbox size on the
python side of things, which seems tolerable even with a safety factor
of 5-10 (i don't know if python does zero-copy).

If that's a problem, we can slap a size limit and still keep the 98%
percentile or so of threads accessible.
We can also add a rate limit to mitigate this as a DOS vector, if you're
really concerned about that. Or a captcha, if we must.

On the postgres side (and psycopg2 internals), I'm less knowledgable.
Maybe someone else can jump in? Does postgres (or psycopg2) use
inordinately large amounts of memory when fetching a meg or two of text?
What's the simplest way to profile postgres's memory usage during
a query?

The large data lives in a field which is tellingly schema'd as "bytea".
I bet that's the good stuff, considering who wrote this.

Amir





pgsql-www by date:

Previous
From: Stefan Kaltenbrunner
Date:
Subject: Re: No easy way to join discussion in existing thread when not subscribed
Next
From: Stefan Kaltenbrunner
Date:
Subject: Re: No easy way to join discussion in existing thread when not subscribed