Thread: Re: No easy way to join discussion in existing thread when not subscribed

Re: No easy way to join discussion in existing thread when not subscribed

From

"Amir Rohan"

Date:

29 September 2015, 19:35:06

On 09/29/2015 08:44 PM, Stefan Kaltenbrunner wrote:

> Hi Amir!
> 

Hi Stefan, thanks for reviewing.

>>
>> Please, see attached patch adding a "whole thread[as mbox] link"
>> to pgarchives, by popular request upthread.
> 
> Thanks a lot for the patch - I took a quick look at the patch and have a
> few comments to make..
> 
>>

>>
>> If you'd like changes (hard limits, strip attachments, etc'),
>> do let me know.
> 
> some other things:
> 
> * while this is a preexisting issue in the code (most of the http auth
> requests are handled directly in lighttpd so nobody noticed so far i
> guess) please use "Please authenticate with user 'archives' and
> 'password' antispam"

I don't follow. The Auth handling is duplicated from the "raw" message
view, and I verified basic auth is required, prompt and all.

> * have you verified that the resulting mbox actually contains the
> newline seperator after each message(I have not checked whether the
> source data has it)?

Yes.

I also checked that both Mutt and Thunderbird can import
the generated files.
AFAICT, Adding an extra newline actualy ends up being one too
many. That may be a quirk from the test data I used from
the loader scripts and mboxes provided, but I can't be sure,
so I kept it on, guessing that email clients will tolerate
this sort of benign spec violation. TB and Mutt do.

btw, there's something off with the mbox processing chain you use.
I think it is non-compliant with the spec (as per qmail manpage of
yore), which requires so called ">From quoting":

http://qmail.org/man/man5/mbox.html

See for example <20150802150506.GH11473@alap3.anarazel.de> in
pgsql-hackers.201508, which includes email messages as mime
attachments and triggers (I believe) "missing Message-Ids"
warnings from your tool, and is perhaps mangled in the archives,
I've seen a few dozen of those while testing.

> * are you sure that using unicode() for building the output is going to
> work on all input? - I dont think you can assume that the source data is
> ASCII clean and/or has only valid unicode code points for mapping
> 

Good catch, but I caught earlier and send v2, which keeps everything in
bytes() and avoids encoding issues altogether.

>> It checks hiddenstatus, but does materialize the entire raw thread
>> in memory (including attachments) to form the response, which
>> is unbounded in principle and can be sizable in practice.
>>
>> Perhaps django can do streaming requests, so we can bound
>> the memory usage + timeout. It's been a while.
>
> this is dangerous - the box we are running on has limited <RAM>
> <...>
> Have you done any (approximate) measurements on what the additional
> in-memory overhead in both pg (to build the response) and in django is
> compared to the resulting  mbox?
>

That's more complicated to answer. First, the "whole thread" view
already materializes all the messages in a threads, excepting
attachments, so whether we include those is the main (only?) issue.
We can drop them (stripping them may cost some cpu) if all else fails.

The average email length in -hackers was about 10k in august.
The largest thread contained 91 messages, the median was 3.
So, say it takes 1M to store an mbox file for a large thread,
assuming august is a representative sample.

Although python (py2, especially) is quite wasteful about strings,  it
has only constant overhead per object for bytes() objects, and psycopg2
returns buffer objects which have the same property. Both these
assertions check out with `sys.getsizeof`.
So we're left with pretty much 1:1 memory to final mbox size on the
python side of things, which seems tolerable even with a safety factor
of 5-10 (i don't know if python does zero-copy).

If that's a problem, we can slap a size limit and still keep the 98%
percentile or so of threads accessible.
We can also add a rate limit to mitigate this as a DOS vector, if you're
really concerned about that. Or a captcha, if we must.

On the postgres side (and psycopg2 internals), I'm less knowledgable.
Maybe someone else can jump in? Does postgres (or psycopg2) use
inordinately large amounts of memory when fetching a meg or two of text?
What's the simplest way to profile postgres's memory usage during
a query?

The large data lives in a field which is tellingly schema'd as "bytea".
I bet that's the good stuff, considering who wrote this.

Amir

Re: No easy way to join discussion in existing thread when not subscribed

From

Stefan Kaltenbrunner

Date:

29 September 2015, 19:51:37

On 09/29/2015 09:34 PM, Amir Rohan wrote:
> On 09/29/2015 08:44 PM, Stefan Kaltenbrunner wrote:
> 
>> Hi Amir!
>>
> 
> Hi Stefan, thanks for reviewing.

np - you are very welcome ;)

> 
>>>
>>> Please, see attached patch adding a "whole thread[as mbox] link"
>>> to pgarchives, by popular request upthread.
>>
>> Thanks a lot for the patch - I took a quick look at the patch and have a
>> few comments to make..
>>
>>>
> 
>>>
>>> If you'd like changes (hard limits, strip attachments, etc'),
>>> do let me know.
>>
>> some other things:
>>
>> * while this is a preexisting issue in the code (most of the http auth
>> requests are handled directly in lighttpd so nobody noticed so far i
>> guess) please use "Please authenticate with user 'archives' and
>> 'password' antispam"
> 
> I don't follow. The Auth handling is duplicated from the "raw" message
> view, and I verified basic auth is required, prompt and all.

for most accesses to the archives the string for the basic auth reply
quotes the "archives" and "password" strings with ' - see
http://www.postgresql.org/message-id/20150827215248.GC14857@momjian.us
That one is "mostly" enforced by lighttph and not the django code - my
"fix" back then missed the django part (hence my the comment on it being
a preexisting issue) but we can fix it now while we are at it.


> 
>> * have you verified that the resulting mbox actually contains the
>> newline seperator after each message(I have not checked whether the
>> source data has it)?
> 
> Yes.
> 
> I also checked that both Mutt and Thunderbird can import
> the generated files.
> AFAICT, Adding an extra newline actualy ends up being one too
> many. That may be a quirk from the test data I used from
> the loader scripts and mboxes provided, but I can't be sure,
> so I kept it on, guessing that email clients will tolerate
> this sort of benign spec violation. TB and Mutt do.

hmm ok - will see if I can do some test on the real data in the next days...

> 
> btw, there's something off with the mbox processing chain you use.
> I think it is non-compliant with the spec (as per qmail manpage of
> yore), which requires so called ">From quoting":
> 
> http://qmail.org/man/man5/mbox.html
> 
> See for example <20150802150506.GH11473@alap3.anarazel.de> in
> pgsql-hackers.201508, which includes email messages as mime
> attachments and triggers (I believe) "missing Message-Ids"
> warnings from your tool, and is perhaps mangled in the archives,
> I've seen a few dozen of those while testing.

we have a number of current issues where data in the archives gets
mangled/corrupted we are looking into. We are currently working on some
infrastructure to "test" parsing fixes across all the messages in the
archives to get a better understanding of what kind effect a change has.
For this specific message I'm curious of how you found it though?

> 
>> * are you sure that using unicode() for building the output is going to
>> work on all input? - I dont think you can assume that the source data is
>> ASCII clean and/or has only valid unicode code points for mapping
>>
> 
> Good catch, but I caught earlier and send v2, which keeps everything in
> bytes() and avoids encoding issues altogether.

yeah missed v2 because it was shown in a seperate thread in my mua -
will look into it soon if I can find time...

> 
>>> It checks hiddenstatus, but does materialize the entire raw thread
>>> in memory (including attachments) to form the response, which
>>> is unbounded in principle and can be sizable in practice.
>>>
>>> Perhaps django can do streaming requests, so we can bound
>>> the memory usage + timeout. It's been a while.
>>
>> this is dangerous - the box we are running on has limited <RAM>
>> <...>
>> Have you done any (approximate) measurements on what the additional
>> in-memory overhead in both pg (to build the response) and in django is
>> compared to the resulting  mbox?
>>
> 
> That's more complicated to answer. First, the "whole thread" view
> already materializes all the messages in a threads, excepting
> attachments, so whether we include those is the main (only?) issue.
> We can drop them (stripping them may cost some cpu) if all else fails.
> 
> The average email length in -hackers was about 10k in august.
> The largest thread contained 91 messages, the median was 3.
> So, say it takes 1M to store an mbox file for a large thread,
> assuming august is a representative sample.
> 
> Although python (py2, especially) is quite wasteful about strings,  it
> has only constant overhead per object for bytes() objects, and psycopg2
> returns buffer objects which have the same property. Both these
> assertions check out with `sys.getsizeof`.
> So we're left with pretty much 1:1 memory to final mbox size on the
> python side of things, which seems tolerable even with a safety factor
> of 5-10 (i don't know if python does zero-copy).
> 
> If that's a problem, we can slap a size limit and still keep the 98%
> percentile or so of threads accessible.
> We can also add a rate limit to mitigate this as a DOS vector, if you're
> really concerned about that. Or a captcha, if we must.

good data - will look into the entire archives to see what the "largest
thread every" (in terms of octet bytes) was and whether the current
system would be able to cope. My concern mostly stems from operational
experience(on the sysadmin team) that some operations on the archives
currently are fairly computational and memory intensive causing issues
with availability and we would want to not add more vectors that can
cause that.

> 
> On the postgres side (and psycopg2 internals), I'm less knowledgable.
> Maybe someone else can jump in? Does postgres (or psycopg2) use
> inordinately large amounts of memory when fetching a meg or two of text?
> What's the simplest way to profile postgres's memory usage during
> a query?
> 
> The large data lives in a field which is tellingly schema'd as "bytea".
> I bet that's the good stuff, considering who wrote this.

well bytea is the only really useful datatype we can store strings of
bytes in (other than blobs) ;)


Stefan

Re: No easy way to join discussion in existing thread when not subscribed

From

Alvaro Herrera

Date:

29 September 2015, 20:11:49

Stefan Kaltenbrunner wrote:
> On 09/29/2015 09:34 PM, Amir Rohan wrote:

> > btw, there's something off with the mbox processing chain you use.
> > I think it is non-compliant with the spec (as per qmail manpage of
> > yore), which requires so called ">From quoting":
> > 
> > http://qmail.org/man/man5/mbox.html
> > 
> > See for example <20150802150506.GH11473@alap3.anarazel.de> in
> > pgsql-hackers.201508, which includes email messages as mime
> > attachments and triggers (I believe) "missing Message-Ids"
> > warnings from your tool, and is perhaps mangled in the archives,
> > I've seen a few dozen of those while testing.
> 
> we have a number of current issues where data in the archives gets
> mangled/corrupted we are looking into. We are currently working on some
> infrastructure to "test" parsing fixes across all the messages in the
> archives to get a better understanding of what kind effect a change has.

We do?  I didn't know we were trying to keep track of these things,
otherwise I would have pointed it out earlier.  I think there's the same
problem in Majordomo2 as well -- or rather, it's a bug in itself that
Majordomo passes these messages through without complaining about the
broken From line.  (A decade ago, I would have said that Majordomo
should have fixed the message itself, but nowadays that's probably not
workable due to hash-signatures of the message bodies etc.)

> > The average email length in -hackers was about 10k in august.
> > The largest thread contained 91 messages, the median was 3.
> > So, say it takes 1M to store an mbox file for a large thread,
> > assuming august is a representative sample.

> good data - will look into the entire archives to see what the "largest
> thread every" (in terms of octet bytes) was and whether the current
> system would be able to cope. My concern mostly stems from operational
> experience(on the sysadmin team) that some operations on the archives
> currently are fairly computational and memory intensive causing issues
> with availability and we would want to not add more vectors that can
> cause that.

The problem is that some threads contain patchsets that become large,
and we don't mind posting them over and over as we revise them, so the
total byte count can become pretty large.  See for instance
https://www.postgresql.org/message-id/20150215044814.GL3391%40alvh.no-ip.org
where I posted the DDL deparse patch series several times, about 400kB
apiece.  It didn't last too long though and I doubt that's the largest
one; maybe search for Andres' patch series for logical decoding, a
couple of years ago.

Also, if we use this as a basis to implement mbox-for-commitfest as I
proposed, it would become much worse because a CF can contain 50-100
threads.

If this could be done using an iterator that processes one or few
messages at a time, that would probably fix the issue completely.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: No easy way to join discussion in existing thread when not subscribed

From

Alvaro Herrera

Date:

29 September 2015, 20:17:43

Amir Rohan wrote:

> btw, there's something off with the mbox processing chain you use.
> I think it is non-compliant with the spec (as per qmail manpage of
> yore), which requires so called ">From quoting":
> 
> http://qmail.org/man/man5/mbox.html

FWIW I'm not sure that this stuff is applicable anymore; nowadays we
trust the mime headers to tell the length, rather than rely on the
(unreliable) "^From " separator.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: No easy way to join discussion in existing thread when not subscribed

From

Stefan Kaltenbrunner

Date:

29 September 2015, 20:20:04

On 09/29/2015 10:11 PM, Alvaro Herrera wrote:
> Stefan Kaltenbrunner wrote:
>> On 09/29/2015 09:34 PM, Amir Rohan wrote:
> 
>>> btw, there's something off with the mbox processing chain you use.
>>> I think it is non-compliant with the spec (as per qmail manpage of
>>> yore), which requires so called ">From quoting":
>>>
>>> http://qmail.org/man/man5/mbox.html
>>>
>>> See for example <20150802150506.GH11473@alap3.anarazel.de> in
>>> pgsql-hackers.201508, which includes email messages as mime
>>> attachments and triggers (I believe) "missing Message-Ids"
>>> warnings from your tool, and is perhaps mangled in the archives,
>>> I've seen a few dozen of those while testing.
>>
>> we have a number of current issues where data in the archives gets
>> mangled/corrupted we are looking into. We are currently working on some
>> infrastructure to "test" parsing fixes across all the messages in the
>> archives to get a better understanding of what kind effect a change has.
> 
> We do?  I didn't know we were trying to keep track of these things,
> otherwise I would have pointed it out earlier.  I think there's the same
> problem in Majordomo2 as well -- or rather, it's a bug in itself that
> Majordomo passes these messages through without complaining about the
> broken From line.  (A decade ago, I would have said that Majordomo
> should have fixed the message itself, but nowadays that's probably not
> workable due to hash-signatures of the message bodies etc.)

well magnush and I at least discussed that we need infrastructure to
test parsing fixes/changes in a sensible way - see
http://www.postgresql.org/message-id/CAEepm=1dKk2hG3qQi25GpzABnTir8CiW9TjocJj1x8XTcd6c6A@mail.gmail.com

The way the current archives work is that the archive system is
basically a subscriber to the mailinglists and gets fed the mails as
they are sent out to any other subscriber.

> 
>>> The average email length in -hackers was about 10k in august.
>>> The largest thread contained 91 messages, the median was 3.
>>> So, say it takes 1M to store an mbox file for a large thread,
>>> assuming august is a representative sample.
> 
>> good data - will look into the entire archives to see what the "largest
>> thread every" (in terms of octet bytes) was and whether the current
>> system would be able to cope. My concern mostly stems from operational
>> experience(on the sysadmin team) that some operations on the archives
>> currently are fairly computational and memory intensive causing issues
>> with availability and we would want to not add more vectors that can
>> cause that.
> 
> The problem is that some threads contain patchsets that become large,
> and we don't mind posting them over and over as we revise them, so the
> total byte count can become pretty large.  See for instance
> https://www.postgresql.org/message-id/20150215044814.GL3391%40alvh.no-ip.org
> where I posted the DDL deparse patch series several times, about 400kB
> apiece.  It didn't last too long though and I doubt that's the largest
> one; maybe search for Andres' patch series for logical decoding, a
> couple of years ago.
> 
> Also, if we use this as a basis to implement mbox-for-commitfest as I
> proposed, it would become much worse because a CF can contain 50-100
> threads.
> 
> If this could be done using an iterator that processes one or few
> messages at a time, that would probably fix the issue completely.

maybe - but I also think we should look into the low-tech version of
this and build actual on-disk mbox files on a per-commitfest or a
per-thread base at the time we get the mail into the system.
At that time we can serialize the process sensibly to not overload the
system and afterwards the "only" problem we have to solve is delivering
(semi) static files from a filesystem.


Stefan