Thread: Re: No easy way to join discussion in existing thread when not subscribed
On 09/29/2015 08:44 PM, Stefan Kaltenbrunner wrote: > Hi Amir! > Hi Stefan, thanks for reviewing. >> >> Please, see attached patch adding a "whole thread[as mbox] link" >> to pgarchives, by popular request upthread. > > Thanks a lot for the patch - I took a quick look at the patch and have a > few comments to make.. > >> >> >> If you'd like changes (hard limits, strip attachments, etc'), >> do let me know. > > some other things: > > * while this is a preexisting issue in the code (most of the http auth > requests are handled directly in lighttpd so nobody noticed so far i > guess) please use "Please authenticate with user 'archives' and > 'password' antispam" I don't follow. The Auth handling is duplicated from the "raw" message view, and I verified basic auth is required, prompt and all. > * have you verified that the resulting mbox actually contains the > newline seperator after each message(I have not checked whether the > source data has it)? Yes. I also checked that both Mutt and Thunderbird can import the generated files. AFAICT, Adding an extra newline actualy ends up being one too many. That may be a quirk from the test data I used from the loader scripts and mboxes provided, but I can't be sure, so I kept it on, guessing that email clients will tolerate this sort of benign spec violation. TB and Mutt do. btw, there's something off with the mbox processing chain you use. I think it is non-compliant with the spec (as per qmail manpage of yore), which requires so called ">From quoting": http://qmail.org/man/man5/mbox.html See for example <20150802150506.GH11473@alap3.anarazel.de> in pgsql-hackers.201508, which includes email messages as mime attachments and triggers (I believe) "missing Message-Ids" warnings from your tool, and is perhaps mangled in the archives, I've seen a few dozen of those while testing. > * are you sure that using unicode() for building the output is going to > work on all input? - I dont think you can assume that the source data is > ASCII clean and/or has only valid unicode code points for mapping > Good catch, but I caught earlier and send v2, which keeps everything in bytes() and avoids encoding issues altogether. >> It checks hiddenstatus, but does materialize the entire raw thread >> in memory (including attachments) to form the response, which >> is unbounded in principle and can be sizable in practice. >> >> Perhaps django can do streaming requests, so we can bound >> the memory usage + timeout. It's been a while. > > this is dangerous - the box we are running on has limited <RAM> > <...> > Have you done any (approximate) measurements on what the additional > in-memory overhead in both pg (to build the response) and in django is > compared to the resulting mbox? > That's more complicated to answer. First, the "whole thread" view already materializes all the messages in a threads, excepting attachments, so whether we include those is the main (only?) issue. We can drop them (stripping them may cost some cpu) if all else fails. The average email length in -hackers was about 10k in august. The largest thread contained 91 messages, the median was 3. So, say it takes 1M to store an mbox file for a large thread, assuming august is a representative sample. Although python (py2, especially) is quite wasteful about strings, it has only constant overhead per object for bytes() objects, and psycopg2 returns buffer objects which have the same property. Both these assertions check out with `sys.getsizeof`. So we're left with pretty much 1:1 memory to final mbox size on the python side of things, which seems tolerable even with a safety factor of 5-10 (i don't know if python does zero-copy). If that's a problem, we can slap a size limit and still keep the 98% percentile or so of threads accessible. We can also add a rate limit to mitigate this as a DOS vector, if you're really concerned about that. Or a captcha, if we must. On the postgres side (and psycopg2 internals), I'm less knowledgable. Maybe someone else can jump in? Does postgres (or psycopg2) use inordinately large amounts of memory when fetching a meg or two of text? What's the simplest way to profile postgres's memory usage during a query? The large data lives in a field which is tellingly schema'd as "bytea". I bet that's the good stuff, considering who wrote this. Amir
Re: No easy way to join discussion in existing thread when not subscribed
From
Stefan Kaltenbrunner
Date:
On 09/29/2015 09:34 PM, Amir Rohan wrote: > On 09/29/2015 08:44 PM, Stefan Kaltenbrunner wrote: > >> Hi Amir! >> > > Hi Stefan, thanks for reviewing. np - you are very welcome ;) > >>> >>> Please, see attached patch adding a "whole thread[as mbox] link" >>> to pgarchives, by popular request upthread. >> >> Thanks a lot for the patch - I took a quick look at the patch and have a >> few comments to make.. >> >>> > >>> >>> If you'd like changes (hard limits, strip attachments, etc'), >>> do let me know. >> >> some other things: >> >> * while this is a preexisting issue in the code (most of the http auth >> requests are handled directly in lighttpd so nobody noticed so far i >> guess) please use "Please authenticate with user 'archives' and >> 'password' antispam" > > I don't follow. The Auth handling is duplicated from the "raw" message > view, and I verified basic auth is required, prompt and all. for most accesses to the archives the string for the basic auth reply quotes the "archives" and "password" strings with ' - see http://www.postgresql.org/message-id/20150827215248.GC14857@momjian.us That one is "mostly" enforced by lighttph and not the django code - my "fix" back then missed the django part (hence my the comment on it being a preexisting issue) but we can fix it now while we are at it. > >> * have you verified that the resulting mbox actually contains the >> newline seperator after each message(I have not checked whether the >> source data has it)? > > Yes. > > I also checked that both Mutt and Thunderbird can import > the generated files. > AFAICT, Adding an extra newline actualy ends up being one too > many. That may be a quirk from the test data I used from > the loader scripts and mboxes provided, but I can't be sure, > so I kept it on, guessing that email clients will tolerate > this sort of benign spec violation. TB and Mutt do. hmm ok - will see if I can do some test on the real data in the next days... > > btw, there's something off with the mbox processing chain you use. > I think it is non-compliant with the spec (as per qmail manpage of > yore), which requires so called ">From quoting": > > http://qmail.org/man/man5/mbox.html > > See for example <20150802150506.GH11473@alap3.anarazel.de> in > pgsql-hackers.201508, which includes email messages as mime > attachments and triggers (I believe) "missing Message-Ids" > warnings from your tool, and is perhaps mangled in the archives, > I've seen a few dozen of those while testing. we have a number of current issues where data in the archives gets mangled/corrupted we are looking into. We are currently working on some infrastructure to "test" parsing fixes across all the messages in the archives to get a better understanding of what kind effect a change has. For this specific message I'm curious of how you found it though? > >> * are you sure that using unicode() for building the output is going to >> work on all input? - I dont think you can assume that the source data is >> ASCII clean and/or has only valid unicode code points for mapping >> > > Good catch, but I caught earlier and send v2, which keeps everything in > bytes() and avoids encoding issues altogether. yeah missed v2 because it was shown in a seperate thread in my mua - will look into it soon if I can find time... > >>> It checks hiddenstatus, but does materialize the entire raw thread >>> in memory (including attachments) to form the response, which >>> is unbounded in principle and can be sizable in practice. >>> >>> Perhaps django can do streaming requests, so we can bound >>> the memory usage + timeout. It's been a while. >> >> this is dangerous - the box we are running on has limited <RAM> >> <...> >> Have you done any (approximate) measurements on what the additional >> in-memory overhead in both pg (to build the response) and in django is >> compared to the resulting mbox? >> > > That's more complicated to answer. First, the "whole thread" view > already materializes all the messages in a threads, excepting > attachments, so whether we include those is the main (only?) issue. > We can drop them (stripping them may cost some cpu) if all else fails. > > The average email length in -hackers was about 10k in august. > The largest thread contained 91 messages, the median was 3. > So, say it takes 1M to store an mbox file for a large thread, > assuming august is a representative sample. > > Although python (py2, especially) is quite wasteful about strings, it > has only constant overhead per object for bytes() objects, and psycopg2 > returns buffer objects which have the same property. Both these > assertions check out with `sys.getsizeof`. > So we're left with pretty much 1:1 memory to final mbox size on the > python side of things, which seems tolerable even with a safety factor > of 5-10 (i don't know if python does zero-copy). > > If that's a problem, we can slap a size limit and still keep the 98% > percentile or so of threads accessible. > We can also add a rate limit to mitigate this as a DOS vector, if you're > really concerned about that. Or a captcha, if we must. good data - will look into the entire archives to see what the "largest thread every" (in terms of octet bytes) was and whether the current system would be able to cope. My concern mostly stems from operational experience(on the sysadmin team) that some operations on the archives currently are fairly computational and memory intensive causing issues with availability and we would want to not add more vectors that can cause that. > > On the postgres side (and psycopg2 internals), I'm less knowledgable. > Maybe someone else can jump in? Does postgres (or psycopg2) use > inordinately large amounts of memory when fetching a meg or two of text? > What's the simplest way to profile postgres's memory usage during > a query? > > The large data lives in a field which is tellingly schema'd as "bytea". > I bet that's the good stuff, considering who wrote this. well bytea is the only really useful datatype we can store strings of bytes in (other than blobs) ;) Stefan
Stefan Kaltenbrunner wrote: > On 09/29/2015 09:34 PM, Amir Rohan wrote: > > btw, there's something off with the mbox processing chain you use. > > I think it is non-compliant with the spec (as per qmail manpage of > > yore), which requires so called ">From quoting": > > > > http://qmail.org/man/man5/mbox.html > > > > See for example <20150802150506.GH11473@alap3.anarazel.de> in > > pgsql-hackers.201508, which includes email messages as mime > > attachments and triggers (I believe) "missing Message-Ids" > > warnings from your tool, and is perhaps mangled in the archives, > > I've seen a few dozen of those while testing. > > we have a number of current issues where data in the archives gets > mangled/corrupted we are looking into. We are currently working on some > infrastructure to "test" parsing fixes across all the messages in the > archives to get a better understanding of what kind effect a change has. We do? I didn't know we were trying to keep track of these things, otherwise I would have pointed it out earlier. I think there's the same problem in Majordomo2 as well -- or rather, it's a bug in itself that Majordomo passes these messages through without complaining about the broken From line. (A decade ago, I would have said that Majordomo should have fixed the message itself, but nowadays that's probably not workable due to hash-signatures of the message bodies etc.) > > The average email length in -hackers was about 10k in august. > > The largest thread contained 91 messages, the median was 3. > > So, say it takes 1M to store an mbox file for a large thread, > > assuming august is a representative sample. > good data - will look into the entire archives to see what the "largest > thread every" (in terms of octet bytes) was and whether the current > system would be able to cope. My concern mostly stems from operational > experience(on the sysadmin team) that some operations on the archives > currently are fairly computational and memory intensive causing issues > with availability and we would want to not add more vectors that can > cause that. The problem is that some threads contain patchsets that become large, and we don't mind posting them over and over as we revise them, so the total byte count can become pretty large. See for instance https://www.postgresql.org/message-id/20150215044814.GL3391%40alvh.no-ip.org where I posted the DDL deparse patch series several times, about 400kB apiece. It didn't last too long though and I doubt that's the largest one; maybe search for Andres' patch series for logical decoding, a couple of years ago. Also, if we use this as a basis to implement mbox-for-commitfest as I proposed, it would become much worse because a CF can contain 50-100 threads. If this could be done using an iterator that processes one or few messages at a time, that would probably fix the issue completely. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Amir Rohan wrote: > btw, there's something off with the mbox processing chain you use. > I think it is non-compliant with the spec (as per qmail manpage of > yore), which requires so called ">From quoting": > > http://qmail.org/man/man5/mbox.html FWIW I'm not sure that this stuff is applicable anymore; nowadays we trust the mime headers to tell the length, rather than rely on the (unreliable) "^From " separator. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: No easy way to join discussion in existing thread when not subscribed
From
Stefan Kaltenbrunner
Date:
On 09/29/2015 10:11 PM, Alvaro Herrera wrote: > Stefan Kaltenbrunner wrote: >> On 09/29/2015 09:34 PM, Amir Rohan wrote: > >>> btw, there's something off with the mbox processing chain you use. >>> I think it is non-compliant with the spec (as per qmail manpage of >>> yore), which requires so called ">From quoting": >>> >>> http://qmail.org/man/man5/mbox.html >>> >>> See for example <20150802150506.GH11473@alap3.anarazel.de> in >>> pgsql-hackers.201508, which includes email messages as mime >>> attachments and triggers (I believe) "missing Message-Ids" >>> warnings from your tool, and is perhaps mangled in the archives, >>> I've seen a few dozen of those while testing. >> >> we have a number of current issues where data in the archives gets >> mangled/corrupted we are looking into. We are currently working on some >> infrastructure to "test" parsing fixes across all the messages in the >> archives to get a better understanding of what kind effect a change has. > > We do? I didn't know we were trying to keep track of these things, > otherwise I would have pointed it out earlier. I think there's the same > problem in Majordomo2 as well -- or rather, it's a bug in itself that > Majordomo passes these messages through without complaining about the > broken From line. (A decade ago, I would have said that Majordomo > should have fixed the message itself, but nowadays that's probably not > workable due to hash-signatures of the message bodies etc.) well magnush and I at least discussed that we need infrastructure to test parsing fixes/changes in a sensible way - see http://www.postgresql.org/message-id/CAEepm=1dKk2hG3qQi25GpzABnTir8CiW9TjocJj1x8XTcd6c6A@mail.gmail.com The way the current archives work is that the archive system is basically a subscriber to the mailinglists and gets fed the mails as they are sent out to any other subscriber. > >>> The average email length in -hackers was about 10k in august. >>> The largest thread contained 91 messages, the median was 3. >>> So, say it takes 1M to store an mbox file for a large thread, >>> assuming august is a representative sample. > >> good data - will look into the entire archives to see what the "largest >> thread every" (in terms of octet bytes) was and whether the current >> system would be able to cope. My concern mostly stems from operational >> experience(on the sysadmin team) that some operations on the archives >> currently are fairly computational and memory intensive causing issues >> with availability and we would want to not add more vectors that can >> cause that. > > The problem is that some threads contain patchsets that become large, > and we don't mind posting them over and over as we revise them, so the > total byte count can become pretty large. See for instance > https://www.postgresql.org/message-id/20150215044814.GL3391%40alvh.no-ip.org > where I posted the DDL deparse patch series several times, about 400kB > apiece. It didn't last too long though and I doubt that's the largest > one; maybe search for Andres' patch series for logical decoding, a > couple of years ago. > > Also, if we use this as a basis to implement mbox-for-commitfest as I > proposed, it would become much worse because a CF can contain 50-100 > threads. > > If this could be done using an iterator that processes one or few > messages at a time, that would probably fix the issue completely. maybe - but I also think we should look into the low-tech version of this and build actual on-disk mbox files on a per-commitfest or a per-thread base at the time we get the mail into the system. At that time we can serialize the process sensibly to not overload the system and afterwards the "only" problem we have to solve is delivering (semi) static files from a filesystem. Stefan