Thread: Re: No easy way to join discussion in existing thread when not subscribed

Re: No easy way to join discussion in existing thread when not subscribed

From

"Amir Rohan"

Date:

30 September 2015, 01:27:55

On 09/29/2015 10:51 PM, Stefan Kaltenbrunner wrote:
> On 09/29/2015 09:34 PM, Amir Rohan wrote:
>
> for most accesses to the archives the string for the basic auth reply
> quotes the "archives" and "password" strings with ' - see

Fixed.

> we have a number of current issues where data in the archives gets
> mangled/corrupted we are looking into. We are currently working on some
> infrastructure to "test" parsing fixes across all the messages in the
> archives to get a better understanding of what kind effect a change has.
> For this specific message I'm curious of how you found it though?
>

I made a prototype before looking at the repo, using
python's 'mailbox' parser module, and some asserts failed
when some messages parsed out as lacking Message-ID. I had
also read the mbox spec in order to write the patch, and
put the two together.

>>> <...>
>>> Have you done any (approximate) measurements  on what the additional
>>> in-memory overhead in both pg (to build the response) and in django is
>>> compared to the resulting  mbox?
>>>
>>>> Amir Wrote:
>>>> <some napkins and mitigations>
> My concern mostly stems from operational
> experience(on the sysadmin team) that some operations on the archives
> currently are fairly computational and memory intensive causing issues
> with availability and we would want to not add more vectors that can
> cause that.
>

You're right to be concerned, I raised the issue myself to begin with.
We can solve any particular problem, but how to optimize depends too
much on particulars I don't have.

If you have both cpu and memory shortage, we could trade storage.
You already serve monthly mbox's, having per thread mboxes which are
updated in batch (say hourly) could be managable, and that code
is practically written already. Serving static is as cheap as it gets
on noth cpu and memory.

But for now, see attached patch, which adds a tweakable for setting a
cap on the max size of the response. It still gets everything
from the database at once, so it may not be of much help except
perhaps as a metric for you to easily monitor.

There's also an EJECT button that turns all thread mbox requests into
403, so you can just throw this in production and flip the switch
if a problem appears. Also fixes the quoting in the message.

You

Attachment

20150929.pgarchives.add_thread_mbox_dl_amir_v3.patch

Re: No easy way to join discussion in existing thread when not subscribed

From

Stefan Kaltenbrunner

Date:

30 September 2015, 06:53:36

On 09/30/2015 03:27 AM, Amir Rohan wrote:
> On 09/29/2015 10:51 PM, Stefan Kaltenbrunner wrote:
>> On 09/29/2015 09:34 PM, Amir Rohan wrote:
>>
>> for most accesses to the archives the string for the basic auth reply
>> quotes the "archives" and "password" strings with ' - see
>
> Fixed.

I think you missed at least one spot in the code you added and also at 
least one occurance in existing code.

>
>> we have a number of current issues where data in the archives gets
>> mangled/corrupted we are looking into. We are currently working on some
>> infrastructure to "test" parsing fixes across all the messages in the
>> archives to get a better understanding of what kind effect a change has.
>> For this specific message I'm curious of how you found it though?
>>
>
> I made a prototype before looking at the repo, using
> python's 'mailbox' parser module, and some asserts failed
> when some messages parsed out as lacking Message-ID. I had
> also read the mbox spec in order to write the patch, and
> put the two together.

ah - nice effort!

>
>>>> <...>
>>>> Have you done any (approximate) measurements  on what the additional
>>>> in-memory overhead in both pg (to build the response) and in django is
>>>> compared to the resulting  mbox?
>>>>
>>>>> Amir Wrote:
>>>>> <some napkins and mitigations>
>> My concern mostly stems from operational
>> experience(on the sysadmin team) that some operations on the archives
>> currently are fairly computational and memory intensive causing issues
>> with availability and we would want to not add more vectors that can
>> cause that.
>>
>
> You're right to be concerned, I raised the issue myself to begin with.
> We can solve any particular problem, but how to optimize depends too
> much on particulars I don't have.
>
> If you have both cpu and memory shortage, we could trade storage.
> You already serve monthly mbox's, having per thread mboxes which are
> updated in batch (say hourly) could be managable, and that code
> is practically written already. Serving static is as cheap as it gets
> on noth cpu and memory.

yeah that is what I was thinking - though I dont think we want hourly. 
Went went a long way to actually get the current system to be "almost 
instant" in terms of having the archives in sync with the lists(at least 
for the basic stuff). What I was thinking is doing the mbox creating 
during the import - we already serialize the process (on the MTA/LDA 
side) there to have only one message imported concurrently so there is 
way less risk of overwhelming the box.

>
> But for now, see attached patch, which adds a tweakable for setting a
> cap on the max size of the response. It still gets everything
> from the database at once, so it may not be of much help except
> perhaps as a metric for you to easily monitor.
>
> There's also an EJECT button that turns all thread mbox requests into
> 403, so you can just throw this in production and flip the switch
> if a problem appears. Also fixes the quoting in the message.


thanks for the updated patch - will take a look and see whether I can 
find out what the worst case is in the archives later today.




Stefan