Re: pgsql-bugs mailing list dump? - Mailing list pgsql-www

From Magnus Hagander
Subject Re: pgsql-bugs mailing list dump?
Date
Msg-id CABUevEytobwva8k=jNeRWcRJcORqRP-DAkiOdy4fMw7ZS7vpxw@mail.gmail.com
Whole thread Raw
In response to Re: pgsql-bugs mailing list dump?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: pgsql-bugs mailing list dump?
List pgsql-www
On Wed, Dec 16, 2020 at 3:53 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
>
> Hello Magnus,
>
> On Wed, 16 Dec 2020 15:02:03 +0100
> Magnus Hagander <magnus@hagander.net> wrote:
>
> > On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais
> > <jgdr@dalibo.com> wrote:
> [...]
> > > However, maybe some admins would agree to provide some pgsql dump or access
> > > to some json API if relevant? We would save some time and CPU :)
> >
> > There are mbox files available for download from the list archives --
> > would that work for you? It can be done on a per-thread basis as well,
> > i guess, but that's not something we have now (that is, we don't have
> > a unique listing of threads).
>
> The srht import API process one JSON documents per thread. That's why we try to
> gather one mbox per thread.

There must be something I'm missing here, because that sounds.. Insane?

Basically they take a raw mbox and wrap it in json? Just to make it
less efficient?

And they specifically need the "outside" to have done the one thing
that's actually hard, namely threading?

What are they actually trying to accomplish here?


> > But if you're building your own threading on it, then the monthly mbox files
> > at https://www.postgresql.org/list/pgsql-bugs/ should be enough?
>
> Yes, we already got them to start pocking around. We have a small
> python script processing them but mbox format and/or python lib and/or email
> format are a bit loose and we currently have 3k orphans emails out of 13697
> threads.

Oh, there is a lot of weirdness in the email archives, particularly in
history (it's gotten a bit better, but we still see really weird mime
combinations fairly often). And there have been many crappy
implementations of mbox over the years as well, which has led to a lot
of problems of imports :/

So the root question there is, why are we exactring more structured
data into a format that we know is worse?


> BTW, we found some orphans emails in pgarchiver UI as well that might be fixed
> if you are interested. The in-reply-to field is malformed but a message-id is
> still available there, eg: https://postgr.es/m/4454.935677480%40sss.pgh.pa.us.

I'm not sure we want to go down the route of manually editing
messages. It would work for a message like this from 1999 because
that's before DKIM which would prevent us from doing it at all. But
either way the archives should represent what things actually looked
like as much as possible. And from an archives perspective that it not
an orphaned thread, that is a single message sent on it's own thread
(and we have plenty of those in general).



> Without any better solution, maybe our current method is "good enough" for a
> simple PoC. We could tighten/rewrite this part of the procedure in a second
> round if it worth it.

Probably.

But if you are somehow crawling the per-thread mbox urls please make
sure you rate limit yourself severely. They're really not meant to be
API endpoints...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



pgsql-www by date:

Previous
From: "Jonathan S. Katz"
Date:
Subject: Re: bad entries at proffesional services and hosting providers
Next
From: Jehan-Guillaume de Rorthais
Date:
Subject: Re: pgsql-bugs mailing list dump?