Re: pgsql-bugs mailing list dump? - Mailing list pgsql-www

From Jehan-Guillaume de Rorthais
Subject Re: pgsql-bugs mailing list dump?
Date
Msg-id 20201223235400.226428d8@firost
Whole thread Raw
In response to Re: pgsql-bugs mailing list dump?  (Magnus Hagander <magnus@hagander.net>)
List pgsql-www
On Tue, 22 Dec 2020 11:11:10 +0100
Magnus Hagander <magnus@hagander.net> wrote:

> On Wed, Dec 16, 2020 at 3:53 PM Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote:
> >
> > Hello Magnus,
> >
> > On Wed, 16 Dec 2020 15:02:03 +0100
> > Magnus Hagander <magnus@hagander.net> wrote:
> >  
> > > On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais
> > > <jgdr@dalibo.com> wrote:  
> > [...]  
> > > > However, maybe some admins would agree to provide some pgsql dump or
> > > > access to some json API if relevant? We would save some time and
> > > > CPU :)  
> > >
> > > There are mbox files available for download from the list archives --
> > > would that work for you? It can be done on a per-thread basis as well,
> > > i guess, but that's not something we have now (that is, we don't have
> > > a unique listing of threads).  
> >
> > The srht import API process one JSON documents per thread. That's why we
> > try to gather one mbox per thread.  
> 
> There must be something I'm missing here, because that sounds.. Insane?
> 
> Basically they take a raw mbox and wrap it in json? Just to make it
> less efficient?
> 
> And they specifically need the "outside" to have done the one thing
> that's actually hard, namely threading?
> 
> What are they actually trying to accomplish here?

This would be perfectly insane and crazy :) Such a story would be a dead end
right from the start.

No, the sr.ht import script is accepting a pure json doc *only*. They do not
require you to wrap mbox in json. The whole thread must be in json following
their **import/export** format.

When downloading mbox from postgresql.org, we have to write the wheel to
transform mbox to json.

Note that in production, the bug tracker relies on a mailing list managed by
sr.ht. Each mails is parsed and stored in pgsql.

> > > But if you're building your own threading on it, then the monthly mbox
> > > files at https://www.postgresql.org/list/pgsql-bugs/ should be enough?  
> >
> > Yes, we already got them to start pocking around. We have a small
> > python script processing them but mbox format and/or python lib and/or email
> > format are a bit loose and we currently have 3k orphans emails out of 13697
> > threads.  
> 
> Oh, there is a lot of weirdness in the email archives, particularly in
> history (it's gotten a bit better, but we still see really weird mime
> combinations fairly often). And there have been many crappy
> implementations of mbox over the years as well, which has led to a lot
> of problems of imports :/

Indeed. But anyway, my colleague's script is already able to sort out most of
the troubles. Good enough for now.

> So the root question there is, why are we exactring more structured
> data into a format that we know is worse?

The root question was me asking if a database dump or access to some json API
would be somehow possible. I should have quickly explained this was to extract
data as json from there.

My bad, really. I hope the whole picture is clearer now.

> > BTW, we found some orphans emails in pgarchiver UI as well that might be
> > fixed if you are interested. The in-reply-to field is malformed but a
> > message-id is still available there, eg:
> > https://postgr.es/m/4454.935677480%40sss.pgh.pa.us.  
> 
> I'm not sure we want to go down the route of manually editing
> messages. It would work for a message like this from 1999 because
> that's before DKIM which would prevent us from doing it at all. But
> either way the archives should represent what things actually looked
> like as much as possible. And from an archives perspective that it not
> an orphaned thread, that is a single message sent on it's own thread
> (and we have plenty of those in general).

Sure.

> > Without any better solution, maybe our current method is "good enough" for a
> > simple PoC. We could tighten/rewrite this part of the procedure in a second
> > round if it worth it.  
> 
> Probably.
> 
> But if you are somehow crawling the per-thread mbox urls please make
> sure you rate limit yourself severely. They're really not meant to be
> API endpoints...

As far as I know, we now have enough data to move ahead. We should not crawl
again soon. We will do some rate limit if needed in the futur, but I hope we
will not have to deal with mbox anymore.

Thanks!



pgsql-www by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: pgsql-bugs mailing list dump?
Next
From: Tom Lane
Date:
Subject: Busted links in commit emails