Re: pgsql-bugs mailing list dump? - Mailing list pgsql-www
From | Jehan-Guillaume de Rorthais |
---|---|
Subject | Re: pgsql-bugs mailing list dump? |
Date | |
Msg-id | 20201223235400.226428d8@firost Whole thread Raw |
In response to | Re: pgsql-bugs mailing list dump? (Magnus Hagander <magnus@hagander.net>) |
List | pgsql-www |
On Tue, 22 Dec 2020 11:11:10 +0100 Magnus Hagander <magnus@hagander.net> wrote: > On Wed, Dec 16, 2020 at 3:53 PM Jehan-Guillaume de Rorthais > <jgdr@dalibo.com> wrote: > > > > Hello Magnus, > > > > On Wed, 16 Dec 2020 15:02:03 +0100 > > Magnus Hagander <magnus@hagander.net> wrote: > > > > > On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais > > > <jgdr@dalibo.com> wrote: > > [...] > > > > However, maybe some admins would agree to provide some pgsql dump or > > > > access to some json API if relevant? We would save some time and > > > > CPU :) > > > > > > There are mbox files available for download from the list archives -- > > > would that work for you? It can be done on a per-thread basis as well, > > > i guess, but that's not something we have now (that is, we don't have > > > a unique listing of threads). > > > > The srht import API process one JSON documents per thread. That's why we > > try to gather one mbox per thread. > > There must be something I'm missing here, because that sounds.. Insane? > > Basically they take a raw mbox and wrap it in json? Just to make it > less efficient? > > And they specifically need the "outside" to have done the one thing > that's actually hard, namely threading? > > What are they actually trying to accomplish here? This would be perfectly insane and crazy :) Such a story would be a dead end right from the start. No, the sr.ht import script is accepting a pure json doc *only*. They do not require you to wrap mbox in json. The whole thread must be in json following their **import/export** format. When downloading mbox from postgresql.org, we have to write the wheel to transform mbox to json. Note that in production, the bug tracker relies on a mailing list managed by sr.ht. Each mails is parsed and stored in pgsql. > > > But if you're building your own threading on it, then the monthly mbox > > > files at https://www.postgresql.org/list/pgsql-bugs/ should be enough? > > > > Yes, we already got them to start pocking around. We have a small > > python script processing them but mbox format and/or python lib and/or email > > format are a bit loose and we currently have 3k orphans emails out of 13697 > > threads. > > Oh, there is a lot of weirdness in the email archives, particularly in > history (it's gotten a bit better, but we still see really weird mime > combinations fairly often). And there have been many crappy > implementations of mbox over the years as well, which has led to a lot > of problems of imports :/ Indeed. But anyway, my colleague's script is already able to sort out most of the troubles. Good enough for now. > So the root question there is, why are we exactring more structured > data into a format that we know is worse? The root question was me asking if a database dump or access to some json API would be somehow possible. I should have quickly explained this was to extract data as json from there. My bad, really. I hope the whole picture is clearer now. > > BTW, we found some orphans emails in pgarchiver UI as well that might be > > fixed if you are interested. The in-reply-to field is malformed but a > > message-id is still available there, eg: > > https://postgr.es/m/4454.935677480%40sss.pgh.pa.us. > > I'm not sure we want to go down the route of manually editing > messages. It would work for a message like this from 1999 because > that's before DKIM which would prevent us from doing it at all. But > either way the archives should represent what things actually looked > like as much as possible. And from an archives perspective that it not > an orphaned thread, that is a single message sent on it's own thread > (and we have plenty of those in general). Sure. > > Without any better solution, maybe our current method is "good enough" for a > > simple PoC. We could tighten/rewrite this part of the procedure in a second > > round if it worth it. > > Probably. > > But if you are somehow crawling the per-thread mbox urls please make > sure you rate limit yourself severely. They're really not meant to be > API endpoints... As far as I know, we now have enough data to move ahead. We should not crawl again soon. We will do some rate limit if needed in the futur, but I hope we will not have to deal with mbox anymore. Thanks!