Re: pgsql-bugs mailing list dump? - Mailing list pgsql-www
From | Magnus Hagander |
---|---|
Subject | Re: pgsql-bugs mailing list dump? |
Date | |
Msg-id | CABUevEytobwva8k=jNeRWcRJcORqRP-DAkiOdy4fMw7ZS7vpxw@mail.gmail.com Whole thread Raw |
In response to | Re: pgsql-bugs mailing list dump? (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>) |
Responses |
Re: pgsql-bugs mailing list dump?
|
List | pgsql-www |
On Wed, Dec 16, 2020 at 3:53 PM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > > Hello Magnus, > > On Wed, 16 Dec 2020 15:02:03 +0100 > Magnus Hagander <magnus@hagander.net> wrote: > > > On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais > > <jgdr@dalibo.com> wrote: > [...] > > > However, maybe some admins would agree to provide some pgsql dump or access > > > to some json API if relevant? We would save some time and CPU :) > > > > There are mbox files available for download from the list archives -- > > would that work for you? It can be done on a per-thread basis as well, > > i guess, but that's not something we have now (that is, we don't have > > a unique listing of threads). > > The srht import API process one JSON documents per thread. That's why we try to > gather one mbox per thread. There must be something I'm missing here, because that sounds.. Insane? Basically they take a raw mbox and wrap it in json? Just to make it less efficient? And they specifically need the "outside" to have done the one thing that's actually hard, namely threading? What are they actually trying to accomplish here? > > But if you're building your own threading on it, then the monthly mbox files > > at https://www.postgresql.org/list/pgsql-bugs/ should be enough? > > Yes, we already got them to start pocking around. We have a small > python script processing them but mbox format and/or python lib and/or email > format are a bit loose and we currently have 3k orphans emails out of 13697 > threads. Oh, there is a lot of weirdness in the email archives, particularly in history (it's gotten a bit better, but we still see really weird mime combinations fairly often). And there have been many crappy implementations of mbox over the years as well, which has led to a lot of problems of imports :/ So the root question there is, why are we exactring more structured data into a format that we know is worse? > BTW, we found some orphans emails in pgarchiver UI as well that might be fixed > if you are interested. The in-reply-to field is malformed but a message-id is > still available there, eg: https://postgr.es/m/4454.935677480%40sss.pgh.pa.us. I'm not sure we want to go down the route of manually editing messages. It would work for a message like this from 1999 because that's before DKIM which would prevent us from doing it at all. But either way the archives should represent what things actually looked like as much as possible. And from an archives perspective that it not an orphaned thread, that is a single message sent on it's own thread (and we have plenty of those in general). > Without any better solution, maybe our current method is "good enough" for a > simple PoC. We could tighten/rewrite this part of the procedure in a second > round if it worth it. Probably. But if you are somehow crawling the per-thread mbox urls please make sure you rate limit yourself severely. They're really not meant to be API endpoints... -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/