Thread: pgsql-bugs mailing list dump?
Hi guys, Having a sourcehut[1] debian packager as colleague and being a small team of people interested in this project, we would like to try to build an instance and import pgsql-bugs in there. We would then report to the community and expose it if anything seems worthy. We don't want to start a long discussion about srht right now and consume precious contributors time before having real PoC. If you are interested for feedback whatever the PoC result will be, feel free to raise your hand. In the meantime, we are currently considering how to gather pgsql-bugs mailing history, thread-by-thread. We could write a HTTP crawler looking for HTTP Location redirections of flat pages and gather all threads afterwards. However, maybe some admins would agree to provide some pgsql dump or access to some json API if relevant? We would save some time and CPU :) Thanks! [1] https://sourcehut.org/
On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > > Hi guys, > > Having a sourcehut[1] debian packager as colleague and being a small team of > people interested in this project, we would like to try to build an instance > and import pgsql-bugs in there. We would then report to the community and > expose it if anything seems worthy. > > We don't want to start a long discussion about srht right now and consume > precious contributors time before having real PoC. If you are interested for > feedback whatever the PoC result will be, feel free to raise your hand. > > In the meantime, we are currently considering how to gather pgsql-bugs mailing > history, thread-by-thread. We could write a HTTP crawler looking for > HTTP Location redirections of flat pages and gather all threads afterwards. > > However, maybe some admins would agree to provide some pgsql dump or access to > some json API if relevant? We would save some time and CPU :) There are mbox files available for download from the list archives -- would that work for you? It can be done on a per-thread basis as well, i guess, but that's not something we have now (that is, we don't have a unique listing of threads). But if you're building your own threading on it, then the monthly mbox files at https://www.postgresql.org/list/pgsql-bugs/ should be enough? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Hello Magnus, On Wed, 16 Dec 2020 15:02:03 +0100 Magnus Hagander <magnus@hagander.net> wrote: > On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais > <jgdr@dalibo.com> wrote: [...] > > However, maybe some admins would agree to provide some pgsql dump or access > > to some json API if relevant? We would save some time and CPU :) > > There are mbox files available for download from the list archives -- > would that work for you? It can be done on a per-thread basis as well, > i guess, but that's not something we have now (that is, we don't have > a unique listing of threads). The srht import API process one JSON documents per thread. That's why we try to gather one mbox per thread. > But if you're building your own threading on it, then the monthly mbox files > at https://www.postgresql.org/list/pgsql-bugs/ should be enough? Yes, we already got them to start pocking around. We have a small python script processing them but mbox format and/or python lib and/or email format are a bit loose and we currently have 3k orphans emails out of 13697 threads. BTW, we found some orphans emails in pgarchiver UI as well that might be fixed if you are interested. The in-reply-to field is malformed but a message-id is still available there, eg: https://postgr.es/m/4454.935677480%40sss.pgh.pa.us. Without any better solution, maybe our current method is "good enough" for a simple PoC. We could tighten/rewrite this part of the procedure in a second round if it worth it. Thanks!
On Wed, Dec 16, 2020 at 3:53 PM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > > Hello Magnus, > > On Wed, 16 Dec 2020 15:02:03 +0100 > Magnus Hagander <magnus@hagander.net> wrote: > > > On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais > > <jgdr@dalibo.com> wrote: > [...] > > > However, maybe some admins would agree to provide some pgsql dump or access > > > to some json API if relevant? We would save some time and CPU :) > > > > There are mbox files available for download from the list archives -- > > would that work for you? It can be done on a per-thread basis as well, > > i guess, but that's not something we have now (that is, we don't have > > a unique listing of threads). > > The srht import API process one JSON documents per thread. That's why we try to > gather one mbox per thread. There must be something I'm missing here, because that sounds.. Insane? Basically they take a raw mbox and wrap it in json? Just to make it less efficient? And they specifically need the "outside" to have done the one thing that's actually hard, namely threading? What are they actually trying to accomplish here? > > But if you're building your own threading on it, then the monthly mbox files > > at https://www.postgresql.org/list/pgsql-bugs/ should be enough? > > Yes, we already got them to start pocking around. We have a small > python script processing them but mbox format and/or python lib and/or email > format are a bit loose and we currently have 3k orphans emails out of 13697 > threads. Oh, there is a lot of weirdness in the email archives, particularly in history (it's gotten a bit better, but we still see really weird mime combinations fairly often). And there have been many crappy implementations of mbox over the years as well, which has led to a lot of problems of imports :/ So the root question there is, why are we exactring more structured data into a format that we know is worse? > BTW, we found some orphans emails in pgarchiver UI as well that might be fixed > if you are interested. The in-reply-to field is malformed but a message-id is > still available there, eg: https://postgr.es/m/4454.935677480%40sss.pgh.pa.us. I'm not sure we want to go down the route of manually editing messages. It would work for a message like this from 1999 because that's before DKIM which would prevent us from doing it at all. But either way the archives should represent what things actually looked like as much as possible. And from an archives perspective that it not an orphaned thread, that is a single message sent on it's own thread (and we have plenty of those in general). > Without any better solution, maybe our current method is "good enough" for a > simple PoC. We could tighten/rewrite this part of the procedure in a second > round if it worth it. Probably. But if you are somehow crawling the per-thread mbox urls please make sure you rate limit yourself severely. They're really not meant to be API endpoints... -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On Tue, 22 Dec 2020 11:11:10 +0100 Magnus Hagander <magnus@hagander.net> wrote: > On Wed, Dec 16, 2020 at 3:53 PM Jehan-Guillaume de Rorthais > <jgdr@dalibo.com> wrote: > > > > Hello Magnus, > > > > On Wed, 16 Dec 2020 15:02:03 +0100 > > Magnus Hagander <magnus@hagander.net> wrote: > > > > > On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais > > > <jgdr@dalibo.com> wrote: > > [...] > > > > However, maybe some admins would agree to provide some pgsql dump or > > > > access to some json API if relevant? We would save some time and > > > > CPU :) > > > > > > There are mbox files available for download from the list archives -- > > > would that work for you? It can be done on a per-thread basis as well, > > > i guess, but that's not something we have now (that is, we don't have > > > a unique listing of threads). > > > > The srht import API process one JSON documents per thread. That's why we > > try to gather one mbox per thread. > > There must be something I'm missing here, because that sounds.. Insane? > > Basically they take a raw mbox and wrap it in json? Just to make it > less efficient? > > And they specifically need the "outside" to have done the one thing > that's actually hard, namely threading? > > What are they actually trying to accomplish here? This would be perfectly insane and crazy :) Such a story would be a dead end right from the start. No, the sr.ht import script is accepting a pure json doc *only*. They do not require you to wrap mbox in json. The whole thread must be in json following their **import/export** format. When downloading mbox from postgresql.org, we have to write the wheel to transform mbox to json. Note that in production, the bug tracker relies on a mailing list managed by sr.ht. Each mails is parsed and stored in pgsql. > > > But if you're building your own threading on it, then the monthly mbox > > > files at https://www.postgresql.org/list/pgsql-bugs/ should be enough? > > > > Yes, we already got them to start pocking around. We have a small > > python script processing them but mbox format and/or python lib and/or email > > format are a bit loose and we currently have 3k orphans emails out of 13697 > > threads. > > Oh, there is a lot of weirdness in the email archives, particularly in > history (it's gotten a bit better, but we still see really weird mime > combinations fairly often). And there have been many crappy > implementations of mbox over the years as well, which has led to a lot > of problems of imports :/ Indeed. But anyway, my colleague's script is already able to sort out most of the troubles. Good enough for now. > So the root question there is, why are we exactring more structured > data into a format that we know is worse? The root question was me asking if a database dump or access to some json API would be somehow possible. I should have quickly explained this was to extract data as json from there. My bad, really. I hope the whole picture is clearer now. > > BTW, we found some orphans emails in pgarchiver UI as well that might be > > fixed if you are interested. The in-reply-to field is malformed but a > > message-id is still available there, eg: > > https://postgr.es/m/4454.935677480%40sss.pgh.pa.us. > > I'm not sure we want to go down the route of manually editing > messages. It would work for a message like this from 1999 because > that's before DKIM which would prevent us from doing it at all. But > either way the archives should represent what things actually looked > like as much as possible. And from an archives perspective that it not > an orphaned thread, that is a single message sent on it's own thread > (and we have plenty of those in general). Sure. > > Without any better solution, maybe our current method is "good enough" for a > > simple PoC. We could tighten/rewrite this part of the procedure in a second > > round if it worth it. > > Probably. > > But if you are somehow crawling the per-thread mbox urls please make > sure you rate limit yourself severely. They're really not meant to be > API endpoints... As far as I know, we now have enough data to move ahead. We should not crawl again soon. We will do some rate limit if needed in the futur, but I hope we will not have to deal with mbox anymore. Thanks!