Thread: pgsql-bugs mailing list dump?

pgsql-bugs mailing list dump?

From
Jehan-Guillaume de Rorthais
Date:
Hi guys,

Having a sourcehut[1] debian packager as colleague and being a small team of
people interested in this project, we would like to try to build an instance
and import pgsql-bugs in there. We would then report to the community and
expose it if anything seems worthy.

We don't want to start a long discussion about srht right now and consume
precious contributors time before having real PoC. If you are interested for
feedback whatever the PoC result will be, feel free to raise your hand.

In the meantime, we are currently considering how to gather pgsql-bugs mailing
history, thread-by-thread. We could write a HTTP crawler looking for
HTTP Location redirections of flat pages and gather all threads afterwards.

However, maybe some admins would agree to provide some pgsql dump or access to
some json API if relevant? We would save some time and CPU :)

Thanks!

[1] https://sourcehut.org/



Re: pgsql-bugs mailing list dump?

From
Magnus Hagander
Date:
On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
>
> Hi guys,
>
> Having a sourcehut[1] debian packager as colleague and being a small team of
> people interested in this project, we would like to try to build an instance
> and import pgsql-bugs in there. We would then report to the community and
> expose it if anything seems worthy.
>
> We don't want to start a long discussion about srht right now and consume
> precious contributors time before having real PoC. If you are interested for
> feedback whatever the PoC result will be, feel free to raise your hand.
>
> In the meantime, we are currently considering how to gather pgsql-bugs mailing
> history, thread-by-thread. We could write a HTTP crawler looking for
> HTTP Location redirections of flat pages and gather all threads afterwards.
>
> However, maybe some admins would agree to provide some pgsql dump or access to
> some json API if relevant? We would save some time and CPU :)

There are mbox files available for download from the list archives --
would that work for you? It can be done on a per-thread basis as well,
i guess, but that's not something we have now (that is, we don't have
a unique listing of threads). But if you're building your own
threading on it, then the monthly mbox files at
https://www.postgresql.org/list/pgsql-bugs/ should be enough?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: pgsql-bugs mailing list dump?

From
Jehan-Guillaume de Rorthais
Date:
Hello Magnus,

On Wed, 16 Dec 2020 15:02:03 +0100
Magnus Hagander <magnus@hagander.net> wrote:

> On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote:
[...]
> > However, maybe some admins would agree to provide some pgsql dump or access
> > to some json API if relevant? We would save some time and CPU :)  
> 
> There are mbox files available for download from the list archives --
> would that work for you? It can be done on a per-thread basis as well,
> i guess, but that's not something we have now (that is, we don't have
> a unique listing of threads).

The srht import API process one JSON documents per thread. That's why we try to
gather one mbox per thread.

> But if you're building your own threading on it, then the monthly mbox files
> at https://www.postgresql.org/list/pgsql-bugs/ should be enough?

Yes, we already got them to start pocking around. We have a small
python script processing them but mbox format and/or python lib and/or email
format are a bit loose and we currently have 3k orphans emails out of 13697
threads.

BTW, we found some orphans emails in pgarchiver UI as well that might be fixed
if you are interested. The in-reply-to field is malformed but a message-id is
still available there, eg: https://postgr.es/m/4454.935677480%40sss.pgh.pa.us.

Without any better solution, maybe our current method is "good enough" for a
simple PoC. We could tighten/rewrite this part of the procedure in a second
round if it worth it.

Thanks!



Re: pgsql-bugs mailing list dump?

From
Magnus Hagander
Date:
On Wed, Dec 16, 2020 at 3:53 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
>
> Hello Magnus,
>
> On Wed, 16 Dec 2020 15:02:03 +0100
> Magnus Hagander <magnus@hagander.net> wrote:
>
> > On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais
> > <jgdr@dalibo.com> wrote:
> [...]
> > > However, maybe some admins would agree to provide some pgsql dump or access
> > > to some json API if relevant? We would save some time and CPU :)
> >
> > There are mbox files available for download from the list archives --
> > would that work for you? It can be done on a per-thread basis as well,
> > i guess, but that's not something we have now (that is, we don't have
> > a unique listing of threads).
>
> The srht import API process one JSON documents per thread. That's why we try to
> gather one mbox per thread.

There must be something I'm missing here, because that sounds.. Insane?

Basically they take a raw mbox and wrap it in json? Just to make it
less efficient?

And they specifically need the "outside" to have done the one thing
that's actually hard, namely threading?

What are they actually trying to accomplish here?


> > But if you're building your own threading on it, then the monthly mbox files
> > at https://www.postgresql.org/list/pgsql-bugs/ should be enough?
>
> Yes, we already got them to start pocking around. We have a small
> python script processing them but mbox format and/or python lib and/or email
> format are a bit loose and we currently have 3k orphans emails out of 13697
> threads.

Oh, there is a lot of weirdness in the email archives, particularly in
history (it's gotten a bit better, but we still see really weird mime
combinations fairly often). And there have been many crappy
implementations of mbox over the years as well, which has led to a lot
of problems of imports :/

So the root question there is, why are we exactring more structured
data into a format that we know is worse?


> BTW, we found some orphans emails in pgarchiver UI as well that might be fixed
> if you are interested. The in-reply-to field is malformed but a message-id is
> still available there, eg: https://postgr.es/m/4454.935677480%40sss.pgh.pa.us.

I'm not sure we want to go down the route of manually editing
messages. It would work for a message like this from 1999 because
that's before DKIM which would prevent us from doing it at all. But
either way the archives should represent what things actually looked
like as much as possible. And from an archives perspective that it not
an orphaned thread, that is a single message sent on it's own thread
(and we have plenty of those in general).



> Without any better solution, maybe our current method is "good enough" for a
> simple PoC. We could tighten/rewrite this part of the procedure in a second
> round if it worth it.

Probably.

But if you are somehow crawling the per-thread mbox urls please make
sure you rate limit yourself severely. They're really not meant to be
API endpoints...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: pgsql-bugs mailing list dump?

From
Jehan-Guillaume de Rorthais
Date:
On Tue, 22 Dec 2020 11:11:10 +0100
Magnus Hagander <magnus@hagander.net> wrote:

> On Wed, Dec 16, 2020 at 3:53 PM Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote:
> >
> > Hello Magnus,
> >
> > On Wed, 16 Dec 2020 15:02:03 +0100
> > Magnus Hagander <magnus@hagander.net> wrote:
> >  
> > > On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais
> > > <jgdr@dalibo.com> wrote:  
> > [...]  
> > > > However, maybe some admins would agree to provide some pgsql dump or
> > > > access to some json API if relevant? We would save some time and
> > > > CPU :)  
> > >
> > > There are mbox files available for download from the list archives --
> > > would that work for you? It can be done on a per-thread basis as well,
> > > i guess, but that's not something we have now (that is, we don't have
> > > a unique listing of threads).  
> >
> > The srht import API process one JSON documents per thread. That's why we
> > try to gather one mbox per thread.  
> 
> There must be something I'm missing here, because that sounds.. Insane?
> 
> Basically they take a raw mbox and wrap it in json? Just to make it
> less efficient?
> 
> And they specifically need the "outside" to have done the one thing
> that's actually hard, namely threading?
> 
> What are they actually trying to accomplish here?

This would be perfectly insane and crazy :) Such a story would be a dead end
right from the start.

No, the sr.ht import script is accepting a pure json doc *only*. They do not
require you to wrap mbox in json. The whole thread must be in json following
their **import/export** format.

When downloading mbox from postgresql.org, we have to write the wheel to
transform mbox to json.

Note that in production, the bug tracker relies on a mailing list managed by
sr.ht. Each mails is parsed and stored in pgsql.

> > > But if you're building your own threading on it, then the monthly mbox
> > > files at https://www.postgresql.org/list/pgsql-bugs/ should be enough?  
> >
> > Yes, we already got them to start pocking around. We have a small
> > python script processing them but mbox format and/or python lib and/or email
> > format are a bit loose and we currently have 3k orphans emails out of 13697
> > threads.  
> 
> Oh, there is a lot of weirdness in the email archives, particularly in
> history (it's gotten a bit better, but we still see really weird mime
> combinations fairly often). And there have been many crappy
> implementations of mbox over the years as well, which has led to a lot
> of problems of imports :/

Indeed. But anyway, my colleague's script is already able to sort out most of
the troubles. Good enough for now.

> So the root question there is, why are we exactring more structured
> data into a format that we know is worse?

The root question was me asking if a database dump or access to some json API
would be somehow possible. I should have quickly explained this was to extract
data as json from there.

My bad, really. I hope the whole picture is clearer now.

> > BTW, we found some orphans emails in pgarchiver UI as well that might be
> > fixed if you are interested. The in-reply-to field is malformed but a
> > message-id is still available there, eg:
> > https://postgr.es/m/4454.935677480%40sss.pgh.pa.us.  
> 
> I'm not sure we want to go down the route of manually editing
> messages. It would work for a message like this from 1999 because
> that's before DKIM which would prevent us from doing it at all. But
> either way the archives should represent what things actually looked
> like as much as possible. And from an archives perspective that it not
> an orphaned thread, that is a single message sent on it's own thread
> (and we have plenty of those in general).

Sure.

> > Without any better solution, maybe our current method is "good enough" for a
> > simple PoC. We could tighten/rewrite this part of the procedure in a second
> > round if it worth it.  
> 
> Probably.
> 
> But if you are somehow crawling the per-thread mbox urls please make
> sure you rate limit yourself severely. They're really not meant to be
> API endpoints...

As far as I know, we now have enough data to move ahead. We should not crawl
again soon. We will do some rate limit if needed in the futur, but I hope we
will not have to deal with mbox anymore.

Thanks!