Hello Magnus,
On Wed, 16 Dec 2020 15:02:03 +0100
Magnus Hagander <magnus@hagander.net> wrote:
> On Wed, Dec 16, 2020 at 2:57 PM Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote:
[...]
> > However, maybe some admins would agree to provide some pgsql dump or access
> > to some json API if relevant? We would save some time and CPU :)
>
> There are mbox files available for download from the list archives --
> would that work for you? It can be done on a per-thread basis as well,
> i guess, but that's not something we have now (that is, we don't have
> a unique listing of threads).
The srht import API process one JSON documents per thread. That's why we try to
gather one mbox per thread.
> But if you're building your own threading on it, then the monthly mbox files
> at https://www.postgresql.org/list/pgsql-bugs/ should be enough?
Yes, we already got them to start pocking around. We have a small
python script processing them but mbox format and/or python lib and/or email
format are a bit loose and we currently have 3k orphans emails out of 13697
threads.
BTW, we found some orphans emails in pgarchiver UI as well that might be fixed
if you are interested. The in-reply-to field is malformed but a message-id is
still available there, eg: https://postgr.es/m/4454.935677480%40sss.pgh.pa.us.
Without any better solution, maybe our current method is "good enough" for a
simple PoC. We could tighten/rewrite this part of the procedure in a second
round if it worth it.
Thanks!