Re: mailing list archiver chewing patches - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: mailing list archiver chewing patches
Date
Msg-id 9837222c1001120130n67516f0fj317b5656b435f964@mail.gmail.com
Whole thread Raw
In response to Re: mailing list archiver chewing patches  (Dimitri Fontaine <dfontaine@hi-media.com>)
Responses Re: mailing list archiver chewing patches
List pgsql-hackers
On Tue, Jan 12, 2010 at 10:05, Dimitri Fontaine <dfontaine@hi-media.com> wrote:
> Dave Page <dpage@pgadmin.org> writes:
>> 2010/1/12 Matteo Beccati <php@beccati.com>:
>>> So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list
archivesreading from the AOX database: 
>>>
>>> http://archives.beccati.org/
>>
>> Seems to work.
>
> Hehe, nice a beginning!

The problem is usually with strange looking emails with 15 different
MIME types. If we can figure out the proper way to render that, the
rest really is just a SMOP.

(BTW, for something to actually be used In Production (TM), we want
something that uses one of our existing frameworks. So don't go
overboard in code-wise implementations on something else - proof of
concept on something else is always ok, of course)


>> So just to put this into perspective and give anyone paying attention
>> an idea of the pain that lies ahead should they decide to work on
>> this:
>>
>> - We need to import the old archives (of which there are hundreds of
>> thousands of messages, the first few years of which have, umm, minimal
>> headers.
>
> Anyone having a local copy of this in his mailboxes? At some point there
> were some NNTP gateway, so maybe there's a copy this way.

We have MBOX files.

IIRC, aox has an import function that can read MBOX files. The
interesting thing is what happens with the really old files that don't
have complete headers.

I don't think you can trust the NNTP gateway now or in the past,
messages are sometimes lost there. The mbox files are as complete as
anything we'll ever get.


>> - We need to generate thread indexes
>
> We have CTEs :)

Right. We still need the threading information, so we have something
to use our CTEs on :-)

But I assume that AOX already does this?


>> - We need to re-generate the original URLs for backwards compatibility
>
> I guess the message-id one ain't the tricky one... and it should be
> possible to fill a relation table like
>  monharc_compat(message_id, list, year, month, message_number);

Yeah. It's not so hard, you can just screen-scrape the current
archives the same way the search server does.


> Then we'd need some help from the webserver (rewrite rules I guess) so
> that the current URL is transformed to call a catch-all script:
>   http://archives.postgresql.org/pgsql-xxx/YYYY-MM/msg01234.php
> -> http://archives.postgresql.org/compat.php?l=xxx&y=YYYY&m=MM&n=01234

Or just a trivial regexp catch in any modern app platform.


> In that compat.php script you then issue the following query or the like
> to get the message_id, then use the newer infrastructure to get to
> display it:
>
>  SELECT message_id
>    FROM monharc_compat
>   WHERE list = ? and year = ? and month = ? and message_number = ?;

I'd rather see it redirect it to the new style URL, but it's the same
query, yes :-)


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Typed tables
Next
From: Andres Freund
Date:
Subject: Re: Hot Standy introduced problem with query cancel behavior