Re: mailing list archiver chewing patches - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: mailing list archiver chewing patches
Date
Msg-id 9837222c1001121204m52adbf21k6260609f1c8768b@mail.gmail.com
Whole thread Raw
In response to Re: mailing list archiver chewing patches  (Matteo Beccati <php@beccati.com>)
Responses Re: mailing list archiver chewing patches  (Matteo Beccati <php@beccati.com>)
List pgsql-hackers
On Tue, Jan 12, 2010 at 20:56, Matteo Beccati <php@beccati.com> wrote:
> Il 12/01/2010 10:30, Magnus Hagander ha scritto:
>>
>> The problem is usually with strange looking emails with 15 different
>> MIME types. If we can figure out the proper way to render that, the
>> rest really is just a SMOP.
>
> Yeah, I was expecting some, but all the message I've looked at seemed to be
> working ok.

Have you been looking at old or new messages? Try grabbing a couple of
MBOX files off archives.postgresql.org from several years back, you're
more likely to find weird MUAs then I think.


>> (BTW, for something to actually be used In Production (TM), we want
>> something that uses one of our existing frameworks. So don't go
>> overboard in code-wise implementations on something else - proof of
>> concept on something else is always ok, of course)
>
> OK, that's something I didn't know, even though I expected some kind of
> limitations. Could you please elaborate a bit more (i.e. where to find
> info)?

Well, the framework we're moving towards is built on top of django, so
that would be a good first start.

There is also whever the commitfest thing is built on, but I'm told
that's basically no framework.


> Having played with it, here's my feedback about AOX:
>
> pros:
> - seemed to be working reliably;
> - does most of the dirty job of parsing emails, splitting parts, etc
> - highly normalized schema
> - thread support (partial?)

A killer will be if that thread support is enough. If we have to build
that completely ourselves, it'll take a lot more work.


> cons:
> - directly publishing the live email feed might not be desirable

Why not?

> - queries might end up being a bit complicate for simple tasks

As long as we don't have to hit them too often, which is solve:able
with caching. And we do have a pretty good RDBMS to run the queries on
:)


>> I don't think you can trust the NNTP gateway now or in the past,
>> messages are sometimes lost there. The mbox files are as complete as
>> anything we'll ever get.
>
> Importing the whole pgsql-www archive with a perl script that bounces
> messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, I
> haven't looked into it that much.

Um, yes. There is an MBOX import tool.


>>>> - We need to generate thread indexes
>>>
>>> We have CTEs :)
>>
>> Right. We still need the threading information, so we have something
>> to use our CTEs on :-)
>>
>> But I assume that AOX already does this?
>
> there are thread related tables and they seem to get filled when a SORT IMAP
> command is issued, however I haven't found a way to get the hierarchy out of
> them.
>
> What that means is that we'd need some kind of post processing to populate a
> thread hierarchy.
>
> If there isn't a fully usable thread hierarchy I was more thinking to ltree,
> mainly because I've successfully used it in past and I haven't had enough
> time yet to look at CTEs. But if performance is comparable I don't see a
> reason why we shouldn't use them.

I'd favor CTEs if they are fast enough. Great flexibility.


>>>> - We need to re-generate the original URLs for backwards compatibility
>>>
>>> I guess the message-id one ain't the tricky one... and it should be
>>> possible to fill a relation table like
>>>  monharc_compat(message_id, list, year, month, message_number);
>>
>> Yeah. It's not so hard, you can just screen-scrape the current
>> archives the same way the search server does.
>
> Definitely an easy enough task.
>
> With all that said, I can't promise anything as it all depends on how much
> spare time I have, but I can proceed with the evaluation if you think it's
> useful. I have a feeling that AOX is not truly the right tool for the job,
> but we might be able to customise it to suit our needs. Are there any other
> requirements that weren't specified?

Well, I think we want to avoid customizing it. Using a custom
frontend, sure. But we don't want to end up customizing the
parser/backend. That's the road to unmaintainability.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


pgsql-hackers by date:

Previous
From: Matteo Beccati
Date:
Subject: Re: mailing list archiver chewing patches
Next
From: Marko Kreen
Date:
Subject: Re: Streaming replication status