Thread: mailing list archiver chewing patches

mailing list archiver chewing patches

From
Andrew Dunstan
Date:
Tim Bunce's recent patch has been mangled apparently by the list 
archives. He sent it as an attachment, and that's how I have it in my 
mailbox, so why isn't it appearing as such in the web archive so that it 
can be nicely downloaded? See 
<http://archives.postgresql.org/message-id/20100108124613.GL2505@timac.local>.  
It's happened to other people as well: 
<http://archives.postgresql.org/message-id/4B02D3E4.1040107@hut.fi>

Reviewers and others shouldn't have to c&p patches from web pages, 
especially when it will be horribly line wrapped etc. Can we stop this 
happening somehow?

cheers

andrew


Re: mailing list archiver chewing patches

From
Alvaro Herrera
Date:
Andrew Dunstan wrote:
> 
> Tim Bunce's recent patch has been mangled apparently by the list
> archives. He sent it as an attachment, and that's how I have it in
> my mailbox, so why isn't it appearing as such in the web archive so
> that it can be nicely downloaded? See <http://archives.postgresql.org/message-id/20100108124613.GL2505@timac.local>.
> It's happened to other people as well:
> <http://archives.postgresql.org/message-id/4B02D3E4.1040107@hut.fi>
> 
> Reviewers and others shouldn't have to c&p patches from web pages,
> especially when it will be horribly line wrapped etc. Can we stop
> this happening somehow?

Try this

http://archives.postgresql.org/msgtxt.php?id=20100108124613.GL2505@timac.local

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: mailing list archiver chewing patches

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:
> Andrew Dunstan wrote:
> > 
> > Tim Bunce's recent patch has been mangled apparently by the list
> > archives. He sent it as an attachment, and that's how I have it in
> > my mailbox, so why isn't it appearing as such in the web archive so
> > that it can be nicely downloaded? See
<http://archives.postgresql.org/message-id/20100108124613.GL2505@timac.local>.
> > It's happened to other people as well:
> > <http://archives.postgresql.org/message-id/4B02D3E4.1040107@hut.fi>
> > 
> > Reviewers and others shouldn't have to c&p patches from web pages,
> > especially when it will be horribly line wrapped etc. Can we stop
> > this happening somehow?
> 
> Try this
> 
> http://archives.postgresql.org/msgtxt.php?id=20100108124613.GL2505@timac.local

This was previously broken for a lot of emails, but I just fixed some of
it, and it seems to work for the vast majority of our emails (and
certainly for all emails that matter).

The other point related to this is that each email should have a link
pointing to its text/plain version.  This used to be present, but it got
broken (I think) at the same time that the anti-email-harvesting measure
got broken.  I'm going to look at that next.

Let me know if you find something broken with this style of link.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: mailing list archiver chewing patches

From
Tim Bunce
Date:
On Sat, Jan 09, 2010 at 02:17:27AM -0300, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> > Andrew Dunstan wrote:
> > > 
> > > Tim Bunce's recent patch has been mangled apparently by the list
> > > archives. He sent it as an attachment, and that's how I have it in
> > > my mailbox, so why isn't it appearing as such in the web archive so
> > > that it can be nicely downloaded? See
<http://archives.postgresql.org/message-id/20100108124613.GL2505@timac.local>.
> > > It's happened to other people as well:
> > > <http://archives.postgresql.org/message-id/4B02D3E4.1040107@hut.fi>
> > > 
> > > Reviewers and others shouldn't have to c&p patches from web pages,
> > > especially when it will be horribly line wrapped etc. Can we stop
> > > this happening somehow?
> > 
> > Try this
> > 
> > http://archives.postgresql.org/msgtxt.php?id=20100108124613.GL2505@timac.local

That looks like it dumps the raw message. That'll cause problems for any
messages using quoted-printable encoding. I'd hazard a guess it also
won't do thing right thing for non-charset=us-ascii emails/attachments.

> This was previously broken for a lot of emails, but I just fixed some of
> it, and it seems to work for the vast majority of our emails (and
> certainly for all emails that matter).
> 
> The other point related to this is that each email should have a link
> pointing to its text/plain version.  This used to be present, but it got
> broken (I think) at the same time that the anti-email-harvesting measure
> got broken.  I'm going to look at that next.
>
> Let me know if you find something broken with this style of link.

What's needed is a) a download link for each attachment, regardless of the
kind of attachment, and b) the download link should download the content
of the attachment in a way that's directly usable.

For example, see http://archives.postgresql.org/pgsql-hackers/2010-01/msg00589.php
Looking at the raw version of the original
messagehttp://archives.postgresql.org/msgtxt.php?id=757953.70187.qm@web29001.mail.ird.yahoo.com
That message has a patch as an attachment:Content-Type: application/octet-stream;
name="patch_bit.patch"Content-Transfer-Encoding:base64Content-Disposition: attachment; filename="patch_bit.patch"
 

It gets a link in the archive (because it's a non-text content-type I
presume):http://archives.postgresql.org/pgsql-hackers/2010-01/bin5ThVOJC3jI.bin
but the link doesn't work well.  The url ends with .bin and the http
response content-type is Content-Type: application/octet-stream so
downloaders get a .bin file instead of the original .patch file.

It seems that people wanting to send in a patch have two options: send
it as text/(something) so it's readable on the archive web page but not
copy-n-paste'able because of wordwrapping, or set it as
application/octet-stream so it's downloadable but not readable on the
web page.

Let me know if I've misunderstood anything.

Some sugestions:
- Provide links for all attachments, whether text/* or not.
- For text/* types show the content inline verbatim, don't wrap the text.
- If the attachment has a Content-Disposition with a filename then append that to the url. It could simply be a fake
'pathinfo':.../2010-01/bin5ThVOJC3jI.bin/patch_bit.patch
 
- Instead of "Description: Binary data" on the web page, give the values of the Content-Type and Content-Disposition
headers.

Tim.

p.s. For background... I'm writing an email to the dbi-users &
dbi-announce mailing lists (~2000 & ~5000 users last time I checked)
asking anyone who might be interested to help review the plperl feature
patch and encouraging them to contribute to the commitfest review
process for other patches.  It's important that it's *very* easy for
these new-comers to follow simple instructions to get involved.
I was hoping to be able to use a archives.postgresql.org url to the
message with the patch to explain what's the patch does _and_ provide a
download link. It seems I'll have to upload the patch somewhere else.


Re: mailing list archiver chewing patches

From
Andrew Dunstan
Date:

Tim Bunce wrote:
> It seems that people wanting to send in a patch have two options: send
> it as text/(something) so it's readable on the archive web page but not
> copy-n-paste'able because of wordwrapping, or set it as
> application/octet-stream so it's downloadable but not readable on the
> web page.
>
>
>   

That is assuming that the MUA gives you the option of specifying the 
attachment MIME type. Many (including mine) do not. It would mean an 
extra step - I'd have to gzip each patch or something like that. That 
would be unfortunate,as well as imposing extra effort, because it would 
make the patch not display inline in many MUAs (again, like mine).

cheers

andrew




Re: mailing list archiver chewing patches

From
Alvaro Herrera
Date:
Tim Bunce wrote:

> > > Try this
> > > 
> > > http://archives.postgresql.org/msgtxt.php?id=20100108124613.GL2505@timac.local
> 
> That looks like it dumps the raw message. That'll cause problems for any
> messages using quoted-printable encoding. I'd hazard a guess it also
> won't do thing right thing for non-charset=us-ascii emails/attachments.

Yeah.  Grab it and open it as an mbox.

> What's needed is a) a download link for each attachment, regardless of the
> kind of attachment, and b) the download link should download the content
> of the attachment in a way that's directly usable.

Yeah, well, that's a bit outside what I am able to do, unless you can
get a MHonArc expert somewhere who can help us figure out how to
set it up for these requirements.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> That is assuming that the MUA gives you the option of specifying the
> attachment MIME type. Many (including mine) do not. It would mean an extra
> step - I'd have to gzip each patch or something like that. That would be
> unfortunate,as well as imposing extra effort, because it would make the
> patch not display inline in many MUAs (again, like mine).

Bad MUA, change MUA, or what they say…

More seriously though, it's not the first time we're having some
difficulties with the MHonArc setup, and I think it's also related to
the poor thread following on the archives website at month boundaries.

MHonArc (http://hydra.nac.uci.edu/indiv/ehood/mhonarc.html) seems to be
about converting the mails into some HTML pages, and offering the web
interface to get to use them, with some indexing and searches
facilities.

Are our indexing and searches provided by MHonArc or maintained by the
community? How helpful considering alternatives, such as AOX (which runs
atop PostgreSQL and would offer anonymous IMAP facility over the
archives) would be?

Of course it'll boil down to who's maintaining the current solution and
how much time is allocated to this, the solution research and migration
would have to fit in there I suppose. Same as pgfoundry. But still,
should we talk about it?

Regards,
--
dim


Re: mailing list archiver chewing patches

From
Alvaro Herrera
Date:
Dimitri Fontaine wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > That is assuming that the MUA gives you the option of specifying the
> > attachment MIME type. Many (including mine) do not. It would mean an extra
> > step - I'd have to gzip each patch or something like that. That would be
> > unfortunate,as well as imposing extra effort, because it would make the
> > patch not display inline in many MUAs (again, like mine).
> 
> Bad MUA, change MUA, or what they say…
> 
> More seriously though, it's not the first time we're having some
> difficulties with the MHonArc setup, and I think it's also related to
> the poor thread following on the archives website at month boundaries.

Absolutely.  The month boundary problem boils down to the fact that
Mhonarc does not scale very well, so we can't have mboxes that are too
large.  This is why most people split their archives per month, and then
each month is published as an independent Mhonarc output archive.  It's
a horrid solution.

> Are our indexing and searches provided by MHonArc or maintained by the
> community?

Searches are completely external to mhonarc.

> How helpful considering alternatives, such as AOX (which runs
> atop PostgreSQL and would offer anonymous IMAP facility over the
> archives) would be?
>
> Of course it'll boil down to who's maintaining the current solution and
> how much time is allocated to this, the solution research and migration
> would have to fit in there I suppose. Same as pgfoundry. But still,
> should we talk about it?

There's some talk about writing our own archiving system,
database-backed.  There have been a few false starts but no concrete
result so far.  We need a lot more manpower invested in this problem.
If there's interest, let's talk about it.

My daugher was born yesterday and I'm having a bit of a calm before the
storm because she's not coming home until Tuesday or so (at this time of
the day, that is, because I have to take care of the other daughter).
I'll be probably away for (at least) a week when she does; and I'll
probably have somewhat of a shortage of spare time after that.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
2010/1/11 Alvaro Herrera <alvherre@commandprompt.com>:
> Dimitri Fontaine wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>> > That is assuming that the MUA gives you the option of specifying the
>> > attachment MIME type. Many (including mine) do not. It would mean an extra
>> > step - I'd have to gzip each patch or something like that. That would be
>> > unfortunate,as well as imposing extra effort, because it would make the
>> > patch not display inline in many MUAs (again, like mine).
>>
>> Bad MUA, change MUA, or what they say…
>>
>> More seriously though, it's not the first time we're having some
>> difficulties with the MHonArc setup, and I think it's also related to
>> the poor thread following on the archives website at month boundaries.
>
> Absolutely.  The month boundary problem boils down to the fact that
> Mhonarc does not scale very well, so we can't have mboxes that are too
> large.  This is why most people split their archives per month, and then
> each month is published as an independent Mhonarc output archive.  It's
> a horrid solution.

Yeah.


>> Are our indexing and searches provided by MHonArc or maintained by the
>> community?
>
> Searches are completely external to mhonarc.

It is, but it's tied into the format of the URLs and the format of the
actual messages in order to be more efficient. But it should be fairly
easy to adapt it to some other base system if we want.


>> How helpful considering alternatives, such as AOX (which runs
>> atop PostgreSQL and would offer anonymous IMAP facility over the
>> archives) would be?
>>
>> Of course it'll boil down to who's maintaining the current solution and
>> how much time is allocated to this, the solution research and migration
>> would have to fit in there I suppose. Same as pgfoundry. But still,
>> should we talk about it?
>
> There's some talk about writing our own archiving system,
> database-backed.  There have been a few false starts but no concrete
> result so far.  We need a lot more manpower invested in this problem.
> If there's interest, let's talk about it.

Yeah, definitely, let's talk about it. Anything that gives us an
efficient backend with a good API is interesting (SQL is a reasonably
good API. Not so sure about IMAP, since it is a bit too focused on
single messages IIRC). Particularly, something that can separate
frontend and backend (can still be on the same machine of course, I'm
talking conceptually) seems to be a lot more flexible, which we'd
like.

As for AOX, my understanding is that it is no longer maintained, so
I'd be worried about choosing such a solution for a complex problem.
But it's open for discussion.


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Absolutely.  The month boundary problem boils down to the fact that
> Mhonarc does not scale very well, so we can't have mboxes that are too
> large.  This is why most people split their archives per month, and then
> each month is published as an independent Mhonarc output archive.  It's
> a horrid solution.
>
>> Are our indexing and searches provided by MHonArc or maintained by the
>> community?
>
> Searches are completely external to mhonarc.

Changing the MHonArc solution would probably mean adapting them, I
guess, or proposing a new solution with compatible output for the
searching to still work…

>> How helpful considering alternatives, such as AOX (which runs
>> atop PostgreSQL and would offer anonymous IMAP facility over the
>> archives) would be?
>>
>> Of course it'll boil down to who's maintaining the current solution and
>> how much time is allocated to this, the solution research and migration
>> would have to fit in there I suppose. Same as pgfoundry. But still,
>> should we talk about it?
>
> There's some talk about writing our own archiving system,
> database-backed.  There have been a few false starts but no concrete
> result so far.  We need a lot more manpower invested in this problem.
> If there's interest, let's talk about it.

AOX is already a database backed email solution, offering an archive
page with searching. I believe the searching is baked by tsearch
indexing. That's why I think it'd be suitable.

They already archive and offer search over one of our mailing lists, and
from there it seems like we'd only miss the user interface bits:
 http://archives.aox.org/archives/pgsql-announce

I hope the UI bits are not the most time demanding one.

Is there someone with enough time to install aox somewhere and have it
subscribed to our lists?

> My daugher was born yesterday and I'm having a bit of a calm before the
> storm because she's not coming home until Tuesday or so (at this time of
> the day, that is, because I have to take care of the other daughter).
> I'll be probably away for (at least) a week when she does; and I'll
> probably have somewhat of a shortage of spare time after that.

Ahaha :)
IME that's not the shortage of spare time which ruins you the most as
the lack of energy when you do have this little precious resource
again, very few piece of it atime.

Regards,
--
dim


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Hi,

Il 11/01/2010 11:18, Dimitri Fontaine ha scritto:
> AOX is already a database backed email solution, offering an archive
> page with searching. I believe the searching is baked by tsearch
> indexing. That's why I think it'd be suitable.
>
> They already archive and offer search over one of our mailing lists, and
> from there it seems like we'd only miss the user interface bits:
>
>    http://archives.aox.org/archives/pgsql-announce
>
> I hope the UI bits are not the most time demanding one.
>
> Is there someone with enough time to install aox somewhere and have it
> subscribed to our lists?

I recall having tried AOX a long time ago but I can't remember the 
reason why I was not satisfied. I guess I can give another try by 
setting up a test ML archive.


>> My daugher was born yesterday and I'm having a bit of a calm before the
>> storm because she's not coming home until Tuesday or so (at this time of
>> the day, that is, because I have to take care of the other daughter).
>> I'll be probably away for (at least) a week when she does; and I'll
>> probably have somewhat of a shortage of spare time after that.

BTW, congrats Alvaro!


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Dave Page
Date:
On Mon, Jan 11, 2010 at 5:23 PM, Matteo Beccati <php@beccati.com> wrote:
> Hi,
>
> Il 11/01/2010 11:18, Dimitri Fontaine ha scritto:
>>
>> AOX is already a database backed email solution, offering an archive
>> page with searching. I believe the searching is baked by tsearch
>> indexing. That's why I think it'd be suitable.
>>
>> They already archive and offer search over one of our mailing lists, and
>> from there it seems like we'd only miss the user interface bits:
>>
>>   http://archives.aox.org/archives/pgsql-announce
>>
>> I hope the UI bits are not the most time demanding one.
>>
>> Is there someone with enough time to install aox somewhere and have it
>> subscribed to our lists?
>
> I recall having tried AOX a long time ago but I can't remember the reason
> why I was not satisfied. I guess I can give another try by setting up a test
> ML archive.

I tried it too, before I started writing the new prototype archiver
from scratch. I too forget why I gave up on it, but it was a strong
enough reason for me to start coding from scratch.

BTW, we only need to replace the archiver/display code. The search
works well already.


--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 11/01/2010 12:58, Dave Page ha scritto:
> On Mon, Jan 11, 2010 at 5:23 PM, Matteo Beccati<php@beccati.com>  wrote:
>> I recall having tried AOX a long time ago but I can't remember the reason
>> why I was not satisfied. I guess I can give another try by setting up a test
>> ML archive.
>
> I tried it too, before I started writing the new prototype archiver
> from scratch. I too forget why I gave up on it, but it was a strong
> enough reason for me to start coding from scratch.
>
> BTW, we only need to replace the archiver/display code. The search
> works well already.

It took me no more than 10 minutes to set up AOX and hook it up to a 
domain. An email account is now subscribed to the hackers ML.

I'll try to estimate how hard it could be to write a web app that 
displays the archive from the db, even though I'm not sure that this is 
a good way to proceed.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Magnus Hagander <magnus@hagander.net> writes:
> As for AOX, my understanding is that it is no longer maintained, so
> I'd be worried about choosing such a solution for a complex problem.
> But it's open for discussion.

Ouch.
-- 
dim


Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Dave Page <dpage@pgadmin.org> writes:
>> I recall having tried AOX a long time ago but I can't remember the reason
>> why I was not satisfied. I guess I can give another try by setting up a test
>> ML archive.
>
> I tried it too, before I started writing the new prototype archiver
> from scratch. I too forget why I gave up on it, but it was a strong
> enough reason for me to start coding from scratch.
>
> BTW, we only need to replace the archiver/display code. The search
> works well already.

What the current archiver looks like? A PG database containing the raw
mails and attachements? It that's the case the missing piece would be to
plug a browsing UI atop of that, right?

Regards,
-- 
dim


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
2010/1/11 Dimitri Fontaine <dfontaine@hi-media.com>:
> Dave Page <dpage@pgadmin.org> writes:
>>> I recall having tried AOX a long time ago but I can't remember the reason
>>> why I was not satisfied. I guess I can give another try by setting up a test
>>> ML archive.
>>
>> I tried it too, before I started writing the new prototype archiver
>> from scratch. I too forget why I gave up on it, but it was a strong
>> enough reason for me to start coding from scratch.
>>
>> BTW, we only need to replace the archiver/display code. The search
>> works well already.
>
> What the current archiver looks like? A PG database containing the raw
> mails and attachements? It that's the case the missing piece would be to
> plug a browsing UI atop of that, right?

No, the current archiver is a set of MBOX files that are processed
incrementally by mhonarc.

(yes, this is why it doesn't scale)

*search* is in a postgresql database, but it doesn't contain the
entire messages - doesn't have attachments for examples - only the
parts it has web-scraped off the the current archives.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Dimitri Fontaine <dfontaine@hi-media.com> writes:
> Magnus Hagander <magnus@hagander.net> writes:
>> As for AOX, my understanding is that it is no longer maintained, so
>> I'd be worried about choosing such a solution for a complex problem.
>> But it's open for discussion.
>
> Ouch.

It seems that the company baking the development is dead, but the
developpers are still working on the product on their spare time. New
release ahead.

They're not working on the archive UI part.
-- 
dim


Re: mailing list archiver chewing patches

From
Andrew Dunstan
Date:

Magnus Hagander wrote:
>
> No, the current archiver is a set of MBOX files that are processed
> incrementally by mhonarc.
>
> (yes, this is why it doesn't scale)
>
> *search* is in a postgresql database, but it doesn't contain the
> entire messages - doesn't have attachments for examples - only the
> parts it has web-scraped off the the current archives.
>   

Fixing this mess and giving us decent archives with guaranteed 
downloadable patches and good search would be a nice job for someone who 
wants to contribute without having to cut or review core code.

cheers

andrew


Re: mailing list archiver chewing patches

From
Abhijit Menon-Sen
Date:
(Many thanks to Dimitri for bringing this thread to my attention.)

At 2010-01-11 10:46:10 +0100, magnus@hagander.net wrote:
>
> As for AOX, my understanding is that it is no longer maintained, so
> I'd be worried about choosing such a solution for a complex problem.

I'll keep this short: Oryx, the company behind Archiveopteryx (aox), is
no longer around, but the software is still maintained. The developers
(myself included) are still interested in keeping it alive. It's been a
while since the last release, but it'll be ready soon. If you're having
any sort of problems with it, write to me, and I'll help you.

(That said, we're not working on the web interface. It did work, in its
limited fashion, but it's not feature complete; and I need to find some
paying work, so it's not a priority. That, and some health problems, are
also why I haven't been active on the pg lists for a while.)

Feel free to write to me off-list for more.

-- ams


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 11/01/2010 15:00, Abhijit Menon-Sen ha scritto:
> I'll keep this short: Oryx, the company behind Archiveopteryx (aox), is
> no longer around, but the software is still maintained. The developers
> (myself included) are still interested in keeping it alive. It's been a
> while since the last release, but it'll be ready soon. If you're having
> any sort of problems with it, write to me, and I'll help you.

That's good news indeed for the project, AOX seems to be working fine on 
my server. I've had a few IMAP glitches, but it seems to live happily 
with my qmail and stores the emails on the db, fulfilling my current needs.

So, I've decided to spend a bit more time on this and here is a proof of 
concept web app that displays mailing list archives reading from the AOX 
database:

http://archives.beccati.org/

Please take it as an exercise I've made trying to learn how to use 
symfony this afternoon. It's not feature complete, nor probably very 
scalable, but at least it features attachment download ;)

http://archives.beccati.org/pgsql-hackers/message/37


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Dave Page
Date:
2010/1/12 Matteo Beccati <php@beccati.com>:
> So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list
archivesreading from the AOX database:
 
>
> http://archives.beccati.org/

Seems to work.

> Please take it as an exercise I've made trying to learn how to use symfony this afternoon. It's not feature complete,
norprobably very scalable, but at least it features attachment download ;)
 
>
> http://archives.beccati.org/pgsql-hackers/message/37

:-)

So just to put this into perspective and give anyone paying attention
an idea of the pain that lies ahead should they decide to work on
this:

- We need to import the old archives (of which there are hundreds of
thousands of messages, the first few years of which have, umm, minimal
headers.
- We need to generate thread indexes
- We need to re-generate the original URLs for backwards compatibility

Now there's encouragement :-)

-- 
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Dave Page <dpage@pgadmin.org> writes:
> 2010/1/12 Matteo Beccati <php@beccati.com>:
>> So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list
archivesreading from the AOX database:
 
>>
>> http://archives.beccati.org/
>
> Seems to work.

Hehe, nice a beginning!

> So just to put this into perspective and give anyone paying attention
> an idea of the pain that lies ahead should they decide to work on
> this:
>
> - We need to import the old archives (of which there are hundreds of
> thousands of messages, the first few years of which have, umm, minimal
> headers.

Anyone having a local copy of this in his mailboxes? At some point there
were some NNTP gateway, so maybe there's a copy this way.

> - We need to generate thread indexes

We have CTEs :)

> - We need to re-generate the original URLs for backwards compatibility

I guess the message-id one ain't the tricky one... and it should be
possible to fill a relation table like monharc_compat(message_id, list, year, month, message_number);

Then we'd need some help from the webserver (rewrite rules I guess) so
that the current URL is transformed to call a catch-all script:
http://archives.postgresql.org/pgsql-xxx/YYYY-MM/msg01234.php
-> http://archives.postgresql.org/compat.php?l=xxx&y=YYYY&m=MM&n=01234

In that compat.php script you then issue the following query or the like
to get the message_id, then use the newer infrastructure to get to
display it:
 SELECT message_id    FROM monharc_compat   WHERE list = ? and year = ? and month = ? and message_number = ?;

Regards,
-- 
dim


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
On Tue, Jan 12, 2010 at 10:05, Dimitri Fontaine <dfontaine@hi-media.com> wrote:
> Dave Page <dpage@pgadmin.org> writes:
>> 2010/1/12 Matteo Beccati <php@beccati.com>:
>>> So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list
archivesreading from the AOX database: 
>>>
>>> http://archives.beccati.org/
>>
>> Seems to work.
>
> Hehe, nice a beginning!

The problem is usually with strange looking emails with 15 different
MIME types. If we can figure out the proper way to render that, the
rest really is just a SMOP.

(BTW, for something to actually be used In Production (TM), we want
something that uses one of our existing frameworks. So don't go
overboard in code-wise implementations on something else - proof of
concept on something else is always ok, of course)


>> So just to put this into perspective and give anyone paying attention
>> an idea of the pain that lies ahead should they decide to work on
>> this:
>>
>> - We need to import the old archives (of which there are hundreds of
>> thousands of messages, the first few years of which have, umm, minimal
>> headers.
>
> Anyone having a local copy of this in his mailboxes? At some point there
> were some NNTP gateway, so maybe there's a copy this way.

We have MBOX files.

IIRC, aox has an import function that can read MBOX files. The
interesting thing is what happens with the really old files that don't
have complete headers.

I don't think you can trust the NNTP gateway now or in the past,
messages are sometimes lost there. The mbox files are as complete as
anything we'll ever get.


>> - We need to generate thread indexes
>
> We have CTEs :)

Right. We still need the threading information, so we have something
to use our CTEs on :-)

But I assume that AOX already does this?


>> - We need to re-generate the original URLs for backwards compatibility
>
> I guess the message-id one ain't the tricky one... and it should be
> possible to fill a relation table like
>  monharc_compat(message_id, list, year, month, message_number);

Yeah. It's not so hard, you can just screen-scrape the current
archives the same way the search server does.


> Then we'd need some help from the webserver (rewrite rules I guess) so
> that the current URL is transformed to call a catch-all script:
>   http://archives.postgresql.org/pgsql-xxx/YYYY-MM/msg01234.php
> -> http://archives.postgresql.org/compat.php?l=xxx&y=YYYY&m=MM&n=01234

Or just a trivial regexp catch in any modern app platform.


> In that compat.php script you then issue the following query or the like
> to get the message_id, then use the newer infrastructure to get to
> display it:
>
>  SELECT message_id
>    FROM monharc_compat
>   WHERE list = ? and year = ? and month = ? and message_number = ?;

I'd rather see it redirect it to the new style URL, but it's the same
query, yes :-)


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
On Mon, Jan 11, 2010 at 15:00, Abhijit Menon-Sen <ams@toroid.org> wrote:
> (Many thanks to Dimitri for bringing this thread to my attention.)
>
> At 2010-01-11 10:46:10 +0100, magnus@hagander.net wrote:
>>
>> As for AOX, my understanding is that it is no longer maintained, so
>> I'd be worried about choosing such a solution for a complex problem.
>
> I'll keep this short: Oryx, the company behind Archiveopteryx (aox), is
> no longer around, but the software is still maintained. The developers
> (myself included) are still interested in keeping it alive. It's been a
> while since the last release, but it'll be ready soon. If you're having
> any sort of problems with it, write to me, and I'll help you.

Hmm. So if this means that the system is actually something we can
rely on long-term for the parsing and importing of messages into the
database, it may be an interesting optino still, so we don't have to
write that part ourselves.

I just want to end up with a non-maintained system. I doubt we, as a
community, want to take on maintaining a message parser in C++. I'd be
much more inclined to end up having to maintain something written in
python or perl in that case, since they'd probably rely much on
external modules that a lot of others rely on --> somebody else would
help maintain large parts of it..


> (That said, we're not working on the web interface. It did work, in its
> limited fashion, but it's not feature complete; and I need to find some
> paying work, so it's not a priority. That, and some health problems, are
> also why I haven't been active on the pg lists for a while.)

As long as the db structure is easy enough to parse and generate stuff
from, this may actually be a feature, because it will make it easier
to integrate with our other website stuff. If it's very low level and
leaves too much to work, well, then it's the opposite of course :-)

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
"Joshua D. Drake"
Date:
On Tue, 2010-01-12 at 10:24 +0530, Dave Page wrote:
> 2010/1/12 Matteo Beccati <php@beccati.com>:
> > So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list
archivesreading from the AOX database: 
> >
> > http://archives.beccati.org/
>
> Seems to work.
>
> > Please take it as an exercise I've made trying to learn how to use symfony this afternoon. It's not feature
complete,nor probably very scalable, but at least it features attachment download ;) 
> >
> > http://archives.beccati.org/pgsql-hackers/message/37
>
> :-)
>
> So just to put this into perspective and give anyone paying attention
> an idea of the pain that lies ahead should they decide to work on
> this:
>
> - We need to import the old archives (of which there are hundreds of
> thousands of messages, the first few years of which have, umm, minimal
> headers.
> - We need to generate thread indexes
> - We need to re-generate the original URLs for backwards compatibility
>
> Now there's encouragement :-)

Or, we just leave the current infrastructure in place and use a new one
for all new messages going forward. We shouldn't limit our ability to
have a decent system due to decisions of the past.



Joshua D. Drake



--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: mailing list archiver chewing patches

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> On Tue, 2010-01-12 at 10:24 +0530, Dave Page wrote:
>> So just to put this into perspective and give anyone paying attention
>> an idea of the pain that lies ahead should they decide to work on
>> this:
>> 
>> - We need to import the old archives (of which there are hundreds of
>> thousands of messages, the first few years of which have, umm, minimal
>> headers.
>> - We need to generate thread indexes
>> - We need to re-generate the original URLs for backwards compatibility
>> 
>> Now there's encouragement :-)

> Or, we just leave the current infrastructure in place and use a new one
> for all new messages going forward. We shouldn't limit our ability to
> have a decent system due to decisions of the past.

-1.  What's the point of having archives?  IMO the mailing list archives
are nearly as critical a piece of the project infrastructure as the CVS
repository.  We've already established that moving to a new SCM that
fails to preserve the CVS history wouldn't be acceptable.  I hardly
think that the bar is any lower for mailing list archives.

Now I think we could possibly skip the requirement suggested above for
URL compatibility, if we just leave the old archives on-line so that
those URLs all still resolve.  But if we can't load all the old messages
into the new infrastructure, it'll basically be useless for searching
purposes.

(Hmm, re-reading what you said, maybe we are suggesting the same thing,
but it's not clear.  Anyway my point is that Dave's first two
requirements are real.  Only the third might not be.)
        regards, tom lane


Re: mailing list archiver chewing patches

From
"Joshua D. Drake"
Date:
On Tue, 2010-01-12 at 11:54 -0500, Tom Lane wrote:

> > Or, we just leave the current infrastructure in place and use a new one
> > for all new messages going forward. We shouldn't limit our ability to
> > have a decent system due to decisions of the past.
>
> -1.  What's the point of having archives?  IMO the mailing list archives
> are nearly as critical a piece of the project infrastructure as the CVS
> repository.  We've already established that moving to a new SCM that
> fails to preserve the CVS history wouldn't be acceptable.  I hardly
> think that the bar is any lower for mailing list archives.
>
> Now I think we could possibly skip the requirement suggested above for
> URL compatibility, if we just leave the old archives on-line so that
> those URLs all still resolve.  But if we can't load all the old messages
> into the new infrastructure, it'll basically be useless for searching
> purposes.
>
> (Hmm, re-reading what you said, maybe we are suggesting the same thing,
> but it's not clear.  Anyway my point is that Dave's first two
> requirements are real.  Only the third might not be.)

We are saying the same thing. Sorry if I wasn't clear.

Joshua D. Drake



>
>             regards, tom lane
>


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: mailing list archiver chewing patches

From
Dave Page
Date:
On Tue, Jan 12, 2010 at 10:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> On Tue, 2010-01-12 at 10:24 +0530, Dave Page wrote:
>>> So just to put this into perspective and give anyone paying attention
>>> an idea of the pain that lies ahead should they decide to work on
>>> this:
>>>
>>> - We need to import the old archives (of which there are hundreds of
>>> thousands of messages, the first few years of which have, umm, minimal
>>> headers.
>>> - We need to generate thread indexes
>>> - We need to re-generate the original URLs for backwards compatibility
>>>
>>> Now there's encouragement :-)
>
>> Or, we just leave the current infrastructure in place and use a new one
>> for all new messages going forward. We shouldn't limit our ability to
>> have a decent system due to decisions of the past.
>
> -1.  What's the point of having archives?  IMO the mailing list archives
> are nearly as critical a piece of the project infrastructure as the CVS
> repository.  We've already established that moving to a new SCM that
> fails to preserve the CVS history wouldn't be acceptable.  I hardly
> think that the bar is any lower for mailing list archives.
>
> Now I think we could possibly skip the requirement suggested above for
> URL compatibility, if we just leave the old archives on-line so that
> those URLs all still resolve.  But if we can't load all the old messages
> into the new infrastructure, it'll basically be useless for searching
> purposes.
>
> (Hmm, re-reading what you said, maybe we are suggesting the same thing,
> but it's not clear.  Anyway my point is that Dave's first two
> requirements are real.  Only the third might not be.)

The third actually isn't actually that hard to do in theory. The
message numbers are basically the zero-based position in the mbox
file, and the rest of the URL is obvious.


--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
On Tue, Jan 12, 2010 at 18:34, Dave Page <dpage@pgadmin.org> wrote:
> On Tue, Jan 12, 2010 at 10:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> "Joshua D. Drake" <jd@commandprompt.com> writes:
>>> On Tue, 2010-01-12 at 10:24 +0530, Dave Page wrote:
>>>> So just to put this into perspective and give anyone paying attention
>>>> an idea of the pain that lies ahead should they decide to work on
>>>> this:
>>>>
>>>> - We need to import the old archives (of which there are hundreds of
>>>> thousands of messages, the first few years of which have, umm, minimal
>>>> headers.
>>>> - We need to generate thread indexes
>>>> - We need to re-generate the original URLs for backwards compatibility
>>>>
>>>> Now there's encouragement :-)
>>
>>> Or, we just leave the current infrastructure in place and use a new one
>>> for all new messages going forward. We shouldn't limit our ability to
>>> have a decent system due to decisions of the past.
>>
>> -1.  What's the point of having archives?  IMO the mailing list archives
>> are nearly as critical a piece of the project infrastructure as the CVS
>> repository.  We've already established that moving to a new SCM that
>> fails to preserve the CVS history wouldn't be acceptable.  I hardly
>> think that the bar is any lower for mailing list archives.
>>
>> Now I think we could possibly skip the requirement suggested above for
>> URL compatibility, if we just leave the old archives on-line so that
>> those URLs all still resolve.  But if we can't load all the old messages
>> into the new infrastructure, it'll basically be useless for searching
>> purposes.
>>
>> (Hmm, re-reading what you said, maybe we are suggesting the same thing,
>> but it's not clear.  Anyway my point is that Dave's first two
>> requirements are real.  Only the third might not be.)
>
> The third actually isn't actually that hard to do in theory. The
> message numbers are basically the zero-based position in the mbox
> file, and the rest of the URL is obvious.

The third part is trivial. The search system already does 95% of it.
I've already implemented exactly that kind of redirect thing on top of
the search code once just as a poc, and it was less than 30 minutes of
hacking. Can't seem to find the script ATM though, but you get the
idea.

Let's not focus on that part, we can easily solve that.


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 12/01/2010 10:30, Magnus Hagander ha scritto:
> The problem is usually with strange looking emails with 15 different
> MIME types. If we can figure out the proper way to render that, the
> rest really is just a SMOP.

Yeah, I was expecting some, but all the message I've looked at seemed to 
be working ok.

> (BTW, for something to actually be used In Production (TM), we want
> something that uses one of our existing frameworks. So don't go
> overboard in code-wise implementations on something else - proof of
> concept on something else is always ok, of course)

OK, that's something I didn't know, even though I expected some kind of 
limitations. Could you please elaborate a bit more (i.e. where to find 
info)?

Having played with it, here's my feedback about AOX:

pros:
- seemed to be working reliably;
- does most of the dirty job of parsing emails, splitting parts, etc
- highly normalized schema
- thread support (partial?)

cons:
- directly publishing the live email feed might not be desirable
- queries might end up being a bit complicate for simple tasks
- might be not easy to add additional processing in the workflow


>>> So just to put this into perspective and give anyone paying attention
>>> an idea of the pain that lies ahead should they decide to work on
>>> this:
>>>
>>> - We need to import the old archives (of which there are hundreds of
>>> thousands of messages, the first few years of which have, umm, minimal
>>> headers.
>>
>> Anyone having a local copy of this in his mailboxes? At some point there
>> were some NNTP gateway, so maybe there's a copy this way.
>
> We have MBOX files.
>
> IIRC, aox has an import function that can read MBOX files. The
> interesting thing is what happens with the really old files that don't
> have complete headers.
>
> I don't think you can trust the NNTP gateway now or in the past,
> messages are sometimes lost there. The mbox files are as complete as
> anything we'll ever get.

Importing the whole pgsql-www archive with a perl script that bounces 
messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, 
I haven't looked into it that much.

>>> - We need to generate thread indexes
>>
>> We have CTEs :)
>
> Right. We still need the threading information, so we have something
> to use our CTEs on :-)
>
> But I assume that AOX already does this?

there are thread related tables and they seem to get filled when a SORT 
IMAP command is issued, however I haven't found a way to get the 
hierarchy out of them.

What that means is that we'd need some kind of post processing to 
populate a thread hierarchy.

If there isn't a fully usable thread hierarchy I was more thinking to 
ltree, mainly because I've successfully used it in past and I haven't 
had enough time yet to look at CTEs. But if performance is comparable I 
don't see a reason why we shouldn't use them.

>>> - We need to re-generate the original URLs for backwards compatibility
>>
>> I guess the message-id one ain't the tricky one... and it should be
>> possible to fill a relation table like
>>   monharc_compat(message_id, list, year, month, message_number);
>
> Yeah. It's not so hard, you can just screen-scrape the current
> archives the same way the search server does.

Definitely an easy enough task.

With all that said, I can't promise anything as it all depends on how 
much spare time I have, but I can proceed with the evaluation if you 
think it's useful. I have a feeling that AOX is not truly the right tool 
for the job, but we might be able to customise it to suit our needs. Are 
there any other requirements that weren't specified?


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 12/01/2010 19:54, Magnus Hagander ha scritto:
> On Tue, Jan 12, 2010 at 18:34, Dave Page<dpage@pgadmin.org>  wrote:
>> On Tue, Jan 12, 2010 at 10:24 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>> "Joshua D. Drake"<jd@commandprompt.com>  writes:
>>>> On Tue, 2010-01-12 at 10:24 +0530, Dave Page wrote:
>>>>> So just to put this into perspective and give anyone paying attention
>>>>> an idea of the pain that lies ahead should they decide to work on
>>>>> this:
>>>>>
>>>>> - We need to import the old archives (of which there are hundreds of
>>>>> thousands of messages, the first few years of which have, umm, minimal
>>>>> headers.
>>>>> - We need to generate thread indexes
>>>>> - We need to re-generate the original URLs for backwards compatibility
>>>>>
>>>>> Now there's encouragement :-)
>>>
>>>> Or, we just leave the current infrastructure in place and use a new one
>>>> for all new messages going forward. We shouldn't limit our ability to
>>>> have a decent system due to decisions of the past.
>>>
>>> -1.  What's the point of having archives?  IMO the mailing list archives
>>> are nearly as critical a piece of the project infrastructure as the CVS
>>> repository.  We've already established that moving to a new SCM that
>>> fails to preserve the CVS history wouldn't be acceptable.  I hardly
>>> think that the bar is any lower for mailing list archives.
>>>
>>> Now I think we could possibly skip the requirement suggested above for
>>> URL compatibility, if we just leave the old archives on-line so that
>>> those URLs all still resolve.  But if we can't load all the old messages
>>> into the new infrastructure, it'll basically be useless for searching
>>> purposes.
>>>
>>> (Hmm, re-reading what you said, maybe we are suggesting the same thing,
>>> but it's not clear.  Anyway my point is that Dave's first two
>>> requirements are real.  Only the third might not be.)
>>
>> The third actually isn't actually that hard to do in theory. The
>> message numbers are basically the zero-based position in the mbox
>> file, and the rest of the URL is obvious.
>
> The third part is trivial. The search system already does 95% of it.
> I've already implemented exactly that kind of redirect thing on top of
> the search code once just as a poc, and it was less than 30 minutes of
> hacking. Can't seem to find the script ATM though, but you get the
> idea.
>
> Let's not focus on that part, we can easily solve that.

Agreed. That's the part that worries me less.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
On Tue, Jan 12, 2010 at 20:56, Matteo Beccati <php@beccati.com> wrote:
> Il 12/01/2010 10:30, Magnus Hagander ha scritto:
>>
>> The problem is usually with strange looking emails with 15 different
>> MIME types. If we can figure out the proper way to render that, the
>> rest really is just a SMOP.
>
> Yeah, I was expecting some, but all the message I've looked at seemed to be
> working ok.

Have you been looking at old or new messages? Try grabbing a couple of
MBOX files off archives.postgresql.org from several years back, you're
more likely to find weird MUAs then I think.


>> (BTW, for something to actually be used In Production (TM), we want
>> something that uses one of our existing frameworks. So don't go
>> overboard in code-wise implementations on something else - proof of
>> concept on something else is always ok, of course)
>
> OK, that's something I didn't know, even though I expected some kind of
> limitations. Could you please elaborate a bit more (i.e. where to find
> info)?

Well, the framework we're moving towards is built on top of django, so
that would be a good first start.

There is also whever the commitfest thing is built on, but I'm told
that's basically no framework.


> Having played with it, here's my feedback about AOX:
>
> pros:
> - seemed to be working reliably;
> - does most of the dirty job of parsing emails, splitting parts, etc
> - highly normalized schema
> - thread support (partial?)

A killer will be if that thread support is enough. If we have to build
that completely ourselves, it'll take a lot more work.


> cons:
> - directly publishing the live email feed might not be desirable

Why not?

> - queries might end up being a bit complicate for simple tasks

As long as we don't have to hit them too often, which is solve:able
with caching. And we do have a pretty good RDBMS to run the queries on
:)


>> I don't think you can trust the NNTP gateway now or in the past,
>> messages are sometimes lost there. The mbox files are as complete as
>> anything we'll ever get.
>
> Importing the whole pgsql-www archive with a perl script that bounces
> messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, I
> haven't looked into it that much.

Um, yes. There is an MBOX import tool.


>>>> - We need to generate thread indexes
>>>
>>> We have CTEs :)
>>
>> Right. We still need the threading information, so we have something
>> to use our CTEs on :-)
>>
>> But I assume that AOX already does this?
>
> there are thread related tables and they seem to get filled when a SORT IMAP
> command is issued, however I haven't found a way to get the hierarchy out of
> them.
>
> What that means is that we'd need some kind of post processing to populate a
> thread hierarchy.
>
> If there isn't a fully usable thread hierarchy I was more thinking to ltree,
> mainly because I've successfully used it in past and I haven't had enough
> time yet to look at CTEs. But if performance is comparable I don't see a
> reason why we shouldn't use them.

I'd favor CTEs if they are fast enough. Great flexibility.


>>>> - We need to re-generate the original URLs for backwards compatibility
>>>
>>> I guess the message-id one ain't the tricky one... and it should be
>>> possible to fill a relation table like
>>>  monharc_compat(message_id, list, year, month, message_number);
>>
>> Yeah. It's not so hard, you can just screen-scrape the current
>> archives the same way the search server does.
>
> Definitely an easy enough task.
>
> With all that said, I can't promise anything as it all depends on how much
> spare time I have, but I can proceed with the evaluation if you think it's
> useful. I have a feeling that AOX is not truly the right tool for the job,
> but we might be able to customise it to suit our needs. Are there any other
> requirements that weren't specified?

Well, I think we want to avoid customizing it. Using a custom
frontend, sure. But we don't want to end up customizing the
parser/backend. That's the road to unmaintainability.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Aidan Van Dyk
Date:
I'll note that the whole idea of a "email archive" interface might be a
very good "advocacy" project as well.  AOX might not be a perfect fit,
but it could be a good learning experience... Really, all the PG mail
archives need is:

1) A nice normalized DB schema representing mail messages and their  relations to other message and "recipients" (or
"folders")

2) A "injector" that can parse an email message, and de-compose it into  the various parts/tables of the DB schema, and
insertit
 

3) A nice set of SQL queries to return message, parts, threads,  folders based on $criteria (search, id, folder, etc)

4) A web interface to view the messages/thread/parts #3 returns

The largest part of this is #1, but a good schema would be a very good
candidate to show of some of PG's more powerful features in a way that
"others" could see (like the movie store sample somewhere) , such as: 1) full text search 2) text vs bytea handling
(thinkingof all the mime parts, and encoding,    etc) 3) CTEs, ltree, recursion, etc, for threading/searching 4)
Triggersfor "materialized views" (for quick threading/folder queries) 5) expression indexes
 

a.

* Matteo Beccati <php@beccati.com> [100112 14:56]:

> Having played with it, here's my feedback about AOX:
>
> pros:
> - seemed to be working reliably;
> - does most of the dirty job of parsing emails, splitting parts, etc
> - highly normalized schema
> - thread support (partial?)
>
> cons:
> - directly publishing the live email feed might not be desirable
> - queries might end up being a bit complicate for simple tasks
> - might be not easy to add additional processing in the workflow

> If there isn't a fully usable thread hierarchy I was more thinking to  
> ltree, mainly because I've successfully used it in past and I haven't  
> had enough time yet to look at CTEs. But if performance is comparable I  
> don't see a reason why we shouldn't use them.

> With all that said, I can't promise anything as it all depends on how  
> much spare time I have, but I can proceed with the evaluation if you  
> think it's useful. I have a feeling that AOX is not truly the right tool  
> for the job, but we might be able to customise it to suit our needs. Are  
> there any other requirements that weren't specified?

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 12/01/2010 21:04, Magnus Hagander ha scritto:
> On Tue, Jan 12, 2010 at 20:56, Matteo Beccati<php@beccati.com>  wrote:
>> Il 12/01/2010 10:30, Magnus Hagander ha scritto:
>>>
>>> The problem is usually with strange looking emails with 15 different
>>> MIME types. If we can figure out the proper way to render that, the
>>> rest really is just a SMOP.
>>
>> Yeah, I was expecting some, but all the message I've looked at seemed to be
>> working ok.
>
> Have you been looking at old or new messages? Try grabbing a couple of
> MBOX files off archives.postgresql.org from several years back, you're
> more likely to find weird MUAs then I think.

Both. pgsql-hacker and -general are subscribed and getting new emails 
and pgsql-www is just an import of the archives:

http://archives.beccati.org/pgsql-www/by/date (sorry, no paging)

(just fixed a 500 error that was caused by the fact that I've been 
playing with the db a bit and a required helper table was missing)

>>> (BTW, for something to actually be used In Production (TM), we want
>>> something that uses one of our existing frameworks. So don't go
>>> overboard in code-wise implementations on something else - proof of
>>> concept on something else is always ok, of course)
>>
>> OK, that's something I didn't know, even though I expected some kind of
>> limitations. Could you please elaborate a bit more (i.e. where to find
>> info)?
>
> Well, the framework we're moving towards is built on top of django, so
> that would be a good first start.
>
> There is also whever the commitfest thing is built on, but I'm told
> that's basically no framework.

I'm afraid that's outside on my expertise. But I can get as far as 
having a proof of concept and the required queries / php code.

>> Having played with it, here's my feedback about AOX:
>>
>> pros:
>> - seemed to be working reliably;
>> - does most of the dirty job of parsing emails, splitting parts, etc
>> - highly normalized schema
>> - thread support (partial?)
>
> A killer will be if that thread support is enough. If we have to build
> that completely ourselves, it'll take a lot more work.

Looks like we need to populate a helper table with hierarchy 
information, unless Ahijit has a better idea and knows how to get it 
from the aox main schema.

>> cons:
>> - directly publishing the live email feed might not be desirable
>
> Why not?

The scenario I was thinking at was the creation of a static snapshot and 
potential inconsistencies that might occur if the threads get updated 
during that time.

>> - queries might end up being a bit complicate for simple tasks
>
> As long as we don't have to hit them too often, which is solve:able
> with caching. And we do have a pretty good RDBMS to run the queries on
> :)

True :)

>>> I don't think you can trust the NNTP gateway now or in the past,
>>> messages are sometimes lost there. The mbox files are as complete as
>>> anything we'll ever get.
>>
>> Importing the whole pgsql-www archive with a perl script that bounces
>> messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, I
>> haven't looked into it that much.
>
> Um, yes. There is an MBOX import tool.

Cool.


>> With all that said, I can't promise anything as it all depends on how much
>> spare time I have, but I can proceed with the evaluation if you think it's
>> useful. I have a feeling that AOX is not truly the right tool for the job,
>> but we might be able to customise it to suit our needs. Are there any other
>> requirements that weren't specified?
>
> Well, I think we want to avoid customizing it. Using a custom
> frontend, sure. But we don't want to end up customizing the
> parser/backend. That's the road to unmaintainability.

Sure. I guess my wording wasn't right... I was more thinking about 
adding new tables, materialized views or whatever else might be missing 
to make it fit out purpose.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Aidan Van Dyk <aidan@highrise.ca> writes:

> I'll note that the whole idea of a "email archive" interface might be a
> very good "advocacy" project as well.  AOX might not be a perfect fit,
> but it could be a good learning experience... Really, all the PG mail
> archives need is:
>
> 1) A nice normalized DB schema representing mail messages and their
>    relations to other message and "recipients" (or "folders")

We're now hoping that this one will fit:
 http://www.archiveopteryx.org/schema

> 2) A "injector" that can parse an email message, and de-compose it into
>    the various parts/tables of the DB schema, and insert it

aox has that either as a bulk importer or as a MDA.

> 3) A nice set of SQL queries to return message, parts, threads,
>    folders based on $criteria (search, id, folder, etc)

I guess Matteo's working on that…

> 4) A web interface to view the messages/thread/parts #3 returns

And that too.

> The largest part of this is #1, but a good schema would be a very good
> candidate to show of some of PG's more powerful features in a way that
> "others" could see (like the movie store sample somewhere) , such as:
>   1) full text search
>   2) text vs bytea handling (thinking of all the mime parts, and encoding,
>      etc)
>   3) CTEs, ltree, recursion, etc, for threading/searching
>   4) Triggers for "materialized views" (for quick threading/folder queries)
>   5) expression indexes

And Tsearch, too, maybe. Oh and pg_trgm might be quite good at providing
suggestion as you type or "Did you mean?" stuff.

Regards,
--
dim


Re: mailing list archiver chewing patches

From
Aidan Van Dyk
Date:
* Dimitri Fontaine <dfontaine@hi-media.com> [100112 16:28]:

> > 1) A nice normalized DB schema representing mail messages and their
> >    relations to other message and "recipients" (or "folders")
>
> We're now hoping that this one will fit:
>
>   http://www.archiveopteryx.org/schema

Yup, and it provides a lot more too, which could probably be safely
ignored.

> > 2) A "injector" that can parse an email message, and de-compose it into
> >    the various parts/tables of the DB schema, and insert it
>
> aox has that either as a bulk importer or as a MDA.

Yup, LMTP is ideally suited for that too.

> > 3) A nice set of SQL queries to return message, parts, threads,
> >    folders based on $criteria (search, id, folder, etc)
>
> I guess Matteo's working on that…

Right, but this is where I want to see the AOX schema "imporove"... In
ways like adding persistant tables for threading, which are updated by
triggers as new messages are delivered, etc.  Documented queries that
show how to use CTEs, ltree, etc to get threaded views, good FTS support
(with indexes and triggers managing them), etc.

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Aidan Van Dyk <aidan@highrise.ca> writes:
>> aox has that either as a bulk importer or as a MDA.
>
> Yup, LMTP is ideally suited for that too.

Yes.

>> > 3) A nice set of SQL queries to return message, parts, threads,
>> >    folders based on $criteria (search, id, folder, etc)
>>
>> I guess Matteo's working on that…
>
> Right, but this is where I want to see the AOX schema "imporove"... In
> ways like adding persistant tables for threading, which are updated by
> triggers as new messages are delivered, etc.  Documented queries that
> show how to use CTEs, ltree, etc to get threaded views, good FTS support
> (with indexes and triggers managing them), etc.

+1.

I just didn't understand how much your proposal fit into current work :)
--
dim


Re: mailing list archiver chewing patches

From
"Joshua D. Drake"
Date:
On Tue, 2010-01-12 at 10:24 +0530, Dave Page wrote:
> 2010/1/12 Matteo Beccati <php@beccati.com>:
> > So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list
archivesreading from the AOX database:
 
> >
> > http://archives.beccati.org/
> 
> Seems to work.
> 
> > Please take it as an exercise I've made trying to learn how to use symfony this afternoon. It's not feature
complete,nor probably very scalable, but at least it features attachment download ;)
 
> >
> > http://archives.beccati.org/pgsql-hackers/message/37
> 
> :-)
> 
> So just to put this into perspective and give anyone paying attention
> an idea of the pain that lies ahead should they decide to work on
> this:
> 
> - We need to import the old archives (of which there are hundreds of
> thousands of messages, the first few years of which have, umm, minimal
> headers.
> - We need to generate thread indexes
> - We need to re-generate the original URLs for backwards compatibility
> 
> Now there's encouragement :-)

Or, we just leave the current infrastructure in place and use a new one
for all new messages going forward. We shouldn't limit our ability to
have a decent system due to decisions of the past.



Joshua D. Drake



-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.



Re: mailing list archiver chewing patches

From
"Joshua D. Drake"
Date:
On Tue, 2010-01-12 at 11:54 -0500, Tom Lane wrote:

> > Or, we just leave the current infrastructure in place and use a new one
> > for all new messages going forward. We shouldn't limit our ability to
> > have a decent system due to decisions of the past.
> 
> -1.  What's the point of having archives?  IMO the mailing list archives
> are nearly as critical a piece of the project infrastructure as the CVS
> repository.  We've already established that moving to a new SCM that
> fails to preserve the CVS history wouldn't be acceptable.  I hardly
> think that the bar is any lower for mailing list archives.
> 
> Now I think we could possibly skip the requirement suggested above for
> URL compatibility, if we just leave the old archives on-line so that
> those URLs all still resolve.  But if we can't load all the old messages
> into the new infrastructure, it'll basically be useless for searching
> purposes.
> 
> (Hmm, re-reading what you said, maybe we are suggesting the same thing,
> but it's not clear.  Anyway my point is that Dave's first two
> requirements are real.  Only the third might not be.)

We are saying the same thing. Sorry if I wasn't clear.

Joshua D. Drake



> 
>             regards, tom lane
> 


-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.



Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Hi,

>>>> 3) A nice set of SQL queries to return message, parts, threads,
>>>>     folders based on $criteria (search, id, folder, etc)
>>>
>>> I guess Matteo's working on that…
>>
>> Right, but this is where I want to see the AOX schema "imporove"... In
>> ways like adding persistant tables for threading, which are updated by
>> triggers as new messages are delivered, etc.  Documented queries that
>> show how to use CTEs, ltree, etc to get threaded views, good FTS support
>> (with indexes and triggers managing them), etc.
>
> +1.
>
> I just didn't understand how much your proposal fit into current work :)

I'm looking into it. The link I've previously sent will most likely 
return a 500 error for the time being.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 14/01/2010 08:22, Matteo Beccati ha scritto:
> Hi,
>
>>>>> 3) A nice set of SQL queries to return message, parts, threads,
>>>>> folders based on $criteria (search, id, folder, etc)
>>>>
>>>> I guess Matteo's working on that…
>>>
>>> Right, but this is where I want to see the AOX schema "imporove"... In
>>> ways like adding persistant tables for threading, which are updated by
>>> triggers as new messages are delivered, etc. Documented queries that
>>> show how to use CTEs, ltree, etc to get threaded views, good FTS support
>>> (with indexes and triggers managing them), etc.
>>
>> +1.
>>
>> I just didn't understand how much your proposal fit into current work :)
>
> I'm looking into it. The link I've previously sent will most likely
> return a 500 error for the time being.

A quick update:

I've extended AOX with a trigger that takes care of filling a separate 
table that's used to display the index pages. The new table also stores 
threading information (standard headers + Exchange headers support) and 
whether or not the email has attachments.

Please check the updated PoC: http://archives.beccati.org/

pgsql-hackers and -general are currently subscribed, while -www only has 
2003 history imported via aoximport (very fast!).

BTW, I've just noticed a bug in the attachment detection giving false 
positives, but have no time to check now.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Matteo Beccati <php@beccati.com> writes:
> I've extended AOX with a trigger that takes care of filling a separate table
> that's used to display the index pages. The new table also stores threading
> information (standard headers + Exchange headers support) and whether or not
> the email has attachments.
>
> Please check the updated PoC: http://archives.beccati.org/

Looks pretty good, even if some thread are still separated (this one for
example), and the ordering looks strange.

Seems to be right on tracks, that said :)

Thanks for your work,
-- 
dim


Re: mailing list archiver chewing patches

From
Dave Page
Date:
On Thu, Jan 14, 2010 at 7:09 PM, Dimitri Fontaine
<dfontaine@hi-media.com> wrote:
> Matteo Beccati <php@beccati.com> writes:
>> I've extended AOX with a trigger that takes care of filling a separate table
>> that's used to display the index pages. The new table also stores threading
>> information (standard headers + Exchange headers support) and whether or not
>> the email has attachments.
>>
>> Please check the updated PoC: http://archives.beccati.org/
>
> Looks pretty good, even if some thread are still separated (this one for
> example), and the ordering looks strange.
>
> Seems to be right on tracks, that said :)

Yup.

Matteo - Can you try loading up a lot more of the old mbox files,
particularly the very early ones from -hackers? It would be good to
see how it copes under load with a few hundred thousand messages in
the database.

-- 
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 14/01/2010 14:39, Dimitri Fontaine ha scritto:
> Matteo Beccati<php@beccati.com>  writes:
>> I've extended AOX with a trigger that takes care of filling a separate table
>> that's used to display the index pages. The new table also stores threading
>> information (standard headers + Exchange headers support) and whether or not
>> the email has attachments.
>>
>> Please check the updated PoC: http://archives.beccati.org/
>
> Looks pretty good, even if some thread are still separated (this one for
> example), and the ordering looks strange.

This one is separated as the first one is not in the archive yet, thus 
to the system there are multiple parent messages. It shouldn't happen 
with full archives. About sorting, here's the query I've used (my first 
try with CTEs incidentally):

WITH RECURSIVE t (mailbox, uid, date, subject, sender, has_attachments, 
parent_uid, idx, depth) AS (  SELECT mailbox, uid, date, subject, sender, has_attachments, 
parent_uid, uid::text, 1  FROM arc_messages  WHERE parent_uid IS NULL AND mailbox = 15  UNION ALL  SELECT a.mailbox,
a.uid,a.date, a.subject, a.sender, 
 
a.has_attachments, a.parent_uid, t.idx || '.' || a.uid::text, t.depth + 1  FROM t JOIN arc_messages a USING (mailbox)
WHEREt.uid = a.parent_uid
 
) SELECT * FROM t ORDER BY idx

Any improvements to sorting are welcome :)


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 14/01/2010 14:46, Dave Page ha scritto:
> On Thu, Jan 14, 2010 at 7:09 PM, Dimitri Fontaine
> <dfontaine@hi-media.com>  wrote:
>> Matteo Beccati<php@beccati.com>  writes:
>>> I've extended AOX with a trigger that takes care of filling a separate table
>>> that's used to display the index pages. The new table also stores threading
>>> information (standard headers + Exchange headers support) and whether or not
>>> the email has attachments.
>>>
>>> Please check the updated PoC: http://archives.beccati.org/
>>
>> Looks pretty good, even if some thread are still separated (this one for
>> example), and the ordering looks strange.
>>
>> Seems to be right on tracks, that said :)
>
> Yup.
>
> Matteo - Can you try loading up a lot more of the old mbox files,
> particularly the very early ones from -hackers? It would be good to
> see how it copes under load with a few hundred thousand messages in
> the database.

Sure, I will give it a try in the evening or tomorrow.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Dave Page
Date:
On Thu, Jan 14, 2010 at 8:02 PM, Matteo Beccati <php@beccati.com> wrote:
> Il 14/01/2010 14:46, Dave Page ha scritto:
>>
>> On Thu, Jan 14, 2010 at 7:09 PM, Dimitri Fontaine
>> <dfontaine@hi-media.com>  wrote:
>>>
>>> Matteo Beccati<php@beccati.com>  writes:
>>>>
>>>> I've extended AOX with a trigger that takes care of filling a separate
>>>> table
>>>> that's used to display the index pages. The new table also stores
>>>> threading
>>>> information (standard headers + Exchange headers support) and whether or
>>>> not
>>>> the email has attachments.
>>>>
>>>> Please check the updated PoC: http://archives.beccati.org/
>>>
>>> Looks pretty good, even if some thread are still separated (this one for
>>> example), and the ordering looks strange.
>>>
>>> Seems to be right on tracks, that said :)
>>
>> Yup.
>>
>> Matteo - Can you try loading up a lot more of the old mbox files,
>> particularly the very early ones from -hackers? It would be good to
>> see how it copes under load with a few hundred thousand messages in
>> the database.
>
> Sure, I will give it a try in the evening or tomorrow.

Thanks :-)


--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Matteo Beccati <php@beccati.com> writes:
> WITH RECURSIVE t (mailbox, uid, date, subject, sender, has_attachments,
> parent_uid, idx, depth) AS (
>   SELECT mailbox, uid, date, subject, sender, has_attachments, parent_uid,
> uid::text, 1
>   FROM arc_messages
>   WHERE parent_uid IS NULL AND mailbox = 15
>   UNION ALL
>   SELECT a.mailbox, a.uid, a.date, a.subject, a.sender, a.has_attachments,
> a.parent_uid, t.idx || '.' || a.uid::text, t.depth + 1
>   FROM t JOIN arc_messages a USING (mailbox)
>   WHERE t.uid = a.parent_uid
> ) SELECT * FROM t ORDER BY idx
>
> Any improvements to sorting are welcome :)

What I'd like would be to have it sorted by activity, showing first the
thread which received the later messages. I'm yet to play with CTE and
window function myself so without a database example to play with I
won't come up with a nice query, but I guess a more educated reader will
solve this without a sweat, as it looks easier than sudoku-solving,
which has been done already :)

Regards,
-- 
dim


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 14/01/2010 15:47, Dimitri Fontaine ha scritto:
> Matteo Beccati<php@beccati.com>  writes:
>> WITH RECURSIVE t (mailbox, uid, date, subject, sender, has_attachments,
>> parent_uid, idx, depth) AS (
>>    SELECT mailbox, uid, date, subject, sender, has_attachments, parent_uid,
>> uid::text, 1
>>    FROM arc_messages
>>    WHERE parent_uid IS NULL AND mailbox = 15
>>    UNION ALL
>>    SELECT a.mailbox, a.uid, a.date, a.subject, a.sender, a.has_attachments,
>> a.parent_uid, t.idx || '.' || a.uid::text, t.depth + 1
>>    FROM t JOIN arc_messages a USING (mailbox)
>>    WHERE t.uid = a.parent_uid
>> ) SELECT * FROM t ORDER BY idx
>>
>> Any improvements to sorting are welcome :)
>
> What I'd like would be to have it sorted by activity, showing first the
> thread which received the later messages. I'm yet to play with CTE and
> window function myself so without a database example to play with I
> won't come up with a nice query, but I guess a more educated reader will
> solve this without a sweat, as it looks easier than sudoku-solving,
> which has been done already :)

Eheh, that was my first try as well. CTEs look very nice even though I'm 
not yet very comfortable with the syntax. Anyway both for date and 
thread indexes sort is the other way around, with newer posts/threads at 
the bottom. Again I'll give it a try as soon as I find time to work 
again on it.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
On Thu, Jan 14, 2010 at 16:06, Matteo Beccati <php@beccati.com> wrote:
> Il 14/01/2010 15:47, Dimitri Fontaine ha scritto:
>>
>> Matteo Beccati<php@beccati.com>  writes:
>>>
>>> WITH RECURSIVE t (mailbox, uid, date, subject, sender, has_attachments,
>>> parent_uid, idx, depth) AS (
>>>   SELECT mailbox, uid, date, subject, sender, has_attachments,
>>> parent_uid,
>>> uid::text, 1
>>>   FROM arc_messages
>>>   WHERE parent_uid IS NULL AND mailbox = 15
>>>   UNION ALL
>>>   SELECT a.mailbox, a.uid, a.date, a.subject, a.sender,
>>> a.has_attachments,
>>> a.parent_uid, t.idx || '.' || a.uid::text, t.depth + 1
>>>   FROM t JOIN arc_messages a USING (mailbox)
>>>   WHERE t.uid = a.parent_uid
>>> ) SELECT * FROM t ORDER BY idx
>>>
>>> Any improvements to sorting are welcome :)
>>
>> What I'd like would be to have it sorted by activity, showing first the
>> thread which received the later messages. I'm yet to play with CTE and
>> window function myself so without a database example to play with I
>> won't come up with a nice query, but I guess a more educated reader will
>> solve this without a sweat, as it looks easier than sudoku-solving,
>> which has been done already :)
>
> Eheh, that was my first try as well. CTEs look very nice even though I'm not
> yet very comfortable with the syntax. Anyway both for date and thread
> indexes sort is the other way around, with newer posts/threads at the
> bottom. Again I'll give it a try as soon as I find time to work again on it.

Three tips around this,

1) don't be constrained by how things look now. Make something that's useful.

2) don't be constrained  by the fact that we have two ways to view it
now (thread + date). we can easily do three, if different people like
different ways. As long as it's not so much it becomes a maintenance
burden

3) Remember to run your tests with lots of emails, some designs just
tend to fall apart over that (say a thread with 200+ emails in it)


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
David Fetter
Date:
On Thu, Jan 14, 2010 at 03:08:22PM +0100, Matteo Beccati wrote:
> Il 14/01/2010 14:39, Dimitri Fontaine ha scritto:
> >Matteo Beccati<php@beccati.com>  writes:
> >>I've extended AOX with a trigger that takes care of filling a separate table
> >>that's used to display the index pages. The new table also stores threading
> >>information (standard headers + Exchange headers support) and whether or not
> >>the email has attachments.
> >>
> >>Please check the updated PoC: http://archives.beccati.org/
> >
> >Looks pretty good, even if some thread are still separated (this one for
> >example), and the ordering looks strange.
> 
> This one is separated as the first one is not in the archive yet,
> thus to the system there are multiple parent messages. It shouldn't
> happen with full archives. About sorting, here's the query I've used
> (my first try with CTEs incidentally):
> 
> WITH RECURSIVE t (mailbox, uid, date, subject, sender,
> has_attachments, parent_uid, idx, depth) AS (
>   SELECT mailbox, uid, date, subject, sender, has_attachments,
> parent_uid, uid::text, 1
>   FROM arc_messages
>   WHERE parent_uid IS NULL AND mailbox = 15
>   UNION ALL
>   SELECT a.mailbox, a.uid, a.date, a.subject, a.sender,
> a.has_attachments, a.parent_uid, t.idx || '.' || a.uid::text,
> t.depth + 1
>   FROM t JOIN arc_messages a USING (mailbox)
>   WHERE t.uid = a.parent_uid
> ) SELECT * FROM t ORDER BY idx

> Any improvements to sorting are welcome :)

This is probably better written as:

WITH RECURSIVE t (   mailbox,   uid,   date,   subject,   sender,   has_attachments,   "path"
)
AS (   SELECT       mailbox,       uid,       date,       subject,       sender,       has_attachments,
ARRAY[uid]  FROM       arc_messages   WHERE       parent_uid IS NULL AND       mailbox = 15
 
UNION ALL   SELECT       a.mailbox,       a.uid,       a.date,       a.subject,       a.sender,
a.has_attachments,      t."path" || a.uid,   FROM       t JOIN arc_messages a       ON (           a.mailbox =
t.mailboxAND           t.uid = a.parent_uid       )
 
)
SELECT *
FROM t
ORDER BY "path";

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Hi everyone,

Il 14/01/2010 19:36, David Fetter ha scritto:
> On Thu, Jan 14, 2010 at 03:08:22PM +0100, Matteo Beccati wrote:
>> Il 14/01/2010 14:39, Dimitri Fontaine ha scritto:
>>> Matteo Beccati<php@beccati.com>   writes:>
>> Any improvements to sorting are welcome :)
>
> ...
>          ARRAY[uid]
> ...

Thanks David, using an array rather than text concatenation is slightly
slower and uses a bit more memory, but you've been able to convince me
that it's The Right Way(TM) ;)

Anyway, I've made further changes and I would say that at this point the
PoC is feature complete. There surely are still some rough edges and a
few things to clean up, but I'd like to get your feedback once again:

http://archives.beccati.org

You will find that pgsql-general and -hackers are subscribed and getting
messages live, wihle -hackers-history and -www have been imported from
the archives (about 200k and 1.5k messages respectively at 50 messages/s).

Also, I'd need some help with the CTE query that was picking a wrong
plan and led me to forcibly disable merge joins inside the application
when executing it. Plans are attached.


Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/

Attachment

Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Matteo Beccati <php@beccati.com> writes:
> Anyway, I've made further changes and I would say that at this point the PoC
> is feature complete. There surely are still some rough edges and a few
> things to clean up, but I'd like to get your feedback once again:
>
> http://archives.beccati.org

I've been clicking around and like the speedy feeling and the Thread
index appearing under any mail. Also getting the attachments seems to be
just working™. I've also checked than this "local thread" works on month
boundaries, so that you're POC is in a way already better than the
current archives solution.

Only missing is the search, but we have tsearch and pg_trgm masters not
far away…

> You will find that pgsql-general and -hackers are subscribed and getting
> messages live, wihle -hackers-history and -www have been imported from the
> archives (about 200k and 1.5k messages respectively at 50 messages/s).

Tried clicking over there and very far in the past indexes show no
messages. Here's an example:
 http://archives.beccati.org/pgsql-hackers-history/1996-09/by/thread

> Also, I'd need some help with the CTE query that was picking a wrong plan
> and led me to forcibly disable merge joins inside the application when
> executing it. Plans are attached.

Sorry, not from me, still a CTE noob.
--
dim


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 16/01/2010 11:48, Dimitri Fontaine ha scritto:
> Matteo Beccati<php@beccati.com>  writes:
>> Anyway, I've made further changes and I would say that at this point the PoC
>> is feature complete. There surely are still some rough edges and a few
>> things to clean up, but I'd like to get your feedback once again:
>>
>> http://archives.beccati.org
>
> I've been clicking around and like the speedy feeling and the Thread
> index appearing under any mail. Also getting the attachments seems to be
> just working™. I've also checked than this "local thread" works on month
> boundaries, so that you're POC is in a way already better than the
> current archives solution.

Thanks for the feedback.

> Only missing is the search, but we have tsearch and pg_trgm masters not
> far away…

I haven't even looked at it as I was under the impression that the old 
engine could still be used. If not, adding search support should be 
fairly easy.


>> You will find that pgsql-general and -hackers are subscribed and getting
>> messages live, wihle -hackers-history and -www have been imported from the
>> archives (about 200k and 1.5k messages respectively at 50 messages/s).
>
> Tried clicking over there and very far in the past indexes show no
> messages. Here's an example:
>
>    http://archives.beccati.org/pgsql-hackers-history/1996-09/by/thread

Yeah, there are a few messages in the archives with a wrong date header. 
The list is generated using from min(date) to now(), so there are holes. 
At some point I'll run a few queries to fix that.


>> Also, I'd need some help with the CTE query that was picking a wrong plan
>> and led me to forcibly disable merge joins inside the application when
>> executing it. Plans are attached.
>
> Sorry, not from me, still a CTE noob.

Actually the live db doesn't suffer from that problem anymore, but I've 
able to reproduce the issue with a few days old backup running on a test 
8.5alpha3 instance that still has a stock postgresql.conf.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 16/01/2010 14:21, Matteo Beccati ha scritto:
> Il 16/01/2010 11:48, Dimitri Fontaine ha scritto:
>> Matteo Beccati<php@beccati.com> writes:
>>> Anyway, I've made further changes and I would say that at this point
>>> the PoC
>>> is feature complete. There surely are still some rough edges and a few
>>> things to clean up, but I'd like to get your feedback once again:
>
> [...]
>
>>> Also, I'd need some help with the CTE query that was picking a wrong
>>> plan
>>> and led me to forcibly disable merge joins inside the application when
>>> executing it. Plans are attached.
>>
>> Sorry, not from me, still a CTE noob.
>
> Actually the live db doesn't suffer from that problem anymore, but I've
> able to reproduce the issue with a few days old backup running on a test
> 8.5alpha3 instance that still has a stock postgresql.conf.

Following advice from Andrew "RodiumToad" Gierth, I raised cpu costs 
back to the defaults (I did lower them following some tuning guide) and 
that seems to have fixed the problem.

My question now is... what next? :)


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
2010/1/18 Matteo Beccati <php@beccati.com>:
> Il 16/01/2010 14:21, Matteo Beccati ha scritto:
>>
>> Il 16/01/2010 11:48, Dimitri Fontaine ha scritto:
>>>
>>> Matteo Beccati<php@beccati.com> writes:
>>>>
>>>> Anyway, I've made further changes and I would say that at this point
>>>> the PoC
>>>> is feature complete. There surely are still some rough edges and a few
>>>> things to clean up, but I'd like to get your feedback once again:
>>
>> [...]
>>
>>>> Also, I'd need some help with the CTE query that was picking a wrong
>>>> plan
>>>> and led me to forcibly disable merge joins inside the application when
>>>> executing it. Plans are attached.
>>>
>>> Sorry, not from me, still a CTE noob.
>>
>> Actually the live db doesn't suffer from that problem anymore, but I've
>> able to reproduce the issue with a few days old backup running on a test
>> 8.5alpha3 instance that still has a stock postgresql.conf.
>
> Following advice from Andrew "RodiumToad" Gierth, I raised cpu costs back to the defaults (I did lower them following
sometuning guide) and that seems to have fixed the problem.
 
>
> My question now is... what next? :)

If it wasn't for the fact that we're knee deep in two other major
projects for the infrastructure team right now, I'd be all over this
:-) But we really need to complete that before we put anything new in
production here.

What I'd like to see is one that integrates with our general layouts.

Also, I tink one of the main issues with the archives today that
people bring up is the inability to have threads cross months. I think
that should be fixed. Basically, get rid of the grouping by month for
a more dynamic way to browse.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Also, I tink one of the main issues with the archives today that
> people bring up is the inability to have threads cross months. I think
> that should be fixed. Basically, get rid of the grouping by month for
> a more dynamic way to browse.

Clic a mail in a thread within more than one given month. See the Thread
index for this email. It's complete, for both the month. Example here:
 http://archives.beccati.org/pgsql-hackers-history/message/191438.html
http://archives.beccati.org/pgsql-hackers-history/message/191334.html

That said, the month boundary is artificial, so maybe having a X
messages per page instead would be better?

Regards,
-- 
dim


Re: mailing list archiver chewing patches

From
Alvaro Herrera
Date:
Magnus Hagander wrote:
> 2010/1/18 Matteo Beccati <php@beccati.com>:

> > My question now is... what next? :)

Gee, I disappear for a week and look what happens -- we get streaming
replication, a revamped archives site, and maybe something else that I
haven't seen yet.  I love it :-)

> If it wasn't for the fact that we're knee deep in two other major
> projects for the infrastructure team right now, I'd be all over this
> :-) But we really need to complete that before we put anything new in
> production here.
> 
> What I'd like to see is one that integrates with our general layouts.

Yeah.  I think that should be relatively simple to add.  There are some
other things that need a bit of rejiggering too, like the thread index
at the bottom getting too wide with large threads, for example here:
http://archives.beccati.org/pgsql-hackers-history/message/90000.html

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 18/01/2010 15:55, Magnus Hagander ha scritto:
> If it wasn't for the fact that we're knee deep in two other major
> projects for the infrastructure team right now, I'd be all over this
> :-) But we really need to complete that before we put anything new in
> production here.

Sure, that's completely understandable.

> What I'd like to see is one that integrates with our general layouts.

Shoudln't bee too hard, but I wouldn't be very keen on spending time on 
layout related things that are going to be thrown away to due the 
framework and language being different from what is going to be used on 
production (symfony/php vs django/python).


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 18/01/2010 16:19, Dimitri Fontaine ha scritto:
> Magnus Hagander<magnus@hagander.net>  writes:
>> Also, I tink one of the main issues with the archives today that
>> people bring up is the inability to have threads cross months. I think
>> that should be fixed. Basically, get rid of the grouping by month for
>> a more dynamic way to browse.
>
> Clic a mail in a thread within more than one given month. See the Thread
> index for this email. It's complete, for both the month. Example here:
>
>    http://archives.beccati.org/pgsql-hackers-history/message/191438.html
>    http://archives.beccati.org/pgsql-hackers-history/message/191334.html

Thanks Dimitri, you beat me to it ;)


> That said, the month boundary is artificial, so maybe having a X
> messages per page instead would be better?

Not sure. Having date based pages helps out reducing the set of messages 
that need to be scanned and sorted, increasing the likeliness of an 
index scan. But I'm happy to examine other alternatives too.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
On Mon, Jan 18, 2010 at 18:31, Matteo Beccati <php@beccati.com> wrote:
> Il 18/01/2010 15:55, Magnus Hagander ha scritto:
>>
>> If it wasn't for the fact that we're knee deep in two other major
>> projects for the infrastructure team right now, I'd be all over this
>> :-) But we really need to complete that before we put anything new in
>> production here.
>
> Sure, that's completely understandable.
>
>> What I'd like to see is one that integrates with our general layouts.
>
> Shoudln't bee too hard, but I wouldn't be very keen on spending time on
> layout related things that are going to be thrown away to due the framework
> and language being different from what is going to be used on production
> (symfony/php vs django/python).

I don't know symfony, but as long as it's done in a template it is
probably pretty easy to move between different frameworks for the
layout part.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
On Mon, Jan 18, 2010 at 18:35, Matteo Beccati <php@beccati.com> wrote:
> Il 18/01/2010 16:19, Dimitri Fontaine ha scritto:
>>
>> Magnus Hagander<magnus@hagander.net>  writes:
>>>
>>> Also, I tink one of the main issues with the archives today that
>>> people bring up is the inability to have threads cross months. I think
>>> that should be fixed. Basically, get rid of the grouping by month for
>>> a more dynamic way to browse.
>>
>> Clic a mail in a thread within more than one given month. See the Thread
>> index for this email. It's complete, for both the month. Example here:
>>
>>   http://archives.beccati.org/pgsql-hackers-history/message/191438.html
>>   http://archives.beccati.org/pgsql-hackers-history/message/191334.html
>
> Thanks Dimitri, you beat me to it ;)
>
>
>> That said, the month boundary is artificial, so maybe having a X
>> messages per page instead would be better?
>
> Not sure. Having date based pages helps out reducing the set of messages
> that need to be scanned and sorted, increasing the likeliness of an index
> scan. But I'm happy to examine other alternatives too.

I think we need to get rid of the months based pages. We can keep them
as an option, but they're not a good root thing. I'd rather have
something where you start at a certain point and see <n> before and
<n> after, so we keep the page to a reasonably short time. Keeping "30
days" there somewhere may make sense, but arbitrarily splitting at the
1st of each month doesn't follow the flow of discussions very well.

I think the first step has to be to figure out how we'd like it
presented. Only after that should we consider how to implement it to
get fast scans in the database...


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 18/01/2010 18:42, Magnus Hagander ha scritto:
> On Mon, Jan 18, 2010 at 18:31, Matteo Beccati<php@beccati.com>  wrote:
>> Il 18/01/2010 15:55, Magnus Hagander ha scritto:
>>>
>>> If it wasn't for the fact that we're knee deep in two other major
>>> projects for the infrastructure team right now, I'd be all over this
>>> :-) But we really need to complete that before we put anything new in
>>> production here.
>>
>> Sure, that's completely understandable.
>>
>>> What I'd like to see is one that integrates with our general layouts.
>>
>> Shoudln't bee too hard, but I wouldn't be very keen on spending time on
>> layout related things that are going to be thrown away to due the framework
>> and language being different from what is going to be used on production
>> (symfony/php vs django/python).
>
> I don't know symfony, but as long as it's done in a template it is
> probably pretty easy to move between different frameworks for the
> layout part.

By default symfony uses plain PHP files as templates, but some plugins 
allow using a templating engine instead. I guess I can give them a try.


-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
On Tue, Jan 19, 2010 at 09:11, Matteo Beccati <php@beccati.com> wrote:
> Il 18/01/2010 18:42, Magnus Hagander ha scritto:
>>
>> On Mon, Jan 18, 2010 at 18:31, Matteo Beccati<php@beccati.com>  wrote:
>>>
>>> Il 18/01/2010 15:55, Magnus Hagander ha scritto:
>>>>
>>>> If it wasn't for the fact that we're knee deep in two other major
>>>> projects for the infrastructure team right now, I'd be all over this
>>>> :-) But we really need to complete that before we put anything new in
>>>> production here.
>>>
>>> Sure, that's completely understandable.
>>>
>>>> What I'd like to see is one that integrates with our general layouts.
>>>
>>> Shoudln't bee too hard, but I wouldn't be very keen on spending time on
>>> layout related things that are going to be thrown away to due the
>>> framework
>>> and language being different from what is going to be used on production
>>> (symfony/php vs django/python).
>>
>> I don't know symfony, but as long as it's done in a template it is
>> probably pretty easy to move between different frameworks for the
>> layout part.
>
> By default symfony uses plain PHP files as templates, but some plugins allow
> using a templating engine instead. I guess I can give them a try.

As long as the templating is separated from the code, it doesn't
matter if it's a dedicated templating engine or PHP. The point being,
focus on the contents and interface, porting the actual
HTML-generation is likely to be easy compared to that.


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
Il 19/01/2010 09:44, Magnus Hagander ha scritto:
> As long as the templating is separated from the code, it doesn't
> matter if it's a dedicated templating engine or PHP. The point being,
> focus on the contents and interface, porting the actual
> HTML-generation is likely to be easy compared to that.

I've been following the various suggestions. Please take a look at the 
updated archives proof of concept:

http://archives.beccati.org/

The PoC is now integrated with the website layout and has a working 
"Mailing lists" menu to navigate the available lists. The artificial 
monthly breakdown has been removed and both thread and date sorting use 
pagination instead.

The fancy tables are using the Ext JS framework as it was the only free 
one I could find that features column layout for trees. I'm not 
extremely happy about it, but it just works. Threads are loaded 
asynchronously (AJAX), while date sorting uses regular HTML tables with 
a bit of JS to get the fancy layout. This means that search engines 
still have a way to properly index all the messages.

Last but not least, it's backwards compatibile with the /message-id/* 
URI. The other one (/list/yyyy-mm/msg*.php) is implemented, but I just 
realized that it has problems dealing with the old archive weirdness 
(2009-12 shows also some messages dated aug 2009 nov 2009 or jan 2010 
for -hackers).

That said, there are still a few visual improvements to be done, but 
overall I'm pretty much satisfied.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Alvaro Herrera
Date:
Matteo Beccati wrote:
> Il 19/01/2010 09:44, Magnus Hagander ha scritto:
> >As long as the templating is separated from the code, it doesn't
> >matter if it's a dedicated templating engine or PHP. The point being,
> >focus on the contents and interface, porting the actual
> >HTML-generation is likely to be easy compared to that.
> 
> I've been following the various suggestions. Please take a look at
> the updated archives proof of concept:
> 
> http://archives.beccati.org/

I like this.

Sorry for being unable to get in touch with you on IM.  It's been a
hectic time here with only very few pauses.

Some things:

* the list of lists and groups of lists are stored in two JSON files.
Should I send you a copy of them so that you can tweak your code to use
them?  They are generated automatically from the wwwmaster database.

* We have a bunch of templates that you could perhaps have used, if you
hadn't already written all of it ... :-(

* While I don't personally care, some are going to insist that the site
works with Javascript disabled.  I didn't try but from your description
it doesn't seem like it would.  Is this easily fixable?

* The old monthly interface /list/yyyy-mm/msg*php is not really
necessary to keep, *except* that we need the existing URLs to redirect
to the corresponding new message page.  I think we should be able to
create a database of URL redirects from the old site, using the
Message-Id URL style.  So each message accessed using the old URL style
would require two redirects, but I don't think this is a problem.  Do
you agree?

* We're using Subversion to keep the current code.  Is your code
version-controlled?  We'd need to import your code there, I'm afraid.

> Last but not least, it's backwards compatibile with the
> /message-id/* URI. The other one (/list/yyyy-mm/msg*.php) is
> implemented, but I just realized that it has problems dealing with
> the old archive weirdness (2009-12 shows also some messages dated
> aug 2009 nov 2009 or jan 2010 for -hackers).

I'm surprised about the Aug 2009 ones, but the others are explained
because the site divides the mboxes using one timezone and the time
displayed is a different timezone.  We don't really control the first
one so there's nothing to do about it; but anyway it's not really
important.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: mailing list archiver chewing patches

From
Dimitri Fontaine
Date:
Matteo Beccati <php@beccati.com> writes:
> I've been following the various suggestions. Please take a look at the
> updated archives proof of concept:
>
> http://archives.beccati.org/

I like the features a lot, and the only remarks I can think about are
bikeschedding, so I'll let it to the web team when they integrate it. It
sure looks like a when rather than an if as far as I'm concerned.

In short, +1! And thanks a lot!
-- 
dim


Re: mailing list archiver chewing patches

From
Joe Conway
Date:
On 01/30/2010 01:14 PM, Dimitri Fontaine wrote:
> Matteo Beccati <php@beccati.com> writes:
>> I've been following the various suggestions. Please take a look at the
>> updated archives proof of concept:
>>
>> http://archives.beccati.org/
>
> I like the features a lot, and the only remarks I can think about are
> bikeschedding, so I'll let it to the web team when they integrate it. It
> sure looks like a when rather than an if as far as I'm concerned.
>
> In short, +1! And thanks a lot!

+1 here too. That looks wonderful!

Joe


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
On 30/01/2010 17:54, Alvaro Herrera wrote:
> Matteo Beccati wrote:
>> Il 19/01/2010 09:44, Magnus Hagander ha scritto:
>>> As long as the templating is separated from the code, it doesn't
>>> matter if it's a dedicated templating engine or PHP. The point being,
>>> focus on the contents and interface, porting the actual
>>> HTML-generation is likely to be easy compared to that.
>>
>> I've been following the various suggestions. Please take a look at
>> the updated archives proof of concept:
>>
>> http://archives.beccati.org/
>
> I like this.
>
> Sorry for being unable to get in touch with you on IM.  It's been a
> hectic time here with only very few pauses.

Thanks :)

And no worries, I'm pretty sure you must be quite busy lately!

> Some things:
>
> * the list of lists and groups of lists are stored in two JSON files.
> Should I send you a copy of them so that you can tweak your code to use
> them?  They are generated automatically from the wwwmaster database.
>
> * We have a bunch of templates that you could perhaps have used, if you
> hadn't already written all of it ... :-(

The templates and especially the integration with the current layout 
still need to be rewritten when porting the code to python/Django, so I 
I'm not sure if it's wise to spend more time on it at this stage.

Not sure about the JSON approach either. Maybe it's something that needs 
to be further discussed when/if planning the migration of the archives 
to Archiveopteryx.


> * While I don't personally care, some are going to insist that the site
> works with Javascript disabled.  I didn't try but from your description
> it doesn't seem like it would.  Is this easily fixable?

Date sorting works nicely even without JS, while thread sorting doesn't 
at all. I've just updated the PoC so that thread sorting is not 
available when JS is not available, while it still is the default 
otherwise. Hopefully that's enough to keep JS haters happy.

> * The old monthly interface /list/yyyy-mm/msg*php is not really
> necessary to keep, *except* that we need the existing URLs to redirect
> to the corresponding new message page.  I think we should be able to
> create a database of URL redirects from the old site, using the
> Message-Id URL style.  So each message accessed using the old URL style
> would require two redirects, but I don't think this is a problem.  Do
> you agree?

Sure. I was just hoping there was an even easier way (rescritct to 
month, order by uid limit 1 offset X). I guess it wouldn't be hard to 
write a script that populates a backward compatibility table. No need 
for double redirects, it'd be just a matter of adding a JOIN or two to 
the query.

> * We're using Subversion to keep the current code.  Is your code
> version-controlled?  We'd need to import your code there, I'm afraid.

I do have a local svn repository. Given it's just a PoC that is going to 
be rewritten I don't think it should live in the official repo, but if 
you think id does, I'll be glad to switch.

>> Last but not least, it's backwards compatibile with the
>> /message-id/* URI. The other one (/list/yyyy-mm/msg*.php) is
>> implemented, but I just realized that it has problems dealing with
>> the old archive weirdness (2009-12 shows also some messages dated
>> aug 2009 nov 2009 or jan 2010 for -hackers).
>
> I'm surprised about the Aug 2009 ones, but the others are explained
> because the site divides the mboxes using one timezone and the time
> displayed is a different timezone.  We don't really control the first
> one so there's nothing to do about it; but anyway it's not really
> important.

It's not a big deal, the BC-table approach will take care of those 
out-of-range messages. However there are a few messages in the hackers 
archive (and most likely others) that have wrong date headers (e.g. 
1980, 2036): we need to think what to do with them.



Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
On 30/01/2010 22:18, Joe Conway wrote:
> On 01/30/2010 01:14 PM, Dimitri Fontaine wrote:
>> Matteo Beccati<php@beccati.com>  writes:
>>> I've been following the various suggestions. Please take a look at the
>>> updated archives proof of concept:
>>>
>>> http://archives.beccati.org/
>>
>> I like the features a lot, and the only remarks I can think about are
>> bikeschedding, so I'll let it to the web team when they integrate it. It
>> sure looks like a when rather than an if as far as I'm concerned.
>>
>> In short, +1! And thanks a lot!
>
> +1 here too. That looks wonderful!

Thanks guys. Hopefully in the next few days I'll be able to catch up 
with Alvaro to see how we can proceed on this.

Incidentally, I've just found out that the mailing lists are dropping 
some messages. According to my qmail logs the AOX account never received 
Joe's message yesterday, nor quite a few others:

M156252, M156259, M156262, M156273, M156275

and I've verified that it also has happened before. I don't know why, 
but I'm pretty sure that my MTA was contacted only once for those 
messages, while normally I get two connections (my own address + aox 
address).


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
On Sat, Jan 30, 2010 at 22:43, Matteo Beccati <php@beccati.com> wrote:
> On 30/01/2010 17:54, Alvaro Herrera wrote:
>> * While I don't personally care, some are going to insist that the site
>> works with Javascript disabled.  I didn't try but from your description
>> it doesn't seem like it would.  Is this easily fixable?
>
> Date sorting works nicely even without JS, while thread sorting doesn't at
> all. I've just updated the PoC so that thread sorting is not available when
> JS is not available, while it still is the default otherwise. Hopefully
> that's enough to keep JS haters happy.

I haven't looked at how it actually works, but the general requirement
is that it has to *work* without JS. It doesn't have to work *as
well*. That means serving up a page with zero contents, or a page that
you can't navigate, is not acceptable. Requiring more clicks to get
around the navigation and things like that, are ok.


>> * The old monthly interface /list/yyyy-mm/msg*php is not really
>> necessary to keep, *except* that we need the existing URLs to redirect
>> to the corresponding new message page.  I think we should be able to
>> create a database of URL redirects from the old site, using the
>> Message-Id URL style.  So each message accessed using the old URL style
>> would require two redirects, but I don't think this is a problem.  Do
>> you agree?
>
> Sure. I was just hoping there was an even easier way (rescritct to month,
> order by uid limit 1 offset X). I guess it wouldn't be hard to write a
> script that populates a backward compatibility table. No need for double
> redirects, it'd be just a matter of adding a JOIN or two to the query.

Once we go into production on this, we'll need to do some serious
thinking about the caching issues. And in any such scenario we should
very much avoid serving up the same content under different URLs,
since it'll blow away cache space for no reason - it's much better to
throw a redirct.


>> * We're using Subversion to keep the current code.  Is your code
>> version-controlled?  We'd need to import your code there, I'm afraid.
>
> I do have a local svn repository. Given it's just a PoC that is going to be
> rewritten I don't think it should live in the official repo, but if you
> think id does, I'll be glad to switch.

Note that the plan is to switch pgweb to git as well. So if you just
want to push the stuff up during development so people can look at it,
register for a repository at git.postgresql.org - or just set one up
at github which is even easier.


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
On 31/01/2010 13:45, Magnus Hagander wrote:
> On Sat, Jan 30, 2010 at 22:43, Matteo Beccati<php@beccati.com>  wrote:
>> On 30/01/2010 17:54, Alvaro Herrera wrote:
>>> * While I don't personally care, some are going to insist that the site
>>> works with Javascript disabled.  I didn't try but from your description
>>> it doesn't seem like it would.  Is this easily fixable?
>>
>> Date sorting works nicely even without JS, while thread sorting doesn't at
>> all. I've just updated the PoC so that thread sorting is not available when
>> JS is not available, while it still is the default otherwise. Hopefully
>> that's enough to keep JS haters happy.
>
> I haven't looked at how it actually works, but the general requirement
> is that it has to *work* without JS. It doesn't have to work *as
> well*. That means serving up a page with zero contents, or a page that
> you can't navigate, is not acceptable. Requiring more clicks to get
> around the navigation and things like that, are ok.

As it currently stands, date sorting is the default and there are no 
links to the thread view, which would otherwise look empty. We can 
surely build a non-JS thread view as well, I'm just not sure if it's 
worth the effort.

>>> * The old monthly interface /list/yyyy-mm/msg*php is not really
>>> necessary to keep, *except* that we need the existing URLs to redirect
>>> to the corresponding new message page.  I think we should be able to
>>> create a database of URL redirects from the old site, using the
>>> Message-Id URL style.  So each message accessed using the old URL style
>>> would require two redirects, but I don't think this is a problem.  Do
>>> you agree?
>>
>> Sure. I was just hoping there was an even easier way (rescritct to month,
>> order by uid limit 1 offset X). I guess it wouldn't be hard to write a
>> script that populates a backward compatibility table. No need for double
>> redirects, it'd be just a matter of adding a JOIN or two to the query.
>
> Once we go into production on this, we'll need to do some serious
> thinking about the caching issues. And in any such scenario we should
> very much avoid serving up the same content under different URLs,
> since it'll blow away cache space for no reason - it's much better to
> throw a redirct.

Yes, that was my point. A single redirect to the only URL for the message.

>>> * We're using Subversion to keep the current code.  Is your code
>>> version-controlled?  We'd need to import your code there, I'm afraid.
>>
>> I do have a local svn repository. Given it's just a PoC that is going to be
>> rewritten I don't think it should live in the official repo, but if you
>> think id does, I'll be glad to switch.
>
> Note that the plan is to switch pgweb to git as well. So if you just
> want to push the stuff up during development so people can look at it,
> register for a repository at git.postgresql.org - or just set one up
> at github which is even easier.

The only reason why I used svn is that git support in netbeans is rather 
poor, or at least that was my impression. I think it won't be a problem 
to move to git, I probably just need some directions ;)


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
On Sun, Jan 31, 2010 at 15:09, Matteo Beccati <php@beccati.com> wrote:
> On 31/01/2010 13:45, Magnus Hagander wrote:
>>
>> On Sat, Jan 30, 2010 at 22:43, Matteo Beccati<php@beccati.com>  wrote:
>>>
>>> On 30/01/2010 17:54, Alvaro Herrera wrote:
>>>>
>>>> * While I don't personally care, some are going to insist that the site
>>>> works with Javascript disabled.  I didn't try but from your description
>>>> it doesn't seem like it would.  Is this easily fixable?
>>>
>>> Date sorting works nicely even without JS, while thread sorting doesn't
>>> at
>>> all. I've just updated the PoC so that thread sorting is not available
>>> when
>>> JS is not available, while it still is the default otherwise. Hopefully
>>> that's enough to keep JS haters happy.
>>
>> I haven't looked at how it actually works, but the general requirement
>> is that it has to *work* without JS. It doesn't have to work *as
>> well*. That means serving up a page with zero contents, or a page that
>> you can't navigate, is not acceptable. Requiring more clicks to get
>> around the navigation and things like that, are ok.
>
> As it currently stands, date sorting is the default and there are no links
> to the thread view, which would otherwise look empty. We can surely build a
> non-JS thread view as well, I'm just not sure if it's worth the effort.

Hmm. I personally think we need some level of thread support for
non-JS as well, that's at least not *too* much of a step backwards
from what we have now. But others may have other thoughts about that?


>>>> * We're using Subversion to keep the current code.  Is your code
>>>> version-controlled?  We'd need to import your code there, I'm afraid.
>>>
>>> I do have a local svn repository. Given it's just a PoC that is going to
>>> be
>>> rewritten I don't think it should live in the official repo, but if you
>>> think id does, I'll be glad to switch.
>>
>> Note that the plan is to switch pgweb to git as well. So if you just
>> want to push the stuff up during development so people can look at it,
>> register for a repository at git.postgresql.org - or just set one up
>> at github which is even easier.
>
> The only reason why I used svn is that git support in netbeans is rather
> poor, or at least that was my impression. I think it won't be a problem to
> move to git, I probably just need some directions ;)

:-)

Well, it doesn't matter what type of repo it's in at this point, only
once it goes into production. The reason I suggested git at this point
is that we (the postgresql project) do provide git hosting at
git.postgresql.org, but we don't provide subversion anywhere. And I'm
certainly not going to suggest you use pgfoundry and cvs....

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Alvaro Herrera
Date:
Matteo Beccati wrote:

> Incidentally, I've just found out that the mailing lists are
> dropping some messages. According to my qmail logs the AOX account
> never received Joe's message yesterday, nor quite a few others:
> 
> M156252, M156259, M156262, M156273, M156275
> 
> and I've verified that it also has happened before. I don't know
> why, but I'm pretty sure that my MTA was contacted only once for
> those messages, while normally I get two connections (my own address
> + aox address).

Hmm, I see it here:
http://archives.postgresql.org/message-id/4B64A238.1050307%40joeconway.com
Maybe it was just delayed?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
On 01/02/2010 03:27, Alvaro Herrera wrote:
> Matteo Beccati wrote:
>
>> Incidentally, I've just found out that the mailing lists are
>> dropping some messages. According to my qmail logs the AOX account
>> never received Joe's message yesterday, nor quite a few others:
>>
>> M156252, M156259, M156262, M156273, M156275
>>
>> and I've verified that it also has happened before. I don't know
>> why, but I'm pretty sure that my MTA was contacted only once for
>> those messages, while normally I get two connections (my own address
>> + aox address).
>
> Hmm, I see it here:
> http://archives.postgresql.org/message-id/4B64A238.1050307%40joeconway.com
> Maybe it was just delayed?

But not here:

http://archives.beccati.org/message-id/4B64A238.1050307%40joeconway.com

Anyway, I guess that on production we'll have a better way to inject 
emails into Archiveopteryx rather than relying on a email subscription, 
which seems a bit fragile. It's been ages since I last set up majordomo, 
but I guess there should be a way to also pipe outgoing messages through 
a script that performs the delivery to AOX.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
2010/2/1 Matteo Beccati <php@beccati.com>:
> On 01/02/2010 03:27, Alvaro Herrera wrote:
>>
>> Matteo Beccati wrote:
>>
>>> Incidentally, I've just found out that the mailing lists are
>>> dropping some messages. According to my qmail logs the AOX account
>>> never received Joe's message yesterday, nor quite a few others:
>>>
>>> M156252, M156259, M156262, M156273, M156275
>>>
>>> and I've verified that it also has happened before. I don't know
>>> why, but I'm pretty sure that my MTA was contacted only once for
>>> those messages, while normally I get two connections (my own address
>>> + aox address).
>>
>> Hmm, I see it here:
>> http://archives.postgresql.org/message-id/4B64A238.1050307%40joeconway.com
>> Maybe it was just delayed?
>
> But not here:
>
> http://archives.beccati.org/message-id/4B64A238.1050307%40joeconway.com
>
> Anyway, I guess that on production we'll have a better way to inject emails into Archiveopteryx rather than relying
ona email subscription, which seems a bit fragile. It's been ages since I last set up majordomo, but I guess there
shouldbe a way to also pipe outgoing messages through a script that performs the delivery to AOX. 

Does the MBOX importer support incremental loading? Because majordomo
spits out MBOX files for us already.

One option could be to use SMTP with a subscription as the primary way
(and we could set up a dedicated relaying from the mailserver for this
of course, so it's not subject to graylisting or anything like that),
and then daily or so load the MBOX files to cover anything that was
lost?

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
On 01/02/2010 10:26, Magnus Hagander wrote:
> Does the MBOX importer support incremental loading? Because majordomo
> spits out MBOX files for us already.

Unfortunately the aoximport shell command doesn't support incremental 
loading.

> One option could be to use SMTP with a subscription as the primary way
> (and we could set up a dedicated relaying from the mailserver for this
> of course, so it's not subject to graylisting or anything like that),
> and then daily or so load the MBOX files to cover anything that was
> lost?

I guess we could write a script that parses the mbox and adds whatever 
is missing, as long as we keep it as a last resort if we can't make the 
primary delivery a fail proof.

My main concern is that we'd need to overcomplicate the thread detection 
algorithm so that it better deals with delayed messages: as it currently 
works, the replies to a missing message get linked to the 
"grand-parent". Injecting the missing message afterwards will put it at 
the same level as its replies. If it happens only once in a while I 
guess we can live with it, but definitely not if it happens tens of 
times a day.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
2010/2/1 Matteo Beccati <php@beccati.com>:
> On 01/02/2010 10:26, Magnus Hagander wrote:
>>
>> Does the MBOX importer support incremental loading? Because majordomo
>> spits out MBOX files for us already.
>
> Unfortunately the aoximport shell command doesn't support incremental loading.
>
>> One option could be to use SMTP with a subscription as the primary way
>> (and we could set up a dedicated relaying from the mailserver for this
>> of course, so it's not subject to graylisting or anything like that),
>> and then daily or so load the MBOX files to cover anything that was
>> lost?
>
> I guess we could write a script that parses the mbox and adds whatever is missing, as long as we keep it as a last
resortif we can't make the primary delivery a fail proof. 
>
> My main concern is that we'd need to overcomplicate the thread detection algorithm so that it better deals with
delayedmessages: as it currently works, the replies to a missing message get linked to the "grand-parent". Injecting
themissing message afterwards will put it at the same level as its replies. If it happens only once in a while I guess
wecan live with it, but definitely not if it happens tens of times a day. 

That can potentially be a problem.

Consider the case where message A it sent. Mesasge B is a response to
A, and message C is a response to B. Now assume B is held for
moderation (because the poser is not on the list, or because it trips
some other thing), then message C will definitely arrive before
message B. Is that going to cause problems with this method?

Another case where the same thing will happen is if message delivery
of B gets for example graylisted, or is just slow from sender B, but
gets quickly delivered to the author of message A (because of a direct
CC). In this case, the author of message A may respond to it (making
message D),and this will again arrive before message B because author
A is not graylisted.

So the system definitely needs to deal with out-of-order delivery.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
On 01/02/2010 15:03, Magnus Hagander wrote:
> 2010/2/1 Matteo Beccati<php@beccati.com>:
>> My main concern is that we'd need to overcomplicate the thread detection algorithm so that it better deals with
delayedmessages: as it currently works, the replies to a missing message get linked to the "grand-parent". Injecting
themissing message afterwards will put it at the same level as its replies. If it happens only once in a while I guess
wecan live with it, but definitely not if it happens tens of times a day.
 
>
> That can potentially be a problem.
>
> Consider the case where message A it sent. Mesasge B is a response to
> A, and message C is a response to B. Now assume B is held for
> moderation (because the poser is not on the list, or because it trips
> some other thing), then message C will definitely arrive before
> message B. Is that going to cause problems with this method?
>
> Another case where the same thing will happen is if message delivery
> of B gets for example graylisted, or is just slow from sender B, but
> gets quickly delivered to the author of message A (because of a direct
> CC). In this case, the author of message A may respond to it (making
> message D),and this will again arrive before message B because author
> A is not graylisted.
>
> So the system definitely needs to deal with out-of-order delivery.

Hmm, it looks like I didn't factor in direct CCs when thinking about 
potential problems with the simplified algorithm. Thanks for raising that.

I'll be out of town for a few days, but I will see what I can do when I 
get back.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
2010/2/1 Matteo Beccati <php@beccati.com>:
> On 01/02/2010 15:03, Magnus Hagander wrote:
>>
>> 2010/2/1 Matteo Beccati<php@beccati.com>:
>>>
>>> My main concern is that we'd need to overcomplicate the thread detection algorithm so that it better deals with
delayedmessages: as it currently works, the replies to a missing message get linked to the "grand-parent". Injecting
themissing message afterwards will put it at the same level as its replies. If it happens only once in a while I guess
wecan live with it, but definitely not if it happens tens of times a day. 
>>
>> That can potentially be a problem.
>>
>> Consider the case where message A it sent. Mesasge B is a response to
>> A, and message C is a response to B. Now assume B is held for
>> moderation (because the poser is not on the list, or because it trips
>> some other thing), then message C will definitely arrive before
>> message B. Is that going to cause problems with this method?
>>
>> Another case where the same thing will happen is if message delivery
>> of B gets for example graylisted, or is just slow from sender B, but
>> gets quickly delivered to the author of message A (because of a direct
>> CC). In this case, the author of message A may respond to it (making
>> message D),and this will again arrive before message B because author
>> A is not graylisted.
>>
>> So the system definitely needs to deal with out-of-order delivery.
>
> Hmm, it looks like I didn't factor in direct CCs when thinking about potential problems with the simplified
algorithm.Thanks for raising that. 

That is a very common scenario. And even without that, email taking
different time to get delivered to majordomo is not at all uncomoon.


> I'll be out of town for a few days, but I will see what I can do when I get back.

No rush.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Tom Lane
Date:
Matteo Beccati <php@beccati.com> writes:
> My main concern is that we'd need to overcomplicate the thread detection 
> algorithm so that it better deals with delayed messages: as it currently 
> works, the replies to a missing message get linked to the 
> "grand-parent". Injecting the missing message afterwards will put it at 
> the same level as its replies. If it happens only once in a while I 
> guess we can live with it, but definitely not if it happens tens of 
> times a day.

That's quite common unfortunately --- I think you're going to need to
deal with the case.  Even getting a direct feed from the mail relays
wouldn't avoid it completely: consider cases like
* A sends a message* B replies, cc'ing A and the list* B's reply to list is delayed by greylisting* A replies to B's
reply(cc'ing list)* A's reply goes through immediately* B's reply shows up a bit later
 

That happens pretty frequently IME.
        regards, tom lane


Re: mailing list archiver chewing patches

From
Robert Haas
Date:
On Mon, Feb 1, 2010 at 11:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Matteo Beccati <php@beccati.com> writes:
>> My main concern is that we'd need to overcomplicate the thread detection
>> algorithm so that it better deals with delayed messages: as it currently
>> works, the replies to a missing message get linked to the
>> "grand-parent". Injecting the missing message afterwards will put it at
>> the same level as its replies. If it happens only once in a while I
>> guess we can live with it, but definitely not if it happens tens of
>> times a day.
>
> That's quite common unfortunately --- I think you're going to need to
> deal with the case.  Even getting a direct feed from the mail relays
> wouldn't avoid it completely: consider cases like
>
>        * A sends a message
>        * B replies, cc'ing A and the list
>        * B's reply to list is delayed by greylisting
>        * A replies to B's reply (cc'ing list)
>        * A's reply goes through immediately
>        * B's reply shows up a bit later
>
> That happens pretty frequently IME.

Yeah - and sometimes the delay can be DAYS.

...Robert


Re: mailing list archiver chewing patches

From
Robert Haas
Date:
On Mon, Feb 1, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Feb 1, 2010 at 11:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>        * A sends a message
>>>        * B replies, cc'ing A and the list
>>>        * B's reply to list is delayed by greylisting
>>>        * A replies to B's reply (cc'ing list)
>>>        * A's reply goes through immediately
>>>        * B's reply shows up a bit later
>>>
>>> That happens pretty frequently IME.
>
>> Yeah - and sometimes the delay can be DAYS.
>
> Greylisting wouldn't explain a delay of more than an hour or so.
> OTOH, if B's reply got held for moderation for some reason, then
> yeah it could be days :-(.  But in that case the rest of the list
> didn't see it in real-time either, so having it show up out of
> "logical" sequence in the archive doesn't seem like a terrible
> reflection of reality.  I'm just concerned about the threading not
> being sensitive to skews on the order of a few minutes --- those
> are extremely common.

I not infrequently receive messages out of sequence by time periods
well in excess of a few minutes.

Don't know why, but I do.

...Robert


Re: mailing list archiver chewing patches

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Feb 1, 2010 at 11:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>        * A sends a message
>>        * B replies, cc'ing A and the list
>>        * B's reply to list is delayed by greylisting
>>        * A replies to B's reply (cc'ing list)
>>        * A's reply goes through immediately
>>        * B's reply shows up a bit later
>> 
>> That happens pretty frequently IME.

> Yeah - and sometimes the delay can be DAYS.

Greylisting wouldn't explain a delay of more than an hour or so.
OTOH, if B's reply got held for moderation for some reason, then
yeah it could be days :-(.  But in that case the rest of the list
didn't see it in real-time either, so having it show up out of
"logical" sequence in the archive doesn't seem like a terrible
reflection of reality.  I'm just concerned about the threading not
being sensitive to skews on the order of a few minutes --- those
are extremely common.
        regards, tom lane


Re: mailing list archiver chewing patches

From
Magnus Hagander
Date:
2010/2/1 Robert Haas <robertmhaas@gmail.com>:
> On Mon, Feb 1, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Mon, Feb 1, 2010 at 11:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>>        * A sends a message
>>>>        * B replies, cc'ing A and the list
>>>>        * B's reply to list is delayed by greylisting
>>>>        * A replies to B's reply (cc'ing list)
>>>>        * A's reply goes through immediately
>>>>        * B's reply shows up a bit later
>>>>
>>>> That happens pretty frequently IME.
>>
>>> Yeah - and sometimes the delay can be DAYS.
>>
>> Greylisting wouldn't explain a delay of more than an hour or so.
>> OTOH, if B's reply got held for moderation for some reason, then
>> yeah it could be days :-(.  But in that case the rest of the list
>> didn't see it in real-time either, so having it show up out of
>> "logical" sequence in the archive doesn't seem like a terrible
>> reflection of reality.  I'm just concerned about the threading not
>> being sensitive to skews on the order of a few minutes --- those
>> are extremely common.
>
> I not infrequently receive messages out of sequence by time periods
> well in excess of a few minutes.
>
> Don't know why, but I do.

Quite often, it's stuck in the moderation queue.

Not quite as often, but still fairly frequently, it's stuck somewhere
in the hub.org relaying/antispam blackbox.


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: mailing list archiver chewing patches

From
Matteo Beccati
Date:
On 01/02/2010 17:28, Tom Lane wrote:
> Matteo Beccati<php@beccati.com>  writes:
>> My main concern is that we'd need to overcomplicate the thread detection
>> algorithm so that it better deals with delayed messages: as it currently
>> works, the replies to a missing message get linked to the
>> "grand-parent". Injecting the missing message afterwards will put it at
>> the same level as its replies. If it happens only once in a while I
>> guess we can live with it, but definitely not if it happens tens of
>> times a day.
>
> That's quite common unfortunately --- I think you're going to need to
> deal with the case.  Even getting a direct feed from the mail relays
> wouldn't avoid it completely: consider cases like
>
>     * A sends a message
>     * B replies, cc'ing A and the list
>     * B's reply to list is delayed by greylisting
>     * A replies to B's reply (cc'ing list)
>     * A's reply goes through immediately
>     * B's reply shows up a bit later
>
> That happens pretty frequently IME.

I've improved the threading algorithm by keeping an ordered backlog of 
unresolved references, i.e. when a message arrives:

1. Search for a parent message using:

1a. In-Reply-To header. If referenced message is not found insert its 
Message-Id to the backlog table with position 0

1b. References header. For each missing referenced message insert its 
Message-Id to the backlog table with position N

1c. MS Exchange Thread-Index and Thread-Topic headers

2. Message is stored along with its parent ID, if any.

3. Compare the Message-Id header with the backlog table. Update the 
parent field of any referencing message and clean up positions >= n in 
the references table.

Now I just need some time to do a final clean up and I'd be ready to 
publish the code, which hopefully will be clearer than my words ;)


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/