Thread: mailing list archiver chewing patches
Tim Bunce's recent patch has been mangled apparently by the list archives. He sent it as an attachment, and that's how I have it in my mailbox, so why isn't it appearing as such in the web archive so that it can be nicely downloaded? See <http://archives.postgresql.org/message-id/20100108124613.GL2505@timac.local>. It's happened to other people as well: <http://archives.postgresql.org/message-id/4B02D3E4.1040107@hut.fi> Reviewers and others shouldn't have to c&p patches from web pages, especially when it will be horribly line wrapped etc. Can we stop this happening somehow? cheers andrew
Andrew Dunstan wrote: > > Tim Bunce's recent patch has been mangled apparently by the list > archives. He sent it as an attachment, and that's how I have it in > my mailbox, so why isn't it appearing as such in the web archive so > that it can be nicely downloaded? See <http://archives.postgresql.org/message-id/20100108124613.GL2505@timac.local>. > It's happened to other people as well: > <http://archives.postgresql.org/message-id/4B02D3E4.1040107@hut.fi> > > Reviewers and others shouldn't have to c&p patches from web pages, > especially when it will be horribly line wrapped etc. Can we stop > this happening somehow? Try this http://archives.postgresql.org/msgtxt.php?id=20100108124613.GL2505@timac.local -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Andrew Dunstan wrote: > > > > Tim Bunce's recent patch has been mangled apparently by the list > > archives. He sent it as an attachment, and that's how I have it in > > my mailbox, so why isn't it appearing as such in the web archive so > > that it can be nicely downloaded? See <http://archives.postgresql.org/message-id/20100108124613.GL2505@timac.local>. > > It's happened to other people as well: > > <http://archives.postgresql.org/message-id/4B02D3E4.1040107@hut.fi> > > > > Reviewers and others shouldn't have to c&p patches from web pages, > > especially when it will be horribly line wrapped etc. Can we stop > > this happening somehow? > > Try this > > http://archives.postgresql.org/msgtxt.php?id=20100108124613.GL2505@timac.local This was previously broken for a lot of emails, but I just fixed some of it, and it seems to work for the vast majority of our emails (and certainly for all emails that matter). The other point related to this is that each email should have a link pointing to its text/plain version. This used to be present, but it got broken (I think) at the same time that the anti-email-harvesting measure got broken. I'm going to look at that next. Let me know if you find something broken with this style of link. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Sat, Jan 09, 2010 at 02:17:27AM -0300, Alvaro Herrera wrote: > Alvaro Herrera wrote: > > Andrew Dunstan wrote: > > > > > > Tim Bunce's recent patch has been mangled apparently by the list > > > archives. He sent it as an attachment, and that's how I have it in > > > my mailbox, so why isn't it appearing as such in the web archive so > > > that it can be nicely downloaded? See <http://archives.postgresql.org/message-id/20100108124613.GL2505@timac.local>. > > > It's happened to other people as well: > > > <http://archives.postgresql.org/message-id/4B02D3E4.1040107@hut.fi> > > > > > > Reviewers and others shouldn't have to c&p patches from web pages, > > > especially when it will be horribly line wrapped etc. Can we stop > > > this happening somehow? > > > > Try this > > > > http://archives.postgresql.org/msgtxt.php?id=20100108124613.GL2505@timac.local That looks like it dumps the raw message. That'll cause problems for any messages using quoted-printable encoding. I'd hazard a guess it also won't do thing right thing for non-charset=us-ascii emails/attachments. > This was previously broken for a lot of emails, but I just fixed some of > it, and it seems to work for the vast majority of our emails (and > certainly for all emails that matter). > > The other point related to this is that each email should have a link > pointing to its text/plain version. This used to be present, but it got > broken (I think) at the same time that the anti-email-harvesting measure > got broken. I'm going to look at that next. > > Let me know if you find something broken with this style of link. What's needed is a) a download link for each attachment, regardless of the kind of attachment, and b) the download link should download the content of the attachment in a way that's directly usable. For example, see http://archives.postgresql.org/pgsql-hackers/2010-01/msg00589.php Looking at the raw version of the original messagehttp://archives.postgresql.org/msgtxt.php?id=757953.70187.qm@web29001.mail.ird.yahoo.com That message has a patch as an attachment:Content-Type: application/octet-stream; name="patch_bit.patch"Content-Transfer-Encoding:base64Content-Disposition: attachment; filename="patch_bit.patch" It gets a link in the archive (because it's a non-text content-type I presume):http://archives.postgresql.org/pgsql-hackers/2010-01/bin5ThVOJC3jI.bin but the link doesn't work well. The url ends with .bin and the http response content-type is Content-Type: application/octet-stream so downloaders get a .bin file instead of the original .patch file. It seems that people wanting to send in a patch have two options: send it as text/(something) so it's readable on the archive web page but not copy-n-paste'able because of wordwrapping, or set it as application/octet-stream so it's downloadable but not readable on the web page. Let me know if I've misunderstood anything. Some sugestions: - Provide links for all attachments, whether text/* or not. - For text/* types show the content inline verbatim, don't wrap the text. - If the attachment has a Content-Disposition with a filename then append that to the url. It could simply be a fake 'pathinfo':.../2010-01/bin5ThVOJC3jI.bin/patch_bit.patch - Instead of "Description: Binary data" on the web page, give the values of the Content-Type and Content-Disposition headers. Tim. p.s. For background... I'm writing an email to the dbi-users & dbi-announce mailing lists (~2000 & ~5000 users last time I checked) asking anyone who might be interested to help review the plperl feature patch and encouraging them to contribute to the commitfest review process for other patches. It's important that it's *very* easy for these new-comers to follow simple instructions to get involved. I was hoping to be able to use a archives.postgresql.org url to the message with the patch to explain what's the patch does _and_ provide a download link. It seems I'll have to upload the patch somewhere else.
Tim Bunce wrote: > It seems that people wanting to send in a patch have two options: send > it as text/(something) so it's readable on the archive web page but not > copy-n-paste'able because of wordwrapping, or set it as > application/octet-stream so it's downloadable but not readable on the > web page. > > > That is assuming that the MUA gives you the option of specifying the attachment MIME type. Many (including mine) do not. It would mean an extra step - I'd have to gzip each patch or something like that. That would be unfortunate,as well as imposing extra effort, because it would make the patch not display inline in many MUAs (again, like mine). cheers andrew
Tim Bunce wrote: > > > Try this > > > > > > http://archives.postgresql.org/msgtxt.php?id=20100108124613.GL2505@timac.local > > That looks like it dumps the raw message. That'll cause problems for any > messages using quoted-printable encoding. I'd hazard a guess it also > won't do thing right thing for non-charset=us-ascii emails/attachments. Yeah. Grab it and open it as an mbox. > What's needed is a) a download link for each attachment, regardless of the > kind of attachment, and b) the download link should download the content > of the attachment in a way that's directly usable. Yeah, well, that's a bit outside what I am able to do, unless you can get a MHonArc expert somewhere who can help us figure out how to set it up for these requirements. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Andrew Dunstan <andrew@dunslane.net> writes: > That is assuming that the MUA gives you the option of specifying the > attachment MIME type. Many (including mine) do not. It would mean an extra > step - I'd have to gzip each patch or something like that. That would be > unfortunate,as well as imposing extra effort, because it would make the > patch not display inline in many MUAs (again, like mine). Bad MUA, change MUA, or what they say… More seriously though, it's not the first time we're having some difficulties with the MHonArc setup, and I think it's also related to the poor thread following on the archives website at month boundaries. MHonArc (http://hydra.nac.uci.edu/indiv/ehood/mhonarc.html) seems to be about converting the mails into some HTML pages, and offering the web interface to get to use them, with some indexing and searches facilities. Are our indexing and searches provided by MHonArc or maintained by the community? How helpful considering alternatives, such as AOX (which runs atop PostgreSQL and would offer anonymous IMAP facility over the archives) would be? Of course it'll boil down to who's maintaining the current solution and how much time is allocated to this, the solution research and migration would have to fit in there I suppose. Same as pgfoundry. But still, should we talk about it? Regards, -- dim
Dimitri Fontaine wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > That is assuming that the MUA gives you the option of specifying the > > attachment MIME type. Many (including mine) do not. It would mean an extra > > step - I'd have to gzip each patch or something like that. That would be > > unfortunate,as well as imposing extra effort, because it would make the > > patch not display inline in many MUAs (again, like mine). > > Bad MUA, change MUA, or what they say… > > More seriously though, it's not the first time we're having some > difficulties with the MHonArc setup, and I think it's also related to > the poor thread following on the archives website at month boundaries. Absolutely. The month boundary problem boils down to the fact that Mhonarc does not scale very well, so we can't have mboxes that are too large. This is why most people split their archives per month, and then each month is published as an independent Mhonarc output archive. It's a horrid solution. > Are our indexing and searches provided by MHonArc or maintained by the > community? Searches are completely external to mhonarc. > How helpful considering alternatives, such as AOX (which runs > atop PostgreSQL and would offer anonymous IMAP facility over the > archives) would be? > > Of course it'll boil down to who's maintaining the current solution and > how much time is allocated to this, the solution research and migration > would have to fit in there I suppose. Same as pgfoundry. But still, > should we talk about it? There's some talk about writing our own archiving system, database-backed. There have been a few false starts but no concrete result so far. We need a lot more manpower invested in this problem. If there's interest, let's talk about it. My daugher was born yesterday and I'm having a bit of a calm before the storm because she's not coming home until Tuesday or so (at this time of the day, that is, because I have to take care of the other daughter). I'll be probably away for (at least) a week when she does; and I'll probably have somewhat of a shortage of spare time after that. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
2010/1/11 Alvaro Herrera <alvherre@commandprompt.com>: > Dimitri Fontaine wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >> > That is assuming that the MUA gives you the option of specifying the >> > attachment MIME type. Many (including mine) do not. It would mean an extra >> > step - I'd have to gzip each patch or something like that. That would be >> > unfortunate,as well as imposing extra effort, because it would make the >> > patch not display inline in many MUAs (again, like mine). >> >> Bad MUA, change MUA, or what they say… >> >> More seriously though, it's not the first time we're having some >> difficulties with the MHonArc setup, and I think it's also related to >> the poor thread following on the archives website at month boundaries. > > Absolutely. The month boundary problem boils down to the fact that > Mhonarc does not scale very well, so we can't have mboxes that are too > large. This is why most people split their archives per month, and then > each month is published as an independent Mhonarc output archive. It's > a horrid solution. Yeah. >> Are our indexing and searches provided by MHonArc or maintained by the >> community? > > Searches are completely external to mhonarc. It is, but it's tied into the format of the URLs and the format of the actual messages in order to be more efficient. But it should be fairly easy to adapt it to some other base system if we want. >> How helpful considering alternatives, such as AOX (which runs >> atop PostgreSQL and would offer anonymous IMAP facility over the >> archives) would be? >> >> Of course it'll boil down to who's maintaining the current solution and >> how much time is allocated to this, the solution research and migration >> would have to fit in there I suppose. Same as pgfoundry. But still, >> should we talk about it? > > There's some talk about writing our own archiving system, > database-backed. There have been a few false starts but no concrete > result so far. We need a lot more manpower invested in this problem. > If there's interest, let's talk about it. Yeah, definitely, let's talk about it. Anything that gives us an efficient backend with a good API is interesting (SQL is a reasonably good API. Not so sure about IMAP, since it is a bit too focused on single messages IIRC). Particularly, something that can separate frontend and backend (can still be on the same machine of course, I'm talking conceptually) seems to be a lot more flexible, which we'd like. As for AOX, my understanding is that it is no longer maintained, so I'd be worried about choosing such a solution for a complex problem. But it's open for discussion. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Alvaro Herrera <alvherre@commandprompt.com> writes: > Absolutely. The month boundary problem boils down to the fact that > Mhonarc does not scale very well, so we can't have mboxes that are too > large. This is why most people split their archives per month, and then > each month is published as an independent Mhonarc output archive. It's > a horrid solution. > >> Are our indexing and searches provided by MHonArc or maintained by the >> community? > > Searches are completely external to mhonarc. Changing the MHonArc solution would probably mean adapting them, I guess, or proposing a new solution with compatible output for the searching to still work… >> How helpful considering alternatives, such as AOX (which runs >> atop PostgreSQL and would offer anonymous IMAP facility over the >> archives) would be? >> >> Of course it'll boil down to who's maintaining the current solution and >> how much time is allocated to this, the solution research and migration >> would have to fit in there I suppose. Same as pgfoundry. But still, >> should we talk about it? > > There's some talk about writing our own archiving system, > database-backed. There have been a few false starts but no concrete > result so far. We need a lot more manpower invested in this problem. > If there's interest, let's talk about it. AOX is already a database backed email solution, offering an archive page with searching. I believe the searching is baked by tsearch indexing. That's why I think it'd be suitable. They already archive and offer search over one of our mailing lists, and from there it seems like we'd only miss the user interface bits: http://archives.aox.org/archives/pgsql-announce I hope the UI bits are not the most time demanding one. Is there someone with enough time to install aox somewhere and have it subscribed to our lists? > My daugher was born yesterday and I'm having a bit of a calm before the > storm because she's not coming home until Tuesday or so (at this time of > the day, that is, because I have to take care of the other daughter). > I'll be probably away for (at least) a week when she does; and I'll > probably have somewhat of a shortage of spare time after that. Ahaha :) IME that's not the shortage of spare time which ruins you the most as the lack of energy when you do have this little precious resource again, very few piece of it atime. Regards, -- dim
Hi, Il 11/01/2010 11:18, Dimitri Fontaine ha scritto: > AOX is already a database backed email solution, offering an archive > page with searching. I believe the searching is baked by tsearch > indexing. That's why I think it'd be suitable. > > They already archive and offer search over one of our mailing lists, and > from there it seems like we'd only miss the user interface bits: > > http://archives.aox.org/archives/pgsql-announce > > I hope the UI bits are not the most time demanding one. > > Is there someone with enough time to install aox somewhere and have it > subscribed to our lists? I recall having tried AOX a long time ago but I can't remember the reason why I was not satisfied. I guess I can give another try by setting up a test ML archive. >> My daugher was born yesterday and I'm having a bit of a calm before the >> storm because she's not coming home until Tuesday or so (at this time of >> the day, that is, because I have to take care of the other daughter). >> I'll be probably away for (at least) a week when she does; and I'll >> probably have somewhat of a shortage of spare time after that. BTW, congrats Alvaro! Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
On Mon, Jan 11, 2010 at 5:23 PM, Matteo Beccati <php@beccati.com> wrote: > Hi, > > Il 11/01/2010 11:18, Dimitri Fontaine ha scritto: >> >> AOX is already a database backed email solution, offering an archive >> page with searching. I believe the searching is baked by tsearch >> indexing. That's why I think it'd be suitable. >> >> They already archive and offer search over one of our mailing lists, and >> from there it seems like we'd only miss the user interface bits: >> >> http://archives.aox.org/archives/pgsql-announce >> >> I hope the UI bits are not the most time demanding one. >> >> Is there someone with enough time to install aox somewhere and have it >> subscribed to our lists? > > I recall having tried AOX a long time ago but I can't remember the reason > why I was not satisfied. I guess I can give another try by setting up a test > ML archive. I tried it too, before I started writing the new prototype archiver from scratch. I too forget why I gave up on it, but it was a strong enough reason for me to start coding from scratch. BTW, we only need to replace the archiver/display code. The search works well already. -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
Il 11/01/2010 12:58, Dave Page ha scritto: > On Mon, Jan 11, 2010 at 5:23 PM, Matteo Beccati<php@beccati.com> wrote: >> I recall having tried AOX a long time ago but I can't remember the reason >> why I was not satisfied. I guess I can give another try by setting up a test >> ML archive. > > I tried it too, before I started writing the new prototype archiver > from scratch. I too forget why I gave up on it, but it was a strong > enough reason for me to start coding from scratch. > > BTW, we only need to replace the archiver/display code. The search > works well already. It took me no more than 10 minutes to set up AOX and hook it up to a domain. An email account is now subscribed to the hackers ML. I'll try to estimate how hard it could be to write a web app that displays the archive from the db, even though I'm not sure that this is a good way to proceed. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
Magnus Hagander <magnus@hagander.net> writes: > As for AOX, my understanding is that it is no longer maintained, so > I'd be worried about choosing such a solution for a complex problem. > But it's open for discussion. Ouch. -- dim
Dave Page <dpage@pgadmin.org> writes: >> I recall having tried AOX a long time ago but I can't remember the reason >> why I was not satisfied. I guess I can give another try by setting up a test >> ML archive. > > I tried it too, before I started writing the new prototype archiver > from scratch. I too forget why I gave up on it, but it was a strong > enough reason for me to start coding from scratch. > > BTW, we only need to replace the archiver/display code. The search > works well already. What the current archiver looks like? A PG database containing the raw mails and attachements? It that's the case the missing piece would be to plug a browsing UI atop of that, right? Regards, -- dim
2010/1/11 Dimitri Fontaine <dfontaine@hi-media.com>: > Dave Page <dpage@pgadmin.org> writes: >>> I recall having tried AOX a long time ago but I can't remember the reason >>> why I was not satisfied. I guess I can give another try by setting up a test >>> ML archive. >> >> I tried it too, before I started writing the new prototype archiver >> from scratch. I too forget why I gave up on it, but it was a strong >> enough reason for me to start coding from scratch. >> >> BTW, we only need to replace the archiver/display code. The search >> works well already. > > What the current archiver looks like? A PG database containing the raw > mails and attachements? It that's the case the missing piece would be to > plug a browsing UI atop of that, right? No, the current archiver is a set of MBOX files that are processed incrementally by mhonarc. (yes, this is why it doesn't scale) *search* is in a postgresql database, but it doesn't contain the entire messages - doesn't have attachments for examples - only the parts it has web-scraped off the the current archives. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Magnus Hagander <magnus@hagander.net> writes: >> As for AOX, my understanding is that it is no longer maintained, so >> I'd be worried about choosing such a solution for a complex problem. >> But it's open for discussion. > > Ouch. It seems that the company baking the development is dead, but the developpers are still working on the product on their spare time. New release ahead. They're not working on the archive UI part. -- dim
Magnus Hagander wrote: > > No, the current archiver is a set of MBOX files that are processed > incrementally by mhonarc. > > (yes, this is why it doesn't scale) > > *search* is in a postgresql database, but it doesn't contain the > entire messages - doesn't have attachments for examples - only the > parts it has web-scraped off the the current archives. > Fixing this mess and giving us decent archives with guaranteed downloadable patches and good search would be a nice job for someone who wants to contribute without having to cut or review core code. cheers andrew
(Many thanks to Dimitri for bringing this thread to my attention.) At 2010-01-11 10:46:10 +0100, magnus@hagander.net wrote: > > As for AOX, my understanding is that it is no longer maintained, so > I'd be worried about choosing such a solution for a complex problem. I'll keep this short: Oryx, the company behind Archiveopteryx (aox), is no longer around, but the software is still maintained. The developers (myself included) are still interested in keeping it alive. It's been a while since the last release, but it'll be ready soon. If you're having any sort of problems with it, write to me, and I'll help you. (That said, we're not working on the web interface. It did work, in its limited fashion, but it's not feature complete; and I need to find some paying work, so it's not a priority. That, and some health problems, are also why I haven't been active on the pg lists for a while.) Feel free to write to me off-list for more. -- ams
Il 11/01/2010 15:00, Abhijit Menon-Sen ha scritto: > I'll keep this short: Oryx, the company behind Archiveopteryx (aox), is > no longer around, but the software is still maintained. The developers > (myself included) are still interested in keeping it alive. It's been a > while since the last release, but it'll be ready soon. If you're having > any sort of problems with it, write to me, and I'll help you. That's good news indeed for the project, AOX seems to be working fine on my server. I've had a few IMAP glitches, but it seems to live happily with my qmail and stores the emails on the db, fulfilling my current needs. So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list archives reading from the AOX database: http://archives.beccati.org/ Please take it as an exercise I've made trying to learn how to use symfony this afternoon. It's not feature complete, nor probably very scalable, but at least it features attachment download ;) http://archives.beccati.org/pgsql-hackers/message/37 Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
2010/1/12 Matteo Beccati <php@beccati.com>: > So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list archivesreading from the AOX database: > > http://archives.beccati.org/ Seems to work. > Please take it as an exercise I've made trying to learn how to use symfony this afternoon. It's not feature complete, norprobably very scalable, but at least it features attachment download ;) > > http://archives.beccati.org/pgsql-hackers/message/37 :-) So just to put this into perspective and give anyone paying attention an idea of the pain that lies ahead should they decide to work on this: - We need to import the old archives (of which there are hundreds of thousands of messages, the first few years of which have, umm, minimal headers. - We need to generate thread indexes - We need to re-generate the original URLs for backwards compatibility Now there's encouragement :-) -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
Dave Page <dpage@pgadmin.org> writes: > 2010/1/12 Matteo Beccati <php@beccati.com>: >> So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list archivesreading from the AOX database: >> >> http://archives.beccati.org/ > > Seems to work. Hehe, nice a beginning! > So just to put this into perspective and give anyone paying attention > an idea of the pain that lies ahead should they decide to work on > this: > > - We need to import the old archives (of which there are hundreds of > thousands of messages, the first few years of which have, umm, minimal > headers. Anyone having a local copy of this in his mailboxes? At some point there were some NNTP gateway, so maybe there's a copy this way. > - We need to generate thread indexes We have CTEs :) > - We need to re-generate the original URLs for backwards compatibility I guess the message-id one ain't the tricky one... and it should be possible to fill a relation table like monharc_compat(message_id, list, year, month, message_number); Then we'd need some help from the webserver (rewrite rules I guess) so that the current URL is transformed to call a catch-all script: http://archives.postgresql.org/pgsql-xxx/YYYY-MM/msg01234.php -> http://archives.postgresql.org/compat.php?l=xxx&y=YYYY&m=MM&n=01234 In that compat.php script you then issue the following query or the like to get the message_id, then use the newer infrastructure to get to display it: SELECT message_id FROM monharc_compat WHERE list = ? and year = ? and month = ? and message_number = ?; Regards, -- dim
On Tue, Jan 12, 2010 at 10:05, Dimitri Fontaine <dfontaine@hi-media.com> wrote: > Dave Page <dpage@pgadmin.org> writes: >> 2010/1/12 Matteo Beccati <php@beccati.com>: >>> So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list archivesreading from the AOX database: >>> >>> http://archives.beccati.org/ >> >> Seems to work. > > Hehe, nice a beginning! The problem is usually with strange looking emails with 15 different MIME types. If we can figure out the proper way to render that, the rest really is just a SMOP. (BTW, for something to actually be used In Production (TM), we want something that uses one of our existing frameworks. So don't go overboard in code-wise implementations on something else - proof of concept on something else is always ok, of course) >> So just to put this into perspective and give anyone paying attention >> an idea of the pain that lies ahead should they decide to work on >> this: >> >> - We need to import the old archives (of which there are hundreds of >> thousands of messages, the first few years of which have, umm, minimal >> headers. > > Anyone having a local copy of this in his mailboxes? At some point there > were some NNTP gateway, so maybe there's a copy this way. We have MBOX files. IIRC, aox has an import function that can read MBOX files. The interesting thing is what happens with the really old files that don't have complete headers. I don't think you can trust the NNTP gateway now or in the past, messages are sometimes lost there. The mbox files are as complete as anything we'll ever get. >> - We need to generate thread indexes > > We have CTEs :) Right. We still need the threading information, so we have something to use our CTEs on :-) But I assume that AOX already does this? >> - We need to re-generate the original URLs for backwards compatibility > > I guess the message-id one ain't the tricky one... and it should be > possible to fill a relation table like > monharc_compat(message_id, list, year, month, message_number); Yeah. It's not so hard, you can just screen-scrape the current archives the same way the search server does. > Then we'd need some help from the webserver (rewrite rules I guess) so > that the current URL is transformed to call a catch-all script: > http://archives.postgresql.org/pgsql-xxx/YYYY-MM/msg01234.php > -> http://archives.postgresql.org/compat.php?l=xxx&y=YYYY&m=MM&n=01234 Or just a trivial regexp catch in any modern app platform. > In that compat.php script you then issue the following query or the like > to get the message_id, then use the newer infrastructure to get to > display it: > > SELECT message_id > FROM monharc_compat > WHERE list = ? and year = ? and month = ? and message_number = ?; I'd rather see it redirect it to the new style URL, but it's the same query, yes :-) -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On Mon, Jan 11, 2010 at 15:00, Abhijit Menon-Sen <ams@toroid.org> wrote: > (Many thanks to Dimitri for bringing this thread to my attention.) > > At 2010-01-11 10:46:10 +0100, magnus@hagander.net wrote: >> >> As for AOX, my understanding is that it is no longer maintained, so >> I'd be worried about choosing such a solution for a complex problem. > > I'll keep this short: Oryx, the company behind Archiveopteryx (aox), is > no longer around, but the software is still maintained. The developers > (myself included) are still interested in keeping it alive. It's been a > while since the last release, but it'll be ready soon. If you're having > any sort of problems with it, write to me, and I'll help you. Hmm. So if this means that the system is actually something we can rely on long-term for the parsing and importing of messages into the database, it may be an interesting optino still, so we don't have to write that part ourselves. I just want to end up with a non-maintained system. I doubt we, as a community, want to take on maintaining a message parser in C++. I'd be much more inclined to end up having to maintain something written in python or perl in that case, since they'd probably rely much on external modules that a lot of others rely on --> somebody else would help maintain large parts of it.. > (That said, we're not working on the web interface. It did work, in its > limited fashion, but it's not feature complete; and I need to find some > paying work, so it's not a priority. That, and some health problems, are > also why I haven't been active on the pg lists for a while.) As long as the db structure is easy enough to parse and generate stuff from, this may actually be a feature, because it will make it easier to integrate with our other website stuff. If it's very low level and leaves too much to work, well, then it's the opposite of course :-) -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On Tue, 2010-01-12 at 10:24 +0530, Dave Page wrote: > 2010/1/12 Matteo Beccati <php@beccati.com>: > > So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list archivesreading from the AOX database: > > > > http://archives.beccati.org/ > > Seems to work. > > > Please take it as an exercise I've made trying to learn how to use symfony this afternoon. It's not feature complete,nor probably very scalable, but at least it features attachment download ;) > > > > http://archives.beccati.org/pgsql-hackers/message/37 > > :-) > > So just to put this into perspective and give anyone paying attention > an idea of the pain that lies ahead should they decide to work on > this: > > - We need to import the old archives (of which there are hundreds of > thousands of messages, the first few years of which have, umm, minimal > headers. > - We need to generate thread indexes > - We need to re-generate the original URLs for backwards compatibility > > Now there's encouragement :-) Or, we just leave the current infrastructure in place and use a new one for all new messages going forward. We shouldn't limit our ability to have a decent system due to decisions of the past. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
"Joshua D. Drake" <jd@commandprompt.com> writes: > On Tue, 2010-01-12 at 10:24 +0530, Dave Page wrote: >> So just to put this into perspective and give anyone paying attention >> an idea of the pain that lies ahead should they decide to work on >> this: >> >> - We need to import the old archives (of which there are hundreds of >> thousands of messages, the first few years of which have, umm, minimal >> headers. >> - We need to generate thread indexes >> - We need to re-generate the original URLs for backwards compatibility >> >> Now there's encouragement :-) > Or, we just leave the current infrastructure in place and use a new one > for all new messages going forward. We shouldn't limit our ability to > have a decent system due to decisions of the past. -1. What's the point of having archives? IMO the mailing list archives are nearly as critical a piece of the project infrastructure as the CVS repository. We've already established that moving to a new SCM that fails to preserve the CVS history wouldn't be acceptable. I hardly think that the bar is any lower for mailing list archives. Now I think we could possibly skip the requirement suggested above for URL compatibility, if we just leave the old archives on-line so that those URLs all still resolve. But if we can't load all the old messages into the new infrastructure, it'll basically be useless for searching purposes. (Hmm, re-reading what you said, maybe we are suggesting the same thing, but it's not clear. Anyway my point is that Dave's first two requirements are real. Only the third might not be.) regards, tom lane
On Tue, 2010-01-12 at 11:54 -0500, Tom Lane wrote: > > Or, we just leave the current infrastructure in place and use a new one > > for all new messages going forward. We shouldn't limit our ability to > > have a decent system due to decisions of the past. > > -1. What's the point of having archives? IMO the mailing list archives > are nearly as critical a piece of the project infrastructure as the CVS > repository. We've already established that moving to a new SCM that > fails to preserve the CVS history wouldn't be acceptable. I hardly > think that the bar is any lower for mailing list archives. > > Now I think we could possibly skip the requirement suggested above for > URL compatibility, if we just leave the old archives on-line so that > those URLs all still resolve. But if we can't load all the old messages > into the new infrastructure, it'll basically be useless for searching > purposes. > > (Hmm, re-reading what you said, maybe we are suggesting the same thing, > but it's not clear. Anyway my point is that Dave's first two > requirements are real. Only the third might not be.) We are saying the same thing. Sorry if I wasn't clear. Joshua D. Drake > > regards, tom lane > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Tue, Jan 12, 2010 at 10:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> On Tue, 2010-01-12 at 10:24 +0530, Dave Page wrote: >>> So just to put this into perspective and give anyone paying attention >>> an idea of the pain that lies ahead should they decide to work on >>> this: >>> >>> - We need to import the old archives (of which there are hundreds of >>> thousands of messages, the first few years of which have, umm, minimal >>> headers. >>> - We need to generate thread indexes >>> - We need to re-generate the original URLs for backwards compatibility >>> >>> Now there's encouragement :-) > >> Or, we just leave the current infrastructure in place and use a new one >> for all new messages going forward. We shouldn't limit our ability to >> have a decent system due to decisions of the past. > > -1. What's the point of having archives? IMO the mailing list archives > are nearly as critical a piece of the project infrastructure as the CVS > repository. We've already established that moving to a new SCM that > fails to preserve the CVS history wouldn't be acceptable. I hardly > think that the bar is any lower for mailing list archives. > > Now I think we could possibly skip the requirement suggested above for > URL compatibility, if we just leave the old archives on-line so that > those URLs all still resolve. But if we can't load all the old messages > into the new infrastructure, it'll basically be useless for searching > purposes. > > (Hmm, re-reading what you said, maybe we are suggesting the same thing, > but it's not clear. Anyway my point is that Dave's first two > requirements are real. Only the third might not be.) The third actually isn't actually that hard to do in theory. The message numbers are basically the zero-based position in the mbox file, and the rest of the URL is obvious. -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
On Tue, Jan 12, 2010 at 18:34, Dave Page <dpage@pgadmin.org> wrote: > On Tue, Jan 12, 2010 at 10:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> "Joshua D. Drake" <jd@commandprompt.com> writes: >>> On Tue, 2010-01-12 at 10:24 +0530, Dave Page wrote: >>>> So just to put this into perspective and give anyone paying attention >>>> an idea of the pain that lies ahead should they decide to work on >>>> this: >>>> >>>> - We need to import the old archives (of which there are hundreds of >>>> thousands of messages, the first few years of which have, umm, minimal >>>> headers. >>>> - We need to generate thread indexes >>>> - We need to re-generate the original URLs for backwards compatibility >>>> >>>> Now there's encouragement :-) >> >>> Or, we just leave the current infrastructure in place and use a new one >>> for all new messages going forward. We shouldn't limit our ability to >>> have a decent system due to decisions of the past. >> >> -1. What's the point of having archives? IMO the mailing list archives >> are nearly as critical a piece of the project infrastructure as the CVS >> repository. We've already established that moving to a new SCM that >> fails to preserve the CVS history wouldn't be acceptable. I hardly >> think that the bar is any lower for mailing list archives. >> >> Now I think we could possibly skip the requirement suggested above for >> URL compatibility, if we just leave the old archives on-line so that >> those URLs all still resolve. But if we can't load all the old messages >> into the new infrastructure, it'll basically be useless for searching >> purposes. >> >> (Hmm, re-reading what you said, maybe we are suggesting the same thing, >> but it's not clear. Anyway my point is that Dave's first two >> requirements are real. Only the third might not be.) > > The third actually isn't actually that hard to do in theory. The > message numbers are basically the zero-based position in the mbox > file, and the rest of the URL is obvious. The third part is trivial. The search system already does 95% of it. I've already implemented exactly that kind of redirect thing on top of the search code once just as a poc, and it was less than 30 minutes of hacking. Can't seem to find the script ATM though, but you get the idea. Let's not focus on that part, we can easily solve that. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Il 12/01/2010 10:30, Magnus Hagander ha scritto: > The problem is usually with strange looking emails with 15 different > MIME types. If we can figure out the proper way to render that, the > rest really is just a SMOP. Yeah, I was expecting some, but all the message I've looked at seemed to be working ok. > (BTW, for something to actually be used In Production (TM), we want > something that uses one of our existing frameworks. So don't go > overboard in code-wise implementations on something else - proof of > concept on something else is always ok, of course) OK, that's something I didn't know, even though I expected some kind of limitations. Could you please elaborate a bit more (i.e. where to find info)? Having played with it, here's my feedback about AOX: pros: - seemed to be working reliably; - does most of the dirty job of parsing emails, splitting parts, etc - highly normalized schema - thread support (partial?) cons: - directly publishing the live email feed might not be desirable - queries might end up being a bit complicate for simple tasks - might be not easy to add additional processing in the workflow >>> So just to put this into perspective and give anyone paying attention >>> an idea of the pain that lies ahead should they decide to work on >>> this: >>> >>> - We need to import the old archives (of which there are hundreds of >>> thousands of messages, the first few years of which have, umm, minimal >>> headers. >> >> Anyone having a local copy of this in his mailboxes? At some point there >> were some NNTP gateway, so maybe there's a copy this way. > > We have MBOX files. > > IIRC, aox has an import function that can read MBOX files. The > interesting thing is what happens with the really old files that don't > have complete headers. > > I don't think you can trust the NNTP gateway now or in the past, > messages are sometimes lost there. The mbox files are as complete as > anything we'll ever get. Importing the whole pgsql-www archive with a perl script that bounces messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, I haven't looked into it that much. >>> - We need to generate thread indexes >> >> We have CTEs :) > > Right. We still need the threading information, so we have something > to use our CTEs on :-) > > But I assume that AOX already does this? there are thread related tables and they seem to get filled when a SORT IMAP command is issued, however I haven't found a way to get the hierarchy out of them. What that means is that we'd need some kind of post processing to populate a thread hierarchy. If there isn't a fully usable thread hierarchy I was more thinking to ltree, mainly because I've successfully used it in past and I haven't had enough time yet to look at CTEs. But if performance is comparable I don't see a reason why we shouldn't use them. >>> - We need to re-generate the original URLs for backwards compatibility >> >> I guess the message-id one ain't the tricky one... and it should be >> possible to fill a relation table like >> monharc_compat(message_id, list, year, month, message_number); > > Yeah. It's not so hard, you can just screen-scrape the current > archives the same way the search server does. Definitely an easy enough task. With all that said, I can't promise anything as it all depends on how much spare time I have, but I can proceed with the evaluation if you think it's useful. I have a feeling that AOX is not truly the right tool for the job, but we might be able to customise it to suit our needs. Are there any other requirements that weren't specified? Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
Il 12/01/2010 19:54, Magnus Hagander ha scritto: > On Tue, Jan 12, 2010 at 18:34, Dave Page<dpage@pgadmin.org> wrote: >> On Tue, Jan 12, 2010 at 10:24 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >>> "Joshua D. Drake"<jd@commandprompt.com> writes: >>>> On Tue, 2010-01-12 at 10:24 +0530, Dave Page wrote: >>>>> So just to put this into perspective and give anyone paying attention >>>>> an idea of the pain that lies ahead should they decide to work on >>>>> this: >>>>> >>>>> - We need to import the old archives (of which there are hundreds of >>>>> thousands of messages, the first few years of which have, umm, minimal >>>>> headers. >>>>> - We need to generate thread indexes >>>>> - We need to re-generate the original URLs for backwards compatibility >>>>> >>>>> Now there's encouragement :-) >>> >>>> Or, we just leave the current infrastructure in place and use a new one >>>> for all new messages going forward. We shouldn't limit our ability to >>>> have a decent system due to decisions of the past. >>> >>> -1. What's the point of having archives? IMO the mailing list archives >>> are nearly as critical a piece of the project infrastructure as the CVS >>> repository. We've already established that moving to a new SCM that >>> fails to preserve the CVS history wouldn't be acceptable. I hardly >>> think that the bar is any lower for mailing list archives. >>> >>> Now I think we could possibly skip the requirement suggested above for >>> URL compatibility, if we just leave the old archives on-line so that >>> those URLs all still resolve. But if we can't load all the old messages >>> into the new infrastructure, it'll basically be useless for searching >>> purposes. >>> >>> (Hmm, re-reading what you said, maybe we are suggesting the same thing, >>> but it's not clear. Anyway my point is that Dave's first two >>> requirements are real. Only the third might not be.) >> >> The third actually isn't actually that hard to do in theory. The >> message numbers are basically the zero-based position in the mbox >> file, and the rest of the URL is obvious. > > The third part is trivial. The search system already does 95% of it. > I've already implemented exactly that kind of redirect thing on top of > the search code once just as a poc, and it was less than 30 minutes of > hacking. Can't seem to find the script ATM though, but you get the > idea. > > Let's not focus on that part, we can easily solve that. Agreed. That's the part that worries me less. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
On Tue, Jan 12, 2010 at 20:56, Matteo Beccati <php@beccati.com> wrote: > Il 12/01/2010 10:30, Magnus Hagander ha scritto: >> >> The problem is usually with strange looking emails with 15 different >> MIME types. If we can figure out the proper way to render that, the >> rest really is just a SMOP. > > Yeah, I was expecting some, but all the message I've looked at seemed to be > working ok. Have you been looking at old or new messages? Try grabbing a couple of MBOX files off archives.postgresql.org from several years back, you're more likely to find weird MUAs then I think. >> (BTW, for something to actually be used In Production (TM), we want >> something that uses one of our existing frameworks. So don't go >> overboard in code-wise implementations on something else - proof of >> concept on something else is always ok, of course) > > OK, that's something I didn't know, even though I expected some kind of > limitations. Could you please elaborate a bit more (i.e. where to find > info)? Well, the framework we're moving towards is built on top of django, so that would be a good first start. There is also whever the commitfest thing is built on, but I'm told that's basically no framework. > Having played with it, here's my feedback about AOX: > > pros: > - seemed to be working reliably; > - does most of the dirty job of parsing emails, splitting parts, etc > - highly normalized schema > - thread support (partial?) A killer will be if that thread support is enough. If we have to build that completely ourselves, it'll take a lot more work. > cons: > - directly publishing the live email feed might not be desirable Why not? > - queries might end up being a bit complicate for simple tasks As long as we don't have to hit them too often, which is solve:able with caching. And we do have a pretty good RDBMS to run the queries on :) >> I don't think you can trust the NNTP gateway now or in the past, >> messages are sometimes lost there. The mbox files are as complete as >> anything we'll ever get. > > Importing the whole pgsql-www archive with a perl script that bounces > messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, I > haven't looked into it that much. Um, yes. There is an MBOX import tool. >>>> - We need to generate thread indexes >>> >>> We have CTEs :) >> >> Right. We still need the threading information, so we have something >> to use our CTEs on :-) >> >> But I assume that AOX already does this? > > there are thread related tables and they seem to get filled when a SORT IMAP > command is issued, however I haven't found a way to get the hierarchy out of > them. > > What that means is that we'd need some kind of post processing to populate a > thread hierarchy. > > If there isn't a fully usable thread hierarchy I was more thinking to ltree, > mainly because I've successfully used it in past and I haven't had enough > time yet to look at CTEs. But if performance is comparable I don't see a > reason why we shouldn't use them. I'd favor CTEs if they are fast enough. Great flexibility. >>>> - We need to re-generate the original URLs for backwards compatibility >>> >>> I guess the message-id one ain't the tricky one... and it should be >>> possible to fill a relation table like >>> monharc_compat(message_id, list, year, month, message_number); >> >> Yeah. It's not so hard, you can just screen-scrape the current >> archives the same way the search server does. > > Definitely an easy enough task. > > With all that said, I can't promise anything as it all depends on how much > spare time I have, but I can proceed with the evaluation if you think it's > useful. I have a feeling that AOX is not truly the right tool for the job, > but we might be able to customise it to suit our needs. Are there any other > requirements that weren't specified? Well, I think we want to avoid customizing it. Using a custom frontend, sure. But we don't want to end up customizing the parser/backend. That's the road to unmaintainability. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
I'll note that the whole idea of a "email archive" interface might be a very good "advocacy" project as well. AOX might not be a perfect fit, but it could be a good learning experience... Really, all the PG mail archives need is: 1) A nice normalized DB schema representing mail messages and their relations to other message and "recipients" (or "folders") 2) A "injector" that can parse an email message, and de-compose it into the various parts/tables of the DB schema, and insertit 3) A nice set of SQL queries to return message, parts, threads, folders based on $criteria (search, id, folder, etc) 4) A web interface to view the messages/thread/parts #3 returns The largest part of this is #1, but a good schema would be a very good candidate to show of some of PG's more powerful features in a way that "others" could see (like the movie store sample somewhere) , such as: 1) full text search 2) text vs bytea handling (thinkingof all the mime parts, and encoding, etc) 3) CTEs, ltree, recursion, etc, for threading/searching 4) Triggersfor "materialized views" (for quick threading/folder queries) 5) expression indexes a. * Matteo Beccati <php@beccati.com> [100112 14:56]: > Having played with it, here's my feedback about AOX: > > pros: > - seemed to be working reliably; > - does most of the dirty job of parsing emails, splitting parts, etc > - highly normalized schema > - thread support (partial?) > > cons: > - directly publishing the live email feed might not be desirable > - queries might end up being a bit complicate for simple tasks > - might be not easy to add additional processing in the workflow > If there isn't a fully usable thread hierarchy I was more thinking to > ltree, mainly because I've successfully used it in past and I haven't > had enough time yet to look at CTEs. But if performance is comparable I > don't see a reason why we shouldn't use them. > With all that said, I can't promise anything as it all depends on how > much spare time I have, but I can proceed with the evaluation if you > think it's useful. I have a feeling that AOX is not truly the right tool > for the job, but we might be able to customise it to suit our needs. Are > there any other requirements that weren't specified? -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Il 12/01/2010 21:04, Magnus Hagander ha scritto: > On Tue, Jan 12, 2010 at 20:56, Matteo Beccati<php@beccati.com> wrote: >> Il 12/01/2010 10:30, Magnus Hagander ha scritto: >>> >>> The problem is usually with strange looking emails with 15 different >>> MIME types. If we can figure out the proper way to render that, the >>> rest really is just a SMOP. >> >> Yeah, I was expecting some, but all the message I've looked at seemed to be >> working ok. > > Have you been looking at old or new messages? Try grabbing a couple of > MBOX files off archives.postgresql.org from several years back, you're > more likely to find weird MUAs then I think. Both. pgsql-hacker and -general are subscribed and getting new emails and pgsql-www is just an import of the archives: http://archives.beccati.org/pgsql-www/by/date (sorry, no paging) (just fixed a 500 error that was caused by the fact that I've been playing with the db a bit and a required helper table was missing) >>> (BTW, for something to actually be used In Production (TM), we want >>> something that uses one of our existing frameworks. So don't go >>> overboard in code-wise implementations on something else - proof of >>> concept on something else is always ok, of course) >> >> OK, that's something I didn't know, even though I expected some kind of >> limitations. Could you please elaborate a bit more (i.e. where to find >> info)? > > Well, the framework we're moving towards is built on top of django, so > that would be a good first start. > > There is also whever the commitfest thing is built on, but I'm told > that's basically no framework. I'm afraid that's outside on my expertise. But I can get as far as having a proof of concept and the required queries / php code. >> Having played with it, here's my feedback about AOX: >> >> pros: >> - seemed to be working reliably; >> - does most of the dirty job of parsing emails, splitting parts, etc >> - highly normalized schema >> - thread support (partial?) > > A killer will be if that thread support is enough. If we have to build > that completely ourselves, it'll take a lot more work. Looks like we need to populate a helper table with hierarchy information, unless Ahijit has a better idea and knows how to get it from the aox main schema. >> cons: >> - directly publishing the live email feed might not be desirable > > Why not? The scenario I was thinking at was the creation of a static snapshot and potential inconsistencies that might occur if the threads get updated during that time. >> - queries might end up being a bit complicate for simple tasks > > As long as we don't have to hit them too often, which is solve:able > with caching. And we do have a pretty good RDBMS to run the queries on > :) True :) >>> I don't think you can trust the NNTP gateway now or in the past, >>> messages are sometimes lost there. The mbox files are as complete as >>> anything we'll ever get. >> >> Importing the whole pgsql-www archive with a perl script that bounces >> messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, I >> haven't looked into it that much. > > Um, yes. There is an MBOX import tool. Cool. >> With all that said, I can't promise anything as it all depends on how much >> spare time I have, but I can proceed with the evaluation if you think it's >> useful. I have a feeling that AOX is not truly the right tool for the job, >> but we might be able to customise it to suit our needs. Are there any other >> requirements that weren't specified? > > Well, I think we want to avoid customizing it. Using a custom > frontend, sure. But we don't want to end up customizing the > parser/backend. That's the road to unmaintainability. Sure. I guess my wording wasn't right... I was more thinking about adding new tables, materialized views or whatever else might be missing to make it fit out purpose. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
Aidan Van Dyk <aidan@highrise.ca> writes: > I'll note that the whole idea of a "email archive" interface might be a > very good "advocacy" project as well. AOX might not be a perfect fit, > but it could be a good learning experience... Really, all the PG mail > archives need is: > > 1) A nice normalized DB schema representing mail messages and their > relations to other message and "recipients" (or "folders") We're now hoping that this one will fit: http://www.archiveopteryx.org/schema > 2) A "injector" that can parse an email message, and de-compose it into > the various parts/tables of the DB schema, and insert it aox has that either as a bulk importer or as a MDA. > 3) A nice set of SQL queries to return message, parts, threads, > folders based on $criteria (search, id, folder, etc) I guess Matteo's working on that… > 4) A web interface to view the messages/thread/parts #3 returns And that too. > The largest part of this is #1, but a good schema would be a very good > candidate to show of some of PG's more powerful features in a way that > "others" could see (like the movie store sample somewhere) , such as: > 1) full text search > 2) text vs bytea handling (thinking of all the mime parts, and encoding, > etc) > 3) CTEs, ltree, recursion, etc, for threading/searching > 4) Triggers for "materialized views" (for quick threading/folder queries) > 5) expression indexes And Tsearch, too, maybe. Oh and pg_trgm might be quite good at providing suggestion as you type or "Did you mean?" stuff. Regards, -- dim
* Dimitri Fontaine <dfontaine@hi-media.com> [100112 16:28]: > > 1) A nice normalized DB schema representing mail messages and their > > relations to other message and "recipients" (or "folders") > > We're now hoping that this one will fit: > > http://www.archiveopteryx.org/schema Yup, and it provides a lot more too, which could probably be safely ignored. > > 2) A "injector" that can parse an email message, and de-compose it into > > the various parts/tables of the DB schema, and insert it > > aox has that either as a bulk importer or as a MDA. Yup, LMTP is ideally suited for that too. > > 3) A nice set of SQL queries to return message, parts, threads, > > folders based on $criteria (search, id, folder, etc) > > I guess Matteo's working on that… Right, but this is where I want to see the AOX schema "imporove"... In ways like adding persistant tables for threading, which are updated by triggers as new messages are delivered, etc. Documented queries that show how to use CTEs, ltree, etc to get threaded views, good FTS support (with indexes and triggers managing them), etc. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Aidan Van Dyk <aidan@highrise.ca> writes: >> aox has that either as a bulk importer or as a MDA. > > Yup, LMTP is ideally suited for that too. Yes. >> > 3) A nice set of SQL queries to return message, parts, threads, >> > folders based on $criteria (search, id, folder, etc) >> >> I guess Matteo's working on that… > > Right, but this is where I want to see the AOX schema "imporove"... In > ways like adding persistant tables for threading, which are updated by > triggers as new messages are delivered, etc. Documented queries that > show how to use CTEs, ltree, etc to get threaded views, good FTS support > (with indexes and triggers managing them), etc. +1. I just didn't understand how much your proposal fit into current work :) -- dim
On Tue, 2010-01-12 at 10:24 +0530, Dave Page wrote: > 2010/1/12 Matteo Beccati <php@beccati.com>: > > So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list archivesreading from the AOX database: > > > > http://archives.beccati.org/ > > Seems to work. > > > Please take it as an exercise I've made trying to learn how to use symfony this afternoon. It's not feature complete,nor probably very scalable, but at least it features attachment download ;) > > > > http://archives.beccati.org/pgsql-hackers/message/37 > > :-) > > So just to put this into perspective and give anyone paying attention > an idea of the pain that lies ahead should they decide to work on > this: > > - We need to import the old archives (of which there are hundreds of > thousands of messages, the first few years of which have, umm, minimal > headers. > - We need to generate thread indexes > - We need to re-generate the original URLs for backwards compatibility > > Now there's encouragement :-) Or, we just leave the current infrastructure in place and use a new one for all new messages going forward. We shouldn't limit our ability to have a decent system due to decisions of the past. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Tue, 2010-01-12 at 11:54 -0500, Tom Lane wrote: > > Or, we just leave the current infrastructure in place and use a new one > > for all new messages going forward. We shouldn't limit our ability to > > have a decent system due to decisions of the past. > > -1. What's the point of having archives? IMO the mailing list archives > are nearly as critical a piece of the project infrastructure as the CVS > repository. We've already established that moving to a new SCM that > fails to preserve the CVS history wouldn't be acceptable. I hardly > think that the bar is any lower for mailing list archives. > > Now I think we could possibly skip the requirement suggested above for > URL compatibility, if we just leave the old archives on-line so that > those URLs all still resolve. But if we can't load all the old messages > into the new infrastructure, it'll basically be useless for searching > purposes. > > (Hmm, re-reading what you said, maybe we are suggesting the same thing, > but it's not clear. Anyway my point is that Dave's first two > requirements are real. Only the third might not be.) We are saying the same thing. Sorry if I wasn't clear. Joshua D. Drake > > regards, tom lane > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
Hi, >>>> 3) A nice set of SQL queries to return message, parts, threads, >>>> folders based on $criteria (search, id, folder, etc) >>> >>> I guess Matteo's working on that… >> >> Right, but this is where I want to see the AOX schema "imporove"... In >> ways like adding persistant tables for threading, which are updated by >> triggers as new messages are delivered, etc. Documented queries that >> show how to use CTEs, ltree, etc to get threaded views, good FTS support >> (with indexes and triggers managing them), etc. > > +1. > > I just didn't understand how much your proposal fit into current work :) I'm looking into it. The link I've previously sent will most likely return a 500 error for the time being. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
Il 14/01/2010 08:22, Matteo Beccati ha scritto: > Hi, > >>>>> 3) A nice set of SQL queries to return message, parts, threads, >>>>> folders based on $criteria (search, id, folder, etc) >>>> >>>> I guess Matteo's working on that… >>> >>> Right, but this is where I want to see the AOX schema "imporove"... In >>> ways like adding persistant tables for threading, which are updated by >>> triggers as new messages are delivered, etc. Documented queries that >>> show how to use CTEs, ltree, etc to get threaded views, good FTS support >>> (with indexes and triggers managing them), etc. >> >> +1. >> >> I just didn't understand how much your proposal fit into current work :) > > I'm looking into it. The link I've previously sent will most likely > return a 500 error for the time being. A quick update: I've extended AOX with a trigger that takes care of filling a separate table that's used to display the index pages. The new table also stores threading information (standard headers + Exchange headers support) and whether or not the email has attachments. Please check the updated PoC: http://archives.beccati.org/ pgsql-hackers and -general are currently subscribed, while -www only has 2003 history imported via aoximport (very fast!). BTW, I've just noticed a bug in the attachment detection giving false positives, but have no time to check now. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
Matteo Beccati <php@beccati.com> writes: > I've extended AOX with a trigger that takes care of filling a separate table > that's used to display the index pages. The new table also stores threading > information (standard headers + Exchange headers support) and whether or not > the email has attachments. > > Please check the updated PoC: http://archives.beccati.org/ Looks pretty good, even if some thread are still separated (this one for example), and the ordering looks strange. Seems to be right on tracks, that said :) Thanks for your work, -- dim
On Thu, Jan 14, 2010 at 7:09 PM, Dimitri Fontaine <dfontaine@hi-media.com> wrote: > Matteo Beccati <php@beccati.com> writes: >> I've extended AOX with a trigger that takes care of filling a separate table >> that's used to display the index pages. The new table also stores threading >> information (standard headers + Exchange headers support) and whether or not >> the email has attachments. >> >> Please check the updated PoC: http://archives.beccati.org/ > > Looks pretty good, even if some thread are still separated (this one for > example), and the ordering looks strange. > > Seems to be right on tracks, that said :) Yup. Matteo - Can you try loading up a lot more of the old mbox files, particularly the very early ones from -hackers? It would be good to see how it copes under load with a few hundred thousand messages in the database. -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
Il 14/01/2010 14:39, Dimitri Fontaine ha scritto: > Matteo Beccati<php@beccati.com> writes: >> I've extended AOX with a trigger that takes care of filling a separate table >> that's used to display the index pages. The new table also stores threading >> information (standard headers + Exchange headers support) and whether or not >> the email has attachments. >> >> Please check the updated PoC: http://archives.beccati.org/ > > Looks pretty good, even if some thread are still separated (this one for > example), and the ordering looks strange. This one is separated as the first one is not in the archive yet, thus to the system there are multiple parent messages. It shouldn't happen with full archives. About sorting, here's the query I've used (my first try with CTEs incidentally): WITH RECURSIVE t (mailbox, uid, date, subject, sender, has_attachments, parent_uid, idx, depth) AS ( SELECT mailbox, uid, date, subject, sender, has_attachments, parent_uid, uid::text, 1 FROM arc_messages WHERE parent_uid IS NULL AND mailbox = 15 UNION ALL SELECT a.mailbox, a.uid,a.date, a.subject, a.sender, a.has_attachments, a.parent_uid, t.idx || '.' || a.uid::text, t.depth + 1 FROM t JOIN arc_messages a USING (mailbox) WHEREt.uid = a.parent_uid ) SELECT * FROM t ORDER BY idx Any improvements to sorting are welcome :) Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
Il 14/01/2010 14:46, Dave Page ha scritto: > On Thu, Jan 14, 2010 at 7:09 PM, Dimitri Fontaine > <dfontaine@hi-media.com> wrote: >> Matteo Beccati<php@beccati.com> writes: >>> I've extended AOX with a trigger that takes care of filling a separate table >>> that's used to display the index pages. The new table also stores threading >>> information (standard headers + Exchange headers support) and whether or not >>> the email has attachments. >>> >>> Please check the updated PoC: http://archives.beccati.org/ >> >> Looks pretty good, even if some thread are still separated (this one for >> example), and the ordering looks strange. >> >> Seems to be right on tracks, that said :) > > Yup. > > Matteo - Can you try loading up a lot more of the old mbox files, > particularly the very early ones from -hackers? It would be good to > see how it copes under load with a few hundred thousand messages in > the database. Sure, I will give it a try in the evening or tomorrow. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
On Thu, Jan 14, 2010 at 8:02 PM, Matteo Beccati <php@beccati.com> wrote: > Il 14/01/2010 14:46, Dave Page ha scritto: >> >> On Thu, Jan 14, 2010 at 7:09 PM, Dimitri Fontaine >> <dfontaine@hi-media.com> wrote: >>> >>> Matteo Beccati<php@beccati.com> writes: >>>> >>>> I've extended AOX with a trigger that takes care of filling a separate >>>> table >>>> that's used to display the index pages. The new table also stores >>>> threading >>>> information (standard headers + Exchange headers support) and whether or >>>> not >>>> the email has attachments. >>>> >>>> Please check the updated PoC: http://archives.beccati.org/ >>> >>> Looks pretty good, even if some thread are still separated (this one for >>> example), and the ordering looks strange. >>> >>> Seems to be right on tracks, that said :) >> >> Yup. >> >> Matteo - Can you try loading up a lot more of the old mbox files, >> particularly the very early ones from -hackers? It would be good to >> see how it copes under load with a few hundred thousand messages in >> the database. > > Sure, I will give it a try in the evening or tomorrow. Thanks :-) -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
Matteo Beccati <php@beccati.com> writes: > WITH RECURSIVE t (mailbox, uid, date, subject, sender, has_attachments, > parent_uid, idx, depth) AS ( > SELECT mailbox, uid, date, subject, sender, has_attachments, parent_uid, > uid::text, 1 > FROM arc_messages > WHERE parent_uid IS NULL AND mailbox = 15 > UNION ALL > SELECT a.mailbox, a.uid, a.date, a.subject, a.sender, a.has_attachments, > a.parent_uid, t.idx || '.' || a.uid::text, t.depth + 1 > FROM t JOIN arc_messages a USING (mailbox) > WHERE t.uid = a.parent_uid > ) SELECT * FROM t ORDER BY idx > > Any improvements to sorting are welcome :) What I'd like would be to have it sorted by activity, showing first the thread which received the later messages. I'm yet to play with CTE and window function myself so without a database example to play with I won't come up with a nice query, but I guess a more educated reader will solve this without a sweat, as it looks easier than sudoku-solving, which has been done already :) Regards, -- dim
Il 14/01/2010 15:47, Dimitri Fontaine ha scritto: > Matteo Beccati<php@beccati.com> writes: >> WITH RECURSIVE t (mailbox, uid, date, subject, sender, has_attachments, >> parent_uid, idx, depth) AS ( >> SELECT mailbox, uid, date, subject, sender, has_attachments, parent_uid, >> uid::text, 1 >> FROM arc_messages >> WHERE parent_uid IS NULL AND mailbox = 15 >> UNION ALL >> SELECT a.mailbox, a.uid, a.date, a.subject, a.sender, a.has_attachments, >> a.parent_uid, t.idx || '.' || a.uid::text, t.depth + 1 >> FROM t JOIN arc_messages a USING (mailbox) >> WHERE t.uid = a.parent_uid >> ) SELECT * FROM t ORDER BY idx >> >> Any improvements to sorting are welcome :) > > What I'd like would be to have it sorted by activity, showing first the > thread which received the later messages. I'm yet to play with CTE and > window function myself so without a database example to play with I > won't come up with a nice query, but I guess a more educated reader will > solve this without a sweat, as it looks easier than sudoku-solving, > which has been done already :) Eheh, that was my first try as well. CTEs look very nice even though I'm not yet very comfortable with the syntax. Anyway both for date and thread indexes sort is the other way around, with newer posts/threads at the bottom. Again I'll give it a try as soon as I find time to work again on it. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
On Thu, Jan 14, 2010 at 16:06, Matteo Beccati <php@beccati.com> wrote: > Il 14/01/2010 15:47, Dimitri Fontaine ha scritto: >> >> Matteo Beccati<php@beccati.com> writes: >>> >>> WITH RECURSIVE t (mailbox, uid, date, subject, sender, has_attachments, >>> parent_uid, idx, depth) AS ( >>> SELECT mailbox, uid, date, subject, sender, has_attachments, >>> parent_uid, >>> uid::text, 1 >>> FROM arc_messages >>> WHERE parent_uid IS NULL AND mailbox = 15 >>> UNION ALL >>> SELECT a.mailbox, a.uid, a.date, a.subject, a.sender, >>> a.has_attachments, >>> a.parent_uid, t.idx || '.' || a.uid::text, t.depth + 1 >>> FROM t JOIN arc_messages a USING (mailbox) >>> WHERE t.uid = a.parent_uid >>> ) SELECT * FROM t ORDER BY idx >>> >>> Any improvements to sorting are welcome :) >> >> What I'd like would be to have it sorted by activity, showing first the >> thread which received the later messages. I'm yet to play with CTE and >> window function myself so without a database example to play with I >> won't come up with a nice query, but I guess a more educated reader will >> solve this without a sweat, as it looks easier than sudoku-solving, >> which has been done already :) > > Eheh, that was my first try as well. CTEs look very nice even though I'm not > yet very comfortable with the syntax. Anyway both for date and thread > indexes sort is the other way around, with newer posts/threads at the > bottom. Again I'll give it a try as soon as I find time to work again on it. Three tips around this, 1) don't be constrained by how things look now. Make something that's useful. 2) don't be constrained by the fact that we have two ways to view it now (thread + date). we can easily do three, if different people like different ways. As long as it's not so much it becomes a maintenance burden 3) Remember to run your tests with lots of emails, some designs just tend to fall apart over that (say a thread with 200+ emails in it) -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On Thu, Jan 14, 2010 at 03:08:22PM +0100, Matteo Beccati wrote: > Il 14/01/2010 14:39, Dimitri Fontaine ha scritto: > >Matteo Beccati<php@beccati.com> writes: > >>I've extended AOX with a trigger that takes care of filling a separate table > >>that's used to display the index pages. The new table also stores threading > >>information (standard headers + Exchange headers support) and whether or not > >>the email has attachments. > >> > >>Please check the updated PoC: http://archives.beccati.org/ > > > >Looks pretty good, even if some thread are still separated (this one for > >example), and the ordering looks strange. > > This one is separated as the first one is not in the archive yet, > thus to the system there are multiple parent messages. It shouldn't > happen with full archives. About sorting, here's the query I've used > (my first try with CTEs incidentally): > > WITH RECURSIVE t (mailbox, uid, date, subject, sender, > has_attachments, parent_uid, idx, depth) AS ( > SELECT mailbox, uid, date, subject, sender, has_attachments, > parent_uid, uid::text, 1 > FROM arc_messages > WHERE parent_uid IS NULL AND mailbox = 15 > UNION ALL > SELECT a.mailbox, a.uid, a.date, a.subject, a.sender, > a.has_attachments, a.parent_uid, t.idx || '.' || a.uid::text, > t.depth + 1 > FROM t JOIN arc_messages a USING (mailbox) > WHERE t.uid = a.parent_uid > ) SELECT * FROM t ORDER BY idx > Any improvements to sorting are welcome :) This is probably better written as: WITH RECURSIVE t ( mailbox, uid, date, subject, sender, has_attachments, "path" ) AS ( SELECT mailbox, uid, date, subject, sender, has_attachments, ARRAY[uid] FROM arc_messages WHERE parent_uid IS NULL AND mailbox = 15 UNION ALL SELECT a.mailbox, a.uid, a.date, a.subject, a.sender, a.has_attachments, t."path" || a.uid, FROM t JOIN arc_messages a ON ( a.mailbox = t.mailboxAND t.uid = a.parent_uid ) ) SELECT * FROM t ORDER BY "path"; Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Hi everyone, Il 14/01/2010 19:36, David Fetter ha scritto: > On Thu, Jan 14, 2010 at 03:08:22PM +0100, Matteo Beccati wrote: >> Il 14/01/2010 14:39, Dimitri Fontaine ha scritto: >>> Matteo Beccati<php@beccati.com> writes:> >> Any improvements to sorting are welcome :) > > ... > ARRAY[uid] > ... Thanks David, using an array rather than text concatenation is slightly slower and uses a bit more memory, but you've been able to convince me that it's The Right Way(TM) ;) Anyway, I've made further changes and I would say that at this point the PoC is feature complete. There surely are still some rough edges and a few things to clean up, but I'd like to get your feedback once again: http://archives.beccati.org You will find that pgsql-general and -hackers are subscribed and getting messages live, wihle -hackers-history and -www have been imported from the archives (about 200k and 1.5k messages respectively at 50 messages/s). Also, I'd need some help with the CTE query that was picking a wrong plan and led me to forcibly disable merge joins inside the application when executing it. Plans are attached. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
Attachment
Matteo Beccati <php@beccati.com> writes: > Anyway, I've made further changes and I would say that at this point the PoC > is feature complete. There surely are still some rough edges and a few > things to clean up, but I'd like to get your feedback once again: > > http://archives.beccati.org I've been clicking around and like the speedy feeling and the Thread index appearing under any mail. Also getting the attachments seems to be just working™. I've also checked than this "local thread" works on month boundaries, so that you're POC is in a way already better than the current archives solution. Only missing is the search, but we have tsearch and pg_trgm masters not far away… > You will find that pgsql-general and -hackers are subscribed and getting > messages live, wihle -hackers-history and -www have been imported from the > archives (about 200k and 1.5k messages respectively at 50 messages/s). Tried clicking over there and very far in the past indexes show no messages. Here's an example: http://archives.beccati.org/pgsql-hackers-history/1996-09/by/thread > Also, I'd need some help with the CTE query that was picking a wrong plan > and led me to forcibly disable merge joins inside the application when > executing it. Plans are attached. Sorry, not from me, still a CTE noob. -- dim
Il 16/01/2010 11:48, Dimitri Fontaine ha scritto: > Matteo Beccati<php@beccati.com> writes: >> Anyway, I've made further changes and I would say that at this point the PoC >> is feature complete. There surely are still some rough edges and a few >> things to clean up, but I'd like to get your feedback once again: >> >> http://archives.beccati.org > > I've been clicking around and like the speedy feeling and the Thread > index appearing under any mail. Also getting the attachments seems to be > just working™. I've also checked than this "local thread" works on month > boundaries, so that you're POC is in a way already better than the > current archives solution. Thanks for the feedback. > Only missing is the search, but we have tsearch and pg_trgm masters not > far away… I haven't even looked at it as I was under the impression that the old engine could still be used. If not, adding search support should be fairly easy. >> You will find that pgsql-general and -hackers are subscribed and getting >> messages live, wihle -hackers-history and -www have been imported from the >> archives (about 200k and 1.5k messages respectively at 50 messages/s). > > Tried clicking over there and very far in the past indexes show no > messages. Here's an example: > > http://archives.beccati.org/pgsql-hackers-history/1996-09/by/thread Yeah, there are a few messages in the archives with a wrong date header. The list is generated using from min(date) to now(), so there are holes. At some point I'll run a few queries to fix that. >> Also, I'd need some help with the CTE query that was picking a wrong plan >> and led me to forcibly disable merge joins inside the application when >> executing it. Plans are attached. > > Sorry, not from me, still a CTE noob. Actually the live db doesn't suffer from that problem anymore, but I've able to reproduce the issue with a few days old backup running on a test 8.5alpha3 instance that still has a stock postgresql.conf. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
Il 16/01/2010 14:21, Matteo Beccati ha scritto: > Il 16/01/2010 11:48, Dimitri Fontaine ha scritto: >> Matteo Beccati<php@beccati.com> writes: >>> Anyway, I've made further changes and I would say that at this point >>> the PoC >>> is feature complete. There surely are still some rough edges and a few >>> things to clean up, but I'd like to get your feedback once again: > > [...] > >>> Also, I'd need some help with the CTE query that was picking a wrong >>> plan >>> and led me to forcibly disable merge joins inside the application when >>> executing it. Plans are attached. >> >> Sorry, not from me, still a CTE noob. > > Actually the live db doesn't suffer from that problem anymore, but I've > able to reproduce the issue with a few days old backup running on a test > 8.5alpha3 instance that still has a stock postgresql.conf. Following advice from Andrew "RodiumToad" Gierth, I raised cpu costs back to the defaults (I did lower them following some tuning guide) and that seems to have fixed the problem. My question now is... what next? :) Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
2010/1/18 Matteo Beccati <php@beccati.com>: > Il 16/01/2010 14:21, Matteo Beccati ha scritto: >> >> Il 16/01/2010 11:48, Dimitri Fontaine ha scritto: >>> >>> Matteo Beccati<php@beccati.com> writes: >>>> >>>> Anyway, I've made further changes and I would say that at this point >>>> the PoC >>>> is feature complete. There surely are still some rough edges and a few >>>> things to clean up, but I'd like to get your feedback once again: >> >> [...] >> >>>> Also, I'd need some help with the CTE query that was picking a wrong >>>> plan >>>> and led me to forcibly disable merge joins inside the application when >>>> executing it. Plans are attached. >>> >>> Sorry, not from me, still a CTE noob. >> >> Actually the live db doesn't suffer from that problem anymore, but I've >> able to reproduce the issue with a few days old backup running on a test >> 8.5alpha3 instance that still has a stock postgresql.conf. > > Following advice from Andrew "RodiumToad" Gierth, I raised cpu costs back to the defaults (I did lower them following sometuning guide) and that seems to have fixed the problem. > > My question now is... what next? :) If it wasn't for the fact that we're knee deep in two other major projects for the infrastructure team right now, I'd be all over this :-) But we really need to complete that before we put anything new in production here. What I'd like to see is one that integrates with our general layouts. Also, I tink one of the main issues with the archives today that people bring up is the inability to have threads cross months. I think that should be fixed. Basically, get rid of the grouping by month for a more dynamic way to browse. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > Also, I tink one of the main issues with the archives today that > people bring up is the inability to have threads cross months. I think > that should be fixed. Basically, get rid of the grouping by month for > a more dynamic way to browse. Clic a mail in a thread within more than one given month. See the Thread index for this email. It's complete, for both the month. Example here: http://archives.beccati.org/pgsql-hackers-history/message/191438.html http://archives.beccati.org/pgsql-hackers-history/message/191334.html That said, the month boundary is artificial, so maybe having a X messages per page instead would be better? Regards, -- dim
Magnus Hagander wrote: > 2010/1/18 Matteo Beccati <php@beccati.com>: > > My question now is... what next? :) Gee, I disappear for a week and look what happens -- we get streaming replication, a revamped archives site, and maybe something else that I haven't seen yet. I love it :-) > If it wasn't for the fact that we're knee deep in two other major > projects for the infrastructure team right now, I'd be all over this > :-) But we really need to complete that before we put anything new in > production here. > > What I'd like to see is one that integrates with our general layouts. Yeah. I think that should be relatively simple to add. There are some other things that need a bit of rejiggering too, like the thread index at the bottom getting too wide with large threads, for example here: http://archives.beccati.org/pgsql-hackers-history/message/90000.html -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Il 18/01/2010 15:55, Magnus Hagander ha scritto: > If it wasn't for the fact that we're knee deep in two other major > projects for the infrastructure team right now, I'd be all over this > :-) But we really need to complete that before we put anything new in > production here. Sure, that's completely understandable. > What I'd like to see is one that integrates with our general layouts. Shoudln't bee too hard, but I wouldn't be very keen on spending time on layout related things that are going to be thrown away to due the framework and language being different from what is going to be used on production (symfony/php vs django/python). Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
Il 18/01/2010 16:19, Dimitri Fontaine ha scritto: > Magnus Hagander<magnus@hagander.net> writes: >> Also, I tink one of the main issues with the archives today that >> people bring up is the inability to have threads cross months. I think >> that should be fixed. Basically, get rid of the grouping by month for >> a more dynamic way to browse. > > Clic a mail in a thread within more than one given month. See the Thread > index for this email. It's complete, for both the month. Example here: > > http://archives.beccati.org/pgsql-hackers-history/message/191438.html > http://archives.beccati.org/pgsql-hackers-history/message/191334.html Thanks Dimitri, you beat me to it ;) > That said, the month boundary is artificial, so maybe having a X > messages per page instead would be better? Not sure. Having date based pages helps out reducing the set of messages that need to be scanned and sorted, increasing the likeliness of an index scan. But I'm happy to examine other alternatives too. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
On Mon, Jan 18, 2010 at 18:31, Matteo Beccati <php@beccati.com> wrote: > Il 18/01/2010 15:55, Magnus Hagander ha scritto: >> >> If it wasn't for the fact that we're knee deep in two other major >> projects for the infrastructure team right now, I'd be all over this >> :-) But we really need to complete that before we put anything new in >> production here. > > Sure, that's completely understandable. > >> What I'd like to see is one that integrates with our general layouts. > > Shoudln't bee too hard, but I wouldn't be very keen on spending time on > layout related things that are going to be thrown away to due the framework > and language being different from what is going to be used on production > (symfony/php vs django/python). I don't know symfony, but as long as it's done in a template it is probably pretty easy to move between different frameworks for the layout part. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On Mon, Jan 18, 2010 at 18:35, Matteo Beccati <php@beccati.com> wrote: > Il 18/01/2010 16:19, Dimitri Fontaine ha scritto: >> >> Magnus Hagander<magnus@hagander.net> writes: >>> >>> Also, I tink one of the main issues with the archives today that >>> people bring up is the inability to have threads cross months. I think >>> that should be fixed. Basically, get rid of the grouping by month for >>> a more dynamic way to browse. >> >> Clic a mail in a thread within more than one given month. See the Thread >> index for this email. It's complete, for both the month. Example here: >> >> http://archives.beccati.org/pgsql-hackers-history/message/191438.html >> http://archives.beccati.org/pgsql-hackers-history/message/191334.html > > Thanks Dimitri, you beat me to it ;) > > >> That said, the month boundary is artificial, so maybe having a X >> messages per page instead would be better? > > Not sure. Having date based pages helps out reducing the set of messages > that need to be scanned and sorted, increasing the likeliness of an index > scan. But I'm happy to examine other alternatives too. I think we need to get rid of the months based pages. We can keep them as an option, but they're not a good root thing. I'd rather have something where you start at a certain point and see <n> before and <n> after, so we keep the page to a reasonably short time. Keeping "30 days" there somewhere may make sense, but arbitrarily splitting at the 1st of each month doesn't follow the flow of discussions very well. I think the first step has to be to figure out how we'd like it presented. Only after that should we consider how to implement it to get fast scans in the database... -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Il 18/01/2010 18:42, Magnus Hagander ha scritto: > On Mon, Jan 18, 2010 at 18:31, Matteo Beccati<php@beccati.com> wrote: >> Il 18/01/2010 15:55, Magnus Hagander ha scritto: >>> >>> If it wasn't for the fact that we're knee deep in two other major >>> projects for the infrastructure team right now, I'd be all over this >>> :-) But we really need to complete that before we put anything new in >>> production here. >> >> Sure, that's completely understandable. >> >>> What I'd like to see is one that integrates with our general layouts. >> >> Shoudln't bee too hard, but I wouldn't be very keen on spending time on >> layout related things that are going to be thrown away to due the framework >> and language being different from what is going to be used on production >> (symfony/php vs django/python). > > I don't know symfony, but as long as it's done in a template it is > probably pretty easy to move between different frameworks for the > layout part. By default symfony uses plain PHP files as templates, but some plugins allow using a templating engine instead. I guess I can give them a try. -- Matteo Beccati Development & Consulting - http://www.beccati.com/
On Tue, Jan 19, 2010 at 09:11, Matteo Beccati <php@beccati.com> wrote: > Il 18/01/2010 18:42, Magnus Hagander ha scritto: >> >> On Mon, Jan 18, 2010 at 18:31, Matteo Beccati<php@beccati.com> wrote: >>> >>> Il 18/01/2010 15:55, Magnus Hagander ha scritto: >>>> >>>> If it wasn't for the fact that we're knee deep in two other major >>>> projects for the infrastructure team right now, I'd be all over this >>>> :-) But we really need to complete that before we put anything new in >>>> production here. >>> >>> Sure, that's completely understandable. >>> >>>> What I'd like to see is one that integrates with our general layouts. >>> >>> Shoudln't bee too hard, but I wouldn't be very keen on spending time on >>> layout related things that are going to be thrown away to due the >>> framework >>> and language being different from what is going to be used on production >>> (symfony/php vs django/python). >> >> I don't know symfony, but as long as it's done in a template it is >> probably pretty easy to move between different frameworks for the >> layout part. > > By default symfony uses plain PHP files as templates, but some plugins allow > using a templating engine instead. I guess I can give them a try. As long as the templating is separated from the code, it doesn't matter if it's a dedicated templating engine or PHP. The point being, focus on the contents and interface, porting the actual HTML-generation is likely to be easy compared to that. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Il 19/01/2010 09:44, Magnus Hagander ha scritto: > As long as the templating is separated from the code, it doesn't > matter if it's a dedicated templating engine or PHP. The point being, > focus on the contents and interface, porting the actual > HTML-generation is likely to be easy compared to that. I've been following the various suggestions. Please take a look at the updated archives proof of concept: http://archives.beccati.org/ The PoC is now integrated with the website layout and has a working "Mailing lists" menu to navigate the available lists. The artificial monthly breakdown has been removed and both thread and date sorting use pagination instead. The fancy tables are using the Ext JS framework as it was the only free one I could find that features column layout for trees. I'm not extremely happy about it, but it just works. Threads are loaded asynchronously (AJAX), while date sorting uses regular HTML tables with a bit of JS to get the fancy layout. This means that search engines still have a way to properly index all the messages. Last but not least, it's backwards compatibile with the /message-id/* URI. The other one (/list/yyyy-mm/msg*.php) is implemented, but I just realized that it has problems dealing with the old archive weirdness (2009-12 shows also some messages dated aug 2009 nov 2009 or jan 2010 for -hackers). That said, there are still a few visual improvements to be done, but overall I'm pretty much satisfied. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
Matteo Beccati wrote: > Il 19/01/2010 09:44, Magnus Hagander ha scritto: > >As long as the templating is separated from the code, it doesn't > >matter if it's a dedicated templating engine or PHP. The point being, > >focus on the contents and interface, porting the actual > >HTML-generation is likely to be easy compared to that. > > I've been following the various suggestions. Please take a look at > the updated archives proof of concept: > > http://archives.beccati.org/ I like this. Sorry for being unable to get in touch with you on IM. It's been a hectic time here with only very few pauses. Some things: * the list of lists and groups of lists are stored in two JSON files. Should I send you a copy of them so that you can tweak your code to use them? They are generated automatically from the wwwmaster database. * We have a bunch of templates that you could perhaps have used, if you hadn't already written all of it ... :-( * While I don't personally care, some are going to insist that the site works with Javascript disabled. I didn't try but from your description it doesn't seem like it would. Is this easily fixable? * The old monthly interface /list/yyyy-mm/msg*php is not really necessary to keep, *except* that we need the existing URLs to redirect to the corresponding new message page. I think we should be able to create a database of URL redirects from the old site, using the Message-Id URL style. So each message accessed using the old URL style would require two redirects, but I don't think this is a problem. Do you agree? * We're using Subversion to keep the current code. Is your code version-controlled? We'd need to import your code there, I'm afraid. > Last but not least, it's backwards compatibile with the > /message-id/* URI. The other one (/list/yyyy-mm/msg*.php) is > implemented, but I just realized that it has problems dealing with > the old archive weirdness (2009-12 shows also some messages dated > aug 2009 nov 2009 or jan 2010 for -hackers). I'm surprised about the Aug 2009 ones, but the others are explained because the site divides the mboxes using one timezone and the time displayed is a different timezone. We don't really control the first one so there's nothing to do about it; but anyway it's not really important. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Matteo Beccati <php@beccati.com> writes: > I've been following the various suggestions. Please take a look at the > updated archives proof of concept: > > http://archives.beccati.org/ I like the features a lot, and the only remarks I can think about are bikeschedding, so I'll let it to the web team when they integrate it. It sure looks like a when rather than an if as far as I'm concerned. In short, +1! And thanks a lot! -- dim
On 01/30/2010 01:14 PM, Dimitri Fontaine wrote: > Matteo Beccati <php@beccati.com> writes: >> I've been following the various suggestions. Please take a look at the >> updated archives proof of concept: >> >> http://archives.beccati.org/ > > I like the features a lot, and the only remarks I can think about are > bikeschedding, so I'll let it to the web team when they integrate it. It > sure looks like a when rather than an if as far as I'm concerned. > > In short, +1! And thanks a lot! +1 here too. That looks wonderful! Joe
On 30/01/2010 17:54, Alvaro Herrera wrote: > Matteo Beccati wrote: >> Il 19/01/2010 09:44, Magnus Hagander ha scritto: >>> As long as the templating is separated from the code, it doesn't >>> matter if it's a dedicated templating engine or PHP. The point being, >>> focus on the contents and interface, porting the actual >>> HTML-generation is likely to be easy compared to that. >> >> I've been following the various suggestions. Please take a look at >> the updated archives proof of concept: >> >> http://archives.beccati.org/ > > I like this. > > Sorry for being unable to get in touch with you on IM. It's been a > hectic time here with only very few pauses. Thanks :) And no worries, I'm pretty sure you must be quite busy lately! > Some things: > > * the list of lists and groups of lists are stored in two JSON files. > Should I send you a copy of them so that you can tweak your code to use > them? They are generated automatically from the wwwmaster database. > > * We have a bunch of templates that you could perhaps have used, if you > hadn't already written all of it ... :-( The templates and especially the integration with the current layout still need to be rewritten when porting the code to python/Django, so I I'm not sure if it's wise to spend more time on it at this stage. Not sure about the JSON approach either. Maybe it's something that needs to be further discussed when/if planning the migration of the archives to Archiveopteryx. > * While I don't personally care, some are going to insist that the site > works with Javascript disabled. I didn't try but from your description > it doesn't seem like it would. Is this easily fixable? Date sorting works nicely even without JS, while thread sorting doesn't at all. I've just updated the PoC so that thread sorting is not available when JS is not available, while it still is the default otherwise. Hopefully that's enough to keep JS haters happy. > * The old monthly interface /list/yyyy-mm/msg*php is not really > necessary to keep, *except* that we need the existing URLs to redirect > to the corresponding new message page. I think we should be able to > create a database of URL redirects from the old site, using the > Message-Id URL style. So each message accessed using the old URL style > would require two redirects, but I don't think this is a problem. Do > you agree? Sure. I was just hoping there was an even easier way (rescritct to month, order by uid limit 1 offset X). I guess it wouldn't be hard to write a script that populates a backward compatibility table. No need for double redirects, it'd be just a matter of adding a JOIN or two to the query. > * We're using Subversion to keep the current code. Is your code > version-controlled? We'd need to import your code there, I'm afraid. I do have a local svn repository. Given it's just a PoC that is going to be rewritten I don't think it should live in the official repo, but if you think id does, I'll be glad to switch. >> Last but not least, it's backwards compatibile with the >> /message-id/* URI. The other one (/list/yyyy-mm/msg*.php) is >> implemented, but I just realized that it has problems dealing with >> the old archive weirdness (2009-12 shows also some messages dated >> aug 2009 nov 2009 or jan 2010 for -hackers). > > I'm surprised about the Aug 2009 ones, but the others are explained > because the site divides the mboxes using one timezone and the time > displayed is a different timezone. We don't really control the first > one so there's nothing to do about it; but anyway it's not really > important. It's not a big deal, the BC-table approach will take care of those out-of-range messages. However there are a few messages in the hackers archive (and most likely others) that have wrong date headers (e.g. 1980, 2036): we need to think what to do with them. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
On 30/01/2010 22:18, Joe Conway wrote: > On 01/30/2010 01:14 PM, Dimitri Fontaine wrote: >> Matteo Beccati<php@beccati.com> writes: >>> I've been following the various suggestions. Please take a look at the >>> updated archives proof of concept: >>> >>> http://archives.beccati.org/ >> >> I like the features a lot, and the only remarks I can think about are >> bikeschedding, so I'll let it to the web team when they integrate it. It >> sure looks like a when rather than an if as far as I'm concerned. >> >> In short, +1! And thanks a lot! > > +1 here too. That looks wonderful! Thanks guys. Hopefully in the next few days I'll be able to catch up with Alvaro to see how we can proceed on this. Incidentally, I've just found out that the mailing lists are dropping some messages. According to my qmail logs the AOX account never received Joe's message yesterday, nor quite a few others: M156252, M156259, M156262, M156273, M156275 and I've verified that it also has happened before. I don't know why, but I'm pretty sure that my MTA was contacted only once for those messages, while normally I get two connections (my own address + aox address). Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
On Sat, Jan 30, 2010 at 22:43, Matteo Beccati <php@beccati.com> wrote: > On 30/01/2010 17:54, Alvaro Herrera wrote: >> * While I don't personally care, some are going to insist that the site >> works with Javascript disabled. I didn't try but from your description >> it doesn't seem like it would. Is this easily fixable? > > Date sorting works nicely even without JS, while thread sorting doesn't at > all. I've just updated the PoC so that thread sorting is not available when > JS is not available, while it still is the default otherwise. Hopefully > that's enough to keep JS haters happy. I haven't looked at how it actually works, but the general requirement is that it has to *work* without JS. It doesn't have to work *as well*. That means serving up a page with zero contents, or a page that you can't navigate, is not acceptable. Requiring more clicks to get around the navigation and things like that, are ok. >> * The old monthly interface /list/yyyy-mm/msg*php is not really >> necessary to keep, *except* that we need the existing URLs to redirect >> to the corresponding new message page. I think we should be able to >> create a database of URL redirects from the old site, using the >> Message-Id URL style. So each message accessed using the old URL style >> would require two redirects, but I don't think this is a problem. Do >> you agree? > > Sure. I was just hoping there was an even easier way (rescritct to month, > order by uid limit 1 offset X). I guess it wouldn't be hard to write a > script that populates a backward compatibility table. No need for double > redirects, it'd be just a matter of adding a JOIN or two to the query. Once we go into production on this, we'll need to do some serious thinking about the caching issues. And in any such scenario we should very much avoid serving up the same content under different URLs, since it'll blow away cache space for no reason - it's much better to throw a redirct. >> * We're using Subversion to keep the current code. Is your code >> version-controlled? We'd need to import your code there, I'm afraid. > > I do have a local svn repository. Given it's just a PoC that is going to be > rewritten I don't think it should live in the official repo, but if you > think id does, I'll be glad to switch. Note that the plan is to switch pgweb to git as well. So if you just want to push the stuff up during development so people can look at it, register for a repository at git.postgresql.org - or just set one up at github which is even easier. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On 31/01/2010 13:45, Magnus Hagander wrote: > On Sat, Jan 30, 2010 at 22:43, Matteo Beccati<php@beccati.com> wrote: >> On 30/01/2010 17:54, Alvaro Herrera wrote: >>> * While I don't personally care, some are going to insist that the site >>> works with Javascript disabled. I didn't try but from your description >>> it doesn't seem like it would. Is this easily fixable? >> >> Date sorting works nicely even without JS, while thread sorting doesn't at >> all. I've just updated the PoC so that thread sorting is not available when >> JS is not available, while it still is the default otherwise. Hopefully >> that's enough to keep JS haters happy. > > I haven't looked at how it actually works, but the general requirement > is that it has to *work* without JS. It doesn't have to work *as > well*. That means serving up a page with zero contents, or a page that > you can't navigate, is not acceptable. Requiring more clicks to get > around the navigation and things like that, are ok. As it currently stands, date sorting is the default and there are no links to the thread view, which would otherwise look empty. We can surely build a non-JS thread view as well, I'm just not sure if it's worth the effort. >>> * The old monthly interface /list/yyyy-mm/msg*php is not really >>> necessary to keep, *except* that we need the existing URLs to redirect >>> to the corresponding new message page. I think we should be able to >>> create a database of URL redirects from the old site, using the >>> Message-Id URL style. So each message accessed using the old URL style >>> would require two redirects, but I don't think this is a problem. Do >>> you agree? >> >> Sure. I was just hoping there was an even easier way (rescritct to month, >> order by uid limit 1 offset X). I guess it wouldn't be hard to write a >> script that populates a backward compatibility table. No need for double >> redirects, it'd be just a matter of adding a JOIN or two to the query. > > Once we go into production on this, we'll need to do some serious > thinking about the caching issues. And in any such scenario we should > very much avoid serving up the same content under different URLs, > since it'll blow away cache space for no reason - it's much better to > throw a redirct. Yes, that was my point. A single redirect to the only URL for the message. >>> * We're using Subversion to keep the current code. Is your code >>> version-controlled? We'd need to import your code there, I'm afraid. >> >> I do have a local svn repository. Given it's just a PoC that is going to be >> rewritten I don't think it should live in the official repo, but if you >> think id does, I'll be glad to switch. > > Note that the plan is to switch pgweb to git as well. So if you just > want to push the stuff up during development so people can look at it, > register for a repository at git.postgresql.org - or just set one up > at github which is even easier. The only reason why I used svn is that git support in netbeans is rather poor, or at least that was my impression. I think it won't be a problem to move to git, I probably just need some directions ;) Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
On Sun, Jan 31, 2010 at 15:09, Matteo Beccati <php@beccati.com> wrote: > On 31/01/2010 13:45, Magnus Hagander wrote: >> >> On Sat, Jan 30, 2010 at 22:43, Matteo Beccati<php@beccati.com> wrote: >>> >>> On 30/01/2010 17:54, Alvaro Herrera wrote: >>>> >>>> * While I don't personally care, some are going to insist that the site >>>> works with Javascript disabled. I didn't try but from your description >>>> it doesn't seem like it would. Is this easily fixable? >>> >>> Date sorting works nicely even without JS, while thread sorting doesn't >>> at >>> all. I've just updated the PoC so that thread sorting is not available >>> when >>> JS is not available, while it still is the default otherwise. Hopefully >>> that's enough to keep JS haters happy. >> >> I haven't looked at how it actually works, but the general requirement >> is that it has to *work* without JS. It doesn't have to work *as >> well*. That means serving up a page with zero contents, or a page that >> you can't navigate, is not acceptable. Requiring more clicks to get >> around the navigation and things like that, are ok. > > As it currently stands, date sorting is the default and there are no links > to the thread view, which would otherwise look empty. We can surely build a > non-JS thread view as well, I'm just not sure if it's worth the effort. Hmm. I personally think we need some level of thread support for non-JS as well, that's at least not *too* much of a step backwards from what we have now. But others may have other thoughts about that? >>>> * We're using Subversion to keep the current code. Is your code >>>> version-controlled? We'd need to import your code there, I'm afraid. >>> >>> I do have a local svn repository. Given it's just a PoC that is going to >>> be >>> rewritten I don't think it should live in the official repo, but if you >>> think id does, I'll be glad to switch. >> >> Note that the plan is to switch pgweb to git as well. So if you just >> want to push the stuff up during development so people can look at it, >> register for a repository at git.postgresql.org - or just set one up >> at github which is even easier. > > The only reason why I used svn is that git support in netbeans is rather > poor, or at least that was my impression. I think it won't be a problem to > move to git, I probably just need some directions ;) :-) Well, it doesn't matter what type of repo it's in at this point, only once it goes into production. The reason I suggested git at this point is that we (the postgresql project) do provide git hosting at git.postgresql.org, but we don't provide subversion anywhere. And I'm certainly not going to suggest you use pgfoundry and cvs.... -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Matteo Beccati wrote: > Incidentally, I've just found out that the mailing lists are > dropping some messages. According to my qmail logs the AOX account > never received Joe's message yesterday, nor quite a few others: > > M156252, M156259, M156262, M156273, M156275 > > and I've verified that it also has happened before. I don't know > why, but I'm pretty sure that my MTA was contacted only once for > those messages, while normally I get two connections (my own address > + aox address). Hmm, I see it here: http://archives.postgresql.org/message-id/4B64A238.1050307%40joeconway.com Maybe it was just delayed? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 01/02/2010 03:27, Alvaro Herrera wrote: > Matteo Beccati wrote: > >> Incidentally, I've just found out that the mailing lists are >> dropping some messages. According to my qmail logs the AOX account >> never received Joe's message yesterday, nor quite a few others: >> >> M156252, M156259, M156262, M156273, M156275 >> >> and I've verified that it also has happened before. I don't know >> why, but I'm pretty sure that my MTA was contacted only once for >> those messages, while normally I get two connections (my own address >> + aox address). > > Hmm, I see it here: > http://archives.postgresql.org/message-id/4B64A238.1050307%40joeconway.com > Maybe it was just delayed? But not here: http://archives.beccati.org/message-id/4B64A238.1050307%40joeconway.com Anyway, I guess that on production we'll have a better way to inject emails into Archiveopteryx rather than relying on a email subscription, which seems a bit fragile. It's been ages since I last set up majordomo, but I guess there should be a way to also pipe outgoing messages through a script that performs the delivery to AOX. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
2010/2/1 Matteo Beccati <php@beccati.com>: > On 01/02/2010 03:27, Alvaro Herrera wrote: >> >> Matteo Beccati wrote: >> >>> Incidentally, I've just found out that the mailing lists are >>> dropping some messages. According to my qmail logs the AOX account >>> never received Joe's message yesterday, nor quite a few others: >>> >>> M156252, M156259, M156262, M156273, M156275 >>> >>> and I've verified that it also has happened before. I don't know >>> why, but I'm pretty sure that my MTA was contacted only once for >>> those messages, while normally I get two connections (my own address >>> + aox address). >> >> Hmm, I see it here: >> http://archives.postgresql.org/message-id/4B64A238.1050307%40joeconway.com >> Maybe it was just delayed? > > But not here: > > http://archives.beccati.org/message-id/4B64A238.1050307%40joeconway.com > > Anyway, I guess that on production we'll have a better way to inject emails into Archiveopteryx rather than relying ona email subscription, which seems a bit fragile. It's been ages since I last set up majordomo, but I guess there shouldbe a way to also pipe outgoing messages through a script that performs the delivery to AOX. Does the MBOX importer support incremental loading? Because majordomo spits out MBOX files for us already. One option could be to use SMTP with a subscription as the primary way (and we could set up a dedicated relaying from the mailserver for this of course, so it's not subject to graylisting or anything like that), and then daily or so load the MBOX files to cover anything that was lost? -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On 01/02/2010 10:26, Magnus Hagander wrote: > Does the MBOX importer support incremental loading? Because majordomo > spits out MBOX files for us already. Unfortunately the aoximport shell command doesn't support incremental loading. > One option could be to use SMTP with a subscription as the primary way > (and we could set up a dedicated relaying from the mailserver for this > of course, so it's not subject to graylisting or anything like that), > and then daily or so load the MBOX files to cover anything that was > lost? I guess we could write a script that parses the mbox and adds whatever is missing, as long as we keep it as a last resort if we can't make the primary delivery a fail proof. My main concern is that we'd need to overcomplicate the thread detection algorithm so that it better deals with delayed messages: as it currently works, the replies to a missing message get linked to the "grand-parent". Injecting the missing message afterwards will put it at the same level as its replies. If it happens only once in a while I guess we can live with it, but definitely not if it happens tens of times a day. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
2010/2/1 Matteo Beccati <php@beccati.com>: > On 01/02/2010 10:26, Magnus Hagander wrote: >> >> Does the MBOX importer support incremental loading? Because majordomo >> spits out MBOX files for us already. > > Unfortunately the aoximport shell command doesn't support incremental loading. > >> One option could be to use SMTP with a subscription as the primary way >> (and we could set up a dedicated relaying from the mailserver for this >> of course, so it's not subject to graylisting or anything like that), >> and then daily or so load the MBOX files to cover anything that was >> lost? > > I guess we could write a script that parses the mbox and adds whatever is missing, as long as we keep it as a last resortif we can't make the primary delivery a fail proof. > > My main concern is that we'd need to overcomplicate the thread detection algorithm so that it better deals with delayedmessages: as it currently works, the replies to a missing message get linked to the "grand-parent". Injecting themissing message afterwards will put it at the same level as its replies. If it happens only once in a while I guess wecan live with it, but definitely not if it happens tens of times a day. That can potentially be a problem. Consider the case where message A it sent. Mesasge B is a response to A, and message C is a response to B. Now assume B is held for moderation (because the poser is not on the list, or because it trips some other thing), then message C will definitely arrive before message B. Is that going to cause problems with this method? Another case where the same thing will happen is if message delivery of B gets for example graylisted, or is just slow from sender B, but gets quickly delivered to the author of message A (because of a direct CC). In this case, the author of message A may respond to it (making message D),and this will again arrive before message B because author A is not graylisted. So the system definitely needs to deal with out-of-order delivery. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On 01/02/2010 15:03, Magnus Hagander wrote: > 2010/2/1 Matteo Beccati<php@beccati.com>: >> My main concern is that we'd need to overcomplicate the thread detection algorithm so that it better deals with delayedmessages: as it currently works, the replies to a missing message get linked to the "grand-parent". Injecting themissing message afterwards will put it at the same level as its replies. If it happens only once in a while I guess wecan live with it, but definitely not if it happens tens of times a day. > > That can potentially be a problem. > > Consider the case where message A it sent. Mesasge B is a response to > A, and message C is a response to B. Now assume B is held for > moderation (because the poser is not on the list, or because it trips > some other thing), then message C will definitely arrive before > message B. Is that going to cause problems with this method? > > Another case where the same thing will happen is if message delivery > of B gets for example graylisted, or is just slow from sender B, but > gets quickly delivered to the author of message A (because of a direct > CC). In this case, the author of message A may respond to it (making > message D),and this will again arrive before message B because author > A is not graylisted. > > So the system definitely needs to deal with out-of-order delivery. Hmm, it looks like I didn't factor in direct CCs when thinking about potential problems with the simplified algorithm. Thanks for raising that. I'll be out of town for a few days, but I will see what I can do when I get back. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
2010/2/1 Matteo Beccati <php@beccati.com>: > On 01/02/2010 15:03, Magnus Hagander wrote: >> >> 2010/2/1 Matteo Beccati<php@beccati.com>: >>> >>> My main concern is that we'd need to overcomplicate the thread detection algorithm so that it better deals with delayedmessages: as it currently works, the replies to a missing message get linked to the "grand-parent". Injecting themissing message afterwards will put it at the same level as its replies. If it happens only once in a while I guess wecan live with it, but definitely not if it happens tens of times a day. >> >> That can potentially be a problem. >> >> Consider the case where message A it sent. Mesasge B is a response to >> A, and message C is a response to B. Now assume B is held for >> moderation (because the poser is not on the list, or because it trips >> some other thing), then message C will definitely arrive before >> message B. Is that going to cause problems with this method? >> >> Another case where the same thing will happen is if message delivery >> of B gets for example graylisted, or is just slow from sender B, but >> gets quickly delivered to the author of message A (because of a direct >> CC). In this case, the author of message A may respond to it (making >> message D),and this will again arrive before message B because author >> A is not graylisted. >> >> So the system definitely needs to deal with out-of-order delivery. > > Hmm, it looks like I didn't factor in direct CCs when thinking about potential problems with the simplified algorithm.Thanks for raising that. That is a very common scenario. And even without that, email taking different time to get delivered to majordomo is not at all uncomoon. > I'll be out of town for a few days, but I will see what I can do when I get back. No rush. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Matteo Beccati <php@beccati.com> writes: > My main concern is that we'd need to overcomplicate the thread detection > algorithm so that it better deals with delayed messages: as it currently > works, the replies to a missing message get linked to the > "grand-parent". Injecting the missing message afterwards will put it at > the same level as its replies. If it happens only once in a while I > guess we can live with it, but definitely not if it happens tens of > times a day. That's quite common unfortunately --- I think you're going to need to deal with the case. Even getting a direct feed from the mail relays wouldn't avoid it completely: consider cases like * A sends a message* B replies, cc'ing A and the list* B's reply to list is delayed by greylisting* A replies to B's reply(cc'ing list)* A's reply goes through immediately* B's reply shows up a bit later That happens pretty frequently IME. regards, tom lane
On Mon, Feb 1, 2010 at 11:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Matteo Beccati <php@beccati.com> writes: >> My main concern is that we'd need to overcomplicate the thread detection >> algorithm so that it better deals with delayed messages: as it currently >> works, the replies to a missing message get linked to the >> "grand-parent". Injecting the missing message afterwards will put it at >> the same level as its replies. If it happens only once in a while I >> guess we can live with it, but definitely not if it happens tens of >> times a day. > > That's quite common unfortunately --- I think you're going to need to > deal with the case. Even getting a direct feed from the mail relays > wouldn't avoid it completely: consider cases like > > * A sends a message > * B replies, cc'ing A and the list > * B's reply to list is delayed by greylisting > * A replies to B's reply (cc'ing list) > * A's reply goes through immediately > * B's reply shows up a bit later > > That happens pretty frequently IME. Yeah - and sometimes the delay can be DAYS. ...Robert
On Mon, Feb 1, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Feb 1, 2010 at 11:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> * A sends a message >>> * B replies, cc'ing A and the list >>> * B's reply to list is delayed by greylisting >>> * A replies to B's reply (cc'ing list) >>> * A's reply goes through immediately >>> * B's reply shows up a bit later >>> >>> That happens pretty frequently IME. > >> Yeah - and sometimes the delay can be DAYS. > > Greylisting wouldn't explain a delay of more than an hour or so. > OTOH, if B's reply got held for moderation for some reason, then > yeah it could be days :-(. But in that case the rest of the list > didn't see it in real-time either, so having it show up out of > "logical" sequence in the archive doesn't seem like a terrible > reflection of reality. I'm just concerned about the threading not > being sensitive to skews on the order of a few minutes --- those > are extremely common. I not infrequently receive messages out of sequence by time periods well in excess of a few minutes. Don't know why, but I do. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Feb 1, 2010 at 11:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> * A sends a message >> * B replies, cc'ing A and the list >> * B's reply to list is delayed by greylisting >> * A replies to B's reply (cc'ing list) >> * A's reply goes through immediately >> * B's reply shows up a bit later >> >> That happens pretty frequently IME. > Yeah - and sometimes the delay can be DAYS. Greylisting wouldn't explain a delay of more than an hour or so. OTOH, if B's reply got held for moderation for some reason, then yeah it could be days :-(. But in that case the rest of the list didn't see it in real-time either, so having it show up out of "logical" sequence in the archive doesn't seem like a terrible reflection of reality. I'm just concerned about the threading not being sensitive to skews on the order of a few minutes --- those are extremely common. regards, tom lane
2010/2/1 Robert Haas <robertmhaas@gmail.com>: > On Mon, Feb 1, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> On Mon, Feb 1, 2010 at 11:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> * A sends a message >>>> * B replies, cc'ing A and the list >>>> * B's reply to list is delayed by greylisting >>>> * A replies to B's reply (cc'ing list) >>>> * A's reply goes through immediately >>>> * B's reply shows up a bit later >>>> >>>> That happens pretty frequently IME. >> >>> Yeah - and sometimes the delay can be DAYS. >> >> Greylisting wouldn't explain a delay of more than an hour or so. >> OTOH, if B's reply got held for moderation for some reason, then >> yeah it could be days :-(. But in that case the rest of the list >> didn't see it in real-time either, so having it show up out of >> "logical" sequence in the archive doesn't seem like a terrible >> reflection of reality. I'm just concerned about the threading not >> being sensitive to skews on the order of a few minutes --- those >> are extremely common. > > I not infrequently receive messages out of sequence by time periods > well in excess of a few minutes. > > Don't know why, but I do. Quite often, it's stuck in the moderation queue. Not quite as often, but still fairly frequently, it's stuck somewhere in the hub.org relaying/antispam blackbox. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On 01/02/2010 17:28, Tom Lane wrote: > Matteo Beccati<php@beccati.com> writes: >> My main concern is that we'd need to overcomplicate the thread detection >> algorithm so that it better deals with delayed messages: as it currently >> works, the replies to a missing message get linked to the >> "grand-parent". Injecting the missing message afterwards will put it at >> the same level as its replies. If it happens only once in a while I >> guess we can live with it, but definitely not if it happens tens of >> times a day. > > That's quite common unfortunately --- I think you're going to need to > deal with the case. Even getting a direct feed from the mail relays > wouldn't avoid it completely: consider cases like > > * A sends a message > * B replies, cc'ing A and the list > * B's reply to list is delayed by greylisting > * A replies to B's reply (cc'ing list) > * A's reply goes through immediately > * B's reply shows up a bit later > > That happens pretty frequently IME. I've improved the threading algorithm by keeping an ordered backlog of unresolved references, i.e. when a message arrives: 1. Search for a parent message using: 1a. In-Reply-To header. If referenced message is not found insert its Message-Id to the backlog table with position 0 1b. References header. For each missing referenced message insert its Message-Id to the backlog table with position N 1c. MS Exchange Thread-Index and Thread-Topic headers 2. Message is stored along with its parent ID, if any. 3. Compare the Message-Id header with the backlog table. Update the parent field of any referencing message and clean up positions >= n in the references table. Now I just need some time to do a final clean up and I'd be ready to publish the code, which hopefully will be clearer than my words ;) Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/