On Tue, Nov 21, 2017 at 4:43 PM, Stephen Frost <sfrost@snowman.net> wrote:
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > >> ... I have no doubt at all that that's > >> going to happen a *lot* during the list domain changeover, so I'd > >> strongly recommend putting something in place to de-dup. > > > Yeah, I'm already chatting w/ Magnus about this. > > Curiously, my replies to the same message seem to have been delivered > only once, and that's not because I was awake enough to notice and > remove the extra cc ;-). So my guess at this point is that you do > have some de-dup in there, but it ain't working for gmail-originated > messages.
As near as I can tell, GMail delivered the message to us in two independent runs with two connections to our mail server, while your server only delivered one message in one run to our server.
Yup, that's indeed what happened.
I'm guessing that your server realized it was the same MX for both postgresql.org and lists.postgresql.org and expected our server to handle delivering to the multiple addresses, but PGLister, for a given email that comes in, is only going to deliver once to each of the lists that are listed in the inbound email. On the other hand, GMail seems to split the email on the source side for each domain/subdomain and delivers them independently.
Unfortunately, we aren't going to be able to depend on the sender's MTA to always put the message into one email to us, as made clear by GMail but also because it's not really "correct." We need to have a message-id cache in the PG database that will throw away dups when they come in on a per-list basis. I don't anticipate it being too difficult to implement, really, but I think we'll need it to last at least a couple of days which implies having a cleanup job for it, et al.
I have deployed what I think is the correct way to deal with this deduplication. Basically it tracks if an existing combination of (msgid, list) has been seen before, and if it has the new copy is dropped on the floor (with a log of course). We were already keeping track of that information (though in two different tables), so the extra check was easy and will be cheap.
A db check shows we have 33 emails so far delivered duplicated across lists. Mostly to general (22 of those mails), but a few to other lists too.
So far no attempt has been made since I deployed the check, but they only show up once every few hours so we'll wait a while to see if it works.