Re: PGLister fails to de-dup messages addressed twice to same list - Mailing list pgsql-www

From Magnus Hagander
Subject Re: PGLister fails to de-dup messages addressed twice to same list
Date
Msg-id CABUevEy5BXAA6_SUxzNcnLv+zgvbP8hMv8C5MTL+ugs5yBFRGA@mail.gmail.com
Whole thread Raw
In response to Re: PGLister fails to de-dup messages addressed twice to same list  (Stephen Frost <sfrost@snowman.net>)
List pgsql-www
On Tue, Nov 21, 2017 at 4:43 PM, Stephen Frost <sfrost@snowman.net> wrote:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> >> ... I have no doubt at all that that's
> >> going to happen a *lot* during the list domain changeover, so I'd
> >> strongly recommend putting something in place to de-dup.
>
> > Yeah, I'm already chatting w/ Magnus about this.
>
> Curiously, my replies to the same message seem to have been delivered
> only once, and that's not because I was awake enough to notice and
> remove the extra cc ;-).  So my guess at this point is that you do
> have some de-dup in there, but it ain't working for gmail-originated
> messages.

As near as I can tell, GMail delivered the message to us in two
independent runs with two connections to our mail server, while your
server only delivered one message in one run to our server.

Yup, that's indeed what happened.

 
I'm guessing that your server realized it was the same MX for both
postgresql.org and lists.postgresql.org and expected our server to
handle delivering to the multiple addresses, but PGLister, for a given
email that comes in, is only going to deliver once to each of the lists
that are listed in the inbound email.  On the other hand, GMail seems to
split the email on the source side for each domain/subdomain and
delivers them independently.

Unfortunately, we aren't going to be able to depend on the sender's MTA
to always put the message into one email to us, as made clear by GMail
but also because it's not really "correct."  We need to have a
message-id cache in the PG database that will throw away dups when they
come in on a per-list basis.  I don't anticipate it being too difficult
to implement, really, but I think we'll need it to last at least a
couple of days which implies having a cleanup job for it, et al.

I have deployed what I think is the correct way to deal with this deduplication. Basically it tracks if an existing combination of (msgid, list) has been seen before, and if it has the new copy is dropped on the floor (with a log of course). We were already keeping track of that information (though in two different tables), so the extra check was easy and will be cheap.

A db check shows we have 33 emails so far delivered duplicated across lists. Mostly to general (22 of those mails), but a few to other lists too.

So far no attempt has been made since I deployed the check, but they only show up once every few hours so we'll wait a while to see if it works. 

--

pgsql-www by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: [pgcommitfest2] update README
Next
From: "Ivan E. Panchenko"
Date:
Subject: Re: Postgres Pro build for windows