Thread: Mailing list subscription's mail delivery delays?
Hi, By lack of a better place to ask: I've recently noticed that in several of the email threads that I follow over on -hackers@ that some of the email messages have a very high time-to-delivery, and thus mails from the same thread arrive out-of-order. I've seen several occurances of this with very long delays of over 10 hours, with at least one larger than 19 hours, assuming mail server clocks are accurate and receipt dates are correctly included in the mail headers. I'm not sure if the issue is on my side (mail servers are gmail's) or on the mailing list server - all traces I've checked indicate that the delay is somewhere in the delivery from postgres' last mail server to the first gmail mail server. I've only really noticed this sometime in the past few weeks. After sampling my mails, I found other examples of significant delays (>1h) for mails from well-respected hackers dating back to at least 2023-08-28. Would you happen to know why this could be the case, and what I can do to fix it if it's something on my side? I've attached three recently received mails from -hackers as .eml, to help with any debugging: one was delivered relatively quickly (91s), one for which the delivery took a long time (11h+) and one more with a very long delivery time (19h+). I haven't yet noticed any specific differences or commonalities between fast and slow mails. Kind regards, Matthias van de Meent. Hi, On 2023-09-27 17:43:04 -0700, Peter Geoghegan wrote: > On Wed, Sep 27, 2023 at 5:20 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > Can you define "unfreeze"? I don't know if this newly invented term > > > refers to unsetting a page that was marked all-frozen following (say) > > > an UPDATE, or if it refers to choosing to not freeze when the option > > > was available (in the sense that it was possible to do it and fully > > > mark the page all-frozen in the VM). Or something else. > > > > By "unfreeze", I mean unsetting a page all frozen in the visibility > > map when modifying the page for the first time after it was last > > frozen. > > I see. So I guess that Andres meant that you'd track that within all > backends, using pgstats infrastructure (when he summarized your call > earlier today)? That call was just between Robert and me (and not dedicated just to this topic, fwiw). Yes, I was thinking of tracking that in pgstat. I can imagine occasionally rolling it over into pg_class, to better deal with crashes / failovers, but am fairly agnostic on whether that's really useful / necessary. > And that that information would be an important input for VACUUM, as opposed > to something that it maintained itself? Yes. If the ratio of opportunistically frozen pages (which I'd define as pages that were frozen not because they strictly needed to) vs the number of unfrozen pages increases, we need to make opportunistic freezing less aggressive and vice versa. > ISTM that the concept of "unfreezing" a page is equivalent to > "opening" the page that was "closed" at some point (by VACUUM). It's > not limited to freezing per se -- it's "closed for business until > further notice", which is a slightly broader concept (and one not > unique to Postgres). You don't just need to be concerned about updates > and deletes -- inserts are also a concern. > > I would be sure to look out for new inserts that "unfreeze" pages, too > -- ideally you'd have instrumentation that caught that, in order to > get a general sense of the extent of the problem in each of your > chosen representative workloads. This is particularly likely to be a > concern when there is enough space on a heap page to fit one more heap > tuple, that's smaller than most other tuples. The FSM will "helpfully" > make sure of it. This problem isn't rare at all, unfortunately. I'm not as convinced as you are that this is a problem / that the solution won't cause more problems than it solves. Users are concerned when free space can't be used - you don't have to look further than the discussion in the last weeks about adding the ability to disable HOT to fight bloat. I do agree that the FSM code tries way too hard to fit things onto early pages - it e.g. can slow down concurrent copy workloads by 3-4x due to contention in the FSM - and that it has more size classes than necessary, but I don't think just closing frozen pages against further insertions of small tuples will cause its own set of issues. I think at the very least there'd need to be something causing pages to reopen once the aggregate unused space in the table reaches some threshold. Greetings, Andres Freund On Wed, Sep 27, 2023 at 5:20 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Can you define "unfreeze"? I don't know if this newly invented term > > refers to unsetting a page that was marked all-frozen following (say) > > an UPDATE, or if it refers to choosing to not freeze when the option > > was available (in the sense that it was possible to do it and fully > > mark the page all-frozen in the VM). Or something else. > > By "unfreeze", I mean unsetting a page all frozen in the visibility > map when modifying the page for the first time after it was last > frozen. I see. So I guess that Andres meant that you'd track that within all backends, using pgstats infrastructure (when he summarized your call earlier today)? And that that information would be an important input for VACUUM, as opposed to something that it maintained itself? > I would probably call choosing not to freeze when the option is > available "no freeze". I have been thinking of what to call it because > I want to add some developer stats for myself indicating why a page > that was freezable was not frozen. I think that having that sort of information available via custom instrumentation (just for the performance validation side) makes a lot of sense. ISTM that the concept of "unfreezing" a page is equivalent to "opening" the page that was "closed" at some point (by VACUUM). It's not limited to freezing per se -- it's "closed for business until further notice", which is a slightly broader concept (and one not unique to Postgres). You don't just need to be concerned about updates and deletes -- inserts are also a concern. I would be sure to look out for new inserts that "unfreeze" pages, too -- ideally you'd have instrumentation that caught that, in order to get a general sense of the extent of the problem in each of your chosen representative workloads. This is particularly likely to be a concern when there is enough space on a heap page to fit one more heap tuple, that's smaller than most other tuples. The FSM will "helpfully" make sure of it. This problem isn't rare at all, unfortunately. > > The choice to freeze or not freeze pretty much always relies on > > guesswork about what'll happen to the page in the future, no? > > Obviously we wouldn't even apply the FPI trigger criteria if we could > > somehow easily determine that it won't work out (to some degree that's > > what conditioning it on being able to set the all-frozen VM bit > > actually does). > > I suppose you are thinking of "opportunistic" as freezing whenever we > aren't certain it is the right thing to do simply because we have the > opportunity to do it? I have heard the term "opportunistic freezing" used to refer to freezing that takes place outside of VACUUM before now. You know, something perfectly analogous to pruning in VACUUM versus opportunistic pruning. (I knew that you can't have meant that -- my point is that the terminology in this area has problems.) > I want a way to express "freeze when freeze min age doesn't require it" That makes sense when you consider where we are right now, but it'll sound odd in a world where freezing via min_freeze_age is the exception rather than the rule. If anything, it would make more sense if the traditional min_freeze_age trigger criteria was the type of freezing that needed its own adjective. -- Peter Geoghegan Andres Freund <andres@anarazel.de> writes: > On 2023-09-27 16:52:44 -0400, Tom Lane wrote: >> I think it doesn't, as long as all the relevant build targets >> write their dependencies with "frontend_code" before "libpq". > Hm, that's not great. I don't think that should be required. I'll try to take > a look at why that's needed. Well, it's only important on platforms where we can't restrict libpq.so from exporting all symbols. I don't know how close we are to deciding that such cases are no longer interesting to worry about. Makefile.shlib seems to know how to do it everywhere except Windows, and I imagine we know how to do it over in the MSVC scripts. >> However, it's hard to test this, because the meson build >> seems completely broken on current macOS: > Looks like you need 1.2 for the new clang / ld output... Apparently apple's > linker changed the format of its version output :/. Ah, yeah, updating MacPorts again brought in meson 1.2.1 which seems to work. I now see a bunch of ld: warning: ignoring -e, not used for output type ld: warning: -undefined error is deprecated which are unrelated. There's still one duplicate warning from the backend link: ld: warning: ignoring duplicate libraries: '-lpam' I'm a bit baffled why that's showing up; there's no obvious double reference to pam. regards, tom lane
On Thu, Sep 28, 2023 at 3:48 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
I'm not sure if the issue is on my side (mail servers are gmail's) or
on the mailing list server - all traces I've checked indicate that the
delay is somewhere in the delivery from postgres' last mail server to
the first gmail mail server.
I've only really noticed this sometime in the past few weeks. After
sampling my mails, I found other examples of significant delays (>1h)
for mails from well-respected hackers dating back to at least
2023-08-28.
I have noticed the same thing happening for the Gmail account that I use.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes: > On Thu, Sep 28, 2023 at 3:48 PM Matthias van de Meent < > boekewurm+postgres@gmail.com> wrote: >> I'm not sure if the issue is on my side (mail servers are gmail's) or >> on the mailing list server - all traces I've checked indicate that the >> delay is somewhere in the delivery from postgres' last mail server to >> the first gmail mail server. >> >> I've only really noticed this sometime in the past few weeks. After >> sampling my mails, I found other examples of significant delays (>1h) >> for mails from well-respected hackers dating back to at least >> 2023-08-28. > I have noticed the same thing happening for the Gmail account that I use. I have been seeing the same thing for a few days now, on my definitely-not-gmail personal server. Something's flaky in the PG mail infrastructure. It's gotten better since yesterday's outage, though I'm not convinced it's totally fixed. regards, tom lane
On Fri, Sep 29, 2023 at 1:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "David G. Johnston" <david.g.johnston@gmail.com> writes: > > On Thu, Sep 28, 2023 at 3:48 PM Matthias van de Meent < > > boekewurm+postgres@gmail.com> wrote: > >> I'm not sure if the issue is on my side (mail servers are gmail's) or > >> on the mailing list server - all traces I've checked indicate that the > >> delay is somewhere in the delivery from postgres' last mail server to > >> the first gmail mail server. > >> > >> I've only really noticed this sometime in the past few weeks. After > >> sampling my mails, I found other examples of significant delays (>1h) > >> for mails from well-respected hackers dating back to at least > >> 2023-08-28. > > > I have noticed the same thing happening for the Gmail account that I use. > > I have been seeing the same thing for a few days now, on my > definitely-not-gmail personal server. Something's flaky in the > PG mail infrastructure. It's gotten better since yesterday's > outage, though I'm not convinced it's totally fixed. There have been some pretty bad issues with gmail recently. Some changes have been deployed that will hopefully help mitigate those and make things better, but it takes time to recover. The massive backlogs caused by gmail have been enough to spill over and affect other destinations as well simply due to the load created since we have such a huge number of gmail subscribers. But we're slowly seeing the backlogs shrink now and the load come down so hopefully the changes made will continue to have effect and let us be back to normal soon. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On 29/09/2023 08:13, Magnus Hagander wrote: > > There have been some pretty bad issues with gmail recently. Some Just curious - what sort of issues? I don't use gmail myself. Ray. -- Raymond O'Donnell // Galway // Ireland ray@rodonnell.ie
Magnus Hagander <magnus@hagander.net> writes: > On Fri, Sep 29, 2023 at 1:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I have been seeing the same thing for a few days now, on my >> definitely-not-gmail personal server. Something's flaky in the >> PG mail infrastructure. It's gotten better since yesterday's >> outage, though I'm not convinced it's totally fixed. > There have been some pretty bad issues with gmail recently. Some > changes have been deployed that will hopefully help mitigate those and > make things better, but it takes time to recover. > The massive backlogs caused by gmail have been enough to spill over > and affect other destinations as well simply due to the load created > since we have such a huge number of gmail subscribers. But we're > slowly seeing the backlogs shrink now and the load come down so > hopefully the changes made will continue to have effect and let us be > back to normal soon. I'm still seeing multi-hour delivery delays on a subset of traffic, like maybe half a dozen instances today. Looking at the Received: timestamps shows pretty conclusively that the delays are within PG infra, for example this recent message from Heikki got hung up at two separate jumps: Return-Path: <pgsql-hackers-owner+M15-507066@lists.postgresql.org> Received: from malur.postgresql.org (malur.postgresql.org [217.196.149.56]) by sss.pgh.pa.us (8.15.2/8.15.2) with ESMTPS id 392HruLZ2135620 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT) for <tgl@sss.pgh.pa.us>; Mon, 2 Oct 2023 13:53:57 -0400 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from <pgsql-hackers-owner+M15-507066@lists.postgresql.org>) id 1qnN7D-00GbGd-FB for tgl@sss.pgh.pa.us; Mon, 02 Oct 2023 17:53:55 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <hlinnaka@iki.fi>) id 1qnGcb-00AqOg-Ti for pgsql-hackers@lists.postgresql.org; Mon, 02 Oct 2023 10:57:53 +0000 Received: from meesny.iki.fi ([195.140.195.201]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <hlinnaka@iki.fi>) id 1qnF5S-007kvc-AQ for pgsql-hackers@postgresql.org; Mon, 02 Oct 2023 09:19:35 +0000 Received: from [192.168.1.115] (dsl-hkibng22-54f8db-125.dhcp.inet.fi [84.248.219.125]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) (Authenticated sender: hlinnaka) by meesny.iki.fi (Postfix) with ESMTPSA id 4Rzb4d51FBzydx; Mon, 2 Oct 2023 12:19:29 +0300 (EEST) Message-ID: <fe32d2a0-0998-d866-d6ee-2aed70b9be00@iki.fi> Date: Mon, 2 Oct 2023 12:19:29 +0300 ... Also, my own message <2154347.1696278028@sss.pgh.pa.us> went out to -hackers about 25 minutes ago and hasn't come back, so based on other recent examples I'm betting I won't see it for hours. Plenty of other traffic *is* coming through in normal-ish time, so I'm not sure I buy that there's still a massive logjam. regards, tom lane
On Mon, Oct 2, 2023 at 4:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Magnus Hagander <magnus@hagander.net> writes: > > On Fri, Sep 29, 2023 at 1:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I have been seeing the same thing for a few days now, on my > >> definitely-not-gmail personal server. Something's flaky in the > >> PG mail infrastructure. It's gotten better since yesterday's > >> outage, though I'm not convinced it's totally fixed. > > > There have been some pretty bad issues with gmail recently. Some > > changes have been deployed that will hopefully help mitigate those and > > make things better, but it takes time to recover. > > > The massive backlogs caused by gmail have been enough to spill over > > and affect other destinations as well simply due to the load created > > since we have such a huge number of gmail subscribers. But we're > > slowly seeing the backlogs shrink now and the load come down so > > hopefully the changes made will continue to have effect and let us be > > back to normal soon. > > I'm still seeing multi-hour delivery delays on a subset of traffic, > like maybe half a dozen instances today. > > Looking at the Received: timestamps shows pretty conclusively that > the delays are within PG infra, for example this recent message from > Heikki got hung up at two separate jumps: > > Return-Path: <pgsql-hackers-owner+M15-507066@lists.postgresql.org> > Received: from malur.postgresql.org (malur.postgresql.org [217.196.149.56]) > by sss.pgh.pa.us (8.15.2/8.15.2) with ESMTPS id 392HruLZ2135620 > (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT) > for <tgl@sss.pgh.pa.us>; Mon, 2 Oct 2023 13:53:57 -0400 > Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) > by malur.postgresql.org with esmtp (Exim 4.94.2) > (envelope-from <pgsql-hackers-owner+M15-507066@lists.postgresql.org>) > id 1qnN7D-00GbGd-FB > for tgl@sss.pgh.pa.us; Mon, 02 Oct 2023 17:53:55 +0000 > Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) > by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 > (Exim 4.94.2) > (envelope-from <hlinnaka@iki.fi>) > id 1qnGcb-00AqOg-Ti > for pgsql-hackers@lists.postgresql.org; Mon, 02 Oct 2023 10:57:53 +0000 > Received: from meesny.iki.fi ([195.140.195.201]) > by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 > (Exim 4.94.2) > (envelope-from <hlinnaka@iki.fi>) > id 1qnF5S-007kvc-AQ > for pgsql-hackers@postgresql.org; Mon, 02 Oct 2023 09:19:35 +0000 > Received: from [192.168.1.115] (dsl-hkibng22-54f8db-125.dhcp.inet.fi [84.248.219.125]) > (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) > key-exchange X25519 server-signature RSA-PSS (2048 bits)) > (No client certificate requested) > (Authenticated sender: hlinnaka) > by meesny.iki.fi (Postfix) with ESMTPSA id 4Rzb4d51FBzydx; > Mon, 2 Oct 2023 12:19:29 +0300 (EEST) > Message-ID: <fe32d2a0-0998-d866-d6ee-2aed70b9be00@iki.fi> > Date: Mon, 2 Oct 2023 12:19:29 +0300 > ... > > > Also, my own message <2154347.1696278028@sss.pgh.pa.us> went > out to -hackers about 25 minutes ago and hasn't come back, > so based on other recent examples I'm betting I won't see it > for hours. > > Plenty of other traffic *is* coming through in normal-ish time, > so I'm not sure I buy that there's still a massive logjam. There is still definitely a problem, but it is slowly recovering. It is *mostliy* hitting gmail at this point, but there can be spillover to others in some cases (for example, there's a general throttling when the load on the server gets too high). In this particular case, it coincides timing-wise with our old friend the oom-killer nuking postgres on the machine thereby stopping all incoming email for a while before it got moving again. That particular problem should have been taken care of completely by now, but the general backlog/queueing problem is still ongoing but has been improving. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On Tue, Oct 3, 2023 at 2:31 PM Magnus Hagander <magnus@hagander.net> wrote: > > On Mon, Oct 2, 2023 at 4:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Magnus Hagander <magnus@hagander.net> writes: > > > On Fri, Sep 29, 2023 at 1:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> I have been seeing the same thing for a few days now, on my > > >> definitely-not-gmail personal server. Something's flaky in the > > >> PG mail infrastructure. It's gotten better since yesterday's > > >> outage, though I'm not convinced it's totally fixed. > > > > > There have been some pretty bad issues with gmail recently. Some > > > changes have been deployed that will hopefully help mitigate those and > > > make things better, but it takes time to recover. > > > > > The massive backlogs caused by gmail have been enough to spill over > > > and affect other destinations as well simply due to the load created > > > since we have such a huge number of gmail subscribers. But we're > > > slowly seeing the backlogs shrink now and the load come down so > > > hopefully the changes made will continue to have effect and let us be > > > back to normal soon. > > > > I'm still seeing multi-hour delivery delays on a subset of traffic, > > like maybe half a dozen instances today. > > > > Looking at the Received: timestamps shows pretty conclusively that > > the delays are within PG infra, for example this recent message from > > Heikki got hung up at two separate jumps: > > > > Return-Path: <pgsql-hackers-owner+M15-507066@lists.postgresql.org> > > Received: from malur.postgresql.org (malur.postgresql.org [217.196.149.56]) > > by sss.pgh.pa.us (8.15.2/8.15.2) with ESMTPS id 392HruLZ2135620 > > (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT) > > for <tgl@sss.pgh.pa.us>; Mon, 2 Oct 2023 13:53:57 -0400 > > Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) > > by malur.postgresql.org with esmtp (Exim 4.94.2) > > (envelope-from <pgsql-hackers-owner+M15-507066@lists.postgresql.org>) > > id 1qnN7D-00GbGd-FB > > for tgl@sss.pgh.pa.us; Mon, 02 Oct 2023 17:53:55 +0000 > > Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) > > by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 > > (Exim 4.94.2) > > (envelope-from <hlinnaka@iki.fi>) > > id 1qnGcb-00AqOg-Ti > > for pgsql-hackers@lists.postgresql.org; Mon, 02 Oct 2023 10:57:53 +0000 > > Received: from meesny.iki.fi ([195.140.195.201]) > > by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 > > (Exim 4.94.2) > > (envelope-from <hlinnaka@iki.fi>) > > id 1qnF5S-007kvc-AQ > > for pgsql-hackers@postgresql.org; Mon, 02 Oct 2023 09:19:35 +0000 > > Received: from [192.168.1.115] (dsl-hkibng22-54f8db-125.dhcp.inet.fi [84.248.219.125]) > > (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) > > key-exchange X25519 server-signature RSA-PSS (2048 bits)) > > (No client certificate requested) > > (Authenticated sender: hlinnaka) > > by meesny.iki.fi (Postfix) with ESMTPSA id 4Rzb4d51FBzydx; > > Mon, 2 Oct 2023 12:19:29 +0300 (EEST) > > Message-ID: <fe32d2a0-0998-d866-d6ee-2aed70b9be00@iki.fi> > > Date: Mon, 2 Oct 2023 12:19:29 +0300 > > ... > > > > > > Also, my own message <2154347.1696278028@sss.pgh.pa.us> went > > out to -hackers about 25 minutes ago and hasn't come back, > > so based on other recent examples I'm betting I won't see it > > for hours. > > > > Plenty of other traffic *is* coming through in normal-ish time, > > so I'm not sure I buy that there's still a massive logjam. > > There is still definitely a problem, but it is slowly recovering. It > is *mostliy* hitting gmail at this point, but there can be spillover > to others in some cases (for example, there's a general throttling > when the load on the server gets too high). In this particular case, > it coincides timing-wise with our old friend the oom-killer nuking > postgres on the machine thereby stopping all incoming email for a > while before it got moving again. That particular problem should have > been taken care of completely by now, but the general backlog/queueing > problem is still ongoing but has been improving. We *think* this issue has now been mostly resolved. We are still seeing some extra delays in deliveries to gmail right now but that's due to *us* slowing down the deliveries to not trigger things. But we are now talking delays of minutes or tens of minutes, and not hours or tens of hours. Non-gmail recipients should now be back to being mostly unaffected. We're continuing to monitor the situation of course, and to make careful modifications to bring us back to the quicker deliverry times. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/