Thread: pg_stat_io for the startup process
Hi, Regarding pg_stat_io for the startup process, I noticed that the counters are only incremented after the startup process exits, not during WAL replay in standby mode. This is because pgstat_flush_io() is only called when the startup process exits. Shouldn't it be called during WAL replay as well to report IO statistics by the startup process even in standby mode? Also, the pg_stat_io view includes a row with backend_type=startup and context=vacuum, but it seems that the startup process doesn't perform any I/O operations with BAS_VACUUM. If this understanding is right, shouldn't we omit this row from the view? Additionally, I noticed that the view also includes a row with backend_type=startup and context=bulkread / bulkwrite. Do these operations actually occur during startup process? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Tue, Apr 25, 2023 at 10:51:14PM +0900, Fujii Masao wrote: > Hi, > > Regarding pg_stat_io for the startup process, I noticed that the counters > are only incremented after the startup process exits, not during WAL replay > in standby mode. This is because pgstat_flush_io() is only called when > the startup process exits. Shouldn't it be called during WAL replay as well > to report IO statistics by the startup process even in standby mode? Yes, we definitely want stats from the startup process on the standby. Elsewhere on the internet where you originally raised this, I mentioned that I hacked a pgstat_flush_io() into the redo apply loop in PerformWalRecovery() but that I wasn't sure that this was affordable. Andres Freund replied saying that it would be too expensive and suggested that the set up a regular timeout which sets a flag that's checked by HandleStartupProcInterrupts(). I'm wondering if this is something we consider a bug and thus would be under consideration for 16. > Also, the pg_stat_io view includes a row with backend_type=startup and > context=vacuum, but it seems that the startup process doesn't perform > any I/O operations with BAS_VACUUM. If this understanding is right, > shouldn't we omit this row from the view? Additionally, I noticed that > the view also includes a row with backend_type=startup and > context=bulkread / bulkwrite. Do these operations actually occur > during startup process? Hmm. Yes, I remember posing this question on the thread and not getting an answer. I read some code and did some testing and can't see a way we would end up with the startup process doing IO in a non-normal context. Certainly I can't see how startup process would ever use a BAS_VACUUM context given that it executes heap_xlog_vacuum(). I thought at some point I had encountered an assertion failure when I banned the startup process from tracking io operations in bulkread and bulkwrite contexts. But, I'm not seeing how that could happen. - Melanie
Hi, On 2023-04-25 13:54:43 -0400, Melanie Plageman wrote: > On Tue, Apr 25, 2023 at 10:51:14PM +0900, Fujii Masao wrote: > > Regarding pg_stat_io for the startup process, I noticed that the counters > > are only incremented after the startup process exits, not during WAL replay > > in standby mode. This is because pgstat_flush_io() is only called when > > the startup process exits. Shouldn't it be called during WAL replay as well > > to report IO statistics by the startup process even in standby mode? > > Yes, we definitely want stats from the startup process on the standby. > Elsewhere on the internet where you originally raised this, I mentioned > that I hacked a pgstat_flush_io() into the redo apply loop in > PerformWalRecovery() but that I wasn't sure that this was affordable. > Andres Freund replied saying that it would be too expensive and > suggested that the set up a regular timeout which sets a flag that's > checked by HandleStartupProcInterrupts(). It's tempting to try to reuse the STARTUP_PROGRESS_TIMEOUT timer. But it's controlled by a GUC, so it's not really suitable. > I'm wondering if this is something we consider a bug and thus would be > under consideration for 16. I'm mildly inclined to not consider it a bug, given that this looks to have been true for other stats for quite a while? But it does still seem worth improving upon - I'd make the consideration when to apply the relevant patches depend on the complexity. I'm worried we'd need to introduce sufficiently new infrastructure that 16 doesn't seem like a good idea. Let's come up with a patch and judge it after? > > Also, the pg_stat_io view includes a row with backend_type=startup and > > context=vacuum, but it seems that the startup process doesn't perform > > any I/O operations with BAS_VACUUM. If this understanding is right, > > shouldn't we omit this row from the view? Additionally, I noticed that > > the view also includes a row with backend_type=startup and > > context=bulkread / bulkwrite. Do these operations actually occur > > during startup process? > > Hmm. Yes, I remember posing this question on the thread and not getting > an answer. I read some code and did some testing and can't see a way we > would end up with the startup process doing IO in a non-normal context. > > Certainly I can't see how startup process would ever use a BAS_VACUUM > context given that it executes heap_xlog_vacuum(). > > I thought at some point I had encountered an assertion failure when I > banned the startup process from tracking io operations in bulkread and > bulkwrite contexts. But, I'm not seeing how that could happen. It's possible that we decided to not apply such restrictions because the startup process can be made to execute more code via the extensible rmgrs. Greetings, Andres Freund
On Tue, Apr 25, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote: > I'm mildly inclined to not consider it a bug, given that this looks to have > been true for other stats for quite a while? But it does still seem worth > improving upon - I'd make the consideration when to apply the relevant patches > depend on the complexity. I'm worried we'd need to introduce sufficiently new > infrastructure that 16 doesn't seem like a good idea. Let's come up with a > patch and judge it after? ISTM that it's pretty desirable to do something about this. If the process isn't going to report statistics properly, at least remove it from the view. If it can be made to report properly, that would be even better. But shipping a new view with information that will nearly always be zeroes instead of real data seems like a bad call, even if there are existing cases that have the same problem. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Apr 25, 2023 at 04:00:24PM -0400, Robert Haas wrote: > ISTM that it's pretty desirable to do something about this. If the > process isn't going to report statistics properly, at least remove it > from the view. If it can be made to report properly, that would be > even better. But shipping a new view with information that will nearly > always be zeroes instead of real data seems like a bad call, even if > there are existing cases that have the same problem. Agreed that reporting no information may be better than reporting incorrect information, even if it means an extra qual. As mentioned upthread, if this requires an extra design piece, adding the correct information had better be pushed to v17~. Perhaps an open item should be added? -- Michael
Attachment
Hi, On 2023-04-25 16:00:24 -0400, Robert Haas wrote: > On Tue, Apr 25, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote: > > I'm mildly inclined to not consider it a bug, given that this looks to have > > been true for other stats for quite a while? But it does still seem worth > > improving upon - I'd make the consideration when to apply the relevant patches > > depend on the complexity. I'm worried we'd need to introduce sufficiently new > > infrastructure that 16 doesn't seem like a good idea. Let's come up with a > > patch and judge it after? > > ISTM that it's pretty desirable to do something about this. If the > process isn't going to report statistics properly, at least remove it > from the view. It's populated after crash recovery, when shutting down and at the time of promotion, that isn't *completely* crazy. > If it can be made to report properly, that would be even better. But > shipping a new view with information that will nearly always be zeroes > instead of real data seems like a bad call, even if there are existing cases > that have the same problem. I refreshed my memory: The startup process has indeed behaved that way for much longer than pg_stat_io existed - but it's harder to spot, because the stats are more coarsely aggregated :/. And it's very oddly inconsistent: The startup process doesn't report per-relation read/hit (it might when we create a fake relcache entry, to lazy to see what happens exactly), because we key those stats by oid. However, it *does* report the read/write time. But only at process exit, of course. The weird part is that the startup process does *NOT* increase pg_stat_database.blks_read/blks_hit, because instead of basing those on pgBufferUsage.shared_blks_read etc, we compute them based on the relation level stats. pgBufferUsage is just used for EXPLAIN. This isn't recent, afaict. TL;DR: Currently the startup process maintains blk_read_time, blk_write_time, but doesn't maintain blks_read, blks_hit - which doesn't make sense. Yikes. Greetings, Andres Freund
At Tue, 25 Apr 2023 16:04:23 -0700, Andres Freund <andres@anarazel.de> wrote in > I refreshed my memory: The startup process has indeed behaved that way for > much longer than pg_stat_io existed - but it's harder to spot, because the > stats are more coarsely aggregated :/. And it's very oddly inconsistent: > > The startup process doesn't report per-relation read/hit (it might when we > create a fake relcache entry, to lazy to see what happens exactly), because we The key difference lies between relation-level and smgr-level; recovery doesn't call ReadBufferExtended. > key those stats by oid. However, it *does* report the read/write time. But > only at process exit, of course. The weird part is that the startup process > does *NOT* increase pg_stat_database.blks_read/blks_hit, because instead of > basing those on pgBufferUsage.shared_blks_read etc, we compute them based on > the relation level stats. pgBufferUsage is just used for EXPLAIN. This isn't > recent, afaict. I see four issues here. 1. The current database stats omits buffer fetches that don't originate from a relation. In this case pgstat_relation can't work since recovery isn't conscious of relids. We might be able to resolve relfilenode into a relid, but it may not be that simple. Fortunately we already count fetches and hits process-wide using pgBufferUsage, so we can use this for database stats. 2. Even if we wanted to report stats for the startup process, pgstat_report_stats wouldn't permit it since transaction-end timestamp doesn't advance. I'm not certain if it's the correct approach, but perhaps we could use GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp() specifically for the startup process. 3. When should we call pgstat_report_stats on the startup process? During recovery, I think we can call pgstat_report_stats() (or a subset of it) right before invoking WaitLatch and at segment boundaries. 4. In the existing ReadBuffer_common, there's an inconsistency in counting hits and reads between pgstat_io and pgBufferUsage. The difference comes from the case of RBM_ZERO pages. We should simply align them. > TL;DR: Currently the startup process maintains blk_read_time, blk_write_time, > but doesn't maintain blks_read, blks_hit - which doesn't make sense. As a result, the attached patch, which is meant for discussion, allows pg_stat_database to show fetches and reads by the startup process as the counts for the database with id 0. There's still some difference between pg_stat_io and pg_stat_database, but I haven't examined it in detail. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > 3. When should we call pgstat_report_stats on the startup process? > > During recovery, I think we can call pgstat_report_stats() (or a > subset of it) right before invoking WaitLatch and at segment > boundaries. I think this kind of idea is worth exploring. Andres mentioned timers, but it seems to me that something where we just do it at certain "convenient" points might be good enough and simpler. I'd much rather have statistics that were up to date as of the last time we finished a segment than have nothing at all. -- Robert Haas EDB: http://www.enterprisedb.com
Hi all,
Robert Haas <robertmhaas@gmail.com>, 26 Nis 2023 Çar, 15:34 tarihinde şunu yazdı:
On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> 3. When should we call pgstat_report_stats on the startup process?
>
> During recovery, I think we can call pgstat_report_stats() (or a
> subset of it) right before invoking WaitLatch and at segment
> boundaries.
I think this kind of idea is worth exploring. Andres mentioned timers,
but it seems to me that something where we just do it at certain
"convenient" points might be good enough and simpler. I'd much rather
have statistics that were up to date as of the last time we finished a
segment than have nothing at all.
I created a rough prototype of a timer-based approach for comparison. Please see attached.
Registered a new timeout named "STARTUP_STAT_FLUSH_TIMEOUT", The timeout's handler sets a flag indicating that io stats need to be flushed. HandleStartupProcInterrupts checks the flag to flush stats.
It's enabled if any WAL record is read (in the main loop of PerformWalRecovery). And it's disabled before WaitLatch to avoid unnecessary wakeups if the startup process decides to sleep.
With those changes, I see that startup related rows in pg_stat_io are updated without waiting for the startup process to exit.
postgres=# select context, reads, extends, hits from pg_stat_io where backend_type = 'startup';
context | reads | extends | hits
-----------+--------+---------+----------
bulkread | 0 | | 0
bulkwrite | 0 | 0 | 0
normal | 6 | 1 | 41
vacuum | 0 | 0 | 0
(4 rows)
I'm not sure this is the correct way to implement this approach though. I appreciate any comment.
Also; some questions about this implementation if you think it's worth discussing:
1- I set an arbitrary timeout (1 sec) for testing. I don't know what the correct value should be. Does it make sense to use PGSTAT_MIN/MAX/IDLE_INTERVAL instead?
2- I'm also not sure if this timeout should be registered at the beginning of StartupProcessMain, or does it even make any difference. I tried to do this just before the main redo loop in PerformWalRecovery, but that made the CI red.
Best,
Melih Mutlu
Microsoft
Attachment
Hi, On 2023-04-26 21:53:16 +0300, Melih Mutlu wrote: > Robert Haas <robertmhaas@gmail.com>, 26 Nis 2023 Çar, 15:34 tarihinde şunu > yazdı: > > > On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > 3. When should we call pgstat_report_stats on the startup process? > > > > > > During recovery, I think we can call pgstat_report_stats() (or a > > > subset of it) right before invoking WaitLatch and at segment > > > boundaries. > > > > I think this kind of idea is worth exploring. Andres mentioned timers, > > but it seems to me that something where we just do it at certain > > "convenient" points might be good enough and simpler. I'd much rather > > have statistics that were up to date as of the last time we finished a > > segment than have nothing at all. > > > > I created a rough prototype of a timer-based approach for comparison. > Please see attached. Thanks! > 2- I'm also not sure if this timeout should be registered at the beginning > of StartupProcessMain, or does it even make any difference. I tried to do > this just before the main redo loop in PerformWalRecovery, but that made > the CI red. Huh - do you have a link to the failure? That's how I would expect it to be done. > /* Unsupported old recovery command file names (relative to $PGDATA) */ > #define RECOVERY_COMMAND_FILE "recovery.conf" > @@ -1675,6 +1676,9 @@ PerformWalRecovery(void) > ereport_startup_progress("redo in progress, elapsed time: %ld.%02d s, current LSN: %X/%X", > LSN_FORMAT_ARGS(xlogreader->ReadRecPtr)); > > + /* Is this the right place to enable this? */ > + enable_startup_stat_flush_timeout(); > + I think we should try not have additional code for this inside the loop - we should enable the timer once, when needed, not repeatedly. > #ifdef WAL_DEBUG > if (XLOG_DEBUG || > (record->xl_rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) || > @@ -3617,6 +3621,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, > /* Do background tasks that might benefit us later. */ > KnownAssignedTransactionIdsIdleMaintenance(); > > + /* > + * Need to disable flush timeout to avoid unnecessary > + * wakeups. Enable it again after a WAL record is read > + * in PerformWalRecovery. > + */ > + disable_startup_stat_flush_timeout(); > + > (void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch, > WL_LATCH_SET | WL_TIMEOUT | > WL_EXIT_ON_PM_DEATH, I think always disabling the timer here isn't quite right - we want to wake up *once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending stats before waiting - potentially for a long time - for WAL. One way would be just explicitly report before the WaitLatch(). Another approach, I think better, would be to not use enable_timeout_every(), and to explicitly rearm the timer in HandleStartupProcInterrupts(). When called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do so at the end of WaitForWALToBecomeAvailable(). That way we also wouldn't repeatedly fire between calls to HandleStartupProcInterrupts(). > diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c > index efc2580536..b250fa95f9 100644 > --- a/src/backend/postmaster/startup.c > +++ b/src/backend/postmaster/startup.c > @@ -72,6 +72,11 @@ static TimestampTz startup_progress_phase_start_time; > */ > static volatile sig_atomic_t startup_progress_timer_expired = false; > > +/* Indicates whether flushing stats is needed. */ > +static volatile sig_atomic_t startup_stat_need_flush = false; > + > +int pgstat_stat_flush_timeout = 1000; /* 1 sec ?? */ We probably should move the existing PGSTAT_MIN_INTERVAL constant from pgstat.c to pgstat.h. > +extern void enable_startup_stat_flush_timeout(void); > +extern void disable_startup_stat_flush_timeout(void); > +extern void startup_stat_flush_timeout_handler(void); > + > #endif /* _STARTUP_H */ > diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h > index e561a1cde9..a8d360e255 100644 > --- a/src/include/utils/timeout.h > +++ b/src/include/utils/timeout.h > @@ -35,6 +35,7 @@ typedef enum TimeoutId > IDLE_STATS_UPDATE_TIMEOUT, > CLIENT_CONNECTION_CHECK_TIMEOUT, > STARTUP_PROGRESS_TIMEOUT, > + STARTUP_STAT_FLUSH_TIMEOUT, > /* First user-definable timeout reason */ > USER_TIMEOUT, > /* Maximum number of timeout reasons */ I think we could just reuse IDLE_STATS_UPDATE_TIMEOUT? Greetings, Andres Freund
On Wed, Apr 26, 2023 at 2:53 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote: > > Hi all, > > Robert Haas <robertmhaas@gmail.com>, 26 Nis 2023 Çar, 15:34 tarihinde şunu yazdı: >> >> On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi >> <horikyota.ntt@gmail.com> wrote: >> > 3. When should we call pgstat_report_stats on the startup process? >> > >> > During recovery, I think we can call pgstat_report_stats() (or a >> > subset of it) right before invoking WaitLatch and at segment >> > boundaries. >> >> I think this kind of idea is worth exploring. Andres mentioned timers, >> but it seems to me that something where we just do it at certain >> "convenient" points might be good enough and simpler. I'd much rather >> have statistics that were up to date as of the last time we finished a >> segment than have nothing at all. > > > I created a rough prototype of a timer-based approach for comparison. Please see attached. > > Registered a new timeout named "STARTUP_STAT_FLUSH_TIMEOUT", The timeout's > handler sets a flag indicating that io stats need to be flushed. > HandleStartupProcInterrupts checks the flag to flush stats. It's enabled if > any WAL record is read (in the main loop of PerformWalRecovery). And it's > disabled before WaitLatch to avoid unnecessary wakeups if the startup process > decides to sleep. I was reviewing this and found that if I remove the calls to disable_startup_stat_flush_timeout() the number of calls to pgstat_report_wal() on a briefly active and then idle standby are about 8 in 100 seconds whereas with timer disablement, the calls over the same period are about 40. I would have thought that disabling the timer would have caused us to try and flush the stats less often. With the calls to disable_startup_stat_flush_timeout(), we do have much fewer calls to setitimer(), of course. Given the below suggestion by Andres, I tried doing some traces of the various approaches. > > + disable_startup_stat_flush_timeout(); > > + > > (void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch, > > WL_LATCH_SET | WL_TIMEOUT | > > WL_EXIT_ON_PM_DEATH, > > I think always disabling the timer here isn't quite right - we want to wake up > *once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending > stats before waiting - potentially for a long time - for WAL. One way would be > just explicitly report before the WaitLatch(). > > Another approach, I think better, would be to not use enable_timeout_every(), > and to explicitly rearm the timer in HandleStartupProcInterrupts(). When > called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do so > at the end of WaitForWALToBecomeAvailable(). That way we also wouldn't > repeatedly fire between calls to HandleStartupProcInterrupts(). After a quick example implementation of this, I found that it seemed to try and flush the stats less often on an idle standby (good) than using enable_timeout_every(). - Melanie
At Thu, 27 Apr 2023 17:30:40 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in > On Wed, Apr 26, 2023 at 2:53 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote: > > > > Hi all, > > > > Robert Haas <robertmhaas@gmail.com>, 26 Nis 2023 Çar, 15:34 tarihinde şunu yazdı: > >> > >> On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi > >> <horikyota.ntt@gmail.com> wrote: > >> > 3. When should we call pgstat_report_stats on the startup process? > >> > > >> > During recovery, I think we can call pgstat_report_stats() (or a > >> > subset of it) right before invoking WaitLatch and at segment > >> > boundaries. > >> > >> I think this kind of idea is worth exploring. Andres mentioned timers, > >> but it seems to me that something where we just do it at certain > >> "convenient" points might be good enough and simpler. I'd much rather > >> have statistics that were up to date as of the last time we finished a > >> segment than have nothing at all. > > > > > > I created a rough prototype of a timer-based approach for comparison. Please see attached. > > > > Registered a new timeout named "STARTUP_STAT_FLUSH_TIMEOUT", The timeout's > > handler sets a flag indicating that io stats need to be flushed. > > HandleStartupProcInterrupts checks the flag to flush stats. It's enabled if > > any WAL record is read (in the main loop of PerformWalRecovery). And it's > > disabled before WaitLatch to avoid unnecessary wakeups if the startup process > > decides to sleep. > > I was reviewing this and found that if I remove the calls to > disable_startup_stat_flush_timeout() the number of calls to > pgstat_report_wal() on a briefly active and then idle standby are about > 8 in 100 seconds whereas with timer disablement, the calls over the same > period are about 40. I would have thought that disabling the timer would > have caused us to try and flush the stats less often. > > With the calls to disable_startup_stat_flush_timeout(), we do have much > fewer calls to setitimer(), of course. > > Given the below suggestion by Andres, I tried doing some traces of the > various approaches. > > > > + disable_startup_stat_flush_timeout(); > > > + > > > (void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch, > > > WL_LATCH_SET | WL_TIMEOUT | > > > WL_EXIT_ON_PM_DEATH, > > > > I think always disabling the timer here isn't quite right - we want to wake up > > *once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending > > stats before waiting - potentially for a long time - for WAL. One way would be > > just explicitly report before the WaitLatch(). > > > > Another approach, I think better, would be to not use enable_timeout_every(), > > and to explicitly rearm the timer in HandleStartupProcInterrupts(). When > > called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do so > > at the end of WaitForWALToBecomeAvailable(). That way we also wouldn't > > repeatedly fire between calls to HandleStartupProcInterrupts(). > > After a quick example implementation of this, I found that it seemed to > try and flush the stats less often on an idle standby (good) than using > enable_timeout_every(). Just rearming with the full-interval will work that way. Our existing strategy for this is seen in PostgresMain(). stats_timeout = pgstat_report_stat(false); if (stats_timeout > 0) { if (!get_timeout_active(BLAH_TIMEOUT)) enable_timeout_after(BLAH_TIMEOUT, stats_timeout); } else { if (get_timeout_active(BLAH_TIMEOUT)) disable_timeout(BLAH_TIMEOUT, false); } WaitLatch(); regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 28 Apr 2023 11:15:51 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Thu, 27 Apr 2023 17:30:40 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in > > After a quick example implementation of this, I found that it seemed to > > try and flush the stats less often on an idle standby (good) than using > > enable_timeout_every(). > > Just rearming with the full-interval will work that way. Our existing > strategy for this is seen in PostgresMain(). > > stats_timeout = pgstat_report_stat(false); > if (stats_timeout > 0) > { > if (!get_timeout_active(BLAH_TIMEOUT)) > enable_timeout_after(BLAH_TIMEOUT, stats_timeout); > } > else > { > if (get_timeout_active(BLAH_TIMEOUT)) > disable_timeout(BLAH_TIMEOUT, false); > } > WaitLatch(); Im my example, I left out idle-time flushing, but I realized we don't need the timeout mechanism here since we're already managing it. So the following should work (assuming the timestamp updates with GetCurrentTimestamp() in my last patch). @@ -3889,13 +3900,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, /* Update pg_stat_recovery_prefetch before sleeping. */ XLogPrefetcherComputeStats(xlogprefetcher); + /* + * Report stats; if not time yet, set next WaitLatch to + * wake up at the next reporing time. + */ + wait_time = pgstat_report_stat(false); + + /* if no pending stats, sleep forever */ + if (wait_time == 0) + wait_time = -1L; + /* * Wait for more WAL to arrive, when we will be woken * immediately by the WAL receiver. */ (void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, - -1L, + wait_time, WAIT_EVENT_RECOVERY_WAL_STREAM); regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
Hi,
Andres Freund <andres@anarazel.de>, 27 Nis 2023 Per, 19:27 tarihinde şunu yazdı:
Huh - do you have a link to the failure? That's how I would expect it to be
done.
I think I should have registered it in the beginning of PerformWalRecovery() and not just before the main redo loop since HandleStartupProcInterrupts is called before the loop too.
Here's the error log otherise [1]
> #ifdef WAL_DEBUG
> if (XLOG_DEBUG ||
> (record->xl_rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) ||
> @@ -3617,6 +3621,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
> /* Do background tasks that might benefit us later. */
> KnownAssignedTransactionIdsIdleMaintenance();
>
> + /*
> + * Need to disable flush timeout to avoid unnecessary
> + * wakeups. Enable it again after a WAL record is read
> + * in PerformWalRecovery.
> + */
> + disable_startup_stat_flush_timeout();
> +
> (void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
> WL_LATCH_SET | WL_TIMEOUT |
> WL_EXIT_ON_PM_DEATH,
I think always disabling the timer here isn't quite right - we want to wake up
*once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending
stats before waiting - potentially for a long time - for WAL. One way would be
just explicitly report before the WaitLatch().
Another approach, I think better, would be to not use enable_timeout_every(),
and to explicitly rearm the timer in HandleStartupProcInterrupts(). When
called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do so
at the end of WaitForWALToBecomeAvailable(). That way we also wouldn't
repeatedly fire between calls to HandleStartupProcInterrupts().
Attached patch is probably not doing what you asked. IIUC HandleStartupProcInterrupts should arm the timer but also shouldn't arm it if it's called from WaitForWALToBecomeAvailable. But the timer should be armed again at the very end of WaitForWALToBecomeAvailable. I've been thinking about how to do this properly. Should HandleStartupProcInterrupts take a parameter to decide whether the timer needs to be armed? Or need to add an additional global flag to rearm the timer? Any thoughts?
Best,
Melih Mutlu
Microsoft
Attachment
I've reviewed both your patch versions and responded to the ideas in both of them and the associated emails below. On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 25 Apr 2023 16:04:23 -0700, Andres Freund <andres@anarazel.de> wrote in > > key those stats by oid. However, it *does* report the read/write time. But > > only at process exit, of course. The weird part is that the startup process > > does *NOT* increase pg_stat_database.blks_read/blks_hit, because instead of > > basing those on pgBufferUsage.shared_blks_read etc, we compute them based on > > the relation level stats. pgBufferUsage is just used for EXPLAIN. This isn't > > recent, afaict. > > I see four issues here. > > 1. The current database stats omits buffer fetches that don't > originate from a relation. > > In this case pgstat_relation can't work since recovery isn't conscious > of relids. We might be able to resolve relfilenode into a relid, but > it may not be that simple. Fortunately we already count fetches and > hits process-wide using pgBufferUsage, so we can use this for database > stats. ... > > TL;DR: Currently the startup process maintains blk_read_time, blk_write_time, > > but doesn't maintain blks_read, blks_hit - which doesn't make sense. > > As a result, the attached patch, which is meant for discussion, allows > pg_stat_database to show fetches and reads by the startup process as > the counts for the database with id 0. I would put this in its own patch in a patchset. Of course it relies on having pgstat_report_stat() called at appropriate times by the startup process, but having pg_stat_database show read/hit counts is a separate issue than having pg_stat_io do so. I'm not suggesting we do this, but you could argue that if we fix the startup process stats reporting that pg_stat_database not showing reads and hits for the startup process is a bug that also exists in 15. Not directly related, but I do not get why the existing stats counters for pg_stat_database count "fetches" and "hits" and then use subtraction to get the number of reads. I find it confusing and seems like it could lead to subtle inconsistencies with those counters counting reads closer to where they are actually happening. > 2. Even if we wanted to report stats for the startup process, > pgstat_report_stats wouldn't permit it since transaction-end > timestamp doesn't advance. > > I'm not certain if it's the correct approach, but perhaps we could use > GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp() > specifically for the startup process. In theory, since all of the approaches proposed in this thread would exercise rigid control over how often we flush stats in the startup process, I think it is okay to use GetCurrentTimestamp() when pgstat_report_stat() is called by the startup process (i.e. we don't have to worry about overhead of doing it). But looking at it implemented in the patch made me feel unsettled for some reason. > 3. When should we call pgstat_report_stats on the startup process? > > During recovery, I think we can call pgstat_report_stats() (or a > subset of it) right before invoking WaitLatch and at segment > boundaries. I see in the patch you call pgstat_report_stat() in XLogPageRead(). Will this only be called on segment boundaries? > 4. In the existing ReadBuffer_common, there's an inconsistency in > counting hits and reads between pgstat_io and pgBufferUsage. > > The difference comes from the case of RBM_ZERO pages. We should simply > align them. I would definitely make this a separate patch and probably a separate thread. It isn't related to the startup process and is worth a separate discussion. On Thu, Apr 27, 2023 at 10:43 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Fri, 28 Apr 2023 11:15:51 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > At Thu, 27 Apr 2023 17:30:40 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in > > > After a quick example implementation of this, I found that it seemed to > > > try and flush the stats less often on an idle standby (good) than using > > > enable_timeout_every(). > > > > Just rearming with the full-interval will work that way. Our existing > > strategy for this is seen in PostgresMain(). > > > > stats_timeout = pgstat_report_stat(false); > > if (stats_timeout > 0) > > { > > if (!get_timeout_active(BLAH_TIMEOUT)) > > enable_timeout_after(BLAH_TIMEOUT, stats_timeout); > > } > > else > > { > > if (get_timeout_active(BLAH_TIMEOUT)) > > disable_timeout(BLAH_TIMEOUT, false); > > } > > WaitLatch(); > > Im my example, I left out idle-time flushing, but I realized we don't > need the timeout mechanism here since we're already managing it. So > the following should work (assuming the timestamp updates with > GetCurrentTimestamp() in my last patch). > > @@ -3889,13 +3900,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, > /* Update pg_stat_recovery_prefetch before sleeping. */ > XLogPrefetcherComputeStats(xlogprefetcher); > > + /* > + * Report stats; if not time yet, set next WaitLatch to > + * wake up at the next reporing time. > + */ > + wait_time = pgstat_report_stat(false); > + > + /* if no pending stats, sleep forever */ > + if (wait_time == 0) > + wait_time = -1L; > + > /* > * Wait for more WAL to arrive, when we will be woken > * immediately by the WAL receiver. > */ > (void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch, > WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, > - -1L, > + wait_time, > WAIT_EVENT_RECOVERY_WAL_STREAM); The idle-time flushing is a great point I did not think of. I think Andres did have some concern with unconditionally calling pgstat_report_stat() in WaitForWalToBecomeAvailable() before WaitLatch() -- I believe because it would be called too often and attempt flushing multiple times between HandleStartupProcInterrupts(). - Melanie
On Wed, May 03, 2023 at 04:11:33PM +0300, Melih Mutlu wrote: > Andres Freund <andres@anarazel.de>, 27 Nis 2023 Per, 19:27 tarihinde şunu yazdı: > > #ifdef WAL_DEBUG > > > if (XLOG_DEBUG || > > > (record->xl_rmid == RM_XACT_ID && > > trace_recovery_messages <= DEBUG2) || > > > @@ -3617,6 +3621,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, > > bool randAccess, > > > /* Do background tasks > > that might benefit us later. */ > > > > > KnownAssignedTransactionIdsIdleMaintenance(); > > > > > > + /* > > > + * Need to disable flush > > timeout to avoid unnecessary > > > + * wakeups. Enable it > > again after a WAL record is read > > > + * in PerformWalRecovery. > > > + */ > > > + > > disable_startup_stat_flush_timeout(); > > > + > > > (void) > > WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch, > > > > > WL_LATCH_SET | WL_TIMEOUT | > > > > > WL_EXIT_ON_PM_DEATH, > > > > I think always disabling the timer here isn't quite right - we want to > > wake up > > *once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending > > stats before waiting - potentially for a long time - for WAL. One way > > would be > > just explicitly report before the WaitLatch(). > > > > Another approach, I think better, would be to not use > > enable_timeout_every(), > > and to explicitly rearm the timer in HandleStartupProcInterrupts(). When > > called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do > > so > > at the end of WaitForWALToBecomeAvailable(). That way we also wouldn't > > repeatedly fire between calls to HandleStartupProcInterrupts(). > > > > Attached patch is probably not doing what you asked. IIUC > HandleStartupProcInterrupts should arm the timer but also shouldn't arm it > if it's called from WaitForWALToBecomeAvailable. But the timer should be > armed again at the very end of WaitForWALToBecomeAvailable. I've been > thinking about how to do this properly. Should HandleStartupProcInterrupts > take a parameter to decide whether the timer needs to be armed? Or need to > add an additional global flag to rearm the timer? Any thoughts? I had the same question about how to determine whether or not to rearm. > From 9be7360e49db424c45c53e85efe8a4f5359b5244 Mon Sep 17 00:00:00 2001 > From: Melih Mutlu <m.melihmutlu@gmail.com> > Date: Wed, 26 Apr 2023 18:21:32 +0300 > Subject: [PATCH v2] Add timeout to flush stats during startup's main replay > loop > diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c > index efc2580536..842394bc8f 100644 > --- a/src/backend/postmaster/startup.c > +++ b/src/backend/postmaster/startup.c > @@ -72,6 +72,9 @@ static TimestampTz startup_progress_phase_start_time; > */ > static volatile sig_atomic_t startup_progress_timer_expired = false; > > +/* Indicates whether flushing stats is needed. */ > +static volatile sig_atomic_t idle_stats_update_pending = false; > + > /* > * Time between progress updates for long-running startup operations. > */ > @@ -206,6 +209,18 @@ HandleStartupProcInterrupts(void) > /* Perform logging of memory contexts of this process */ > if (LogMemoryContextPending) > ProcessLogMemoryContextInterrupt(); > + > + if (idle_stats_update_pending) > + { > + /* It's time to report wal stats. */ If we want dbstats to be updated, we'll probably have to call pgstat_report_stat() here and fix the timestamp issue Horiguchi-san points out upthread. Perhaps you could review those changes and consider adding those as preliminary patches before this in a set. I think you will then need to handle the issue he mentions with partial flushes coming from pgstat_report_stat() and remembering to try and flush stats again in case of a partial flush. Though, perhaps we can just pass force=true. > + pgstat_report_wal(true); > + idle_stats_update_pending = false; > + } Good that you thought to check if the timeout was already active. > + else if (!get_timeout_active(IDLE_STATS_UPDATE_TIMEOUT)) > + { > + /* Set the next timeout. */ > + enable_idle_stats_update_timeout(); > + } > } > > > @@ -385,3 +400,22 @@ has_startup_progress_timeout_expired(long *secs, int *usecs) > > return true; > } > + > +/* Set a flag indicating that it's time to flush wal stats. */ > +void > +idle_stats_update_timeout_handler(void) > +{ > + idle_stats_update_pending = true; Is WakeupRecovery() needed when the timer goes off and the startup process is waiting on a latch? Does this happen in WaitForWalToBecomeAvailable()? > + WakeupRecovery(); > +} > + > +/* Enable the timeout set for wal stat flush. */ > +void > +enable_idle_stats_update_timeout(void) > +{ > + TimestampTz fin_time; > + > + fin_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), > + PGSTAT_MIN_INTERVAL); It is a shame we have to end up calling GetCurrentTimestamp() since we are using enable_timeout_at(). Couldn't we use enable_timeout_after()? > + enable_timeout_at(IDLE_STATS_UPDATE_TIMEOUT, fin_time); > +} - Melanie
Hi, On 2023-04-26 18:47:14 +0900, Kyotaro Horiguchi wrote: > I see four issues here. > > 1. The current database stats omits buffer fetches that don't > originate from a relation. > > In this case pgstat_relation can't work since recovery isn't conscious > of relids. We might be able to resolve relfilenode into a relid, but > it may not be that simple. Fortunately we already count fetches and > hits process-wide using pgBufferUsage, so we can use this for database > stats. I don't think we need to do anything about that for 16 - they aren't updated at process end either. I think the fix here is to do the architectural change of maintaining most stats keyed by relfilenode as we've discussed in some other threads. Then we also can have relation level write stats etc. > 2. Even if we wanted to report stats for the startup process, > pgstat_report_stats wouldn't permit it since transaction-end > timestamp doesn't advance. > > I'm not certain if it's the correct approach, but perhaps we could use > GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp() > specifically for the startup process. What about using GetCurrentTimestamp() when force == true? That'd make sense for other users as well, I think? > 3. When should we call pgstat_report_stats on the startup process? > > During recovery, I think we can call pgstat_report_stats() (or a > subset of it) right before invoking WaitLatch and at segment > boundaries. I've pondered that as well. But I don't think it's great - it's not exactly intuitive that stats reporting gets far less common if you use a 1GB wal_segment_size. Greetings, Andres Freund
Thanks for clarifying! At Mon, 8 May 2023 14:46:43 -0700, Andres Freund <andres@anarazel.de> wrote in > Hi, > > On 2023-04-26 18:47:14 +0900, Kyotaro Horiguchi wrote: > > I see four issues here. > > > > 1. The current database stats omits buffer fetches that don't > > originate from a relation. > > > > In this case pgstat_relation can't work since recovery isn't conscious > > of relids. We might be able to resolve relfilenode into a relid, but > > it may not be that simple. Fortunately we already count fetches and > > hits process-wide using pgBufferUsage, so we can use this for database > > stats. > > I don't think we need to do anything about that for 16 - they aren't updated > at process end either. > > I think the fix here is to do the architectural change of maintaining most > stats keyed by relfilenode as we've discussed in some other threads. Then we > also can have relation level write stats etc. I think so. > > 2. Even if we wanted to report stats for the startup process, > > pgstat_report_stats wouldn't permit it since transaction-end > > timestamp doesn't advance. > > > > I'm not certain if it's the correct approach, but perhaps we could use > > GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp() > > specifically for the startup process. > > What about using GetCurrentTimestamp() when force == true? That'd make sense > for other users as well, I think? I'm not sure if I got you right, but when force==true, it allows pgstat_report_stats to flush without considering whether the interval has elapsed. In that case, there's no need to keep track of the last flush time and the caller should handle the interval instead. > > 3. When should we call pgstat_report_stats on the startup process? > > > > During recovery, I think we can call pgstat_report_stats() (or a > > subset of it) right before invoking WaitLatch and at segment > > boundaries. > > I've pondered that as well. But I don't think it's great - it's not exactly > intuitive that stats reporting gets far less common if you use a 1GB > wal_segment_size. If the segment size gets larger, the archive intervals become longer. So, I have a vague feeling that users wouldn't go for such a large segment size. I don't have a clear idea about the ideal length for stats reporting intervals in this case, but I think every few minutes or so would not be so bad for the startup process to report stats when recovery gets busy. Also, I think recovery will often wait for new data once it catches up to the primary. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, I don't really feel confident we're going to get the timeout approach solid enough for something going into the tree around beta 1. How about this, imo a lot simpler, approach: We flush stats whenever replaying a XLOG_RUNNING_XACTS record. Unless the primary is idle, it will log those at a regular interval. If the primary is idle, we don't need to flush stats in the startup process, because we'll not have done any io. We only log XLOG_RUNNING_XACTS when wal_level >= replica, so stats wouldn't get regularly flushed if wal_level = minimal - but in that case the stats are also not accessible, so that's not a problem. It's not the prettiest solution, but I think the simplicity is worth a lot. Greetings, Andres Freund
At Sun, 21 May 2023 15:14:23 -0700, Andres Freund <andres@anarazel.de> wrote in > Hi, > > I don't really feel confident we're going to get the timeout approach solid > enough for something going into the tree around beta 1. > > How about this, imo a lot simpler, approach: We flush stats whenever replaying > a XLOG_RUNNING_XACTS record. Unless the primary is idle, it will log those at > a regular interval. If the primary is idle, we don't need to flush stats in > the startup process, because we'll not have done any io. > > We only log XLOG_RUNNING_XACTS when wal_level >= replica, so stats wouldn't > get regularly flushed if wal_level = minimal - but in that case the stats are > also not accessible, so that's not a problem. I concur with the three aspects, interval regularity, reliability and about the case of the minimal wal_level. While I can't confidently claim it is the best, its simplicilty gives us a solid reason to proceed with it. If necessary, we can switch to alternative timing sources in the future without causing major disruptions for users, I believe. > It's not the prettiest solution, but I think the simplicity is worth a lot. > > Greetings, +1. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2023-05-22 13:36:42 +0900, Kyotaro Horiguchi wrote: > At Sun, 21 May 2023 15:14:23 -0700, Andres Freund <andres@anarazel.de> wrote in > > Hi, > > > > I don't really feel confident we're going to get the timeout approach solid > > enough for something going into the tree around beta 1. > > > > How about this, imo a lot simpler, approach: We flush stats whenever replaying > > a XLOG_RUNNING_XACTS record. Unless the primary is idle, it will log those at > > a regular interval. If the primary is idle, we don't need to flush stats in > > the startup process, because we'll not have done any io. > > > > We only log XLOG_RUNNING_XACTS when wal_level >= replica, so stats wouldn't > > get regularly flushed if wal_level = minimal - but in that case the stats are > > also not accessible, so that's not a problem. > > I concur with the three aspects, interval regularity, reliability and > about the case of the minimal wal_level. While I can't confidently > claim it is the best, its simplicilty gives us a solid reason to > proceed with it. If necessary, we can switch to alternative timing > sources in the future without causing major disruptions for users, I > believe. > > > It's not the prettiest solution, but I think the simplicity is worth a lot. > > > > Greetings, > > +1. Attached a patch implementing this approach. It's imo always a separate bug that we were using GetCurrentTransactionStopTimestamp() when force was passed in - that timestamp can be quite out of date in some cases. But I don't immediately see a noticeable consequence, so ... Greetings, Andres Freund
Attachment
At Fri, 9 Jun 2023 21:28:23 -0700, Andres Freund <andres@anarazel.de> wrote in > Hi, > > On 2023-05-22 13:36:42 +0900, Kyotaro Horiguchi wrote: > > At Sun, 21 May 2023 15:14:23 -0700, Andres Freund <andres@anarazel.de> wrote in > > > Hi, > > > > > > I don't really feel confident we're going to get the timeout approach solid > > > enough for something going into the tree around beta 1. > > > > > > How about this, imo a lot simpler, approach: We flush stats whenever replaying > > > a XLOG_RUNNING_XACTS record. Unless the primary is idle, it will log those at > > > a regular interval. If the primary is idle, we don't need to flush stats in > > > the startup process, because we'll not have done any io. > > > > > > We only log XLOG_RUNNING_XACTS when wal_level >= replica, so stats wouldn't > > > get regularly flushed if wal_level = minimal - but in that case the stats are > > > also not accessible, so that's not a problem. > > > > I concur with the three aspects, interval regularity, reliability and > > about the case of the minimal wal_level. While I can't confidently > > claim it is the best, its simplicilty gives us a solid reason to > > proceed with it. If necessary, we can switch to alternative timing > > sources in the future without causing major disruptions for users, I > > believe. > > > > > It's not the prettiest solution, but I think the simplicity is worth a lot. > > > > > > Greetings, > > > > +1. > > Attached a patch implementing this approach. > > It's imo always a separate bug that we were using > GetCurrentTransactionStopTimestamp() when force was passed in - that timestamp > can be quite out of date in some cases. But I don't immediately see a > noticeable consequence, so ... Thanks! I think that the reason is that we only pass true only when we're in a sort of idle time. Addition to that XLOG_RUNNING_XACTS comes so infrequently to cause noticeable degradation. If it causes sufficiently frequent updates, we will be satisfied with it. The patch is quite simple and looks good to me. (At least way better than always using GetCurrentTimestamp():) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2023-06-13 12:54:19 +0900, Kyotaro Horiguchi wrote: > I think that the reason is that we only pass true only when we're in a > sort of idle time. Addition to that XLOG_RUNNING_XACTS comes so > infrequently to cause noticeable degradation. If it causes > sufficiently frequent updates, we will be satisfied with it. > > The patch is quite simple and looks good to me. (At least way better > than always using GetCurrentTimestamp():) Thanks for looking - I already had pushed the commit by the time I read your email, otherwise I'd have mentioned you reviewing it... Greetings, Andres Freund
At Mon, 12 Jun 2023 21:13:35 -0700, Andres Freund <andres@anarazel.de> wrote in > Thanks for looking - I already had pushed the commit by the time I read your > email, otherwise I'd have mentioned you reviewing it... Oops! But I anticipated that and that's no problem (I should have confirmed commits.). Thanks for the reply! regards. -- Kyotaro Horiguchi NTT Open Source Software Center