Thread: Apparent bug in WAL summarizer process (hung state)

Apparent bug in WAL summarizer process (hung state)

From
Israel Barth Rubio
Date:
Hello,

Hope you are doing well.

I've been playing a bit with the incremental backup feature which might come as
part of the 17 release, and I think I hit a possible bug in the WAL summarizer
process.

The issue that I face refers to the summarizer process getting into a hung state.
When the issue is triggered, it keeps in an infinite loop trying to process a WAL
file that no longer exists.  It apparently comes up only when I perform changes to
`wal_summarize` GUC and reload Postgres, while there is some load in Postgres
which makes it recycle WAL files.

I'm running Postgres 17 in a Rockylinux 9 VM. In order to have less WAL files
available in `pg_wal` and make it easier to reproduce the issue, I'm using a low
value for `max_wal_size` ('100MB'). You can find below the steps that I took to
reproduce this problem, assuming this small `max_wal_size`, and `summarize_wal`
initially enabled:

```bash
# Assume we initially have max_wal_size = '100MB' and summarize_wal = on

# Create a table of ~ 100MB
psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"

# Take a full backup
pg_basebackup -X none -c fast -P -D full_backup_1

# Recreate a table of ~ 100MB
psql -c "DROP TABLE test"
psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"

# Take an incremental backup
pg_basebackup -X none -c fast -P -D incremental_backup_1 --incremental full_backup_1/backup_manifest

# Disable summarize_wal
psql -c "ALTER SYSTEM SET summarize_wal TO off"
psql -c "SELECT pg_reload_conf()"

# Recreate a table of ~ 100MB
psql -c "DROP TABLE test"
psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"

# Re-enable sumarize_wal
psql -c "ALTER SYSTEM SET summarize_wal TO on"
psql -c "SELECT pg_reload_conf()"

# Take a full backup
pg_basebackup -X none -c fast -P -D full_backup_2

# Take an incremental backup
pg_basebackup -X none -c fast -P -D incremental_backup_2 --incremental full_backup_2/backup_manifest
```

I'm able to reproduce the issue most of the time when running these steps
manually. It's harder to reproduce if I attempt to run those commands as a
bash script.

This is the sample output of a run of those commands:

```console
(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"
SELECT 3000000
(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D full_backup_1
NOTICE:  WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup
331785/331785 kB (100%), 1/1 tablespace
(barman) [postgres@barmandevhost ~]$ psql -c "DROP TABLE test"
DROP TABLE
(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"
SELECT 3000000
(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D incremental_backup_1 --incremental full_backup_1/backup_manifest
NOTICE:  WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup
111263/331720 kB (33%), 1/1 tablespace
(barman) [postgres@barmandevhost ~]$ psql -c "ALTER SYSTEM SET summarize_wal TO off"
ALTER SYSTEM
(barman) [postgres@barmandevhost ~]$ psql -c "SELECT pg_reload_conf()"
 pg_reload_conf
----------------
 t
(1 row)

(barman) [postgres@barmandevhost ~]$ psql -c "DROP TABLE test"
DROP TABLE
(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"
SELECT 3000000
(barman) [postgres@barmandevhost ~]$ psql -c "ALTER SYSTEM SET summarize_wal TO on"
ALTER SYSTEM
(barman) [postgres@barmandevhost ~]$ psql -c "SELECT pg_reload_conf()"
 pg_reload_conf
----------------
 t
(1 row)

(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D full_backup_2
NOTICE:  WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup
331734/331734 kB (100%), 1/1 tablespace
(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D incremental_backup_2 --incremental full_backup_2/backup_manifest
WARNING:  still waiting for WAL summarization through 2/C1000028 after 10 seconds
DETAIL:  Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in memory.
WARNING:  still waiting for WAL summarization through 2/C1000028 after 20 seconds
DETAIL:  Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in memory.
WARNING:  still waiting for WAL summarization through 2/C1000028 after 30 seconds
DETAIL:  Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in memory.
WARNING:  still waiting for WAL summarization through 2/C1000028 after 40 seconds
DETAIL:  Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in memory.
WARNING:  still waiting for WAL summarization through 2/C1000028 after 50 seconds
DETAIL:  Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in memory.
WARNING:  still waiting for WAL summarization through 2/C1000028 after 60 seconds
DETAIL:  Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in memory.
WARNING:  aborting backup due to backend exiting before pg_backup_stop was called
pg_basebackup: error: could not initiate base backup: ERROR:  WAL summarization is not progressing
DETAIL:  Summarization is needed through 2/C1000028, but is stuck at 2/B30000D8 on disk and 2/B30000D8 in memory.
pg_basebackup: removing data directory "incremental_backup_2"
```

I took an `ls` output from `pg_wal` as well as `strace` and `gdb` from the WAL
summarizer process. I'm attaching that to this email hoping that can help
somehow.

FWIW, once I restart Postgres the WAL summarizer process gets back to normal
functioning. It seems to me there is some race condition between when a WAL file
is removed and when `summarize_wal` is re-enabled, causing the process to keep
looking for a WAL file that is the past.

Best regards,
Israel.

Re: Apparent bug in WAL summarizer process (hung state)

From
Israel Barth Rubio
Date:
I'm attaching the files which I missed in the original email.
Attachment

Re: Apparent bug in WAL summarizer process (hung state)

From
Michael Paquier
Date:
On Mon, Jun 24, 2024 at 02:56:00PM -0300, Israel Barth Rubio wrote:
> I've been playing a bit with the incremental backup feature which might
> come as
> part of the 17 release, and I think I hit a possible bug in the WAL
> summarizer
> process.

Thanks for testing new features and for this report!

> FWIW, once I restart Postgres the WAL summarizer process gets back to normal
> functioning. It seems to me there is some race condition between when a WAL
> file
> is removed and when `summarize_wal` is re-enabled, causing the process to
> keep
> looking for a WAL file that is the past.

I am adding an open item to track this issue, to make sure that this
is looked at.
--
Michael

Attachment

Re: Apparent bug in WAL summarizer process (hung state)

From
Robert Haas
Date:
On Mon, Jun 24, 2024 at 1:56 PM Israel Barth Rubio
<barthisrael@gmail.com> wrote:
> I've been playing a bit with the incremental backup feature which might come as
> part of the 17 release, and I think I hit a possible bug in the WAL summarizer
> process.
>
> The issue that I face refers to the summarizer process getting into a hung state.
> When the issue is triggered, it keeps in an infinite loop trying to process a WAL
> file that no longer exists.  It apparently comes up only when I perform changes to
> `wal_summarize` GUC and reload Postgres, while there is some load in Postgres
> which makes it recycle WAL files.

Yeah, this is a bug. It seems that the WAL summarizer process, when
restarted, wants to restart from wherever it was previously
summarizing WAL, which is correct if that WAL is still around, but if
summarize_wal has been turned off in the meanwhile, it might not be
correct. Here's a patch to fix that.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: Apparent bug in WAL summarizer process (hung state)

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Yeah, this is a bug. It seems that the WAL summarizer process, when
> restarted, wants to restart from wherever it was previously
> summarizing WAL, which is correct if that WAL is still around, but if
> summarize_wal has been turned off in the meanwhile, it might not be
> correct. Here's a patch to fix that.

This comment seems to be truncated:

+    /*
+     * If we're the WAL summarizer, we always want to store the values we
+     * just computed into shared memory, because those are the values we're
+     * going to use to drive our operation, and so they are the authoritative
+     * values. Otherwise, we only store values into shared memory if they are
+     */
+    LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+    if (am_wal_summarizer|| !WalSummarizerCtl->initialized)
+    {

            regards, tom lane



Re: Apparent bug in WAL summarizer process (hung state)

From
Robert Haas
Date:
On Tue, Jun 25, 2024 at 3:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This comment seems to be truncated:

Thanks. New version attached.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: Apparent bug in WAL summarizer process (hung state)

From
Israel Barth Rubio
Date:
> Yeah, this is a bug. It seems that the WAL summarizer process, when
> restarted, wants to restart from wherever it was previously
> summarizing WAL, which is correct if that WAL is still around, but if
> summarize_wal has been turned off in the meanwhile, it might not be
> correct. Here's a patch to fix that.

Thanks for checking this!

> Thanks. New version attached.

And besides that, thanks for the patch, of course!

I compiled Postgres locally with your patch. I attempted to break it several
times, both manually and through a shell script.

No success on that -- which in this case is actually success :)
The WAL summarizer seems able to always resume from a valid point,
so `pg_basebackup` isn't failing anymore.