Thread: Assertion failure when autovacuum drops orphan temp indexes.
Hi, I got the following assertion failure from an autovaucum worker: 2024-10-26 00:32:03.685 PDT [830672] LOG: autovacuum: dropping orphan temp table "postgres.pg_temp_0.hoge" TRAP: failed Assert("HaveRegisteredOrActiveSnapshot()"), File: "index.c", Line: 2345, PID: 830672 Here is the stack trace I got: (lldb) bt * thread #1, name = 'postgres', stop reason = signal SIGABRT * frame #0: 0x00007f8ab215952f libc.so.6`raise + 271 frame #1: 0x00007f8ab212ce65 libc.so.6`abort + 295 frame #2: 0x0000000000bd636e postgres`ExceptionalCondition(conditionName="", fileName="", lineNumber=2345) at assert.c:66:2 frame #3: 0x00000000005de1fb postgres`index_drop(indexId=16389, concurrent=false, concurrent_lock_mode=false) at index.c:2345:3 frame #4: 0x00000000005d0d7d postgres`doDeletion(object=0x0000000002a96ab0, flags=21) at dependency.c:1367:6 frame #5: 0x00000000005d088c postgres`deleteOneObject(object=0x0000000002a96ab0, depRel=0x00007ffe04079260, flags=21) at dependency.c:1274:2 frame #6: 0x00000000005cf243 postgres`deleteObjectsInList(targetObjects=0x0000000002a9cf48, depRel=0x00007ffe04079260, flags=21) at dependency.c:229:3 frame #7: 0x00000000005cf2f8 postgres`performDeletion(object=0x00007ffe04079374, behavior=DROP_CASCADE, flags=21) at dependency.c:314:2 frame #8: 0x000000000092302e postgres`do_autovacuum at autovacuum.c:2215:3 frame #9: 0x00000000009220ee postgres`AutoVacWorkerMain(startup_data=0x0000000000000000, startup_data_len=0) at autovacuum.c:1571:3 frame #10: 0x0000000000928742 postgres`postmaster_child_launch(child_type=B_AUTOVAC_WORKER, startup_data=0x0000000000000000, startup_data_len=0, client_sock=0x0000000000000000)at launch_backend.c:274:3 frame #11: 0x000000000092e0c9 postgres`StartChildProcess(type=B_AUTOVAC_WORKER) at postmaster.c:3736:8 frame #12: 0x000000000092e1a7 postgres`StartAutovacuumWorker at postmaster.c:3790:14 frame #13: 0x000000000092df72 postgres`process_pm_pmsignal at postmaster.c:3628:3 frame #14: 0x000000000092b5b8 postgres`ServerLoop at postmaster.c:1646:5 frame #15: 0x000000000092afed postgres`PostmasterMain(argc=5, argv=0x00000000029c83e0) at postmaster.c:1351:11 frame #16: 0x00000000007eaa4d postgres`main(argc=5, argv=0x00000000029c83e0) at main.c:197:3 frame #17: 0x00007f8ab21457e5 libc.so.6`__libc_start_main + 229 frame #18: 0x00000000004a44be postgres`_start + 46 The reproducible steps are: 1. execute "create temp table test (a int primary key)" and keep the psql session open. 2. shutdown the server in immediate mode. 3. restart the server. 4. when cleaning up the orphan temp table, autovacuum worker raises the assertion failure. This assertion was recently introduced in commit b52adbad467. cc-ed Nathan as committer of the related commit. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
> IMHO the best way to handle this is to just unconditionally push a snapshot
> in this code path instead of making assumptions about what callers will do.
Yes, I agree! I have found the same solution. I attempted to write Perl tests to verify the patch fix, but the autovacuum process is not triggered in my tests. Could you please assist me?
```
use strict;
use warnings;
use threads;
use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::RecursiveCopy;
use PostgreSQL::Test::Utils;
use Test::More;
use Data::Dumper;
my $node = PostgreSQL::Test::Cluster->new('main');
# Create a data directory with initdb
$node->init;
$node->append_conf(
'postgresql.conf', qq[
autovacuum = on
track_counts=on
autovacuum_naptime = 1s
autovacuum_max_workers = 1
autovacuum_vacuum_threshold = 1
]);
# Start the PostgreSQL server
$node->start;
my $psql1 = $node->interactive_psql('postgres');
$psql1->query("create temp table test (a int primary key);");
$node->stop('immediate');
$node->start();
sleep(5);
$node->restart();
ok(1);
done_testing();
```
```
Best Regards, Stepan Neretin!
On Sat, Oct 26, 2024 at 12:13 PM Stepan Neretin <sndcppg@gmail.com> wrote: > > > IMHO the best way to handle this is to just unconditionally push a snapshot > > in this code path instead of making assumptions about what callers will do. +1 > > Yes, I agree! I have found the same solution. I attempted to write Perl tests to verify the patch fix, but the autovacuumprocess is not triggered in my tests. Could you please assist me? > > ``` > use strict; > use warnings; > use threads; > use PostgreSQL::Test::Cluster; > use PostgreSQL::Test::RecursiveCopy; > use PostgreSQL::Test::Utils; > use Test::More; > use Data::Dumper; > > my $node = PostgreSQL::Test::Cluster->new('main'); > > # Create a data directory with initdb > $node->init; > > $node->append_conf( > 'postgresql.conf', qq[ > autovacuum = on > track_counts=on > autovacuum_naptime = 1s > autovacuum_max_workers = 1 > autovacuum_vacuum_threshold = 1 > ]); > > # Start the PostgreSQL server > $node->start; > > my $psql1 = $node->interactive_psql('postgres'); > $psql1->query("create temp table test (a int primary key);"); > > $node->stop('immediate'); > $node->start(); IIUC after an immediate shutdown all pgstat entries are wiped out so the server doesn't have any pgstat entries for databases at this point. And since we don't run autovacuum on databases that have no pg_stat entries, no autovacuum worker worked on the 'postgres' database. Please try executing any query (e.g. 'select 1') on the 'postgres' database after the restart, which creates a pgstat entry for the database. > sleep(5); While the test script sleeps for 5 seconds, the server restarts after a crash. So even if the assertion failure happens, the test would appear to be successful. I think you can set 'restart_after_crash = off' and execute another query using safe_psql() after the sleep. That way, the test ends up with safe_psql() failure because the database server is not running. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
IIUC after an immediate shutdown all pgstat entries are wiped out so
the server doesn't have any pgstat entries for databases at this
point. And since we don't run autovacuum on databases that have no
pg_stat entries, no autovacuum worker worked on the 'postgres'
database. Please try executing any query (e.g. 'select 1') on the
'postgres' database after the restart, which creates a pgstat entry
for the database.
> sleep(5);
While the test script sleeps for 5 seconds, the server restarts after
a crash. So even if the assertion failure happens, the test would
appear to be successful. I think you can set 'restart_after_crash =
off' and execute another query using safe_psql() after the sleep. That
way, the test ends up with safe_psql() failure because the database
server is not running.
Hi, thank you for your suggestions! But they did not help me. Autovacuum does not want to start :(
```
```
use strict;
use warnings;
use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
my $node = PostgreSQL::Test::Cluster->new('main');
$node->init;
$node->append_conf(
'postgresql.conf', qq[
autovacuum = on
autovacuum_naptime = 1s
autovacuum_max_workers = 1
restart_after_crash = off
]);
$node->start;
my $psql1 = $node->interactive_psql('postgres');
$psql1->query("create temp table test (a int primary key);");
$node->stop('immediate');
sleep(5);
$node->start;
sleep(3);
$node->restart;
my $psql2 = $node->interactive_psql('postgres');
$psql2->query('SELECT 1;');
$psql2->query('SELECT 1;');
my $regexp = qr/autovacuum/;
my $offset = 0;
$node->wait_for_log($regexp, $offset);
done_testing();
```
Best Regards, Stepan Neretin.
The assertion failure happens in an autovacuum worker. So if you are
looking for a test that can be integrated in the tree, you could get
some inspiration from 006_signal_autovacuum.pl and rely on an
injection point wait with the existing autovacuum-worker-start. My
2c, as it looks easy enough to avoid the hardcoded waits.
--
Greetings, I appreciate your proposal. I have attempted to implement your suggestion, but unfortunately it did not yield the desired results.
```
use strict;
use warnings;
use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;
if ($ENV{enable_injection_points} ne 'yes') {
plan skip_all => 'Injection points not supported by this build';
}
my $node = PostgreSQL::Test::Cluster->new('main');
$node->init;
$node->append_conf(
'postgresql.conf', qq[
autovacuum = on
autovacuum_naptime = 1
autovacuum_max_workers = 1
]);
$node->start;
if (!$node->check_extension('injection_points')) {
plan skip_all => 'Extension injection_points not installed';
}
my $psql1 = $node->interactive_psql('postgres');
$psql1->query("create temp table test (a int primary key);");
$node->stop('immediate');
$node->start;
$node->restart;
my $psql2 = $node->interactive_psql('postgres');
$psql2->query('CREATE EXTENSION injection_points;');
$psql2->query("SELECT injection_points_attach('autovacuum-worker-start', 'wait');");
$node->wait_for_event('autovacuum worker', 'autovacuum-worker-start');
$psql2->query('select 1');
my $regexp = qr/autovacuum:/;
my $offset = 0;
ok( $node->log_contains(
$regexp,
$offset),
"autovacuum not started");
done_testing();
```
Best Regards, Stepan Neretin!
On Sun, Oct 27, 2024 at 5:50 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Sun, Oct 27, 2024 at 05:19:45PM -0700, Masahiko Sawada wrote: > > my $psql1 = $node->interactive_psql('postgres'); > > $psql1->query("create temp table test (a int primary key);"); > > sleep(3); > > > > $node->stop('immediate'); > > $node->start(); > > $node->safe_psql('postgres', 'select 1'); > > > > sleep(5); > > $node->safe_psql('postgres', 'select 1'); > > > > ok(1); > > done_testing(); > > The assertion failure happens in an autovacuum worker. So if you are > looking for a test that can be integrated in the tree, you could get > some inspiration from 006_signal_autovacuum.pl and rely on an > injection point wait with the existing autovacuum-worker-start. My > 2c, as it looks easy enough to avoid the hardcoded waits. Thank you for your suggestion. IMHO I'm not sure we need to have a regression test for this bug as it's just an oversight of recently committed code. My suggestion was just to help Stepan reproduce this failure using TAP tests. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Thank you for your suggestion. IMHO I'm not sure we need to have a
regression test for this bug as it's just an oversight of recently
committed code. My suggestion was just to help Stepan reproduce this
failure using TAP tests.
Yes, I have reproduced the issue manually before your changes, but I am unable to do so in the perl test.
Best Regards, Stepan Neretin!
On Sun, Oct 27, 2024 at 09:36:38PM -0700, Masahiko Sawada wrote: > Thank you for your suggestion. IMHO I'm not sure we need to have a > regression test for this bug as it's just an oversight of recently > committed code. Agreed. FWIW cfbot seems to catch this already (CTRL+F for "index_drop" in the highlights page [0]), and I saw it on my laptop shortly after the issue was reported. [0] http://cfbot.cputube.org/highlights/all.html -- nathan
Committed. -- nathan