Thread: Assertion failure when autovacuum drops orphan temp indexes.

Assertion failure when autovacuum drops orphan temp indexes.

From
Masahiko Sawada
Date:
Hi,

I got the following assertion failure from an autovaucum worker:

2024-10-26 00:32:03.685 PDT [830672] LOG:  autovacuum: dropping orphan
temp table "postgres.pg_temp_0.hoge"
TRAP: failed Assert("HaveRegisteredOrActiveSnapshot()"), File:
"index.c", Line: 2345, PID: 830672

Here is the stack trace I got:

(lldb) bt
* thread #1, name = 'postgres', stop reason = signal SIGABRT
  * frame #0: 0x00007f8ab215952f libc.so.6`raise + 271
    frame #1: 0x00007f8ab212ce65 libc.so.6`abort + 295
    frame #2: 0x0000000000bd636e
postgres`ExceptionalCondition(conditionName="", fileName="",
lineNumber=2345) at assert.c:66:2
    frame #3: 0x00000000005de1fb postgres`index_drop(indexId=16389,
concurrent=false, concurrent_lock_mode=false) at index.c:2345:3
    frame #4: 0x00000000005d0d7d
postgres`doDeletion(object=0x0000000002a96ab0, flags=21) at
dependency.c:1367:6
    frame #5: 0x00000000005d088c
postgres`deleteOneObject(object=0x0000000002a96ab0,
depRel=0x00007ffe04079260, flags=21) at dependency.c:1274:2
    frame #6: 0x00000000005cf243
postgres`deleteObjectsInList(targetObjects=0x0000000002a9cf48,
depRel=0x00007ffe04079260, flags=21) at dependency.c:229:3
    frame #7: 0x00000000005cf2f8
postgres`performDeletion(object=0x00007ffe04079374,
behavior=DROP_CASCADE, flags=21) at dependency.c:314:2
    frame #8: 0x000000000092302e postgres`do_autovacuum at autovacuum.c:2215:3
    frame #9: 0x00000000009220ee
postgres`AutoVacWorkerMain(startup_data=0x0000000000000000,
startup_data_len=0) at autovacuum.c:1571:3
    frame #10: 0x0000000000928742
postgres`postmaster_child_launch(child_type=B_AUTOVAC_WORKER,
startup_data=0x0000000000000000, startup_data_len=0,
client_sock=0x0000000000000000)at launch_backend.c:274:3
    frame #11: 0x000000000092e0c9
postgres`StartChildProcess(type=B_AUTOVAC_WORKER) at
postmaster.c:3736:8
    frame #12: 0x000000000092e1a7 postgres`StartAutovacuumWorker at
postmaster.c:3790:14
    frame #13: 0x000000000092df72 postgres`process_pm_pmsignal at
postmaster.c:3628:3
    frame #14: 0x000000000092b5b8 postgres`ServerLoop at postmaster.c:1646:5
    frame #15: 0x000000000092afed postgres`PostmasterMain(argc=5,
argv=0x00000000029c83e0) at postmaster.c:1351:11
    frame #16: 0x00000000007eaa4d postgres`main(argc=5,
argv=0x00000000029c83e0) at main.c:197:3
    frame #17: 0x00007f8ab21457e5 libc.so.6`__libc_start_main + 229
    frame #18: 0x00000000004a44be postgres`_start + 46

The reproducible steps are:

1. execute "create temp table test (a int primary key)" and keep the
psql session open.
2. shutdown the server in immediate mode.
3. restart the server.
4. when cleaning up the orphan temp table, autovacuum worker raises
the assertion failure.

This assertion was recently introduced in commit b52adbad467. cc-ed
Nathan as committer of the related commit.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Assertion failure when autovacuum drops orphan temp indexes.

From
Stepan Neretin
Date:
> IMHO the best way to handle this is to just unconditionally push a snapshot
> in this code path instead of making assumptions about what callers will do.

Yes, I agree! I have found the same solution. I attempted to write Perl tests to verify the patch fix, but the autovacuum process is not triggered in my tests. Could you please assist me?

```
use strict;
use warnings;
use threads;
use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::RecursiveCopy;
use PostgreSQL::Test::Utils;
use Test::More;
use Data::Dumper;

my $node = PostgreSQL::Test::Cluster->new('main');

# Create a data directory with initdb
$node->init;

$node->append_conf(
'postgresql.conf', qq[
autovacuum = on
track_counts=on
autovacuum_naptime = 1s
autovacuum_max_workers = 1
autovacuum_vacuum_threshold = 1
]);

# Start the PostgreSQL server
$node->start;

my $psql1 = $node->interactive_psql('postgres');
$psql1->query("create temp table test (a int primary key);");

$node->stop('immediate');
$node->start();
sleep(5);

$node->restart();

ok(1);
done_testing();
```

Best Regards, Stepan Neretin!

Re: Assertion failure when autovacuum drops orphan temp indexes.

From
Masahiko Sawada
Date:
On Sat, Oct 26, 2024 at 12:13 PM Stepan Neretin <sndcppg@gmail.com> wrote:
>
> > IMHO the best way to handle this is to just unconditionally push a snapshot
> > in this code path instead of making assumptions about what callers will do.

+1

>
> Yes, I agree! I have found the same solution. I attempted to write Perl tests to verify the patch fix, but the
autovacuumprocess is not triggered in my tests. Could you please assist me? 
>
> ```
> use strict;
> use warnings;
> use threads;
> use PostgreSQL::Test::Cluster;
> use PostgreSQL::Test::RecursiveCopy;
> use PostgreSQL::Test::Utils;
> use Test::More;
> use Data::Dumper;
>
> my $node = PostgreSQL::Test::Cluster->new('main');
>
> # Create a data directory with initdb
> $node->init;
>
> $node->append_conf(
> 'postgresql.conf', qq[
> autovacuum = on
> track_counts=on
> autovacuum_naptime = 1s
> autovacuum_max_workers = 1
> autovacuum_vacuum_threshold = 1
> ]);
>
> # Start the PostgreSQL server
> $node->start;
>
> my $psql1 = $node->interactive_psql('postgres');
> $psql1->query("create temp table test (a int primary key);");
>
> $node->stop('immediate');
> $node->start();

IIUC after an immediate shutdown all pgstat entries are wiped out so
the server doesn't have any pgstat entries for databases at this
point. And since we don't run autovacuum on databases that have no
pg_stat entries, no autovacuum worker worked on the 'postgres'
database. Please try executing any query (e.g. 'select 1') on the
'postgres' database after the restart, which creates a pgstat entry
for the database.

> sleep(5);

While the test script sleeps for 5 seconds, the server restarts after
a crash. So even if the assertion failure happens, the test would
appear to be successful. I think you can set 'restart_after_crash =
off' and execute another query using safe_psql() after the sleep. That
way, the test ends up with safe_psql() failure because the database
server is not running.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Assertion failure when autovacuum drops orphan temp indexes.

From
Stepan Neretin
Date:



IIUC after an immediate shutdown all pgstat entries are wiped out so
the server doesn't have any pgstat entries for databases at this
point. And since we don't run autovacuum on databases that have no
pg_stat entries, no autovacuum worker worked on the 'postgres'
database. Please try executing any query (e.g. 'select 1') on the
'postgres' database after the restart, which creates a pgstat entry
for the database.

> sleep(5);

While the test script sleeps for 5 seconds, the server restarts after
a crash. So even if the assertion failure happens, the test would
appear to be successful. I think you can set 'restart_after_crash =
off' and execute another query using safe_psql() after the sleep. That
way, the test ends up with safe_psql() failure because the database
server is not running.

Hi, thank you for your suggestions!  But they did not help me. Autovacuum does not want to start :(

```
use strict;
use warnings;
use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;


my $node = PostgreSQL::Test::Cluster->new('main');

$node->init;

$node->append_conf(
'postgresql.conf', qq[
autovacuum = on
autovacuum_naptime = 1s
autovacuum_max_workers = 1
restart_after_crash = off
]);


$node->start;

my $psql1 = $node->interactive_psql('postgres');
$psql1->query("create temp table test (a int primary key);");

$node->stop('immediate');
sleep(5);

$node->start;

sleep(3);

$node->restart;

my $psql2 = $node->interactive_psql('postgres');
$psql2->query('SELECT 1;');
$psql2->query('SELECT 1;');

my $regexp = qr/autovacuum/;
my $offset = 0;

$node->wait_for_log($regexp, $offset);

done_testing();

```

Best Regards, Stepan Neretin.

Re: Assertion failure when autovacuum drops orphan temp indexes.

From
Stepan Neretin
Date:




The assertion failure happens in an autovacuum worker.  So if you are
looking for a test that can be integrated in the tree, you could get
some inspiration from 006_signal_autovacuum.pl and rely on an
injection point wait with the existing autovacuum-worker-start.  My
2c, as it looks easy enough to avoid the hardcoded waits.
--

Greetings, I appreciate your proposal. I have attempted to implement your suggestion, but unfortunately it did not yield the desired results.

```
use strict;
use warnings;
use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;

if ($ENV{enable_injection_points} ne 'yes') {
plan skip_all => 'Injection points not supported by this build';
}

my $node = PostgreSQL::Test::Cluster->new('main');
$node->init;

$node->append_conf(
'postgresql.conf', qq[
autovacuum = on
autovacuum_naptime = 1
autovacuum_max_workers = 1
]);

$node->start;

if (!$node->check_extension('injection_points')) {
plan skip_all => 'Extension injection_points not installed';
}



my $psql1 = $node->interactive_psql('postgres');
$psql1->query("create temp table test (a int primary key);");
$node->stop('immediate');
$node->start;
$node->restart;

my $psql2 = $node->interactive_psql('postgres');
$psql2->query('CREATE EXTENSION injection_points;');
$psql2->query("SELECT injection_points_attach('autovacuum-worker-start', 'wait');");

$node->wait_for_event('autovacuum worker', 'autovacuum-worker-start');
$psql2->query('select 1');

my $regexp = qr/autovacuum:/;
my $offset = 0;

ok( $node->log_contains(
$regexp,
$offset),
"autovacuum not started");


done_testing();

```
Best Regards, Stepan Neretin!

Re: Assertion failure when autovacuum drops orphan temp indexes.

From
Masahiko Sawada
Date:
On Sun, Oct 27, 2024 at 5:50 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sun, Oct 27, 2024 at 05:19:45PM -0700, Masahiko Sawada wrote:
> > my $psql1 = $node->interactive_psql('postgres');
> > $psql1->query("create temp table test (a int primary key);");
> > sleep(3);
> >
> > $node->stop('immediate');
> > $node->start();
> > $node->safe_psql('postgres', 'select 1');
> >
> > sleep(5);
> > $node->safe_psql('postgres', 'select 1');
> >
> > ok(1);
> > done_testing();
>
> The assertion failure happens in an autovacuum worker.  So if you are
> looking for a test that can be integrated in the tree, you could get
> some inspiration from 006_signal_autovacuum.pl and rely on an
> injection point wait with the existing autovacuum-worker-start.  My
> 2c, as it looks easy enough to avoid the hardcoded waits.

Thank you for your suggestion. IMHO I'm not sure we need to have a
regression test for this bug as it's just an oversight of recently
committed code. My suggestion was just to help Stepan reproduce this
failure using TAP tests.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Assertion failure when autovacuum drops orphan temp indexes.

From
Stepan Neretin
Date:

Thank you for your suggestion. IMHO I'm not sure we need to have a
regression test for this bug as it's just an oversight of recently
committed code. My suggestion was just to help Stepan reproduce this
failure using TAP tests.


Yes, I have reproduced the issue manually before your changes, but I am unable to do so in the perl test.
Best Regards, Stepan Neretin!

Re: Assertion failure when autovacuum drops orphan temp indexes.

From
Nathan Bossart
Date:
On Sun, Oct 27, 2024 at 09:36:38PM -0700, Masahiko Sawada wrote:
> Thank you for your suggestion. IMHO I'm not sure we need to have a
> regression test for this bug as it's just an oversight of recently
> committed code.

Agreed.  FWIW cfbot seems to catch this already (CTRL+F for "index_drop" in
the highlights page [0]), and I saw it on my laptop shortly after the issue
was reported.

[0] http://cfbot.cputube.org/highlights/all.html

-- 
nathan



Re: Assertion failure when autovacuum drops orphan temp indexes.

From
Nathan Bossart
Date:
Committed.

-- 
nathan