Re: [HACKERS] Restricting maximum keep segments by repslots - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: [HACKERS] Restricting maximum keep segments by repslots
Date
Msg-id 20200408.093710.447591748588426656.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: [HACKERS] Restricting maximum keep segments by repslots  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: [HACKERS] Restricting maximum keep segments by repslots  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
Thank you for committing this.

At Tue, 7 Apr 2020 18:45:22 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> On 2020-Apr-07, Kyotaro Horiguchi wrote:
> 
> > > Mmm. Couldn't we have a new member 'invalidated' in ReplicationSlot?
> > 
> > I did that in the attached. The invalidated is shared-but-not-saved
> > member of a slot and initialized to false then irreversibly changed to
> > true when the slot loses required segment.
> > 
> > It is checked by the new function CheckReplicationSlotInvalidated() at
> > acquireing a slot and at updating slot by standby reply message. This
> > change stops walsender without explicitly killing but I didn't remove
> > that code.
> 
> This change didn't work well with my proposed change to make
> checkpointer acquire slots before marking them invalid.  When I
> incorporated your patch in the last version I posted yesterday, there
> was a problem that when checkpointer attempted to acquire the slot, it
> would fail with "the slot is invalidated"; also if you try to drop the
> slot, it would obviously fail.  I think it would work to remove the
> SlotIsInvalidated check from the Acquire routine, and instead move it to
> the routines that need it (ie. not the InvalidateObsolete one, and also
> not the routine to drop slots).
> 
> I pushed version 26, with a few further adjustments.
> 
> I think what we have now is sufficient, but if you want to attempt this
> "invalidated" flag on top of what I pushed, be my guest.

I don't think the invalidation flag is essential but it can prevent
unanticipated behavior, in other words, it makes us feel at ease:p

After the current master/HEAD, the following steps causes assertion
failure in xlogreader.c.

P(ublisher) $ vi $PGDATA/postgresql.conf
wal_level=logical
max_slot_wal_keep_size=0
^Z
(start publisher and subscriber)

P=> create table t(a int);
P=> create publication p1 for table t;
S=> create table t(a int);
P=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint;
(publisher crashes)

2020-04-08 09:20:16.893 JST [9582] LOG:  invalidating slot "s1" because its restart_lsn 0/1571770 exceeds
max_slot_wal_keep_size
2020-04-08 09:20:16.897 JST [9496] LOG:  database system is ready to accept connections
2020-04-08 09:20:21.472 JST [9597] LOG:  starting logical decoding for slot "s1"
2020-04-08 09:20:21.472 JST [9597] DETAIL:  Streaming transactions committing after 0/1571770, reading WAL from 0/0.
TRAP: FailedAssertion("!XLogRecPtrIsInvalid(RecPtr)", File: "xlogreader.c", Line: 235)
postgres: walsender horiguti [local] idle(ExceptionalCondition+0xa8)[0xaac4c1]
postgres: walsender horiguti [local] idle(XLogBeginRead+0x30)[0x588dbf]
postgres: walsender horiguti [local] idle[0x8c938b]
postgres: walsender horiguti [local] idle(exec_replication_command+0x311)[0x8c9c75]
postgres: walsender horiguti [local] idle(PostgresMain+0x79a)[0x92f091]
postgres: walsender horiguti [local] idle[0x87eec3]
postgres: walsender horiguti [local] idle[0x87e69a]
postgres: walsender horiguti [local] idle[0x87abc2]
postgres: walsender horiguti [local] idle(PostmasterMain+0x11cd)[0x87a48f]
postgres: walsender horiguti [local] idle[0x7852cb]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7fc190958873]
postgres: walsender horiguti [local] idle(_start+0x2e)[0x48169e]
2020-04-08 09:20:22.255 JST [9496] LOG:  server process (PID 9597) was terminated by signal 6: Aborted
2020-04-08 09:20:22.255 JST [9496] LOG:  terminating any other active server processes
2020-04-08 09:20:22.256 JST [9593] WARNING:  terminating connection because of crash of another server process
2020-04-08 09:20:22.256 JST [9593] DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory.
 

I will look at it.


On the other hand, physical replication doesn't break by invlidation.

Primary: postgres.conf
max_slot_wal_keep_size=0
Standby: postgres.conf
primary_conninfo='connect to master'
primary_slot_name='x1'

(start the primary)
P=> select pg_create_physical_replication_slot('x1');
(start the standby)
S=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint;

(primary log)
2020-04-08 09:35:09.719 JST [10064] LOG:  terminating walsender 10076 because replication slot "x1" is too far behind
2020-04-08 09:35:09.719 JST [10076] FATAL:  terminating connection due to administrator command
2020-04-08 09:35:09.720 JST [10064] LOG:  invalidating slot "x1" because its restart_lsn 0/B9F2000 exceeds
max_slot_wal_keep_size
(standby)
[10075] 2020-04-08 09:35:09.723 JST FATAL:  could not receive data from WAL stream: server closed the connection
unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
[10101] 2020-04-08 09:35:09.734 JST LOG:  started streaming WAL from primary at 0/C000000 on timeline 1

Doesn't harm but something's strange. I'll look it, too.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Make MemoryContextMemAllocated() more precise
Next
From: Tomas Vondra
Date:
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)