Re: Failed to delete old ReorderBuffer spilled files - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Failed to delete old ReorderBuffer spilled files
Date
Msg-id CAD21AoA+5X2W9oTihXhXj=XnizOFSyDVTTEytDN-9XskaoSbkA@mail.gmail.com
Whole thread Raw
In response to Re: Failed to delete old ReorderBuffer spilled files  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Failed to delete old ReorderBuffer spilled files  (atorikoshi <torikoshi_atsushi_z2@lab.ntt.co.jp>)
List pgsql-hackers
On Tue, Nov 21, 2017 at 3:48 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Nov 20, 2017 at 7:35 PM, atorikoshi
> <torikoshi_atsushi_z2@lab.ntt.co.jp> wrote:
>> Hi,
>>
>> I put many queries into one transaction and made ReorderBuffer spill
>> data to disk, and sent SIGKILL to postgres before the end of the
>> transaction.
>>
>> After starting up postgres again, I observed the files spilled to
>> data wasn't deleted.
>>
>> I think these files should be deleted because its transaction was no
>> more valid, so no one can use these files.
>>
>>
>> Below is a reproduction instructions.
>>
>> ------------------------------------------------
>> 1. Create table and publication at publiser.
>>
>>   @pub =# CREATE TABLE t1(
>>       id INT PRIMARY KEY,
>>       name TEXT);
>>
>>   @pub =# CREATE PUBLICATION pub FOR TABLE t1;
>>
>> 2. Create table and subscription at subscriber.
>>
>>   @sub =# CREATE TABLE t1(
>>       id INT PRIMARY KEY,
>>       name TEXT
>>       );
>>
>>   @sub =# CREATE SUBSCRIPTION sub
>>       CONNECTION 'host=[hostname] port=[port] dbname=[dbname]'
>>       PUBLICATION pub;
>>
>> 3. Put many queries into one transaction.
>>
>>   @pub =# BEGIN;
>>         INSERT INTO t1
>>         SELECT
>>           i,
>>           'aaaaaaaaaa'
>>         FROM
>>         generate_series(1, 1000000) as i;
>>
>> 4. Then we can see spilled files.
>>
>>   @pub $ ls -1 ${PGDATA}/pg_replslot/sub/
>>     state
>>     xid-561-lsn-0-1000000.snap
>>     xid-561-lsn-0-2000000.snap
>>     xid-561-lsn-0-3000000.snap
>>     xid-561-lsn-0-4000000.snap
>>     xid-561-lsn-0-5000000.snap
>>     xid-561-lsn-0-6000000.snap
>>     xid-561-lsn-0-7000000.snap
>>     xid-561-lsn-0-8000000.snap
>>     xid-561-lsn-0-9000000.snap
>>
>> 5. Kill publisher's postgres process before COMMIT.
>>
>>   @pub $ kill -s SIGKILL [pid of postgres]
>>
>> 6. Start publisher's postgres process.
>>
>>   @pub $ pg_ctl start -D ${PGDATA}
>>
>> 7. After a while, we can see the files remaining.
>>   (Immediately after starting publiser, we can not see these files.)
>>
>>   @pub $ pg_ctl start -D ${PGDATA}
>>
>>   When I configured with '--enable-cassert', below assertion error
>>   was appeared.
>>
>>     TRAP: FailedAssertion("!(txn->final_lsn != 0)", File: "reorderbuffer.c",
>> Line: 2576)
>> ------------------------------------------------
>>
>> Attached patch sets final_lsn to the last ReorderBufferChange if
>> final_lsn == 0.
>
> Thank you for the report. I could reproduce this issue with the above
> step. My analysis is, the cause of that a serialized reorder buffer
> isn't cleaned up is that the aborted transaction without an abort WAL
> record has no chance to set ReorderBufferTXN->final_lsn. So if there
> is such serialized transactions ReorderBufferRestoreCleanup cleanups
> no files, which is cause of the assertion failure (or a file being
> orphaned). What do you think?
>
> On detail of your patch, I'm not sure it's safe if we set the lsn of
> other than commit record or abort record to final_lsn. The comment in
> reorderbuffer.h says,
>
> typedef trcut ReorderBufferTXN
> {
> (snip)
>
>     /* ----
>      * LSN of the record that lead to this xact to be committed or
>      * aborted. This can be a
>      * * plain commit record
>      * * plain commit record, of a parent transaction
>      * * prepared transaction commit
>      * * plain abort record
>      * * prepared transaction abort
>      * * error during decoding
>      * ----
>      */
>     XLogRecPtr  final_lsn;
>
> But with your patch, we could set a lsn of a record that is other than
> what listed above to final_lsn. One way I came up with is to make
> ReorderBufferRestoreCleanup accept an invalid value of final_lsn and
> regards it as a aborted transaction that doesn't has a abort WAL
> record. So we can cleanup all serialized files if final_lsn of a
> transaction is invalid.

After more thought, since there are some codes cosmetically setting
final_lsn when the fate of transaction is determined possibly we
should not accept a invalid value of final_lsn even in the case.

> Since I'm not very familiar with snapshot
> building part please check it.

Sorry I wanted to say the reorderbuffer module. The snapshot building
isn't relevant.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


pgsql-hackers by date:

Previous
From: Amit Khandekar
Date:
Subject: Re: [HACKERS] Parallel Append implementation
Next
From: Antonin Houska
Date:
Subject: Re: [HACKERS] CLUSTER command progress monitor