Re: WAL logging problem in 9.4.3? - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: WAL logging problem in 9.4.3?
Date
Msg-id CAHGQGwEXta1tMF6o-FcDcjMgCF=jnZLj3v3EoePvsEN4aexJnA@mail.gmail.com
Whole thread Raw
In response to Re: WAL logging problem in 9.4.3?  (Martijn van Oosterhout <kleptog@svana.org>)
Responses Re: WAL logging problem in 9.4.3?  (Martijn van Oosterhout <kleptog@svana.org>)
List pgsql-hackers
On Fri, Jul 3, 2015 at 2:20 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
> On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
>> Hi,
>>
>> On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
>> > === Start with an empty database
>>
>> My guess is you have wal_level = minimal?
>
> Default config, was just initdb'd. So yes, the default wal_level =
> minimal.
>
>> > === Note the index file is 8KB.
>> > === At this point nuke the database server (in this case it was simply
>> > === destroying the container it was running in.
>>
>> How did you continue from there? The container has persistent storage?
>> Or are you repapplying the WAL to somewhere else?
>
> The container has persistant storage on the host. What I think is
> actually unusual is that the script that started postgres was missing
> an 'exec" so postgres never gets the signal to shutdown.
>
>> > martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/
000000010000000000000001|grep -wE '16389|16387|16393' 
>> > rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc:
checkpoint:redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB
1;oldest multi 1 in DB 1; oldest running xid 0; shutdown 
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:          0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc:
filecreate: base/16385/16387 
>> > rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc:
log:rel 1663/16385/16387 
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc:
filecreate: base/16385/16389 
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc:
filecreate: base/16385/16393 
>> > rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc:
log:rel 1663/16385/16387 
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc:
filetruncate: base/16385/16389 to 0 blocks 
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc:
filetruncate: base/16385/16393 to 0 blocks 
>> > pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero length at 0/16BE740
>>
>> Note that the truncate will lead to a new, different, relfilenode.
>
> Really? Comparing the relfilenodes gives the same values before and
> after the truncate.

Yep, the relfilenodes are not changed in this case because CREATE TABLE and
TRUNCATE were executed in the same transaction block.

>> > ctmp=# select * from test;
>> > ERROR:  could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes
>>
>> Hm. I can't reproduce this. Can you include a bit more details about how
>> to reproduce?
>
> Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
> can probably construct a Dockerfile that reproduces it pretty reliably.

I could reproduce the problem in the master branch by doing
the following steps.

1. start the PostgreSQL server with wal_level = minimal
2. execute the following SQL statements    begin;    create table test(id serial primary key);    truncate table test;
 commit; 
3. shutdown the server with immediate mode
4. restart the server (crash recovery occurs)
5. execute the following SQL statement   select * from test;

The optimization of TRUNCATE opereation that we can use when
CREATE TABLE and TRUNCATE are executed in the same transaction block
seems to cause the problem. In this case, only index file truncation is
logged, and index creation in btbuild() is not logged because wal_level
is minimal. Then at the subsequent crash recovery, index file is truncated
to 0 byte... Very simple fix is to log an index creation in that case,
but not sure if that's ok to do..

Regards,

--
Fujii Masao



pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: multivariate statistics / patch v7
Next
From: Fabien COELHO
Date:
Subject: Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?