Thread: inconsistent state after crash recovery

inconsistent state after crash recovery

From
Satoshi Nagayasu
Date:
Hi,

I received a question about inconsistent state after crash recovery.

When a table file is broken (or just lost), PostgreSQL can not recover
a whole table, and does not show any notice while recoverying.
I think it means "inconsistent" state.

(1) create a table, and fill records.
(2) process a checkpoint.
(3) fill more records.
(4) force a crash, and delete the table file.
(5) run recovery on restarting.
(6) only records after the checkpoint can be recoverd.

For example, the attached log shows that PostgreSQL can recover
only 1058 records in the table which contains 2000 records
before the crash, and does not tell anything in the server log.

----------------------------------------------------------
insert into t1 values (trim(to_char(generate_series(1,1000),
'00000000000000000000')) );
INSERT 0 1000
select count(*) from t1;
 count
-------
  1000
(1 row)

checkpoint;
CHECKPOINT
insert into t1 values (trim(to_char(generate_series(1001,2000),
'00000000000000000000')) );
INSERT 0 1000
select count(*) from t1;
 count
-------
  2000
(1 row)

(delete the table file)
(kill postgres)
(restart postgres with recovery)

select count(*) from t1;
 count
-------
  1058
(1 row)
----------------------------------------------------------

Is this expected or acceptable?

I think, at least, PostgreSQL should say something about this
situation in the server log, because DBA can not recognize
this situation if no server log exists.

To reproduce it, please check the attached test script.

Any comments?

Regards,
--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

Attachment

Re: inconsistent state after crash recovery

From
Andres Freund
Date:
Hi,

On 2013-07-26 13:33:13 +0900, Satoshi Nagayasu wrote:
> I received a question about inconsistent state after crash recovery.
> 
> When a table file is broken (or just lost), PostgreSQL can not recover
> a whole table, and does not show any notice while recoverying.
> I think it means "inconsistent" state.
> 
> (1) create a table, and fill records.
> (2) process a checkpoint.
> (3) fill more records.
> (4) force a crash, and delete the table file.
> (5) run recovery on restarting.
> (6) only records after the checkpoint can be recoverd.
> 
> For example, the attached log shows that PostgreSQL can recover
> only 1058 records in the table which contains 2000 records
> before the crash, and does not tell anything in the server log.
> 
> Is this expected or acceptable?

I'd say it's both. WAL replay doesn't have the knowledge to detect that
in all too many cases. Nearly always a page's contents will be restored
by a full page image the first time it gets changed so they will
individually look completely normal.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: inconsistent state after crash recovery

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-07-26 13:33:13 +0900, Satoshi Nagayasu wrote:
>> Is this expected or acceptable?

> I'd say it's both.

Postgres is built on the assumption that the underlying filesystem is
reliable, ie, once you've successfully fsync'd some data that data won't
disappear.  If the filesystem fails to honor that contract, it's a
filesystem bug not a Postgres bug.  Nor is it reasonable to expect
Postgres to be able to detect every such violation.  As an example,
would you expect crash recovery to notice the disappearance of a file
that was touched nowhere in the replayed actions?
        regards, tom lane



Re: inconsistent state after crash recovery

From
KONDO Mitsumasa
Date:
Hi Satoshi,

I was wondering about this problem. Please tell us about your system enviroment
which is postgresql version ,OS, raid card, and file system.

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



Re: inconsistent state after crash recovery

From
Robert Haas
Date:
On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> On 2013-07-26 13:33:13 +0900, Satoshi Nagayasu wrote:
>>> Is this expected or acceptable?
>
>> I'd say it's both.
>
> Postgres is built on the assumption that the underlying filesystem is
> reliable, ie, once you've successfully fsync'd some data that data won't
> disappear.  If the filesystem fails to honor that contract, it's a
> filesystem bug not a Postgres bug.  Nor is it reasonable to expect
> Postgres to be able to detect every such violation.  As an example,
> would you expect crash recovery to notice the disappearance of a file
> that was touched nowhere in the replayed actions?

Eh, maybe not.  But should we try harder to detect the unexpected
disappearance of one that is?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: inconsistent state after crash recovery

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> would you expect crash recovery to notice the disappearance of a file
>> that was touched nowhere in the replayed actions?

> Eh, maybe not.  But should we try harder to detect the unexpected
> disappearance of one that is?

We do, don't we?  The replay stuff should complain unless it sees a drop
or truncate covering any unaccounted-for pages.
        regards, tom lane



Re: inconsistent state after crash recovery

From
Robert Haas
Date:
On Fri, Aug 2, 2013 at 8:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> would you expect crash recovery to notice the disappearance of a file
>>> that was touched nowhere in the replayed actions?
>
>> Eh, maybe not.  But should we try harder to detect the unexpected
>> disappearance of one that is?
>
> We do, don't we?  The replay stuff should complain unless it sees a drop
> or truncate covering any unaccounted-for pages.

Hmm.  Yeah.  But the OP seems to think it doesn't work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: inconsistent state after crash recovery

From
Tomasz Nowak
Date:
Hi,
  I'm very new here on this mailing list, but I've been using PostgreSQL for a while, and it scares me a little, that it's a real pain to try to recover data from corrupted table.

  Situations like file being lost following server crash (after fsck) or page corruption happens quite often.
 
  Having corruption at the row level for example, system could mark the page as corrupted in the system catalog.
  Giving that page knows last change to this page, can we use archived WAL-s to recover the page?
  We can have a table inside pg_catalog like pg_corrupted_pages with information of page corruption detected by backend.

  Similar to tables, in a case of lost file, once system notice that, we should have a column in pg_class called relneedrecovery to record that.
  Will it be possible to recover this file from last hot backup and apply redo to it based on WAL records?

  Similar functionality is provider by Oracle (media and block recovery).
  I assume we will need some extra DDL commands to handle file/block recovery.
  Also, It would be nice to have a command to quickly cross check files between pg_class and filesystem - just simple open/close system call for each relation - it is always faster that to run vaccum to check that.
Tomasz

On 2 August 2013 13:19, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 2, 2013 at 8:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> would you expect crash recovery to notice the disappearance of a file
>>> that was touched nowhere in the replayed actions?
>
>> Eh, maybe not.  But should we try harder to detect the unexpected
>> disappearance of one that is?
>
> We do, don't we?  The replay stuff should complain unless it sees a drop
> or truncate covering any unaccounted-for pages.

Hmm.  Yeah.  But the OP seems to think it doesn't work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



On 2 August 2013 13:19, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 2, 2013 at 8:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> would you expect crash recovery to notice the disappearance of a file
>>> that was touched nowhere in the replayed actions?
>
>> Eh, maybe not.  But should we try harder to detect the unexpected
>> disappearance of one that is?
>
> We do, don't we?  The replay stuff should complain unless it sees a drop
> or truncate covering any unaccounted-for pages.

Hmm.  Yeah.  But the OP seems to think it doesn't work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: inconsistent state after crash recovery

From
Satoshi Nagayasu
Date:
(2013/08/02 21:19), Robert Haas wrote:
> On Fri, Aug 2, 2013 at 8:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> would you expect crash recovery to notice the disappearance of a file
>>>> that was touched nowhere in the replayed actions?
>>
>>> Eh, maybe not.  But should we try harder to detect the unexpected
>>> disappearance of one that is?
>>
>> We do, don't we?  The replay stuff should complain unless it sees a drop
>> or truncate covering any unaccounted-for pages.
>
> Hmm.  Yeah.  But the OP seems to think it doesn't work.

Yes. I'm afraid that.

My attached script shows that crash recovery re-creates the lost
table file implicitly, and fills some of those blocks (maybe
lower ones) with zero without any notice. We can easily observe
it by using pg_filedump.

Thus, the table file can lose records, but DBA cannot recognize
it because no message is left in the server log.

I agree that this is not a PostgreSQL bug.

However, DBA still needs to detect this table corruption,
brought by several components which PostgreSQL relys on, to
consider restoring from database backup.

If PostgreSQL can detect and tell something about that, it
would be really helpful for DBA to make some critical decision.
I think PostgreSQL will be able to do that.

Regards,
--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

Attachment