Thread: Race-condition with failed block-write?

Race-condition with failed block-write?

From

Arjen van der Meijden

Date:

10 September 2005, 16:27:19

Hi list,

I noticed a development machine was responding terribly slow, and found
out postgresql had consumed all available memory:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres  7465  0.0  0.0  16636   928 ?        S    Sep05   0:00
/usr/bin/postmaster -D /var/lib/postgresql/data
postgres  7472  0.0 81.0 3144408 832032 ?      D    Sep05   0:56  \_
postgres: writer process
postgres  7473  0.0  0.0   7552   916 ?        S    Sep05   0:00  \_
postgres: stats buffer process
postgres  7474  0.0  0.0   6848   900 ?        S    Sep05   0:00      \_
postgres: stats collector process

It had taken over 800MB of memory. The machine is a P4 3.0 Ghz (with HT
enabled), 1GB of memory a few SATA disks and runs with a recent Gentoo +
2.6.12.5 kernel.
The postgresql version that failed was 8.0.3, it may or may not be worth
knowing that it runs a 8.1devel on another port and from another directory.

The 8.0.3-database was not actively in use, since our current
development work was using the 8.1devel.

In the postgresql.log a write-failure messages was repeated enough to
make the log file 50MB larger:

This was the first one:
[ - 2005-09-07 13:03:47 CEST @] ERROR:  xlog flush request 29/67713428
is not satisfied --- flushed only to 29/2E73E488
[ - 2005-09-07 13:03:47 CEST @] CONTEXT:  writing block 21 of relation
1663/2013826/9975789
[ - 2005-09-07 13:03:48 CEST @] ERROR:  xlog flush request 29/67713428
is not satisfied --- flushed only to 29/2E73E488
[ - 2005-09-07 13:03:48 CEST @] CONTEXT:  writing block 21 of relation
1663/2013826/9975789
[ - 2005-09-07 13:03:48 CEST @] WARNING:  could not write block 21 of
1663/2013826/9975789
[ - 2005-09-07 13:03:48 CEST @] DETAIL:  Multiple failures --- write
error may be permanent.


And those 4 lines were repeated up untill the memory was empty, as it
seems from the log:


TopMemoryContext: -1095880208 total in 264213 blocks; 537938888 free
(924739 chunks); -1633819096 used
MdSmgr: 8192 total in 1 blocks; 8024 free (0 chunks); 168 used
Pending Ops Table: 8192 total in 1 blocks; 6112 free (0 chunks); 2080 used
DynaHash: 8192 total in 1 blocks; 7488 free (0 chunks); 704 used
smgr relation table: 8192 total in 1 blocks; 4048 free (0 chunks); 4144 used
LockTable (locallock hash): 8192 total in 1 blocks; 6112 free (0
chunks); 2080 used
ErrorContext: 8192 total in 1 blocks; 8176 free (0 chunks); 16 used
[ - 2005-09-09 02:42:22 CEST @] ERROR:  out of memory
[ - 2005-09-09 02:42:22 CEST @] DETAIL:  Failed on request of size 16000.


After issueing an immediate-shutdown, it starts and runs just fine:

[ - 2005-09-09 10:19:51 CEST @] LOG:  received fast shutdown request
[ - 2005-09-09 10:23:01 CEST @] LOG:  received immediate shutdown request
[ - 2005-09-09 10:27:20 CEST @] LOG:  database system was interrupted at
2005-09-05 11:24:08 CEST
[ - 2005-09-09 10:27:20 CEST @] LOG:  checkpoint record is at 29/2E73E44C
[ - 2005-09-09 10:27:20 CEST @] LOG:  redo record is at 29/2E73E44C;
undo record is at 0/0; shutdown TRUE
[ - 2005-09-09 10:27:20 CEST @] LOG:  next transaction ID: 166361; next
OID: 55768251
[ - 2005-09-09 10:27:20 CEST @] LOG:  database system was not properly
shut down; automatic recovery in progress
[ - 2005-09-09 10:27:20 CEST @] LOG:  record with zero length at 29/2E73E488
[ - 2005-09-09 10:27:20 CEST @] LOG:  redo is not required
[ - 2005-09-09 10:27:20 CEST @] LOG:  database system is ready


At 2005-09-05 the machine was rebooted and the only query since then
(which may have triggered the issue) was:
[2005-09-07 13:03:35 CEST - 2005-09-07 13:03:37 CEST acm@acm] LOG:
statement: SELECT c.oid,
           n.nspname,
           c.relname
         FROM pg_catalog.pg_class c
              LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
         WHERE pg_catalog.pg_table_is_visible(c.oid)
               AND c.relname ~ '^pwprijs$'
         ORDER BY 2, 3;

I shut down the postgresql-server after the restart again to prevent
further changes to the data-files of it.

Best regards,

Arjen van der Meijden

Re: Race-condition with failed block-write?

From

Tom Lane

Date:

10 September 2005, 21:18:28

Arjen van der Meijden <acm@tweakers.net> writes:
> In the postgresql.log a write-failure messages was repeated enough to
> make the log file 50MB larger:

> [ - 2005-09-07 13:03:47 CEST @] ERROR:  xlog flush request 29/67713428
> is not satisfied --- flushed only to 29/2E73E488
> [ - 2005-09-07 13:03:47 CEST @] CONTEXT:  writing block 21 of relation
> 1663/2013826/9975789
> ...

> TopMemoryContext: -1095880208 total in 264213 blocks; 537938888 free
> (924739 chunks); -1633819096 used
> MdSmgr: 8192 total in 1 blocks; 8024 free (0 chunks); 168 used
> Pending Ops Table: 8192 total in 1 blocks; 6112 free (0 chunks); 2080 used
> DynaHash: 8192 total in 1 blocks; 7488 free (0 chunks); 704 used
> smgr relation table: 8192 total in 1 blocks; 4048 free (0 chunks); 4144 used
> LockTable (locallock hash): 8192 total in 1 blocks; 6112 free (0
> chunks); 2080 used
> ErrorContext: 8192 total in 1 blocks; 8176 free (0 chunks); 16 used
> [ - 2005-09-09 02:42:22 CEST @] ERROR:  out of memory
> [ - 2005-09-09 02:42:22 CEST @] DETAIL:  Failed on request of size 16000.

The pending-ops table only exists in the bgwriter, so it's evidently the
bgwriter that was out of memory.  I have an old note to myself

: Doesn't bgwriter risk leaking memory if ereport out of a checkpoint?
: Seems we should have it run checkpoints in a short-term context.
: Don't the other daemons have similar issues?

It looks to me like you have a case of this actually happening: it kept
failing to execute a checkpoint and leaking some more memory each time.

I'll move the priority of fixing that up a bit ...

The other question is why the failure was occurring in the first place
--- corrupt LSN value, apparently, but how'd it get that way?

            regards, tom lane

Re: Race-condition with failed block-write?

From

Tom Lane

Date:

12 September 2005, 19:26:11

Arjen van der Meijden <acm@tweakers.net> writes:
> It had taken over 800MB of memory. The machine is a P4 3.0 Ghz (with HT
> enabled), 1GB of memory a few SATA disks and runs with a recent Gentoo +
> 2.6.12.5 kernel.
> The postgresql version that failed was 8.0.3, it may or may not be worth
> knowing that it runs a 8.1devel on another port and from another directory.

> In the postgresql.log a write-failure messages was repeated enough to
> make the log file 50MB larger:

I was able to replicate the memory leak in 8.0 by artificially poking a
bad value into the LSN field of a dirty page.  CVS tip did not seem to
leak anything under the same conditions --- probably the difference is
that BufferSync() uses palloc'd temporary space in 8.0 but not in HEAD.
Nonetheless, we ought to have a positive defense against such leaks in
future, so I've applied the attached patch to both branches.

There is still the question of how the LSN value got corrupted, but
we've probably missed any chance of finding that out now.

            regards, tom lane


Index: bgwriter.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.19
diff -c -r1.19 bgwriter.c
*** bgwriter.c    20 Aug 2005 23:26:17 -0000    1.19
--- bgwriter.c    12 Sep 2005 21:42:48 -0000
***************
*** 160,165 ****
--- 160,166 ----
  BackgroundWriterMain(void)
  {
      sigjmp_buf    local_sigjmp_buf;
+     MemoryContext bgwriter_context;

      Assert(BgWriterShmem != NULL);
      BgWriterShmem->bgwriter_pid = MyProcPid;
***************
*** 208,213 ****
--- 209,227 ----
      last_checkpoint_time = time(NULL);

      /*
+      * Create a memory context that we will do all our work in.  We do this
+      * so that we can reset the context during error recovery and thereby
+      * avoid possible memory leaks.  Formerly this code just ran in
+      * TopMemoryContext, but resetting that would be a really bad idea.
+      */
+     bgwriter_context = AllocSetContextCreate(TopMemoryContext,
+                                              "Background Writer",
+                                              ALLOCSET_DEFAULT_MINSIZE,
+                                              ALLOCSET_DEFAULT_INITSIZE,
+                                              ALLOCSET_DEFAULT_MAXSIZE);
+     MemoryContextSwitchTo(bgwriter_context);
+
+     /*
       * If an exception is encountered, processing resumes here.
       *
       * See notes in postgres.c about the design of this coding.
***************
*** 247,255 ****
           * Now return to normal top-level context and clear ErrorContext
           * for next time.
           */
!         MemoryContextSwitchTo(TopMemoryContext);
          FlushErrorState();

          /* Now we can allow interrupts again */
          RESUME_INTERRUPTS();

--- 261,272 ----
           * Now return to normal top-level context and clear ErrorContext
           * for next time.
           */
!         MemoryContextSwitchTo(bgwriter_context);
          FlushErrorState();

+         /* Flush any leaked data in the top-level context */
+         MemoryContextResetAndDeleteChildren(bgwriter_context);
+
          /* Now we can allow interrupts again */
          RESUME_INTERRUPTS();

Re: Race-condition with failed block-write?

From

Tom Lane

Date:

13 September 2005, 11:26:37

Arjen van der Meijden <acm@tweakers.net> writes:
> As a matter of fact, I just tested it again. Issuing that same query
> will trigger the issue again when I execute it. I don't know whether the
> query matters, but I know this one will trigger it, so I didn't try others.

It's highly unlikely that that query has anything to do with it, since
it's not touching anything but system catalogs and not trying to write
them either.

> Since its a development machine, access to it is a possibility. But if
> you can give me a few pointers how to gather usefull information without
> any "stranger" accessing the machine, that'd be nice.

The first thing you ought to find out is which table
1663/2013826/9975789 is, and look to see if the corrupted LSN value is
already present on disk in that block.  If it is, then we've probably
not got much chance of finding out how it got there.  If it is *not* on
disk, but you have a repeatable way of causing this to happen starting
from a clean postmaster start, then that's pretty interesting --- but
I don't know any way of figuring it out short of groveling through the
code with a debugger.  If you're not already pretty familiar with the PG
code, coaching you remotely isn't going to work very well :-(.  I'd be
glad to look into it if you can get me access to the machine though.

            regards, tom lane

Re: Race-condition with failed block-write?

From

Tom Lane

Date:

13 September 2005, 15:07:29

Arjen van der Meijden <acm@tweakers.net> writes:
> On 13-9-2005 16:25, Tom Lane wrote:
>> The first thing you ought to find out is which table
>> 1663/2013826/9975789 is, and look to see if the corrupted LSN value is
>> already present on disk in that block.

> Well, its an index, not a table. It was the index:
> "pg_class_relname_nsp_index" on pg_class(relname, relnamespace).

Ah.  So you've reindexed pg_class at some point.  Reindexing it again
would likely get you out of this.

> Using pg_filedump I extracted the LSN for block 21 and indeed, that was
> already 67713428 instead of something below 2E73E53C. It wasn't that
> block alone though, here are a few LSN-lines from it:

>   LSN:  logid     41 recoff 0x676f5174      Special  8176 (0x1ff0)
>   LSN:  logid     25 recoff 0x3c6c5504      Special  8176 (0x1ff0)
>   LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
>   LSN:  logid     41 recoff 0x2ea88190      Special  8176 (0x1ff0)
>   LSN:  logid      1 recoff 0x68e2f660      Special  8176 (0x1ff0)
>   LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
>   LSN:  logid      1 recoff 0x68e2f6a4      Special  8176 (0x1ff0)

logid is the high-order half of the LSN, so there's nothing wrong with
those other pages --- it's only the first one you show there that seems
to be past the current end of WAL.

> On that day I did some active query-tuning, but a few times it took too
> long, so I issued immediate shut downs when the selects took too long.
> There were no warnings about broken records afterwards in the log
> though, so I don't believe anything got damaged afterwards.

I have a feeling something may have gone wrong here, though it's hard to
say what.  If the bogus pages in the other tables all have LSNs close to
this one then that makes it less likely that this is a random corruption
event --- what would be more plausible is that end of WAL really was
that high and somehow the WAL counter got reset back during one of those
forced restarts.

Can you show us ls -l output for the pg_xlog directory?  I'm interested
to see the file names and mod dates there.

            regards, tom lane

Re: Race-condition with failed block-write?

From

Arjen van der Meijden

Date:

13 September 2005, 15:45:11

On 13-9-2005 20:04, Tom Lane wrote:
> Arjen van der Meijden <acm@tweakers.net> writes:
>
>>On 13-9-2005 16:25, Tom Lane wrote:
>>
>>Well, its an index, not a table. It was the index:
>>"pg_class_relname_nsp_index" on pg_class(relname, relnamespace).
>
> Ah.  So you've reindexed pg_class at some point.  Reindexing it again
> would likely get you out of this.

Unless reindexing is part of other commands, I didn't do that. The last
time 'grep' was able to find an reference to something being reindexed
was in June, something (maybe me, but I doubt it, I'd also reindex the
user-tables, I suppose) was reindexing all system tables back then.
Besides, its not just the index, on pg_class, pg_class itself (and
pg_index) have wrong LSN's as well.

>>Using pg_filedump I extracted the LSN for block 21 and indeed, that was
>>already 67713428 instead of something below 2E73E53C. It wasn't that
>>block alone though, here are a few LSN-lines from it:
>
>
>>  LSN:  logid     41 recoff 0x676f5174      Special  8176 (0x1ff0)
>>  LSN:  logid     25 recoff 0x3c6c5504      Special  8176 (0x1ff0)
>>  LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
>>  LSN:  logid     41 recoff 0x2ea88190      Special  8176 (0x1ff0)
>>  LSN:  logid      1 recoff 0x68e2f660      Special  8176 (0x1ff0)
>>  LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
>>  LSN:  logid      1 recoff 0x68e2f6a4      Special  8176 (0x1ff0)
>
>
> logid is the high-order half of the LSN, so there's nothing wrong with
> those other pages --- it's only the first one you show there that seems
> to be past the current end of WAL.

There were 3 blocks of 40 with a LSN like the first one above in that
index-file. So with high-order 41, recoff 0x67[67]something.
In the pg_class-file there were 6 blocks, of which 5 LSN's were like the
above in that index. And for pg_index 3 blocks, with 1 wrong.

>>On that day I did some active query-tuning, but a few times it took too
>>long, so I issued immediate shut downs when the selects took too long.
>>There were no warnings about broken records afterwards in the log
>>though, so I don't believe anything got damaged afterwards.
>
> I have a feeling something may have gone wrong here, though it's hard to
> say what.  If the bogus pages in the other tables all have LSNs close to
> this one then that makes it less likely that this is a random corruption
> event --- what would be more plausible is that end of WAL really was
> that high and somehow the WAL counter got reset back during one of those
> forced restarts.
>
> Can you show us ls -l output for the pg_xlog directory?  I'm interested
> to see the file names and mod dates there.

Here you go:

l /var/lib/postgresql/data/pg_xlog/
total 145M
drwx------  3 postgres postgres 4.0K Sep  1 12:37 .
drwx------  8 postgres postgres 4.0K Sep 13 20:31 ..
-rw-------  1 postgres postgres  16M Sep 13 19:25 00000001000000290000002E
-rw-------  1 postgres postgres  16M Sep  1 12:36 000000010000002900000067
-rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000068
-rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000069
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006A
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006B
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006C
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006D
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006E

During data-load it was warning about too frequent checkpoints, but I do
hope thats mostly performance-related, not stability?

Best regards,

Arjen van der Meijden

Re: Race-condition with failed block-write?

From

Tom Lane

Date:

13 September 2005, 16:36:53

Arjen van der Meijden <acm@tweakers.net> writes:
> Unless reindexing is part of other commands, I didn't do that. The last
> time 'grep' was able to find an reference to something being reindexed
> was in June, something (maybe me, but I doubt it, I'd also reindex the
> user-tables, I suppose) was reindexing all system tables back then.

Well, that would have done it.  But if you have such complete logs, do
you have the postmaster log from the whole period?  Look for sequences
approximately like this:

2005-09-13 14:48:33.450 EDT 25857 LOG:  database system was interrupted at 2005-09-13 14:42:08 EDT
2005-09-13 14:48:33.450 EDT 25857 LOG:  checkpoint record is at 0/12B65A40
2005-09-13 14:48:33.450 EDT 25857 LOG:  redo record is at 0/12B65A40; undo record is at 0/0; shutdown FALSE
2005-09-13 14:48:33.450 EDT 25857 LOG:  next transaction ID: 67789; next OID: 240494
2005-09-13 14:48:33.450 EDT 25857 LOG:  next MultiXactId: 1; next MultiXactOffset: 0
2005-09-13 14:48:33.450 EDT 25857 LOG:  database system was not properly shut down; automatic recovery in progress
2005-09-13 14:48:33.452 EDT 25857 LOG:  record with zero length at 0/12B65A88
2005-09-13 14:48:33.452 EDT 25857 LOG:  redo is not required
2005-09-13 14:48:33.474 EDT 25857 LOG:  database system is ready

and note whether there's any one that shows a WAL end address (here
0/12B65A88) that's less than the one before.

> l /var/lib/postgresql/data/pg_xlog/
> total 145M
> drwx------  3 postgres postgres 4.0K Sep  1 12:37 .
> drwx------  8 postgres postgres 4.0K Sep 13 20:31 ..
> -rw-------  1 postgres postgres  16M Sep 13 19:25 00000001000000290000002E
> -rw-------  1 postgres postgres  16M Sep  1 12:36 000000010000002900000067
> -rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000068
> -rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000069
> -rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006A
> -rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006B
> -rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006C
> -rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006D
> -rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006E

This is definitely ugly: there's no way all those higher-numbered
segments would be in place if the WAL counter hadn't been up to that
range.  So I think we are definitely looking at a WAL-counter-went-
backward problem, not some random data corruption.

Judging from the file dates, the WAL counter was still OK (ie, above
67xxx) on Sep 1, and the problem must have occurred since then.

            regards, tom lane

Re: Race-condition with failed block-write?

From

Arjen van der Meijden

Date:

13 September 2005, 17:13:02

On 13-9-2005 20:53, Tom Lane wrote:
> Arjen van der Meijden <acm@tweakers.net> writes:
>
>>Unless reindexing is part of other commands, I didn't do that. The last
>>time 'grep' was able to find an reference to something being reindexed
>>was in June, something (maybe me, but I doubt it, I'd also reindex the
>>user-tables, I suppose) was reindexing all system tables back then.
>
>
> Well, that would have done it.  But if you have such complete logs, do
> you have the postmaster log from the whole period?
I have logs from the time I started using it seriously indeed. It was
just never rotated or something. I did adjust it to have proper prefixes
only at 23th of august, so everything before that is a bit less easy to
read.
Besides that it contains lots of "debug" stuff, like logged queries and
their statistics.

It is a lot of text and the file is currently 52MB. But without all
those repeated failure-messages its about 862KB. So I can e-mail a
compressed version to you, if you're interested in it.

> Look for sequences
> approximately like this:
>
> and note whether there's any one that shows a WAL end address (here
> 0/12B65A88) that's less than the one before.

I take it you mean something like this:
[ - 2005-08-23 15:29:26 CEST @] LOG:  checkpoint record is at 29/2E73E410
[ - 2005-08-23 15:37:27 CEST @] LOG:  checkpoint record is at 29/2E73E410
[ - 2005-08-25 10:26:27 CEST @] LOG:  checkpoint record is at 29/2E8D26B4
[ - 2005-08-25 10:47:59 CEST @] LOG:  checkpoint record is at 29/2E8D26F0
[ - 2005-09-05 11:24:08 CEST @] LOG:  checkpoint record is at 29/2E73E44C
[ - 2005-09-09 10:27:20 CEST @] LOG:  checkpoint record is at 29/2E73E44C

The last start-up at 25th of August has a higher second value than the
start-up at the 5th of September. Which was a start since it was closed
at sep 1.

>>l /var/lib/postgresql/data/pg_xlog/
>>total 145M
>>drwx------  3 postgres postgres 4.0K Sep  1 12:37 .
>>drwx------  8 postgres postgres 4.0K Sep 13 20:31 ..
>>-rw-------  1 postgres postgres  16M Sep 13 19:25 00000001000000290000002E
>>-rw-------  1 postgres postgres  16M Sep  1 12:36 000000010000002900000067
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000068
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000069
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006A
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006B
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006C
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006D
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006E
>
>
> This is definitely ugly: there's no way all those higher-numbered
> segments would be in place if the WAL counter hadn't been up to that
> range.  So I think we are definitely looking at a WAL-counter-went-
> backward problem, not some random data corruption.
>
> Judging from the file dates, the WAL counter was still OK (ie, above
> 67xxx) on Sep 1, and the problem must have occurred since then.

That's odd. But it must have been my hard attempts to shut it down then.
These are all the lines there are for Sep 1:

[ - 2005-09-01 12:36:50 CEST @] LOG:  received fast shutdown request
[ - 2005-09-01 12:36:50 CEST @] LOG:  shutting down
[ - 2005-09-01 12:36:50 CEST @] LOG:  database system is shut down
[ - 2005-09-01 12:37:01 CEST @] LOG:  received smart shutdown request
[ - 2005-09-01 12:37:01 CEST @] LOG:  shutting down
[ - 2005-09-01 12:37:01 CEST @] LOG:  database system is shut down
[ - 2005-09-05 11:24:08 CEST @] LOG:  database system was shut down at
2005-09-01 12:37:01 CEST
[ - 2005-09-05 11:24:08 CEST @] LOG:  checkpoint record is at 29/2E73E44C
[ - 2005-09-05 11:24:08 CEST @] LOG:  redo record is at 29/2E73E44C;
undo record is at 0/0; shutdown TRUE
[ - 2005-09-05 11:24:08 CEST @] LOG:  next transaction ID: 166361; next
OID: 55768251
[ - 2005-09-05 11:24:08 CEST @] LOG:  database system is ready

I'm not sure why I needed two shut down-calls at that time. It may have
been due to a psql on a terminal I'd forgot to close first.

Best regards,

Arjen

Re: Race-condition with failed block-write?

From

Tom Lane

Date:

13 September 2005, 18:01:44

Arjen van der Meijden <acm@tweakers.net> writes:
> These are all the lines there are for Sep 1:

> [ - 2005-09-01 12:36:50 CEST @] LOG:  received fast shutdown request
> [ - 2005-09-01 12:36:50 CEST @] LOG:  shutting down
> [ - 2005-09-01 12:36:50 CEST @] LOG:  database system is shut down
> [ - 2005-09-01 12:37:01 CEST @] LOG:  received smart shutdown request
> [ - 2005-09-01 12:37:01 CEST @] LOG:  shutting down
> [ - 2005-09-01 12:37:01 CEST @] LOG:  database system is shut down

That's all?  There's something awfully suspicious about that.  You're
sure this is 8.0.3?  AFAICS it is absolutely impossible for the 8.0
postmaster.c code to emit "received smart shutdown request" after
emitting "received fast shutdown request".  The SIGINT code looks like

            if (Shutdown >= FastShutdown)
                break;
            Shutdown = FastShutdown;
            ereport(LOG,
                    (errmsg("received fast shutdown request")));

and the SIGTERM code looks like

            if (Shutdown >= SmartShutdown)
                break;
            Shutdown = SmartShutdown;
            ereport(LOG,
                    (errmsg("received smart shutdown request")));

and there are no other places that change the value of Shutdown, and
certainly FastShutdown > SmartShutdown.  So I wonder if something got
lost in the log entries.

Another question is why the postmaster didn't exit at 12:36:50.  It was
not waiting on any backends, else it would not have launched the
shutdown process (which is what emits the other two messages).

[ thinks for a bit ... ]  I wonder if Shutdown ought to be marked
volatile, since it is after all changed by a signal handler.  But given
the way the postmaster is coded, this doesn't seem likely to be an issue.
Basically all of the code runs with signals blocked.

Can you try to reconstruct what you did on Sep 1, and see whether you
can reproduce the above behavior?

            regards, tom lane

Re: Race-condition with failed block-write?

From

Arjen van der Meijden

Date:

13 September 2005, 18:41:38

On 13-9-2005 23:01, Tom Lane wrote:
> Arjen van der Meijden <acm@tweakers.net> writes:
>
> That's all?  There's something awfully suspicious about that.  You're
> sure this is 8.0.3?

When I do "select version();"

PostgreSQL 8.0.3 on i686-pc-linux-gnu, compiled by GCC
i686-pc-linux-gnu-gcc (GCC) 3.3.5-20050130 (Gentoo Linux
3.3.5.20050130-r1, ssp-3.3.5.20050130-1, pie-8.7.7.1)

I checked the emerge.log from Gentoo and that sais I installed version
8.0.3 at the 15th of August. So well before that September the 1st.

> AFAICS it is absolutely impossible for the 8.0
> postmaster.c code to emit "received smart shutdown request" after
> emitting "received fast shutdown request".  The SIGINT code looks like
>
>
> and the SIGTERM code looks like
>
> and there are no other places that change the value of Shutdown, and
> certainly FastShutdown > SmartShutdown.  So I wonder if something got
> lost in the log entries.

That'd surprise me, but it would explain this behaviour. I doubt though
that much happened in those 11 seconds that are missing. It can't have
been a start-up without logging, since it wouldn't have logged the
shut-down then, would it?
Besides that, I normally start it and shut it down using the
/etc/init.d-scripts. And that script issues a fast-shutdown, so a
smart-shutdown should not be necessary anymore.

> Another question is why the postmaster didn't exit at 12:36:50.  It was
> not waiting on any backends, else it would not have launched the
> shutdown process (which is what emits the other two messages).
>
> [ thinks for a bit ... ]  I wonder if Shutdown ought to be marked
> volatile, since it is after all changed by a signal handler.  But given
> the way the postmaster is coded, this doesn't seem likely to be an issue.
> Basically all of the code runs with signals blocked.
>
> Can you try to reconstruct what you did on Sep 1, and see whether you
> can reproduce the above behavior?

The only time I really recall having trouble with shutting it down was
when the memory had leaked up all system memory (at 9-9). I don't know
what happened at 1-9 anymore, as far as I remember and can read back
from the log I just (tried to) shut it down. Most likely I tried to shut
it down to free up some extra memory for the postgres 8.1, running at
that time, the 8.0.3 wasn't in use anyway.
I'll try and see if I can dig up more from the logs and see if I can
test a few reasonable scenario's tomorrow though.

Best regards,

Arjen

Re: Race-condition with failed block-write?

From

Arjen van der Meijden

Date:

14 September 2005, 05:58:43

Arjen van der Meijden wrote:
> On 13-9-2005 23:01, Tom Lane wrote:
>
> I'll try and see if I can dig up more from the logs and see if I can
> test a few reasonable scenario's tomorrow though.

I wasn't able to find anything suspicious at all. A smart-shutdown
should've come from the user postgres and I can trace all the 'pg_ctl
...'-calls for that postgres-installation to other dates than Sep 1 or
the 8.1-installation.
Besides that, if the init.d-script with its fast shutdown wouldn't have
responded, it seems very unlikely to me that I'd try a normal shutdown
after the fast shutdown. I would've tried an immediate shutdown instead.

Besides that the settings hadn't been changed in days at that time, so
any query issued should've been logged (log statement was enabled), even
if it got cancelled before it finished. I also can't find kill's from
root or similar statements, pointing to other incorrect usages.

Nor was I silly enough to start postgres 8.1 on the 8.0.3-data-store, or
postgresql 8.0.3 with pg_ctl instead of the init.d-script.

So why it was able to do two shutdowns in a row, I have no clue, not
even how to think of a reasonable scenario what happened. I wasn't
actively using/testing the database at that time. The machine itself
wasn't shut down as well at that date.

Best regards,

Arjen

Re: Race-condition with failed block-write?

From

Arjen van der Meijden

Date:

14 September 2005, 11:01:01

Tom Lane wrote:
> Arjen van der Meijden <acm@tweakers.net> writes:
>>> I'll try and see if I can dig up more from the logs and see if I can
>>> test a few reasonable scenario's tomorrow though.
>
>> I wasn't able to find anything suspicious at all.
>
> Is it conceivable that the init-script itself did something funny?
> There are so many different init scripts around ...

Its the default gentoo-init-script, but I checked it whether it could do
weird things. Anyway, this is the stop-code:

stop() {
         ebegin "Stopping PostgreSQL"
         su - $PGUSER -c "/usr/bin/pg_ctl stop -D '$PGDATA' -s -m fast"
         eend $?
}

start and restart are similarily simple.
ebegin and eend are afaik simple macro's to pretty print to the console.

Best regards,

Arjen

Re: Race-condition with failed block-write?

From

Tom Lane

Date:

14 September 2005, 11:09:11

Arjen van der Meijden <acm@tweakers.net> writes:
>> I'll try and see if I can dig up more from the logs and see if I can
>> test a few reasonable scenario's tomorrow though.

> I wasn't able to find anything suspicious at all.

Is it conceivable that the init-script itself did something funny?
There are so many different init scripts around ...

            regards, tom lane

Re: Race-condition with failed block-write?

From

Arjen van der Meijden

Date:

14 September 2005, 21:31:48

On 13-9-2005 0:26, Tom Lane wrote:
> Arjen van der Meijden <acm@tweakers.net> writes:
>
> There is still the question of how the LSN value got corrupted, but
> we've probably missed any chance of finding that out now.

As a matter of fact, I just tested it again. Issuing that same query
will trigger the issue again when I execute it. I don't know whether the
query matters, but I know this one will trigger it, so I didn't try others.

This one has the issue as a result:
SELECT c.oid,
           n.nspname,
           c.relname
         FROM pg_catalog.pg_class c
              LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
         WHERE pg_catalog.pg_table_is_visible(c.oid)
               AND c.relname ~ '^pwprijs$'
         ORDER BY 2, 3;

After the above query nothing happens for a few seconds and then these
errors show up again:
[ - 2005-09-13 08:23:10 CEST @] ERROR:  xlog flush request 29/67713428
is not satisfied --- flushed only to 29/2E73E53C
[ - 2005-09-13 08:23:10 CEST @] CONTEXT:  writing block 21 of relation
1663/2013826/9975789
[ - 2005-09-13 08:23:10 CEST @] WARNING:  could not write block 21 of
1663/2013826/9975789

I stopped postgresql again to prevent changes to the on-disk data. But I
did start postgresql a second time to see if that query would trigger it
again and it did.

So what can I do to help you find the issue?
Dumping the data to file and sending it to you would probably be a waste
of bandwidth and trouble. Apart from the fact that I'd have permission
of my employer to do so anyway.
Since its a development machine, access to it is a possibility. But if
you can give me a few pointers how to gather usefull information without
any "stranger" accessing the machine, that'd be nice.

Best regards,

Arjen van der Meijden

Re: Race-condition with failed block-write?

From

Arjen van der Meijden

Date:

14 September 2005, 21:40:20

On 13-9-2005 16:25, Tom Lane wrote:
> Arjen van der Meijden <acm@tweakers.net> writes:
>
> It's highly unlikely that that query has anything to do with it, since
> it's not touching anything but system catalogs and not trying to write
> them either.

Indeed, other things trigger it as well.

> The first thing you ought to find out is which table
> 1663/2013826/9975789 is, and look to see if the corrupted LSN value is
> already present on disk in that block.

Well, its an index, not a table. It was the index:
"pg_class_relname_nsp_index" on pg_class(relname, relnamespace).

Using pg_filedump I extracted the LSN for block 21 and indeed, that was
already 67713428 instead of something below 2E73E53C. It wasn't that
block alone though, here are a few LSN-lines from it:

  LSN:  logid     41 recoff 0x676f5174      Special  8176 (0x1ff0)
  LSN:  logid     25 recoff 0x3c6c5504      Special  8176 (0x1ff0)
  LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
  LSN:  logid     41 recoff 0x2ea88190      Special  8176 (0x1ff0)
  LSN:  logid      1 recoff 0x68e2f660      Special  8176 (0x1ff0)
  LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
  LSN:  logid      1 recoff 0x68e2f6a4      Special  8176 (0x1ff0)

I tried other files and each one I tried only had LSN's of 0.

When trying (\d indexname in psql) to determine to which table that
index belonged I noticed it got the errors again, but for another file
(pg_index this time). And another try (oid2name ...) after that, yet
another file (the pg_class-table). All those files where last changed
somewhere August 25, so now new changes.

On that day I did some active query-tuning, but a few times it took too
long, so I issued immediate shut downs when the selects took too long.
There were no warnings about broken records afterwards in the log
though, so I don't believe anything got damaged afterwards.

After that I loaded some fresh data from a production-database using
either pg_restore or psql < some-file-from-pg_dump.sql (I don't know
which one anymore). A few days later I shut down that postgres,
installed 8.1-beta and used that (in another directory of course), this
8.0.3 only came back up because of a reboot and wasn't used since that
reboot.

I guess, during that reloading those system tables got mixed up?

> If it is, then we've probably
> not got much chance of finding out how it got there.  If it is *not* on
> disk, but you have a repeatable way of causing this to happen starting
> from a clean postmaster start, then that's pretty interesting --- but
> I don't know any way of figuring it out short of groveling through the
> code with a debugger.  If you're not already pretty familiar with the PG
> code, coaching you remotely isn't going to work very well :-(.  I'd be
> glad to look into it if you can get me access to the machine though.

Well, I can very probably give you that access. But as you say, finding
out was went wrong is very hard to do.

Best regards,

Arjen van der Meijden