Shared memory corruption? - Mailing list pgsql-hackers

From Tom I Helbekkmo
Subject Shared memory corruption?
Date
Msg-id 980212192150.5990A@barsoom.Hamartun.Priv.NO
Whole thread Raw
Responses Re: [HACKERS] Shared memory corruption?  (Bruce Momjian <maillist@candle.pha.pa.us>)
Re: [HACKERS] Shared memory corruption?  (Bruce Momjian <maillist@candle.pha.pa.us>)
List pgsql-hackers
[similar report submitted previously, but this is more complete]

There is something that looks like shared memory corruption going on,
which I first noticed by accident the other day, in the 1998-02-09
snapshot.  It's still there today, with the 1998-02-12 one, and looks
like the following on my Sun SS2 under NetBSD/sparc 1.3 (I've created
a simple test case here, for easy testing elsewhere):

First, I run initdb, start a postmaster, create a user 'tih', stop the
postmaster, restart the postmaster with '-d', thus:

 barsoom:postgres> postmaster -i -d
 FindBackend: searching PATH ...
 FindBackend: found "/usr/local/pgsql/bin/postgres" using PATH

Next, I create a database 'words', thus:

 barsoom:tih> createdb words
 barsoom:tih>

The postmaster says:

 postmaster: BackendStartup: pid 6542 user tih db template1 socket 5
 postmaster: reaping dead processes...
 postmaster: CleanupProc: pid 6542 exited with status 0

I fire up psql, thus:

 barsoom:tih> psql words
 words=>

The postmaster goes:

 postmaster: BackendStartup: pid 6549 user tih db words socket 5

In psql, I then do the following:

 words=> create table dictionary (entry char(64));
 CREATE
 words=> create unique index dict_by_entry on dictionary (entry);
 CREATE
 words=> copy dictionary from '/usr/share/dict/words';

The postmaster generates no output at this, and the copy starts as it
should.  There is much disk activity.  Next, while this is running,in
another terminal window, as the same user 'tih', I do:

 barsoom:tih> createdb
 Connection to database 'template1' failed.
 PQexec() -- There is no connection to the backend.
 createdb: database creation failed on tih.
 barsoom:tih>

When this happens, the postmaster generates the following output:

 postmaster: BackendStartup: pid 6560 user tih db template1 socket 5
 ERROR:  cannot write block 171 of dict_by_entry [words] blind
 postmaster: reaping dead processes...
 postmaster: CleanupProc: pid 6560 exited with status 0

Looking at processes running on the system at this time, I see:

  6549 p6  R+ 2:01.88 /usr/local/pgsql/bin/postgres -p -Q -P5 -v 65536 words

This is the backend doing the copy.  It is spinning furiously, eating
CPU like there was no tomorrow -- but there is no more disk activity.
The terminal window where I initiated the copy operation looks as
though it were proceeding normally.  So now I attempt to perform the
database creation again, thus (in the second terminal):

 barsoom:tih> createdb

Nothing happens -- it just hangs there.  The postmaster says:

 postmaster: BackendStartup: pid 6595 user tih db template1 socket 5

Looking with ps again, I can see that this backend is now also running
wild, sharing the CPU half and half with the one with PID 6549...

Note that I'm trying to create a different database when it breaks;
the only possible interaction is through the shared memory that I
understand is maintained by the postmaster on behalf of the backends.
As for seeing this on other platforms, I certainly hope it's
repeatable elsewhere, but it's not unreasonable to assume that it
could cause different symptoms on other platforms, including quiet
data corruption...

The whole thing is completely repeatable here -- any ideas can be
verified quickly and easily -- and with enthusiasm.  :-)

-tih
--
Popularity is the hallmark of mediocrity.  --Niles Crane, "Frasier"


pgsql-hackers by date:

Previous
From: ocie@paracel.com
Date:
Subject: Re: [HACKERS] Problem with the numbers I reported yesterday
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] Problem with the numbers I reported yesterday