Thread: Shared memory corruption?

Shared memory corruption?

From

Tom I Helbekkmo

Date:

12 February 1998, 13:31:11

[similar report submitted previously, but this is more complete]

There is something that looks like shared memory corruption going on,
which I first noticed by accident the other day, in the 1998-02-09
snapshot.  It's still there today, with the 1998-02-12 one, and looks
like the following on my Sun SS2 under NetBSD/sparc 1.3 (I've created
a simple test case here, for easy testing elsewhere):

First, I run initdb, start a postmaster, create a user 'tih', stop the
postmaster, restart the postmaster with '-d', thus:

 barsoom:postgres> postmaster -i -d
 FindBackend: searching PATH ...
 FindBackend: found "/usr/local/pgsql/bin/postgres" using PATH

Next, I create a database 'words', thus:

 barsoom:tih> createdb words
 barsoom:tih>

The postmaster says:

 postmaster: BackendStartup: pid 6542 user tih db template1 socket 5
 postmaster: reaping dead processes...
 postmaster: CleanupProc: pid 6542 exited with status 0

I fire up psql, thus:

 barsoom:tih> psql words
 words=>

The postmaster goes:

 postmaster: BackendStartup: pid 6549 user tih db words socket 5

In psql, I then do the following:

 words=> create table dictionary (entry char(64));
 CREATE
 words=> create unique index dict_by_entry on dictionary (entry);
 CREATE
 words=> copy dictionary from '/usr/share/dict/words';

The postmaster generates no output at this, and the copy starts as it
should.  There is much disk activity.  Next, while this is running,in
another terminal window, as the same user 'tih', I do:

 barsoom:tih> createdb
 Connection to database 'template1' failed.
 PQexec() -- There is no connection to the backend.
 createdb: database creation failed on tih.
 barsoom:tih>

When this happens, the postmaster generates the following output:

 postmaster: BackendStartup: pid 6560 user tih db template1 socket 5
 ERROR:  cannot write block 171 of dict_by_entry [words] blind
 postmaster: reaping dead processes...
 postmaster: CleanupProc: pid 6560 exited with status 0

Looking at processes running on the system at this time, I see:

  6549 p6  R+ 2:01.88 /usr/local/pgsql/bin/postgres -p -Q -P5 -v 65536 words

This is the backend doing the copy.  It is spinning furiously, eating
CPU like there was no tomorrow -- but there is no more disk activity.
The terminal window where I initiated the copy operation looks as
though it were proceeding normally.  So now I attempt to perform the
database creation again, thus (in the second terminal):

 barsoom:tih> createdb

Nothing happens -- it just hangs there.  The postmaster says:

 postmaster: BackendStartup: pid 6595 user tih db template1 socket 5

Looking with ps again, I can see that this backend is now also running
wild, sharing the CPU half and half with the one with PID 6549...

Note that I'm trying to create a different database when it breaks;
the only possible interaction is through the shared memory that I
understand is maintained by the postmaster on behalf of the backends.
As for seeing this on other platforms, I certainly hope it's
repeatable elsewhere, but it's not unreasonable to assume that it
could cause different symptoms on other platforms, including quiet
data corruption...

The whole thing is completely repeatable here -- any ideas can be
verified quickly and easily -- and with enthusiasm.  :-)

-tih
--
Popularity is the hallmark of mediocrity.  --Niles Crane, "Frasier"

Re: [HACKERS] Shared memory corruption?

From

Bruce Momjian

Date:

12 February 1998, 15:31:32

Vadim, I may need your help on this one.  I can reproduce it by runinng
the regression test, and doing a shell 'while' loop that continuously
creates databases:

    while :
    do
        sh -c 'createdb $$'
    done

I get the errors too.  I have no idea on a cause.  I would hope it is
not the new deadlock code, or locking fixes I did.  I think the message
comes from smgrblindwrt.  Is it possible our new speedups are causing
it?



>
> [similar report submitted previously, but this is more complete]
>
> There is something that looks like shared memory corruption going on,
> which I first noticed by accident the other day, in the 1998-02-09
> snapshot.  It's still there today, with the 1998-02-12 one, and looks
> like the following on my Sun SS2 under NetBSD/sparc 1.3 (I've created
> a simple test case here, for easy testing elsewhere):
>
> First, I run initdb, start a postmaster, create a user 'tih', stop the
> postmaster, restart the postmaster with '-d', thus:
>
>  barsoom:postgres> postmaster -i -d
>  FindBackend: searching PATH ...
>  FindBackend: found "/usr/local/pgsql/bin/postgres" using PATH
>
> Next, I create a database 'words', thus:
>
>  barsoom:tih> createdb words
>  barsoom:tih>
>
> The postmaster says:
>
>  postmaster: BackendStartup: pid 6542 user tih db template1 socket 5
>  postmaster: reaping dead processes...
>  postmaster: CleanupProc: pid 6542 exited with status 0
>
> I fire up psql, thus:
>
>  barsoom:tih> psql words
>  words=>
>
> The postmaster goes:
>
>  postmaster: BackendStartup: pid 6549 user tih db words socket 5
>
> In psql, I then do the following:
>
>  words=> create table dictionary (entry char(64));
>  CREATE
>  words=> create unique index dict_by_entry on dictionary (entry);
>  CREATE
>  words=> copy dictionary from '/usr/share/dict/words';
>
> The postmaster generates no output at this, and the copy starts as it
> should.  There is much disk activity.  Next, while this is running,in
> another terminal window, as the same user 'tih', I do:
>
>  barsoom:tih> createdb
>  Connection to database 'template1' failed.
>  PQexec() -- There is no connection to the backend.
>  createdb: database creation failed on tih.
>  barsoom:tih>
>
> When this happens, the postmaster generates the following output:
>
>  postmaster: BackendStartup: pid 6560 user tih db template1 socket 5
>  ERROR:  cannot write block 171 of dict_by_entry [words] blind
>  postmaster: reaping dead processes...
>  postmaster: CleanupProc: pid 6560 exited with status 0
>
> Looking at processes running on the system at this time, I see:
>
>   6549 p6  R+ 2:01.88 /usr/local/pgsql/bin/postgres -p -Q -P5 -v 65536 words
>
> This is the backend doing the copy.  It is spinning furiously, eating
> CPU like there was no tomorrow -- but there is no more disk activity.
> The terminal window where I initiated the copy operation looks as
> though it were proceeding normally.  So now I attempt to perform the
> database creation again, thus (in the second terminal):
>
>  barsoom:tih> createdb
>
> Nothing happens -- it just hangs there.  The postmaster says:
>
>  postmaster: BackendStartup: pid 6595 user tih db template1 socket 5
>
> Looking with ps again, I can see that this backend is now also running
> wild, sharing the CPU half and half with the one with PID 6549...
>
> Note that I'm trying to create a different database when it breaks;
> the only possible interaction is through the shared memory that I
> understand is maintained by the postmaster on behalf of the backends.
> As for seeing this on other platforms, I certainly hope it's
> repeatable elsewhere, but it's not unreasonable to assume that it
> could cause different symptoms on other platforms, including quiet
> data corruption...
>
> The whole thing is completely repeatable here -- any ideas can be
> verified quickly and easily -- and with enthusiasm.  :-)
>
> -tih
> --
> Popularity is the hallmark of mediocrity.  --Niles Crane, "Frasier"
>
>
>


--
Bruce Momjian
maillist@candle.pha.pa.us

Re: [HACKERS] Shared memory corruption?

From

Bruce Momjian

Date:

12 February 1998, 15:31:41

I saw this here too.  I ran the regression tests, and while doing it,
tried to create a database.  No idea on a cause.

>
> [similar report submitted previously, but this is more complete]
>
> There is something that looks like shared memory corruption going on,
> which I first noticed by accident the other day, in the 1998-02-09
> snapshot.  It's still there today, with the 1998-02-12 one, and looks
> like the following on my Sun SS2 under NetBSD/sparc 1.3 (I've created
> a simple test case here, for easy testing elsewhere):
>
> First, I run initdb, start a postmaster, create a user 'tih', stop the
> postmaster, restart the postmaster with '-d', thus:
>
>  barsoom:postgres> postmaster -i -d
>  FindBackend: searching PATH ...
>  FindBackend: found "/usr/local/pgsql/bin/postgres" using PATH
>
> Next, I create a database 'words', thus:


--
Bruce Momjian
maillist@candle.pha.pa.us

Re: [HACKERS] Shared memory corruption?

From

"Vadim B. Mikheev"

Date:

12 February 1998, 21:18:56

Bruce Momjian wrote:
>
> Vadim, I may need your help on this one.  I can reproduce it by runinng
> the regression test, and doing a shell 'while' loop that continuously
> creates databases:
>
>         while :
>         do
>                 sh -c 'createdb $$'
>         done
>
> I get the errors too.  I have no idea on a cause.  I would hope it is
> not the new deadlock code, or locking fixes I did.  I think the message
> comes from smgrblindwrt.  Is it possible our new speedups are causing
> it?

I'll try to deal with this in the next week.
I'm going to update CVS with subselect support right now
and I'll try to fix bugs after this.

Vadim

Re: [HACKERS] Shared memory corruption?

From

Bruce Momjian

Date:

12 February 1998, 21:35:57

>
> Bruce Momjian wrote:
> >
> > Vadim, I may need your help on this one.  I can reproduce it by runinng
> > the regression test, and doing a shell 'while' loop that continuously
> > creates databases:
> >
> >         while :
> >         do
> >                 sh -c 'createdb $$'
> >         done
> >
> > I get the errors too.  I have no idea on a cause.  I would hope it is
> > not the new deadlock code, or locking fixes I did.  I think the message
> > comes from smgrblindwrt.  Is it possible our new speedups are causing
> > it?
>
> I'll try to deal with this in the next week.
> I'm going to update CVS with subselect support right now
> and I'll try to fix bugs after this.

Great.  Thanks.

--
Bruce Momjian
maillist@candle.pha.pa.us

Re: [HACKERS] Shared memory corruption?

From

"Vadim B. Mikheev"

Date:

19 February 1998, 03:02:48

Bruce Momjian wrote:
>
> Vadim, I may need your help on this one.  I can reproduce it by runinng
> the regression test, and doing a shell 'while' loop that continuously
> creates databases:
>
>         while :
>         do
>                 sh -c 'createdb $$'
>         done
>
> I get the errors too.  I have no idea on a cause.  I would hope it is
> not the new deadlock code, or locking fixes I did.  I think the message
> comes from smgrblindwrt.  Is it possible our new speedups are causing
> it?

I can reproduce it. Keep looking...
BTW, did you compile without --enable-cassert ?
(Should be ON by default in beta-s...)
I got some interest assertion from BufferAlloc, without CASSERT you should get
dead spinlock from there.

Vadim

Re: [HACKERS] Shared memory corruption?

From

Bruce Momjian

Date:

19 February 1998, 09:31:50

>
> Bruce Momjian wrote:
> >
> > Vadim, I may need your help on this one.  I can reproduce it by runinng
> > the regression test, and doing a shell 'while' loop that continuously
> > creates databases:
> >
> >         while :
> >         do
> >                 sh -c 'createdb $$'
> >         done
> >
> > I get the errors too.  I have no idea on a cause.  I would hope it is
> > not the new deadlock code, or locking fixes I did.  I think the message
> > comes from smgrblindwrt.  Is it possible our new speedups are causing
> > it?
>
> I can reproduce it. Keep looking...
> BTW, did you compile without --enable-cassert ?
> (Should be ON by default in beta-s...)
> I got some interest assertion from BufferAlloc, without CASSERT you should get
> dead spinlock from there.

I always have asserts on.

--
Bruce Momjian
maillist@candle.pha.pa.us