Thread: Shared memory corruption?
[similar report submitted previously, but this is more complete] There is something that looks like shared memory corruption going on, which I first noticed by accident the other day, in the 1998-02-09 snapshot. It's still there today, with the 1998-02-12 one, and looks like the following on my Sun SS2 under NetBSD/sparc 1.3 (I've created a simple test case here, for easy testing elsewhere): First, I run initdb, start a postmaster, create a user 'tih', stop the postmaster, restart the postmaster with '-d', thus: barsoom:postgres> postmaster -i -d FindBackend: searching PATH ... FindBackend: found "/usr/local/pgsql/bin/postgres" using PATH Next, I create a database 'words', thus: barsoom:tih> createdb words barsoom:tih> The postmaster says: postmaster: BackendStartup: pid 6542 user tih db template1 socket 5 postmaster: reaping dead processes... postmaster: CleanupProc: pid 6542 exited with status 0 I fire up psql, thus: barsoom:tih> psql words words=> The postmaster goes: postmaster: BackendStartup: pid 6549 user tih db words socket 5 In psql, I then do the following: words=> create table dictionary (entry char(64)); CREATE words=> create unique index dict_by_entry on dictionary (entry); CREATE words=> copy dictionary from '/usr/share/dict/words'; The postmaster generates no output at this, and the copy starts as it should. There is much disk activity. Next, while this is running,in another terminal window, as the same user 'tih', I do: barsoom:tih> createdb Connection to database 'template1' failed. PQexec() -- There is no connection to the backend. createdb: database creation failed on tih. barsoom:tih> When this happens, the postmaster generates the following output: postmaster: BackendStartup: pid 6560 user tih db template1 socket 5 ERROR: cannot write block 171 of dict_by_entry [words] blind postmaster: reaping dead processes... postmaster: CleanupProc: pid 6560 exited with status 0 Looking at processes running on the system at this time, I see: 6549 p6 R+ 2:01.88 /usr/local/pgsql/bin/postgres -p -Q -P5 -v 65536 words This is the backend doing the copy. It is spinning furiously, eating CPU like there was no tomorrow -- but there is no more disk activity. The terminal window where I initiated the copy operation looks as though it were proceeding normally. So now I attempt to perform the database creation again, thus (in the second terminal): barsoom:tih> createdb Nothing happens -- it just hangs there. The postmaster says: postmaster: BackendStartup: pid 6595 user tih db template1 socket 5 Looking with ps again, I can see that this backend is now also running wild, sharing the CPU half and half with the one with PID 6549... Note that I'm trying to create a different database when it breaks; the only possible interaction is through the shared memory that I understand is maintained by the postmaster on behalf of the backends. As for seeing this on other platforms, I certainly hope it's repeatable elsewhere, but it's not unreasonable to assume that it could cause different symptoms on other platforms, including quiet data corruption... The whole thing is completely repeatable here -- any ideas can be verified quickly and easily -- and with enthusiasm. :-) -tih -- Popularity is the hallmark of mediocrity. --Niles Crane, "Frasier"
Vadim, I may need your help on this one. I can reproduce it by runinng the regression test, and doing a shell 'while' loop that continuously creates databases: while : do sh -c 'createdb $$' done I get the errors too. I have no idea on a cause. I would hope it is not the new deadlock code, or locking fixes I did. I think the message comes from smgrblindwrt. Is it possible our new speedups are causing it? > > [similar report submitted previously, but this is more complete] > > There is something that looks like shared memory corruption going on, > which I first noticed by accident the other day, in the 1998-02-09 > snapshot. It's still there today, with the 1998-02-12 one, and looks > like the following on my Sun SS2 under NetBSD/sparc 1.3 (I've created > a simple test case here, for easy testing elsewhere): > > First, I run initdb, start a postmaster, create a user 'tih', stop the > postmaster, restart the postmaster with '-d', thus: > > barsoom:postgres> postmaster -i -d > FindBackend: searching PATH ... > FindBackend: found "/usr/local/pgsql/bin/postgres" using PATH > > Next, I create a database 'words', thus: > > barsoom:tih> createdb words > barsoom:tih> > > The postmaster says: > > postmaster: BackendStartup: pid 6542 user tih db template1 socket 5 > postmaster: reaping dead processes... > postmaster: CleanupProc: pid 6542 exited with status 0 > > I fire up psql, thus: > > barsoom:tih> psql words > words=> > > The postmaster goes: > > postmaster: BackendStartup: pid 6549 user tih db words socket 5 > > In psql, I then do the following: > > words=> create table dictionary (entry char(64)); > CREATE > words=> create unique index dict_by_entry on dictionary (entry); > CREATE > words=> copy dictionary from '/usr/share/dict/words'; > > The postmaster generates no output at this, and the copy starts as it > should. There is much disk activity. Next, while this is running,in > another terminal window, as the same user 'tih', I do: > > barsoom:tih> createdb > Connection to database 'template1' failed. > PQexec() -- There is no connection to the backend. > createdb: database creation failed on tih. > barsoom:tih> > > When this happens, the postmaster generates the following output: > > postmaster: BackendStartup: pid 6560 user tih db template1 socket 5 > ERROR: cannot write block 171 of dict_by_entry [words] blind > postmaster: reaping dead processes... > postmaster: CleanupProc: pid 6560 exited with status 0 > > Looking at processes running on the system at this time, I see: > > 6549 p6 R+ 2:01.88 /usr/local/pgsql/bin/postgres -p -Q -P5 -v 65536 words > > This is the backend doing the copy. It is spinning furiously, eating > CPU like there was no tomorrow -- but there is no more disk activity. > The terminal window where I initiated the copy operation looks as > though it were proceeding normally. So now I attempt to perform the > database creation again, thus (in the second terminal): > > barsoom:tih> createdb > > Nothing happens -- it just hangs there. The postmaster says: > > postmaster: BackendStartup: pid 6595 user tih db template1 socket 5 > > Looking with ps again, I can see that this backend is now also running > wild, sharing the CPU half and half with the one with PID 6549... > > Note that I'm trying to create a different database when it breaks; > the only possible interaction is through the shared memory that I > understand is maintained by the postmaster on behalf of the backends. > As for seeing this on other platforms, I certainly hope it's > repeatable elsewhere, but it's not unreasonable to assume that it > could cause different symptoms on other platforms, including quiet > data corruption... > > The whole thing is completely repeatable here -- any ideas can be > verified quickly and easily -- and with enthusiasm. :-) > > -tih > -- > Popularity is the hallmark of mediocrity. --Niles Crane, "Frasier" > > > -- Bruce Momjian maillist@candle.pha.pa.us
I saw this here too. I ran the regression tests, and while doing it, tried to create a database. No idea on a cause. > > [similar report submitted previously, but this is more complete] > > There is something that looks like shared memory corruption going on, > which I first noticed by accident the other day, in the 1998-02-09 > snapshot. It's still there today, with the 1998-02-12 one, and looks > like the following on my Sun SS2 under NetBSD/sparc 1.3 (I've created > a simple test case here, for easy testing elsewhere): > > First, I run initdb, start a postmaster, create a user 'tih', stop the > postmaster, restart the postmaster with '-d', thus: > > barsoom:postgres> postmaster -i -d > FindBackend: searching PATH ... > FindBackend: found "/usr/local/pgsql/bin/postgres" using PATH > > Next, I create a database 'words', thus: -- Bruce Momjian maillist@candle.pha.pa.us
Bruce Momjian wrote: > > Vadim, I may need your help on this one. I can reproduce it by runinng > the regression test, and doing a shell 'while' loop that continuously > creates databases: > > while : > do > sh -c 'createdb $$' > done > > I get the errors too. I have no idea on a cause. I would hope it is > not the new deadlock code, or locking fixes I did. I think the message > comes from smgrblindwrt. Is it possible our new speedups are causing > it? I'll try to deal with this in the next week. I'm going to update CVS with subselect support right now and I'll try to fix bugs after this. Vadim
> > Bruce Momjian wrote: > > > > Vadim, I may need your help on this one. I can reproduce it by runinng > > the regression test, and doing a shell 'while' loop that continuously > > creates databases: > > > > while : > > do > > sh -c 'createdb $$' > > done > > > > I get the errors too. I have no idea on a cause. I would hope it is > > not the new deadlock code, or locking fixes I did. I think the message > > comes from smgrblindwrt. Is it possible our new speedups are causing > > it? > > I'll try to deal with this in the next week. > I'm going to update CVS with subselect support right now > and I'll try to fix bugs after this. Great. Thanks. -- Bruce Momjian maillist@candle.pha.pa.us
Bruce Momjian wrote: > > Vadim, I may need your help on this one. I can reproduce it by runinng > the regression test, and doing a shell 'while' loop that continuously > creates databases: > > while : > do > sh -c 'createdb $$' > done > > I get the errors too. I have no idea on a cause. I would hope it is > not the new deadlock code, or locking fixes I did. I think the message > comes from smgrblindwrt. Is it possible our new speedups are causing > it? I can reproduce it. Keep looking... BTW, did you compile without --enable-cassert ? (Should be ON by default in beta-s...) I got some interest assertion from BufferAlloc, without CASSERT you should get dead spinlock from there. Vadim
> > Bruce Momjian wrote: > > > > Vadim, I may need your help on this one. I can reproduce it by runinng > > the regression test, and doing a shell 'while' loop that continuously > > creates databases: > > > > while : > > do > > sh -c 'createdb $$' > > done > > > > I get the errors too. I have no idea on a cause. I would hope it is > > not the new deadlock code, or locking fixes I did. I think the message > > comes from smgrblindwrt. Is it possible our new speedups are causing > > it? > > I can reproduce it. Keep looking... > BTW, did you compile without --enable-cassert ? > (Should be ON by default in beta-s...) > I got some interest assertion from BufferAlloc, without CASSERT you should get > dead spinlock from there. I always have asserts on. -- Bruce Momjian maillist@candle.pha.pa.us