Shared memory corruption? - Mailing list pgsql-hackers
From | Tom I Helbekkmo |
---|---|
Subject | Shared memory corruption? |
Date | |
Msg-id | 980212192150.5990A@barsoom.Hamartun.Priv.NO Whole thread Raw |
Responses |
Re: [HACKERS] Shared memory corruption?
Re: [HACKERS] Shared memory corruption? |
List | pgsql-hackers |
[similar report submitted previously, but this is more complete] There is something that looks like shared memory corruption going on, which I first noticed by accident the other day, in the 1998-02-09 snapshot. It's still there today, with the 1998-02-12 one, and looks like the following on my Sun SS2 under NetBSD/sparc 1.3 (I've created a simple test case here, for easy testing elsewhere): First, I run initdb, start a postmaster, create a user 'tih', stop the postmaster, restart the postmaster with '-d', thus: barsoom:postgres> postmaster -i -d FindBackend: searching PATH ... FindBackend: found "/usr/local/pgsql/bin/postgres" using PATH Next, I create a database 'words', thus: barsoom:tih> createdb words barsoom:tih> The postmaster says: postmaster: BackendStartup: pid 6542 user tih db template1 socket 5 postmaster: reaping dead processes... postmaster: CleanupProc: pid 6542 exited with status 0 I fire up psql, thus: barsoom:tih> psql words words=> The postmaster goes: postmaster: BackendStartup: pid 6549 user tih db words socket 5 In psql, I then do the following: words=> create table dictionary (entry char(64)); CREATE words=> create unique index dict_by_entry on dictionary (entry); CREATE words=> copy dictionary from '/usr/share/dict/words'; The postmaster generates no output at this, and the copy starts as it should. There is much disk activity. Next, while this is running,in another terminal window, as the same user 'tih', I do: barsoom:tih> createdb Connection to database 'template1' failed. PQexec() -- There is no connection to the backend. createdb: database creation failed on tih. barsoom:tih> When this happens, the postmaster generates the following output: postmaster: BackendStartup: pid 6560 user tih db template1 socket 5 ERROR: cannot write block 171 of dict_by_entry [words] blind postmaster: reaping dead processes... postmaster: CleanupProc: pid 6560 exited with status 0 Looking at processes running on the system at this time, I see: 6549 p6 R+ 2:01.88 /usr/local/pgsql/bin/postgres -p -Q -P5 -v 65536 words This is the backend doing the copy. It is spinning furiously, eating CPU like there was no tomorrow -- but there is no more disk activity. The terminal window where I initiated the copy operation looks as though it were proceeding normally. So now I attempt to perform the database creation again, thus (in the second terminal): barsoom:tih> createdb Nothing happens -- it just hangs there. The postmaster says: postmaster: BackendStartup: pid 6595 user tih db template1 socket 5 Looking with ps again, I can see that this backend is now also running wild, sharing the CPU half and half with the one with PID 6549... Note that I'm trying to create a different database when it breaks; the only possible interaction is through the shared memory that I understand is maintained by the postmaster on behalf of the backends. As for seeing this on other platforms, I certainly hope it's repeatable elsewhere, but it's not unreasonable to assume that it could cause different symptoms on other platforms, including quiet data corruption... The whole thing is completely repeatable here -- any ideas can be verified quickly and easily -- and with enthusiasm. :-) -tih -- Popularity is the hallmark of mediocrity. --Niles Crane, "Frasier"
pgsql-hackers by date: