Thread: Tom Lane's fixes in v6.4.3
>From Tom Lane's horror story... >I spent an hour tracing through startup of 6.4.x, and I now understand >why the thing doesn't crash despite the horrible bugs in ShmemInitHash. >Read on, if you have a strong stomach. Are Tom Lane's fixes included in 6.4.3 beta? I think his findings are so important. --- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> From Tom Lane's horror story... >> I spent an hour tracing through startup of 6.4.x, and I now understand >> why the thing doesn't crash despite the horrible bugs in ShmemInitHash. >> Read on, if you have a strong stomach. > Are Tom Lane's fixes included in 6.4.3 beta? I think his findings are > so important. I have made a patch for 6.4.x which I intend to commit into the REL6_4 tree, but it hasn't gotten as much testing as I would like. The patch is attached if you care to try it for a while first. (These changes are already in the development tree, BTW.) regards, tom lane
>Tatsuo Ishii <t-ishii@sra.co.jp> writes: >>> From Tom Lane's horror story... >>> I spent an hour tracing through startup of 6.4.x, and I now understand >>> why the thing doesn't crash despite the horrible bugs in ShmemInitHash. >>> Read on, if you have a strong stomach. > >> Are Tom Lane's fixes included in 6.4.3 beta? I think his findings are >> so important. > >I have made a patch for 6.4.x which I intend to commit into the REL6_4 >tree, but it hasn't gotten as much testing as I would like. The patch >is attached if you care to try it for a while first. (These changes >are already in the development tree, BTW.) Thanks. Your patches work fine with fresh REL6_4 sources I got this morning. One thing I noticed: when the backend runs out the semaphores, postmatser dies with following messages: IpcSemaphoreCreate: semget failed (No space left on device) key=5432017, num=16, permission=600 NOTICE: Message from PostgreSQL backend:The Postmaster has informed me that some other backend died abnormally and possiblycorrupted shared memory.I have rolled back the current transaction and am going to terminate your database systemconnection and exit.Please reconnect to the database system and repeat your query. Is this normal? I thought postmatser tried to re-initialize the shared buffer and resume to the normal operation in this case. BTW, 6.4 tree does not have the max backend patch I posted. So even if there are enough resources, the backend will crash if connections > MaxBackends. -- Tatsuo Ishii
When I tried to start postmaster as: postmaster -d 3 -B 1024 I got a core dump: FindExec: searching PATH ... ValidateBinary: can't stat "/home/httpd/html/users/t-ishii/bin/postgres" ValidateBinary: can't stat "/usr/local/bin/postgres" ValidateBinary: can't stat "/bin/postgres" ValidateBinary: can't stat "/usr/bin/postgres" ValidateBinary: can't stat "/home/httpd/html/users/t-ishii/src/pgsql/postgresql-6.4.2/src/backend./postgres" ValidateBinary: can't stat "/usr/X11R6/bin/postgres" FindExec: found "/usr/local/pgsql/bin/postgres" using PATH binding ShmemCreate(key=52e2c1, size=9859300) ERROR: InitMultiLocks: couldnt initialize lock table Quit (core dumped) InitMultiLocks calls LockMethodTableInit. So I inspected LockMethodTableInit and found that it returned lockMethodTable->ctl->lockmethod with value 0 which made InitMultiLocks judge something went wrong. Note that ipcs -m -l sais: max number of segments = 128 max seg size (kbytes) = 16384 max total shared memory (kbytes) = 16777216 min seg size (bytes) = 1 So there should be enogh shared mems. Also note that -B 1023 runs fine, but -B 1024 does not. Any idea? This is 6.4.2 + Tom Lanes fix running Linux/Mips (kernel 2.0.33) with 32MB memories. -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > One thing I noticed: when the backend runs out the > semaphores, postmatser dies with following messages: > IpcSemaphoreCreate: semget failed (No space left on device) key=5432017, num=16, permission=600 > NOTICE: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. > I have rolled back the current transaction and am going to terminate your database system connection and exit. > Please reconnect to the database system and repeat your query. > Is this normal? Yes, that's the behavior that we decided we'd better fix for 6.5. I think retrofitting the various MaxBackends-related changes into 6.4.x would be risky --- the changes are fairly widespread and have not gotten all that much testing so far. regards, tom lane
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > One thing I noticed: when the backend runs out the > > semaphores, postmatser dies with following messages: > > > IpcSemaphoreCreate: semget failed (No space left on device) key=5432017, num=16, permission=600 > > NOTICE: Message from PostgreSQL backend: > > The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. > > I have rolled back the current transaction and am going to terminate your database system connection and exit. > > Please reconnect to the database system and repeat your query. > > > Is this normal? > > Yes, that's the behavior that we decided we'd better fix for 6.5. Glad to hear that. > I think retrofitting the various MaxBackends-related changes into 6.4.x > would be risky --- the changes are fairly widespread and have not gotten > all that much testing so far. Ok. I will keep your patches for the case of having trouble with many backends. I think it should be noted somewhere that 6.4.3 is not very stable with many backends (known bugs section?). -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > When I tried to start postmaster as: > postmaster -d 3 -B 1024 > I got a core dump: Can't duplicate that here, using either 6.4+fixes or current source. Some platform dependency involved perhaps?? It seems possible that this indicates some further bugs in the shared memory allocation stuff, so I think it needs to be pursued. regards, tom lane