Thread: Tom Lane's fixes in v6.4.3

Tom Lane's fixes in v6.4.3

From
Tatsuo Ishii
Date:
>From Tom Lane's horror story...

>I spent an hour tracing through startup of 6.4.x, and I now understand
>why the thing doesn't crash despite the horrible bugs in ShmemInitHash.
>Read on, if you have a strong stomach.

Are Tom Lane's fixes included in 6.4.3 beta? I think his findings are
so important.
---
Tatsuo Ishii


Re: [HACKERS] Tom Lane's fixes in v6.4.3

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> From Tom Lane's horror story...
>> I spent an hour tracing through startup of 6.4.x, and I now understand
>> why the thing doesn't crash despite the horrible bugs in ShmemInitHash.
>> Read on, if you have a strong stomach.

> Are Tom Lane's fixes included in 6.4.3 beta? I think his findings are
> so important.

I have made a patch for 6.4.x which I intend to commit into the REL6_4
tree, but it hasn't gotten as much testing as I would like.  The patch
is attached if you care to try it for a while first.  (These changes
are already in the development tree, BTW.)
        regards, tom lane


Re: [HACKERS] Tom Lane's fixes in v6.4.3

From
Tatsuo Ishii
Date:
>Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>>> From Tom Lane's horror story...
>>> I spent an hour tracing through startup of 6.4.x, and I now understand
>>> why the thing doesn't crash despite the horrible bugs in ShmemInitHash.
>>> Read on, if you have a strong stomach.
>
>> Are Tom Lane's fixes included in 6.4.3 beta? I think his findings are
>> so important.
>
>I have made a patch for 6.4.x which I intend to commit into the REL6_4
>tree, but it hasn't gotten as much testing as I would like.  The patch
>is attached if you care to try it for a while first.  (These changes
>are already in the development tree, BTW.)

Thanks. Your patches work fine with fresh REL6_4 sources I got this
morning.  One thing I noticed: when the backend runs out the
semaphores, postmatser dies with following messages:

IpcSemaphoreCreate: semget failed (No space left on device) key=5432017, num=16, permission=600
NOTICE:  Message from PostgreSQL backend:The Postmaster has informed me that some other backend died abnormally and
possiblycorrupted shared memory.I have rolled back the current transaction and am going to terminate your database
systemconnection and exit.Please reconnect to the database system and repeat your query.
 

Is this normal?

I thought postmatser tried to re-initialize the shared buffer and
resume to the normal operation in this case.

BTW, 6.4 tree does not have the max backend patch I posted. So even if 
there are enough resources, the backend will crash if connections >
MaxBackends.
--
Tatsuo Ishii


Re: [HACKERS] Tom Lane's fixes in v6.4.3

From
Tatsuo Ishii
Date:
When I tried to start postmaster as:

postmaster -d 3 -B 1024

I got a core dump:

FindExec: searching PATH ...
ValidateBinary: can't stat "/home/httpd/html/users/t-ishii/bin/postgres"
ValidateBinary: can't stat "/usr/local/bin/postgres"
ValidateBinary: can't stat "/bin/postgres"
ValidateBinary: can't stat "/usr/bin/postgres"
ValidateBinary: can't stat "/home/httpd/html/users/t-ishii/src/pgsql/postgresql-6.4.2/src/backend./postgres"
ValidateBinary: can't stat "/usr/X11R6/bin/postgres"
FindExec: found "/usr/local/pgsql/bin/postgres" using PATH
binding ShmemCreate(key=52e2c1, size=9859300)
ERROR:  InitMultiLocks: couldnt initialize lock table
Quit (core dumped)

InitMultiLocks calls LockMethodTableInit. So I inspected
LockMethodTableInit and found that it returned 
lockMethodTable->ctl->lockmethod with value 0 which made
InitMultiLocks judge something went wrong.
Note that ipcs -m -l sais:

max number of segments = 128
max seg size (kbytes) = 16384
max total shared memory (kbytes) = 16777216
min seg size (bytes) = 1

So there should be enogh shared mems. Also note that -B 1023 runs
fine, but -B 1024 does not.

Any idea?

This is 6.4.2 + Tom Lanes fix running Linux/Mips (kernel 2.0.33) with
32MB memories.
--
Tatsuo Ishii


Re: [HACKERS] Tom Lane's fixes in v6.4.3

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> One thing I noticed: when the backend runs out the
> semaphores, postmatser dies with following messages:

> IpcSemaphoreCreate: semget failed (No space left on device) key=5432017, num=16, permission=600
> NOTICE:  Message from PostgreSQL backend:
>     The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory.
>     I have rolled back the current transaction and am going to terminate your database system connection and exit.
>     Please reconnect to the database system and repeat your query.

> Is this normal?

Yes, that's the behavior that we decided we'd better fix for 6.5.

I think retrofitting the various MaxBackends-related changes into 6.4.x
would be risky --- the changes are fairly widespread and have not gotten
all that much testing so far.
        regards, tom lane


Re: [HACKERS] Tom Lane's fixes in v6.4.3

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > One thing I noticed: when the backend runs out the
> > semaphores, postmatser dies with following messages:
> 
> > IpcSemaphoreCreate: semget failed (No space left on device) key=5432017, num=16, permission=600
> > NOTICE:  Message from PostgreSQL backend:
> >     The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory.
> >     I have rolled back the current transaction and am going to terminate your database system connection and exit.
> >     Please reconnect to the database system and repeat your query.
> 
> > Is this normal?
> 
> Yes, that's the behavior that we decided we'd better fix for 6.5.

Glad to hear that.

> I think retrofitting the various MaxBackends-related changes into 6.4.x
> would be risky --- the changes are fairly widespread and have not gotten
> all that much testing so far.

Ok. I will keep your patches for the case of having trouble with many
backends. I think it should be noted somewhere that 6.4.3 is not very
stable with many backends (known bugs section?).
--
Tatsuo Ishii


Re: [HACKERS] Tom Lane's fixes in v6.4.3

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> When I tried to start postmaster as:
> postmaster -d 3 -B 1024
> I got a core dump:

Can't duplicate that here, using either 6.4+fixes or current source.
Some platform dependency involved perhaps??

It seems possible that this indicates some further bugs in the shared
memory allocation stuff, so I think it needs to be pursued.
        regards, tom lane