Re: Anyone understand shared-memory space usage? - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Anyone understand shared-memory space usage?
Date
Msg-id 10605.919730969@sss.pgh.pa.us
Whole thread Raw
In response to Re: Anyone understand shared-memory space usage?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Re: Anyone understand shared-memory space usage?
List pgsql-hackers
I wrote:
> I would like someone to check my work; if the code was really as
> broken as I think it was, we should have been seeing more problems
> than we were.

I spent an hour tracing through startup of 6.4.x, and I now understand
why the thing doesn't crash despite the horrible bugs in ShmemInitHash.
Read on, if you have a strong stomach.

First off, ShmemInitHash allocates too small a chunk of space for
the hash header + directory (because it computes the size of the
directory as log2(max_size) *bytes* not longwords).  Then, it computes
the wrong address for the directory --- the expressioninfoP->dir = (long *) (location + sizeof(HHDR));
looks good until you remember that location is a pointer to long not
a pointer to char.  Upshot: the address computed for "dir" is typically
168 bytes past the end of the space actually allocated for it.

Why is this not fatal?  Well, the very next ShmemAlloc call is always
to create the first "segment" of the hashtable; this is always for 1024
bytes, so the dir pointer is no longer pointing to nowhere.  It is in
fact pointing at the 42'nd entry of its own first segment.  (HHGTTG fans
can find deep significance in this.)  In other words entry 42 of the
hash segment points back at the segment itself.

When you work through the logic in dynahash.c, you discover that the
upshot of this is that (a) the segment appears to be the first item on
its own 42'nd hash-bucket chain, and (b) the 0'th and 42'nd hash-bucket
chains are therefore the same list, or more accurately the 0'th chain is
the cdr of the 42'nd chain since it doesn't appear to contain the
segment itself.

As long as no searched-for hash key with a hash value of 0 or 42
happens to match whatever the first few words of the segment are,
things pretty much work.  The only way you'd really notice is that
hash_seq() will report some of the hashtable records twice, and will
also report one completely bogus "record" that is the hash segment.
Our uses of hash_seq() are apparently robust enough not to be bothered.

Things don't go to hell in a handbasket until and unless the hashtable
is expanded past 256 entries.  At that point another segment is allocated
and its pointer is stored in slot 43 of the old segment, causing all the
table entries that were in hashbucket 43 to instantly disappear from
view --- they can't be found by searching the table anymore.  Also,
hashchain 43 now appears to be the same as hashchain 256 (the first 
of the new segment), but that's not going to bother anyone any worse
than the first duplicated chain did.

I think it's entirely likely that this set of bugs can account for flaky
behavior seen in installations with more than 256 shared-memory buffers
(postmaster -B > 256), more than 256 simultaneously held locks (have no
idea how to translate that into user terms), or more than 256 concurrent
backends.  I'm still wondering whether that might describe Daryl
Dunbar's problem with locks not getting released, for example.
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Hiroshi Inoue"
Date:
Subject: copyObject() ?
Next
From: RHS Linux User
Date:
Subject: Re: [HACKERS] Updated developers list