Thread: Synchronized scans versus relcache reinitialization

Synchronized scans versus relcache reinitialization

From

Tom Lane

Date:

26 May 2012, 16:14:39

I've been poking at Jeff Frost's and Greg Mullane's recent reports of
high load due to many processes getting "stuck" in relcache init file
rebuild operations.  I can reproduce a similar behavior here by creating
a database containing a whole lot of many-column views, thereby bloating
pg_attribute to the gigabyte range, then manually removing the
pg_internal.init file (simulating what would happen after a relcache
inval on any system catalog), and then throwing a bunch of new
connections at the database simultaneously.  Each new connection tries
to rebuild the init file, and they basically saturate the machine.
I don't believe that this case quite matches what happened to either
Jeff or Greg, but nonetheless it's quite reproducible and it needs
to be fixed.  I can identify three sub-issues:

1. If pg_attribute is larger than 1/4th of shared_buffers, the
synchronized scan logic kicks in when we do seqscans to fill the tuple
descriptors for core system catalogs.  For this particular use case
that's not merely not helpful, it's positively disastrous.  The reason
is that the desired rows are almost always in the first couple dozen
blocks of pg_attribute, and the reading code in RelationBuildTupleDesc
knows this and is coded to stop once it's collected the expected number
of pg_attribute rows for the particular catalog.  So even with a very
large pg_attribute, not much work should be expended here.  But the
syncscan logic causes some of the heapscans to start from points later
than block zero, causing them to miss the rows they need, so that the
scan has to run to the end and wrap around before it finds all the rows
it needs.  In my test case on HEAD, this happens just once out of the
eleven heapscans that occur in this phase, if a single backend is doing
this in isolation.  That increases the startup time from a few
milliseconds to about eight-tenths of a second, due to having to scan
all of pg_attribute.  (In my test case, pg_attribute is fully cached in
RAM, but most of it is in kernel buffers not PG buffers.)

Bad as that is, it gets rapidly worse if there are multiple incoming new
connections.  All of them get swept up in the full-table syncscan
started by the first arrival, so that now all rather than only some of
their heapscans start from a point later than block zero, meaning that
all eleven rather than just one of their heapscans are unduly expensive.

It seems clear to me that we should just disable syncscans for the
relcache reload heapscans.  There is lots of downside due to breaking
the early-exit optimization in RelationBuildTupleDesc, and basically no
upside.  I'm inclined to just modify systable_beginscan to prevent use
of syncscan whenever indexOK is false.  If we wanted to change its API
we could make this happen only for RelationBuildTupleDesc's calls, but
I don't see any upside for allowing syncscans for other forced-heapscan
callers either.

2. The larger problem here is that when we have N incoming connections
we let all N of them try to rebuild the init file independently.  This
doesn't make things faster for any one of them, and once N gets large
enough it makes things slower for all of them.  We would be better off
letting the first arrival do the rebuild work while the others just
sleep waiting for it.  I believe that this fix would probably have
ameliorated Jeff and Greg's cases, even though those do not seem to
have triggered the syncscan logic.

3. Having now spent a good deal of time poking at this, I think that the
syncscan logic is in need of more tuning, and I am wondering whether we
should even have it turned on by default.  It appears to be totally
useless for fully-cached-in-RAM scenarios, even if most of the relation
is out in kernel buffers rather than in shared buffers.  The best case
I saw was less than 2X speedup compared to N-times-the-single-client
case, and that wasn't very reproducible, and it didn't happen at all
unless I hacked BAS_BULKREAD mode to use a ring buffer size many times
larger than the current 256K setting (otherwise the timing requirements
are too tight for multiple backends to stay in sync --- a seqscan can
blow through that much data in a fraction of a millisecond these days,
if it's reading from kernel buffers).  The current tuning may be all
right for cases where you're actually reading from spinning rust, but
that seems to be a decreasing fraction of real-world use cases.

Anyway, I think we definitely need to fix systable_beginscan to not use
syncscans; that's about a one-line change and seems plenty safe to
backpatch.  I also intend to look at avoiding concurrent relcache
rebuilds, which I think should also be simple enough if we are willing
to introduce an additional LWLock.  (That would prevent concurrent init
file rebuilds in different databases, but it's not clear that very many
people care about such scenarios.)  I am inclined to back-patch that as
well; it's a bit riskier than the first change, but the first change is
apparently not going to fix the two cases reported from the field.
Comments?
        regards, tom lane

Re: Synchronized scans versus relcache reinitialization

From

Noah Misch

Date:

27 May 2012, 19:14:24

On Sat, May 26, 2012 at 03:14:18PM -0400, Tom Lane wrote:
> It seems clear to me that we should just disable syncscans for the
> relcache reload heapscans.  There is lots of downside due to breaking
> the early-exit optimization in RelationBuildTupleDesc, and basically no
> upside.  I'm inclined to just modify systable_beginscan to prevent use
> of syncscan whenever indexOK is false.  If we wanted to change its API
> we could make this happen only for RelationBuildTupleDesc's calls, but
> I don't see any upside for allowing syncscans for other forced-heapscan
> callers either.

Looks harmless enough, though it's only targeting a symptom.  No matter how
you cut it, the system is in a bad state when many backends simultaneously
heapscan a large system catalog.

> 2. The larger problem here is that when we have N incoming connections
> we let all N of them try to rebuild the init file independently.  This
> doesn't make things faster for any one of them, and once N gets large
> enough it makes things slower for all of them.  We would be better off
> letting the first arrival do the rebuild work while the others just
> sleep waiting for it.  I believe that this fix would probably have
> ameliorated Jeff and Greg's cases, even though those do not seem to
> have triggered the syncscan logic.

This strikes me as the clearer improvement; it fixes the root cause.

Thanks,
nm

Re: Synchronized scans versus relcache reinitialization

From

Tom Lane

Date:

27 May 2012, 19:38:05

Noah Misch <noah@leadboat.com> writes:
> On Sat, May 26, 2012 at 03:14:18PM -0400, Tom Lane wrote:
>> It seems clear to me that we should just disable syncscans for the
>> relcache reload heapscans.  There is lots of downside due to breaking
>> the early-exit optimization in RelationBuildTupleDesc, and basically no
>> upside.  I'm inclined to just modify systable_beginscan to prevent use
>> of syncscan whenever indexOK is false.  If we wanted to change its API
>> we could make this happen only for RelationBuildTupleDesc's calls, but
>> I don't see any upside for allowing syncscans for other forced-heapscan
>> callers either.

> Looks harmless enough, though it's only targeting a symptom.  No matter how
> you cut it, the system is in a bad state when many backends simultaneously
> heapscan a large system catalog.

Agreed, but actually this isn't just a symptom: the syncscan code is
*causing* full-table heapscans that would not occur otherwise.

>> 2. The larger problem here is that when we have N incoming connections
>> we let all N of them try to rebuild the init file independently.  This
>> doesn't make things faster for any one of them, and once N gets large
>> enough it makes things slower for all of them.  We would be better off
>> letting the first arrival do the rebuild work while the others just
>> sleep waiting for it.  I believe that this fix would probably have
>> ameliorated Jeff and Greg's cases, even though those do not seem to
>> have triggered the syncscan logic.

> This strikes me as the clearer improvement; it fixes the root cause.

As I noted in the other thread, I've had second thoughts about this
proposal: it would serialize incoming sessions even in cases where no
benefit would be gained.  Given the lack of previous reports I'm
inclined to think that fixing the misapplication of syncscan logic
should be enough to cure the problem, and hence we shouldn't take a
risk of de-optimizing behavior that has generally worked fine for the
last fifteen years.
        regards, tom lane

Re: Synchronized scans versus relcache reinitialization

From

Jeff Davis

Date:

30 May 2012, 23:19:54

On Sat, 2012-05-26 at 15:14 -0400, Tom Lane wrote:
> 3. Having now spent a good deal of time poking at this, I think that the
> syncscan logic is in need of more tuning, and I am wondering whether we
> should even have it turned on by default.  It appears to be totally
> useless for fully-cached-in-RAM scenarios, even if most of the relation
> is out in kernel buffers rather than in shared buffers.  The best case
> I saw was less than 2X speedup compared to N-times-the-single-client
> case, and that wasn't very reproducible, and it didn't happen at all
> unless I hacked BAS_BULKREAD mode to use a ring buffer size many times
> larger than the current 256K setting (otherwise the timing requirements
> are too tight for multiple backends to stay in sync --- a seqscan can
> blow through that much data in a fraction of a millisecond these days,
> if it's reading from kernel buffers).  The current tuning may be all
> right for cases where you're actually reading from spinning rust, but
> that seems to be a decreasing fraction of real-world use cases.

Do you mean that the best case you saw ever was 2X, or the best case
when the table is mostly in kernel buffers was 2X?

I clearly saw better than 2X when the table was on disk, so if you
aren't, we should investigate.

One thing we could do is drive the threshold from effective_cache_size
rather than shared_buffers, which was discussed during 8.3 development.

Regards,Jeff Davis

Re: Synchronized scans versus relcache reinitialization

From

Tom Lane

Date:

30 May 2012, 23:50:22

Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2012-05-26 at 15:14 -0400, Tom Lane wrote:
>> 3. Having now spent a good deal of time poking at this, I think that the
>> syncscan logic is in need of more tuning, and I am wondering whether we
>> should even have it turned on by default.  It appears to be totally
>> useless for fully-cached-in-RAM scenarios, even if most of the relation
>> is out in kernel buffers rather than in shared buffers.  The best case
>> I saw was less than 2X speedup compared to N-times-the-single-client
>> case, and that wasn't very reproducible, and it didn't happen at all
>> unless I hacked BAS_BULKREAD mode to use a ring buffer size many times
>> larger than the current 256K setting (otherwise the timing requirements
>> are too tight for multiple backends to stay in sync --- a seqscan can
>> blow through that much data in a fraction of a millisecond these days,
>> if it's reading from kernel buffers).  The current tuning may be all
>> right for cases where you're actually reading from spinning rust, but
>> that seems to be a decreasing fraction of real-world use cases.

> Do you mean that the best case you saw ever was 2X, or the best case
> when the table is mostly in kernel buffers was 2X?

I was only examining a fully-cached-in-RAM case.

> I clearly saw better than 2X when the table was on disk, so if you
> aren't, we should investigate.

I don't doubt that syncscan can provide better than 2X speedup if you
have more than 2 concurrent readers for a syncscan traversing data
that's too big to fit in RAM.  What I'm questioning is whether such
cases represent a sufficiently large fraction of our userbase to justify
having syncscan on by default.  I would be happier about having it on
if it seemed to be useful for fully-cached scenarios, but it doesn't.

> One thing we could do is drive the threshold from effective_cache_size
> rather than shared_buffers, which was discussed during 8.3 development.

If we were going to do that, I think that we'd need to consider having
different thresholds for using bulkread access strategy and using
syncscan, because not using bulkread is going to blow out the
shared_buffers cache.  We originally avoided that on the grounds of
not wanting to have to optimize more than 2 behaviors, but maybe it's
time to investigate more.
        regards, tom lane