Thread: Synchronized scans versus relcache reinitialization
I've been poking at Jeff Frost's and Greg Mullane's recent reports of high load due to many processes getting "stuck" in relcache init file rebuild operations. I can reproduce a similar behavior here by creating a database containing a whole lot of many-column views, thereby bloating pg_attribute to the gigabyte range, then manually removing the pg_internal.init file (simulating what would happen after a relcache inval on any system catalog), and then throwing a bunch of new connections at the database simultaneously. Each new connection tries to rebuild the init file, and they basically saturate the machine. I don't believe that this case quite matches what happened to either Jeff or Greg, but nonetheless it's quite reproducible and it needs to be fixed. I can identify three sub-issues: 1. If pg_attribute is larger than 1/4th of shared_buffers, the synchronized scan logic kicks in when we do seqscans to fill the tuple descriptors for core system catalogs. For this particular use case that's not merely not helpful, it's positively disastrous. The reason is that the desired rows are almost always in the first couple dozen blocks of pg_attribute, and the reading code in RelationBuildTupleDesc knows this and is coded to stop once it's collected the expected number of pg_attribute rows for the particular catalog. So even with a very large pg_attribute, not much work should be expended here. But the syncscan logic causes some of the heapscans to start from points later than block zero, causing them to miss the rows they need, so that the scan has to run to the end and wrap around before it finds all the rows it needs. In my test case on HEAD, this happens just once out of the eleven heapscans that occur in this phase, if a single backend is doing this in isolation. That increases the startup time from a few milliseconds to about eight-tenths of a second, due to having to scan all of pg_attribute. (In my test case, pg_attribute is fully cached in RAM, but most of it is in kernel buffers not PG buffers.) Bad as that is, it gets rapidly worse if there are multiple incoming new connections. All of them get swept up in the full-table syncscan started by the first arrival, so that now all rather than only some of their heapscans start from a point later than block zero, meaning that all eleven rather than just one of their heapscans are unduly expensive. It seems clear to me that we should just disable syncscans for the relcache reload heapscans. There is lots of downside due to breaking the early-exit optimization in RelationBuildTupleDesc, and basically no upside. I'm inclined to just modify systable_beginscan to prevent use of syncscan whenever indexOK is false. If we wanted to change its API we could make this happen only for RelationBuildTupleDesc's calls, but I don't see any upside for allowing syncscans for other forced-heapscan callers either. 2. The larger problem here is that when we have N incoming connections we let all N of them try to rebuild the init file independently. This doesn't make things faster for any one of them, and once N gets large enough it makes things slower for all of them. We would be better off letting the first arrival do the rebuild work while the others just sleep waiting for it. I believe that this fix would probably have ameliorated Jeff and Greg's cases, even though those do not seem to have triggered the syncscan logic. 3. Having now spent a good deal of time poking at this, I think that the syncscan logic is in need of more tuning, and I am wondering whether we should even have it turned on by default. It appears to be totally useless for fully-cached-in-RAM scenarios, even if most of the relation is out in kernel buffers rather than in shared buffers. The best case I saw was less than 2X speedup compared to N-times-the-single-client case, and that wasn't very reproducible, and it didn't happen at all unless I hacked BAS_BULKREAD mode to use a ring buffer size many times larger than the current 256K setting (otherwise the timing requirements are too tight for multiple backends to stay in sync --- a seqscan can blow through that much data in a fraction of a millisecond these days, if it's reading from kernel buffers). The current tuning may be all right for cases where you're actually reading from spinning rust, but that seems to be a decreasing fraction of real-world use cases. Anyway, I think we definitely need to fix systable_beginscan to not use syncscans; that's about a one-line change and seems plenty safe to backpatch. I also intend to look at avoiding concurrent relcache rebuilds, which I think should also be simple enough if we are willing to introduce an additional LWLock. (That would prevent concurrent init file rebuilds in different databases, but it's not clear that very many people care about such scenarios.) I am inclined to back-patch that as well; it's a bit riskier than the first change, but the first change is apparently not going to fix the two cases reported from the field. Comments? regards, tom lane
On Sat, May 26, 2012 at 03:14:18PM -0400, Tom Lane wrote: > It seems clear to me that we should just disable syncscans for the > relcache reload heapscans. There is lots of downside due to breaking > the early-exit optimization in RelationBuildTupleDesc, and basically no > upside. I'm inclined to just modify systable_beginscan to prevent use > of syncscan whenever indexOK is false. If we wanted to change its API > we could make this happen only for RelationBuildTupleDesc's calls, but > I don't see any upside for allowing syncscans for other forced-heapscan > callers either. Looks harmless enough, though it's only targeting a symptom. No matter how you cut it, the system is in a bad state when many backends simultaneously heapscan a large system catalog. > 2. The larger problem here is that when we have N incoming connections > we let all N of them try to rebuild the init file independently. This > doesn't make things faster for any one of them, and once N gets large > enough it makes things slower for all of them. We would be better off > letting the first arrival do the rebuild work while the others just > sleep waiting for it. I believe that this fix would probably have > ameliorated Jeff and Greg's cases, even though those do not seem to > have triggered the syncscan logic. This strikes me as the clearer improvement; it fixes the root cause. Thanks, nm
Noah Misch <noah@leadboat.com> writes: > On Sat, May 26, 2012 at 03:14:18PM -0400, Tom Lane wrote: >> It seems clear to me that we should just disable syncscans for the >> relcache reload heapscans. There is lots of downside due to breaking >> the early-exit optimization in RelationBuildTupleDesc, and basically no >> upside. I'm inclined to just modify systable_beginscan to prevent use >> of syncscan whenever indexOK is false. If we wanted to change its API >> we could make this happen only for RelationBuildTupleDesc's calls, but >> I don't see any upside for allowing syncscans for other forced-heapscan >> callers either. > Looks harmless enough, though it's only targeting a symptom. No matter how > you cut it, the system is in a bad state when many backends simultaneously > heapscan a large system catalog. Agreed, but actually this isn't just a symptom: the syncscan code is *causing* full-table heapscans that would not occur otherwise. >> 2. The larger problem here is that when we have N incoming connections >> we let all N of them try to rebuild the init file independently. This >> doesn't make things faster for any one of them, and once N gets large >> enough it makes things slower for all of them. We would be better off >> letting the first arrival do the rebuild work while the others just >> sleep waiting for it. I believe that this fix would probably have >> ameliorated Jeff and Greg's cases, even though those do not seem to >> have triggered the syncscan logic. > This strikes me as the clearer improvement; it fixes the root cause. As I noted in the other thread, I've had second thoughts about this proposal: it would serialize incoming sessions even in cases where no benefit would be gained. Given the lack of previous reports I'm inclined to think that fixing the misapplication of syncscan logic should be enough to cure the problem, and hence we shouldn't take a risk of de-optimizing behavior that has generally worked fine for the last fifteen years. regards, tom lane
On Sat, 2012-05-26 at 15:14 -0400, Tom Lane wrote: > 3. Having now spent a good deal of time poking at this, I think that the > syncscan logic is in need of more tuning, and I am wondering whether we > should even have it turned on by default. It appears to be totally > useless for fully-cached-in-RAM scenarios, even if most of the relation > is out in kernel buffers rather than in shared buffers. The best case > I saw was less than 2X speedup compared to N-times-the-single-client > case, and that wasn't very reproducible, and it didn't happen at all > unless I hacked BAS_BULKREAD mode to use a ring buffer size many times > larger than the current 256K setting (otherwise the timing requirements > are too tight for multiple backends to stay in sync --- a seqscan can > blow through that much data in a fraction of a millisecond these days, > if it's reading from kernel buffers). The current tuning may be all > right for cases where you're actually reading from spinning rust, but > that seems to be a decreasing fraction of real-world use cases. Do you mean that the best case you saw ever was 2X, or the best case when the table is mostly in kernel buffers was 2X? I clearly saw better than 2X when the table was on disk, so if you aren't, we should investigate. One thing we could do is drive the threshold from effective_cache_size rather than shared_buffers, which was discussed during 8.3 development. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Sat, 2012-05-26 at 15:14 -0400, Tom Lane wrote: >> 3. Having now spent a good deal of time poking at this, I think that the >> syncscan logic is in need of more tuning, and I am wondering whether we >> should even have it turned on by default. It appears to be totally >> useless for fully-cached-in-RAM scenarios, even if most of the relation >> is out in kernel buffers rather than in shared buffers. The best case >> I saw was less than 2X speedup compared to N-times-the-single-client >> case, and that wasn't very reproducible, and it didn't happen at all >> unless I hacked BAS_BULKREAD mode to use a ring buffer size many times >> larger than the current 256K setting (otherwise the timing requirements >> are too tight for multiple backends to stay in sync --- a seqscan can >> blow through that much data in a fraction of a millisecond these days, >> if it's reading from kernel buffers). The current tuning may be all >> right for cases where you're actually reading from spinning rust, but >> that seems to be a decreasing fraction of real-world use cases. > Do you mean that the best case you saw ever was 2X, or the best case > when the table is mostly in kernel buffers was 2X? I was only examining a fully-cached-in-RAM case. > I clearly saw better than 2X when the table was on disk, so if you > aren't, we should investigate. I don't doubt that syncscan can provide better than 2X speedup if you have more than 2 concurrent readers for a syncscan traversing data that's too big to fit in RAM. What I'm questioning is whether such cases represent a sufficiently large fraction of our userbase to justify having syncscan on by default. I would be happier about having it on if it seemed to be useful for fully-cached scenarios, but it doesn't. > One thing we could do is drive the threshold from effective_cache_size > rather than shared_buffers, which was discussed during 8.3 development. If we were going to do that, I think that we'd need to consider having different thresholds for using bulkread access strategy and using syncscan, because not using bulkread is going to blow out the shared_buffers cache. We originally avoided that on the grounds of not wanting to have to optimize more than 2 behaviors, but maybe it's time to investigate more. regards, tom lane