Thread: Cache invalidation bug in RelationGetIndexAttrBitmap()
On 2014-05-14 15:17:39 +0200, Andres Freund wrote: > On 2014-05-14 15:08:08 +0200, Tomas Vondra wrote: > > Apparently there's something wrong with 'test-decoding-check': > > Man. I shouldn't have asked... My code. There's some output in there > that's probably triggered by the extraordinarily long runtimes, but > there's definitely something else wrong. > My gut feeling says it's in RelationGetIndexList(). Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached. Tomas, thanks for that. I've never (and probably will never) run CLOBBER_CACHE_RECURSIVELY during development. Having a machine do that regularly is really helpful. How long does a single testrun take? It takes hundreds of seconds here to do a single UPDATE? There were some more differences but those are all harmless and caused by the extraordinarily long runtime (autovacuums). I think we need to add a feature to test_decoding to suppress displaying transactions without changes. Ick. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-05-14 15:17:39 +0200, Andres Freund wrote: >> My gut feeling says it's in RelationGetIndexList(). > Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached. TBH, I don't believe this patch at all. Where exactly is rd_replidindex reset? If it's supposed to have similar lifespan to, say, rd_oidindex, why isn't it being handled like rd_oidindex? And why does the header comment for RelationGetIndexList make no mention of this new side-effect? Somebody did a seriously poor job of adding this functionality to relcache. regards, tom lane
On 2014-05-14 12:15:27 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-05-14 15:17:39 +0200, Andres Freund wrote: > >> My gut feeling says it's in RelationGetIndexList(). > > > Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached. > > TBH, I don't believe this patch at all. Where exactly is rd_replidindex > reset? If it's supposed to have similar lifespan to, say, rd_oidindex, > why isn't it being handled like rd_oidindex? I don't see why it'd have a different lifespan than rd_oidindex at all? If the latter were used inside the loop it'd be a bug as well. > And why does the header > comment for RelationGetIndexList make no mention of this new side-effect? > Somebody did a seriously poor job of adding this functionality to > relcache. It's not like it's not documented: There's a comment about it in struct RelationData. I must have missed that rd_oidindex has a comment abou it's lifetime over RelationGetIndexList(). I personally actually prefer the struct as the location for the lifetime. I can send a patch to commonalize the location in either place with the other location pointing to it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-05-14 12:15:27 -0400, Tom Lane wrote: >> And why does the header >> comment for RelationGetIndexList make no mention of this new side-effect? >> Somebody did a seriously poor job of adding this functionality to >> relcache. > It's not like it's not documented: There's a comment about it in struct > RelationData. I must have missed that rd_oidindex has a comment abou > it's lifetime over RelationGetIndexList(). If rd_replidindex is being managed like rd_oidindex, then it should be managed just like rd_oidindex, including getting reset in all the places rd_oidindex is. This might be just a matter of cleanliness but I think it's important for readability and debuggability. I notice also that rd_keyattr and rd_idattr have been implemented with bad copies of the logic for rd_indexattr. This is at least leading to a permanent memory leak in CacheMemoryContext during every relcache flush, and maybe worse things. The bugs for rd_keyattr appear to predate your patch though. Working on a patch for this now. One thing I'm wondering about is RelationSetIndexList. It's probably okay for it not to touch rd_keyattr and rd_idattr, but I'm not too clear on what the use cases for those attnum sets are. regards, tom lane
On 2014-05-14 13:32:43 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-05-14 12:15:27 -0400, Tom Lane wrote: > >> And why does the header > >> comment for RelationGetIndexList make no mention of this new side-effect? > >> Somebody did a seriously poor job of adding this functionality to > >> relcache. > > > It's not like it's not documented: There's a comment about it in struct > > RelationData. I must have missed that rd_oidindex has a comment abou > > it's lifetime over RelationGetIndexList(). > > If rd_replidindex is being managed like rd_oidindex, then it should be > managed just like rd_oidindex, including getting reset in all the places > rd_oidindex is. This might be just a matter of cleanliness but I think > it's important for readability and debuggability. Agreed. I am not against resetting it. I think I might not have been aware of rd_oidindex when writing that code... > I notice also that rd_keyattr and rd_idattr have been implemented with > bad copies of the logic for rd_indexattr. This is at least leading > to a permanent memory leak in CacheMemoryContext during every relcache > flush, and maybe worse things. The bugs for rd_keyattr appear to predate > your patch though. Hm. Yes, the bitmapsets should be freed. I guess I copied the logic for keyattr and didn't find any relevant places that touch it. rd_keyattr should go back to 9.3. > Working on a patch for this now. One thing I'm wondering about is > RelationSetIndexList. It's probably okay for it not to touch rd_keyattr > and rd_idattr, but I'm not too clear on what the use cases for those > attnum sets are. rd_keyattr is used to determine whether a heap_update() changed any keys that could be referenced by a foreign key. That's then used to determine which locklevel an update requires. rd_idattr does something similar. It decides whether the configured REPLICA IDENTITY key has changed so whether to log the old primary key for logical decoding or not. I can't see why either would need to care about forced index lists right now, but will do a scan of the sources to see if I am wrong. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 14.5.2014 17:52, Andres Freund wrote: > On 2014-05-14 15:17:39 +0200, Andres Freund wrote: >> On 2014-05-14 15:08:08 +0200, Tomas Vondra wrote: >>> Apparently there's something wrong with 'test-decoding-check': >> >> Man. I shouldn't have asked... My code. There's some output in there >> that's probably triggered by the extraordinarily long runtimes, but >> there's definitely something else wrong. >> My gut feeling says it's in RelationGetIndexList(). > > Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached. > > Tomas, thanks for that. I've never (and probably will never) run > CLOBBER_CACHE_RECURSIVELY during development. Having a machine do that > regularly is really helpful. How long does a single testrun take? It > takes hundreds of seconds here to do a single UPDATE? Don't know yet, as it fails at the beginning. But I suppose it will be tens or possibly hundreds of hours. For example these are the logs from regular build (no clobber etc.) May 14 19:00 SCM-checkout.log May 14 19:00 githead.log May 14 19:00 configure.log May 14 19:00 config.log May14 19:05 make.log May 14 19:05 check.log May 14 19:06 make-contrib.log May 14 19:06 make-install.log May 14 19:06install-contrib.log May 14 19:07 check-pg_upgrade.log May 14 19:08 test-decoding-check.log while these are the logs from recursive clobber: May 14 00:19 SCM-checkout.log May 14 00:20 configure.log May 14 00:20 config.log May 14 00:26 make.log May 1403:12 check.log May 14 03:13 make-contrib.log May 14 03:13 make-install.log May 14 03:13 install-contrib.log May14 08:25 check-pg_upgrade.log May 14 09:07 test-decoding-check.log May 14 09:07 web-txn.data So with the regular build, it took <1 minute to do 'make check' and ~1 minute to test pg_upgrade, with recursive clobber it takes ~3 hours and ~5 hours. That's a factor of ~300, although it's a very rough estimate. Without clobber the whole run (for a "C" locale) takes ~10 minutes, so my estimate is ~50 hours for the recursive one. But I wouldn't be surprised by 100 hours. > > There were some more differences but those are all harmless and caused > by the extraordinarily long runtime (autovacuums). I think we need to > add a feature to test_decoding to suppress displaying transactions > without changes. Ick. > I expect to hit more timing-related issues with the recursive clobber tests - not necessarily in the code/tests itself, but I guess the buildfarm tooling doesn't really expect runs that long. regards Tomas
Hi, On 2014-05-14 21:04:41 +0200, Tomas Vondra wrote: > On 14.5.2014 17:52, Andres Freund wrote: > > On 2014-05-14 15:17:39 +0200, Andres Freund wrote: > >> On 2014-05-14 15:08:08 +0200, Tomas Vondra wrote: > >>> Apparently there's something wrong with 'test-decoding-check': > >> > >> Man. I shouldn't have asked... My code. There's some output in there > >> that's probably triggered by the extraordinarily long runtimes, but > >> there's definitely something else wrong. > >> My gut feeling says it's in RelationGetIndexList(). > > > > Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached. > > > > Tomas, thanks for that. I've never (and probably will never) run > > CLOBBER_CACHE_RECURSIVELY during development. Having a machine do that > > regularly is really helpful. How long does a single testrun take? It > > takes hundreds of seconds here to do a single UPDATE? > > Don't know yet, as it fails at the beginning. test decoding is at the beginning? That's somewhat odd? > But I suppose it will be > tens or possibly hundreds of hours. For example these are the logs from > regular build (no clobber etc.) > May 14 19:00 SCM-checkout.log > May 14 19:00 githead.log > May 14 19:00 configure.log > May 14 19:00 config.log > May 14 19:05 make.log > May 14 19:05 check.log > May 14 19:06 make-contrib.log > May 14 19:06 make-install.log > May 14 19:06 install-contrib.log > May 14 19:07 check-pg_upgrade.log > May 14 19:08 test-decoding-check.log > > while these are the logs from recursive clobber: > > May 14 00:19 SCM-checkout.log > May 14 00:20 configure.log > May 14 00:20 config.log > May 14 00:26 make.log > May 14 03:12 check.log > May 14 03:13 make-contrib.log > May 14 03:13 make-install.log > May 14 03:13 install-contrib.log > May 14 08:25 check-pg_upgrade.log > May 14 09:07 test-decoding-check.log > May 14 09:07 web-txn.data > > > So with the regular build, it took <1 minute to do 'make check' and ~1 > minute to test pg_upgrade, with recursive clobber it takes ~3 hours and > ~5 hours. That's a factor of ~300, although it's a very rough > estimate. I seriously doubt that's recursive clobber. That should take *way* much longer. And indeed you have: > -DCLOBBER_CACHE_ALWAYS -DCLOBBER_FREED_MEMORY -DMEMORY_CONTEXT_CHECKING > -DRANDOMIZE_ALLOCATED_MEMORY -DCLOBBER_CACHE_RECURSIVELY > > it does not happen with > > CPPFLAGS => '-DCLOBBER_CACHE_ALWAYS -DCLOBBER_FREED_MEMORY > -DMEMORY_CONTEXT_CHECKING -DRANDOMIZE_ALLOCATED_MEMORY', #if defined(CLOBBER_CACHE_ALWAYS){ static bool in_recursion = false; if (!in_recursion) { in_recursion = true; InvalidateSystemCaches(); in_recursion = false; }} #elif defined(CLOBBER_CACHE_RECURSIVELY)InvalidateSystemCaches(); #endif i.e. you can't specifiy -DCLOBBER_CACHE_ALWAYS and -DCLOBBER_CACHE_RECURSIVELY together. The former will take precedence. > Without clobber the whole run (for a "C" locale) takes ~10 minutes, so > my estimate is ~50 hours for the recursive one. But I wouldn't be > surprised by 100 hours. I'm afraid it's more in the year range from what i've seen. I.e. not practical. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 14.5.2014 22:29, Andres Freund wrote: > Hi, > > On 2014-05-14 21:04:41 +0200, Tomas Vondra wrote: >> On 14.5.2014 17:52, Andres Freund wrote: >>> On 2014-05-14 15:17:39 +0200, Andres Freund wrote: >>>> On 2014-05-14 15:08:08 +0200, Tomas Vondra wrote: >>>>> Apparently there's something wrong with 'test-decoding-check': >>>> >>>> Man. I shouldn't have asked... My code. There's some output in there >>>> that's probably triggered by the extraordinarily long runtimes, but >>>> there's definitely something else wrong. >>>> My gut feeling says it's in RelationGetIndexList(). >>> >>> Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached. >>> >>> Tomas, thanks for that. I've never (and probably will never) run >>> CLOBBER_CACHE_RECURSIVELY during development. Having a machine do that >>> regularly is really helpful. How long does a single testrun take? It >>> takes hundreds of seconds here to do a single UPDATE? >> >> Don't know yet, as it fails at the beginning. > > test decoding is at the beginning? That's somewhat odd? Judging from the timestamps of log files, it seems to be running after pg_upgrade checks. Or maybe I'm reading that wrong. >> But I suppose it will be >> tens or possibly hundreds of hours. For example these are the logs from >> regular build (no clobber etc.) > >> May 14 19:00 SCM-checkout.log >> May 14 19:00 githead.log >> May 14 19:00 configure.log >> May 14 19:00 config.log >> May 14 19:05 make.log >> May 14 19:05 check.log >> May 14 19:06 make-contrib.log >> May 14 19:06 make-install.log >> May 14 19:06 install-contrib.log >> May 14 19:07 check-pg_upgrade.log >> May 14 19:08 test-decoding-check.log >> >> while these are the logs from recursive clobber: >> >> May 14 00:19 SCM-checkout.log >> May 14 00:20 configure.log >> May 14 00:20 config.log >> May 14 00:26 make.log >> May 14 03:12 check.log >> May 14 03:13 make-contrib.log >> May 14 03:13 make-install.log >> May 14 03:13 install-contrib.log >> May 14 08:25 check-pg_upgrade.log >> May 14 09:07 test-decoding-check.log >> May 14 09:07 web-txn.data >> >> >> So with the regular build, it took <1 minute to do 'make check' and ~1 >> minute to test pg_upgrade, with recursive clobber it takes ~3 hours and >> ~5 hours. That's a factor of ~300, although it's a very rough >> estimate. > > I seriously doubt that's recursive clobber. That should take *way* much > longer. And indeed you have: > >> -DCLOBBER_CACHE_ALWAYS -DCLOBBER_FREED_MEMORY -DMEMORY_CONTEXT_CHECKING >> -DRANDOMIZE_ALLOCATED_MEMORY -DCLOBBER_CACHE_RECURSIVELY >> >> it does not happen with >> >> CPPFLAGS => '-DCLOBBER_CACHE_ALWAYS -DCLOBBER_FREED_MEMORY >> -DMEMORY_CONTEXT_CHECKING -DRANDOMIZE_ALLOCATED_MEMORY', > > #if defined(CLOBBER_CACHE_ALWAYS) > { > static bool in_recursion = false; > > if (!in_recursion) > { > in_recursion = true; > InvalidateSystemCaches(); > in_recursion = false; > } > } > #elif defined(CLOBBER_CACHE_RECURSIVELY) > InvalidateSystemCaches(); > #endif > > i.e. you can't specifiy -DCLOBBER_CACHE_ALWAYS and > -DCLOBBER_CACHE_RECURSIVELY together. The former will take precedence. Oh, I see :-/ >> Without clobber the whole run (for a "C" locale) takes ~10 minutes, so >> my estimate is ~50 hours for the recursive one. But I wouldn't be >> surprised by 100 hours. > > I'm afraid it's more in the year range from what i've seen. I.e. not > practical. Yeah, that wouldn't be very practical. I'll try to run it though and if it'd run more than a few days, I'll switch it to CLOBBER_CACHE_ALWAYS. Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > On 14.5.2014 22:29, Andres Freund wrote: >> I'm afraid it's more in the year range from what i've seen. I.e. not >> practical. > Yeah, that wouldn't be very practical. I'll try to run it though and if > it'd run more than a few days, I'll switch it to CLOBBER_CACHE_ALWAYS. I think it might be interesting to try running CLOBBER_CACHE_RECURSIVELY on some subset of the regression tests. But I'm not sure which subset, nor whether the buildfarm scripts could readily be coaxed to do that. Let's see how far you get with a full run before you get bored ... regards, tom lane