Re: why partition pruning doesn't work? - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: why partition pruning doesn't work? |
Date | |
Msg-id | 17850.1528755844@sss.pgh.pa.us Whole thread Raw |
In response to | Re: why partition pruning doesn't work? (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: why partition pruning doesn't work?
Re: why partition pruning doesn't work? Re: why partition pruning doesn't work? Re: why partition pruning doesn't work? |
List | pgsql-hackers |
I wrote: > * I'm fairly suspicious of the fmgr_info and fmgr_info_copy calls in > perform_pruning_base_step. Those seem likely to leak memory, and > for sure they destroy any opportunity for the called comparison > function to cache info in fn_extra --- something that's critical > for performance for some comparison functions such as array_eq. > Why is it necessary to suppose that those functions could change > from one execution to the next? After looking closer, that code isn't just inefficient, it's flat out broken. The reason is that ExecSetupPartitionPruneState thinks it can store some pointers into the target relation's relcache entry in the PartitionPruneContext, and then continue to use those pointers after closing the relcache entry. Nope, you can't do that. The only reason this code has appeared to pass buildfarm testing is that when we do if (cmpfn != context->partsupfunc[keyno].fn_oid) fmgr_info(cmpfn, &partsupfunc[keyno]); else ... if the relcache entry that context->partsupfunc is pointing into has been freed (and overwritten thanks to CLOBBER_FREED_MEMORY), then the function OID comparison generally fails so that we do a fresh fmgr_info call. In the field it's quite likely that we'd accept and attempt to use a partially-clobbered FmgrInfo; but it would only happen if a relcache flush had caused the data to get released, so it could be awhile before anybody saw it happen, let alone reproduced it enough to figure it out. It's easy to demonstrate that there's a problem if you instrument this code to log when the OID comparison fails, and then run the regression tests with -DRELCACHE_FORCE_RELEASE: you get lots of reports like 2018-06-11 18:01:28.686 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 2018-06-11 18:01:28.686 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 2018-06-11 18:01:28.686 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 2018-06-11 18:01:28.686 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 2018-06-11 18:01:28.686 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 2018-06-11 18:01:28.707 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 2018-06-11 18:01:28.707 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 showing that context->partsupfunc has been overwritten by CLOBBER_FREED_MEMORY. If we had any buildfarm critters running valgrind on RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have detected use of uninitialized memory here ... but I don't think we have any. (The combination of valgrind and CCA would probably be too slow to be practical :-(, though maybe somebody with a fast machine could do the other thing.) Not sure about a good fix for this. It seems annoying to copy the rel's whole partkey data structure into query-local storage, but I'm not sure we have any choice. On the bright side, there might be an opportunity to get rid of repeated runtime fmgr_info lookups in cross-type comparison situations. regards, tom lane
pgsql-hackers by date: