Re: why partition pruning doesn't work? - Mailing list pgsql-hackers

From Tom Lane
Subject Re: why partition pruning doesn't work?
Date
Msg-id 17850.1528755844@sss.pgh.pa.us
Whole thread Raw
In response to Re: why partition pruning doesn't work?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: why partition pruning doesn't work?
Re: why partition pruning doesn't work?
Re: why partition pruning doesn't work?
Re: why partition pruning doesn't work?
List pgsql-hackers
I wrote:
> * I'm fairly suspicious of the fmgr_info and fmgr_info_copy calls in
> perform_pruning_base_step.  Those seem likely to leak memory, and
> for sure they destroy any opportunity for the called comparison
> function to cache info in fn_extra --- something that's critical
> for performance for some comparison functions such as array_eq.
> Why is it necessary to suppose that those functions could change
> from one execution to the next?

After looking closer, that code isn't just inefficient, it's flat
out broken.  The reason is that ExecSetupPartitionPruneState thinks
it can store some pointers into the target relation's relcache entry
in the PartitionPruneContext, and then continue to use those pointers
after closing the relcache entry.  Nope, you can't do that.

The only reason this code has appeared to pass buildfarm testing is
that when we do

                if (cmpfn != context->partsupfunc[keyno].fn_oid)
                    fmgr_info(cmpfn, &partsupfunc[keyno]);
                else ...

if the relcache entry that context->partsupfunc is pointing into
has been freed (and overwritten thanks to CLOBBER_FREED_MEMORY), then the
function OID comparison generally fails so that we do a fresh fmgr_info
call.  In the field it's quite likely that we'd accept and attempt to
use a partially-clobbered FmgrInfo; but it would only happen if a relcache
flush had caused the data to get released, so it could be awhile before
anybody saw it happen, let alone reproduced it enough to figure it out.

It's easy to demonstrate that there's a problem if you instrument this
code to log when the OID comparison fails, and then run the regression
tests with -DRELCACHE_FORCE_RELEASE: you get lots of reports like

2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351
2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351
2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351
2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351
2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351
2018-06-11 18:01:28.707 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351
2018-06-11 18:01:28.707 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351

showing that context->partsupfunc has been overwritten by
CLOBBER_FREED_MEMORY.

If we had any buildfarm critters running valgrind on
RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have
detected use of uninitialized memory here ... but I don't think we have
any.  (The combination of valgrind and CCA would probably be too slow to
be practical :-(, though maybe somebody with a fast machine could do
the other thing.)

Not sure about a good fix for this.  It seems annoying to copy the
rel's whole partkey data structure into query-local storage, but
I'm not sure we have any choice.  On the bright side, there might
be an opportunity to get rid of repeated runtime fmgr_info lookups
in cross-type comparison situations.

            regards, tom lane


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Fix some error handling for read() and errno
Next
From: Andrew Dunstan
Date:
Subject: Re: automating perl compile time checking