Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays |
Date | |
Msg-id | 20200425124024.hsv7z6bia752uymz@development Whole thread Raw |
In response to | Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
|
List | pgsql-hackers |
On Sat, Apr 25, 2020 at 12:21:06AM +0200, Tomas Vondra wrote: >On Thu, Apr 23, 2020 at 04:55:51PM +0200, Tomas Vondra wrote: >>On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote: >>>On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra >>><tomas.vondra@2ndquadrant.com> wrote: >>>> >>>>On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote: >>>>>Over in "execExprInterp() questions / How to improve scalar array op >>>>>expr eval?" [1] I'd mused about how we might be able to optimized >>>>>scalar array ops with OR'd semantics. >>>>> >>>>>This patch implements a binary search for such expressions when the >>>>>array argument is a constant so that we can avoid needing to teach >>>>>expression execution to cache stable values or know when a param has >>>>>changed. >>>>> >>>>>The speed-up for the target case can pretty impressive: in my >>>>>admittedly contrived and relatively unscientific test with a query in >>>>>the form: >>>>> >>>>>select count(*) from generate_series(1,100000) n(i) where i in (<1000 >>>>>random integers in the series>) >>>>> >>>>>shows ~30ms for the patch versus ~640ms on master. >>>>> >>>> >>>>Nice improvement, although 1000 items is probably a bit unusual. The >>>>threshold used in the patch (9 elements) seems a bit too low - what >>>>results have you seen with smaller arrays? >>> >>>At least in our systems we regularly work with 1000 batches of items, >>>which means you get IN clauses of identifiers of that size. Admittedly >>>the most common case sees those IN clauses as simple index scans >>>(e.g., WHERE <primary key> IN (...)), but it's also common to have a >>>broader query that merely filters additionally on something like "... >>>AND <some foreign key> IN (...)" where it makes sense for the rest of >>>the quals to take precedence in generating a reasonable plan. In that >>>case, the IN becomes a regular filter, hence the idea behind the >>>patch. >>> >>>Side note: I'd love for us to be able to treat "IN (VALUES)" the same >>>way...but as noted in the other thread that's an extremely large >>>amount of work, I think. But similarly you could use a hash here >>>instead of a binary search...but this seems quite good. >>> >>>As to the choice of 9 elements: I just picked that as a starting >>>point; Andres had previously commented off hand that at 8 elements >>>serial scanning was faster, so I figured this was a reasonable >>>starting point for discussion. >>> >>>Perhaps it would make sense to determine that minimum not as a pure >>>constant but scaled based on how many rows the planner expects us to >>>see? Of course that'd be a more invasive patch...so may or may not be >>>as feasible as a reasonable default. >>> >> >>Not sure. That seems a bit overcomplicated and I don't think it depends >>on the number of rows the planner expects to see very much. I think we >>usually assume the linear search is cheaper for small arrays and then at >>some point the binary search starts winning The question is where this >>"break even" point is. >> >>I think we usually use something like 64 or so in other places, but >>maybe I'm wrong. The current limit 9 seems a bit too low, but I may be >>wrong. Let's not obsess about this too much, let's do some experiments >>and pick a value based on that. >> >> >>>>Another idea - would a bloom filter be useful here, as a second >>>>optimization? That is, for large arrays build s small bloom filter, >>>>allowing us to skip even the binary search. >>> >>>That's an interesting idea. I actually haven't personally worked with >>>bloom filters, so didn't think about that. >>> >>>Are you thinking that you'd also build the filter *and* presort the >>>array? Or try to get away with using only the bloom filter and not >>>expanding and sorting the array at all? >>> >> >>Yeah, something like that. My intuition is the bloom filter is useful >>only above some number of items, and the number is higher than for the >>binary search. So we'd end up with two thresholds, first one enabling >>binary search, the second one enabling bloom filter. >> >>Of course, the "unknown" variable here is how often we actually find the >>value in the array. If 100% of the queries has a match, then the bloom >>filter is a waste of time. If there are no matches, it can make a >>significant difference. >> > >I did experiment with this is a bit, both to get a bit more familiar >with this code and to see if the bloom filter might help. The short >answer is the bloom filter does not seem to help at all, so I wouldn't >bother about it too much. > >Attacched is an updated patch series and, script I used to collect some >performance measurements, and a spreadsheet with results. The patch >series is broken into four parts: > > 0001 - the original patch with binary search > 0002 - adds GUCs to enable bin search / tweak threshold > 0003 - allows to use bloom filter + binary search > 0004 - try using murmurhash > >The test script runs a wide range of queries with different number >of lookups, keys in the array, match probability (i.e. fraction of >lookups that find a match) ranging from 1% to 100%. And of course, it >runs this with the binsearch/bloom either enabled or disabled (the >threshold was lowered to 1, so it's the on/off GUCs that determine >whether the binsearch/bloom is used). > >The results are summarized in the spreadsheet, demonstrating how useless >the bloom filter is. There's not a single case where it would beat the >binary search. I believe this is because theaccess to bloom filter is >random (determined by the hash function) and we don't save much compared >to the log(K) lookups in the sorted array. > >That makes sense, I think the bloom filters are meant to be used in >cases when the main data don't fit into memory - which is not the case >here. But I wonder how would this change for cases with more expensive >comparisons - this was using just integers, so maybe strings would >result in different behavior. OK, I tried the same test with text columns (with md5 strings), and the results are about as I predicted - the bloom filter actually makes a difference in this case. Depending on the number of lookups and selectivity (i.e. how many lookups have a match in the array) it can mean additional speedup up to ~5x compared to binary search alone. For the case with 100% selectivity (i.e. all rows have a match) this can't really save any time - it's usually still much faster than master, but it's a bit slower than binary search. So I think this might be worth investigating further, once the simple binary search gets committed. We'll probably need to factor in the cost of the comparison (higher cost -> BF more useful), selectivity of the filter (fewer matches -> BF more useful) and number of lookups. This reminds me our attempts to add bloom filters to hash joins, which I think ran into mostly the same challenge of deciding when the bloom filter can be useful and is worth the extra work. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: