Thread: Re: Bloom index cost model seems to be wrong
I've moved this to the hackers list, and added Teodor and Alexander of the bloom extension, as I would like to hear their opinions on the costing.
On Tue, Feb 12, 2019 at 4:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
It's possible that a good cost model for bloom is so far outside
genericcostestimate's ideas that trying to use it is not a good
idea anyway.
I'm attaching a crude patch based over your refactored header files.
I just copied genericcostestimate into bloom, and made a few changes.
I think one change should be conceptually uncontroversial, which is to change the IO cost from random_page_cost to seq_page_cost. Bloom indexes are always scanned in their entirety.
The other change is not to charge any cpu_operator_cost per tuple. Bloom doesn't go through the ADT machinery, it just does very fast bit-twiddling. I could assign a fraction of a cpu_operator_cost, multiplied by bloomLength rather than list_length(indexQuals), to this bit-twiddling. But I think that that fraction would need to be quite close to zero, so I just removed it.
When everything is in memory, Bloom still gets way overcharged for CPU usage even without the cpu_operator_cost. This patch doesn't do anything about that. I don't know if the default value of cpu_index_tuple_cost is way too high, or if Bloom just shouldn't be charging the full value of it in the first place given the way it accesses index tuples. For comparison, when using a Btree as an equality filter on a non-leading column, most of the time goes to index_getattr. Should the time spent there be loaded on cpu_operator_cost or onto cpu_index_tuple_cost? It is not strictly spent in the operator, but fetching the parameter to be used in an operator is more closely a per-operator problem than a per-tuple problem.
Most of genericcostestimate still applies. For example, ScalarArrayOpExpr handling, and Mackert-Lohman formula. It is a shame that all of that has to be copied.
There are some other parts of genericcostestimate that probably don't apply (OrderBy, for example) but I left them untouched for now to make it easier to reconcile changes to the real genericcostestimate with the copy.
For ScalarArrayOpExpr, it would be nice to scan the index once and add to the bitmap all branches of the =ANY in one index scan, but I don't see the machinery to do that. It would be a matter for another patch anyway, other than the way it would change the cost estimate.
Cheers,
Jeff
Attachment
On Sun, Feb 24, 2019 at 11:09 AM Jeff Janes <jeff.janes@gmail.com> wrote:
I've moved this to the hackers list, and added Teodor and Alexander of the bloom extension, as I would like to hear their opinions on the costing.
My previous patch had accidentally included a couple lines of a different thing I was working on (memory leak, now-committed), so this patch removes that diff.
I'm adding it to the commitfest targeting v13. I'm more interested in feedback on the conceptual issues rather than stylistic ones, as I would probably merge the two functions together before proposing something to actually be committed.
Should we be trying to estimate the false positive rate and charging cpu_tuple_cost and cpu_operator_cost the IO costs for visiting the table to recheck and reject those? I don't think other index types do that, and I'm inclined to think the burden should be on the user not to create silly indexes that produce an overwhelming number of false positives.
Cheers,
Jeff
Attachment
Jeff Janes <jeff.janes@gmail.com> writes: > Should we be trying to estimate the false positive rate and charging > cpu_tuple_cost and cpu_operator_cost the IO costs for visiting the table to > recheck and reject those? I don't think other index types do that, and I'm > inclined to think the burden should be on the user not to create silly > indexes that produce an overwhelming number of false positives. Heap-access costs are added on in costsize.c, not in the index cost estimator. I don't remember at the moment whether there's any explicit accounting for lossy indexes (i.e. false positives). Up to now, there haven't been cases where we could estimate the false-positive rate with any accuracy, so we may just be ignoring the effect. But if we decide to account for it, I'd rather have costsize.c continue to add on the actual cost, perhaps based on a false-positive-rate fraction returned by the index estimator. regards, tom lane
On Fri, Mar 1, 2019 at 7:11 AM Jeff Janes <jeff.janes@gmail.com> wrote: > I'm adding it to the commitfest targeting v13. I'm more interested in feedback on the conceptual issues rather than stylisticones, as I would probably merge the two functions together before proposing something to actually be committed. From the trivialities department, this patch shows up as a CI failure with -Werror, because there is no declaration for genericcostestimate2(). I realise that's just a temporary name in a WIP patch anyway so this isn't useful feedback, but for the benefit of anyone going through CI failures in bulk looking for things to complain about: this isn't a real one. -- Thomas Munro https://enterprisedb.com
It's not clear to me what the next action should be on this patch. I think Jeff got some feedback from Tom, but was that enough to expect a new version to be posted? That was in February; should we now (in late September) close this as Returned with Feedback? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Sep 25, 2019 at 05:12:26PM -0300, Alvaro Herrera wrote: > It's not clear to me what the next action should be on this patch. I > think Jeff got some feedback from Tom, but was that enough to expect a > new version to be posted? That was in February; should we now (in late > September) close this as Returned with Feedback? That sounds rather right to me. -- Michael