On 24.01.2013 22:35, Tom Lane wrote:
> Alexander Korotkov<aekorotkov@gmail.com> writes:
>> There is another cause of overhead when use randomization in gistchoose:
>> extra penalty calls. It could be significant when index fits to cache. In
>> order evade it I especially change behaviour of my patch from "look
>> sequentially and choose random" to "look in random order". I think we need
>> to include comparison of CPU time.
>
> Hmm ... actually, isn't that an argument in favor of Heikki's method?
> The way he's doing it, we can exit without making additional penalty
> calls once we've found a zero-penalty tuple and decided not to look
> further (which is something with a pretty high probability).
No, I think Alexander is right, although I believe the difference is
minimal in practice.
If we assume that the there are no zero-penalty tuples on the page, with
both approaches you have to call penalty on every tuple to know which is
best. If there are zero-penalty tuples, then there is a small
difference. With Alexander's method, you can stop looking as soon as you
find a zero-penalty tuple (the order you check the tuples is random).
With my method, you can stop looking as soon as you find a zero-penalty
tuple, *and* you decide to not look further. With the 1/2 probability to
stop looking further, you give up on average after 2 tuples.
So if I'm doing my math right, my patch does on average between 1x - 2x
as many penalty calls as Alexander's, depending on how many zero-penalty
tuples there are. The 2x case is when the page is full of zero-penalty
tuples, in which case the difference isn't big in absolute terms; 2
penalty calls per page versus 1.
BTW, one thing that I wondered about this: How expensive is random()?
I'm assuming not very, but I don't really know. Alexander's patch called
random() for every tuple on the page, while I call it only once for each
equal-penalty tuple. If it's really cheap, I think my/Tom's patch could
be slightly simplified by always initializing keep_current_best with
random() at top of the function, and eliminating the -1 "haven't decided
yet" state. OTOH, if random() is expensive, I note that we only need one
bit of randomness, so we could avoid calling random() so often if we
utilized all the bits from each random() call.
- Heikki