Re: gistchoose vs. bloat - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: gistchoose vs. bloat
Date
Msg-id 51019F41.1060204@vmware.com
Whole thread Raw
In response to Re: gistchoose vs. bloat  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: gistchoose vs. bloat
List pgsql-hackers
On 24.01.2013 22:35, Tom Lane wrote:
> Alexander Korotkov<aekorotkov@gmail.com>  writes:
>> There is another cause of overhead when use randomization in gistchoose:
>> extra penalty calls. It could be significant when index fits to cache. In
>> order evade it I especially change behaviour of my patch from "look
>> sequentially and choose random" to "look in random order". I think we need
>> to include comparison of CPU time.
>
> Hmm ... actually, isn't that an argument in favor of Heikki's method?
> The way he's doing it, we can exit without making additional penalty
> calls once we've found a zero-penalty tuple and decided not to look
> further (which is something with a pretty high probability).

No, I think Alexander is right, although I believe the difference is 
minimal in practice.

If we assume that the there are no zero-penalty tuples on the page, with 
both approaches you have to call penalty on every tuple to know which is 
best. If there are zero-penalty tuples, then there is a small 
difference. With Alexander's method, you can stop looking as soon as you 
find a zero-penalty tuple (the order you check the tuples is random). 
With my method, you can stop looking as soon as you find a zero-penalty 
tuple, *and* you decide to not look further. With the 1/2 probability to 
stop looking further, you give up on average after 2 tuples.

So if I'm doing my math right, my patch does on average between 1x - 2x 
as many penalty calls as Alexander's, depending on how many zero-penalty 
tuples there are. The 2x case is when the page is full of zero-penalty 
tuples, in which case the difference isn't big in absolute terms; 2 
penalty calls per page versus 1.

BTW, one thing that I wondered about this: How expensive is random()? 
I'm assuming not very, but I don't really know. Alexander's patch called 
random() for every tuple on the page, while I call it only once for each 
equal-penalty tuple. If it's really cheap, I think my/Tom's patch could 
be slightly simplified by always initializing keep_current_best with 
random() at top of the function, and eliminating the -1 "haven't decided 
yet" state. OTOH, if random() is expensive, I note that we only need one 
bit of randomness, so we could avoid calling random() so often if we 
utilized all the bits from each random() call.

- Heikki



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: Materialized views WIP patch
Next
From: Bruce Momjian
Date:
Subject: Re: text search: restricting the number of parsed words in headline generation