Re: patch for geqo tweaks - Mailing list pgsql-hackers

From Nathan Wagner
Subject Re: patch for geqo tweaks
Date
Msg-id 20151105174155.GA547@granicus.if.org
Whole thread Raw
In response to Re: patch for geqo tweaks  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: patch for geqo tweaks  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Wed, Nov 04, 2015 at 12:51:52PM -0500, Tom Lane wrote:
> Nathan Wagner <nw+pg@hydaspes.if.org> writes:
> > I have two patches to make the geqo initialization and mutation
> > slightly better.
> 
> > The first adjusts the mutation swaps to avoid having to re-pick
> > ties.  The second changes the initialization and shuffling algorithm
> > for the gene array to use an in-place Fisher-Yates shuffling
> > algorithm.
> 
> I took a quick look at this.
> 
> I'm not very impressed with the first patch: it might save a few
> geqo_randint() calls, but it seems to do so at the price of making the
> swap choices less random --- for instance it sure looks to me like the
> last array element is now less likely to participate in swaps than
> other elements.  Unless you can prove that actually the swapping is
> still unbiased, I'm inclined to reject this part.

If I have understood the original code correctly, we need to select two
different random integers between 0 and num_gene-1, inclusive.  That
happens to be num_gene possible results.

Having chosen the first one, which I will call "swap1", we now only have
num_gene-1 possible results, which need to range from either 0 to
swap1-1 or from swap1+1 to num_gene-1, which is num_gene-1 possible
results.  I treat this as a single range from 0 to num_gene-2 and
generate a number within that range, which I will call "swap2".

If swap2 is between 0 and swap1-1, it is in the first range, and no
adjustment is necessary.  If it is greater than or equal to swap1, then
it is in the second range.  However the generated swap2 in the second
range will be between swap1 and num_gene-2, whereas we need it to be
between swap1+1 and num_gene-1, so I add one to swap2, adjusting the
range to the needed range.

It would be equivalent to set swap2 to num_gene-1 in that case,
effectively remapping the first value to the last, but an increment was
more intuitive to me.

> As for the second part, I had to look up Fisher-Yates ;-) but after
> having read Wikipedia's entry about it I think this is a good change.
> The code's shorter and more efficient, and it should mathematically
> provide an equally-unbiased initial shuffle.  It could do with a
> better comment, and I'd be inclined to handle the first element
> outside the loop rather than uselessly computing geqo_randint(0,0),

I could do that.  It will make the code slightly harder to read.  I
wonder if it would be worth having geqo_randint() handle the special
case instead.

> Having said that, though, I believe that it's also probably a
> *different* initial shuffle, which may well mean that GEQO gives
> different plans in some cases.

Yes, I would expect it to be different in the general case.  I think my
commit message noted that, but perhaps it could be more explicit.

> That doesn't bother me as long as we only make the change in HEAD, but
> does anyone want to complain?

I don't see any need to backport either of these patches.

-- 
nw



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: CustomScan support on readfuncs.c
Next
From: José Luis Tallón
Date:
Subject: Re: Note about comparation PL/SQL packages and our schema/extensions