Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id 4fc51dca-c165-167b-6468-a668552a4a93@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Mark Dilger <hornschnorter@gmail.com>)
List pgsql-hackers
Hi,

On 11/25/2017 09:23 PM, Mark Dilger wrote:
> 
>> On Nov 18, 2017, at 12:28 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> Attached is an updated version of the patch, adopting the psql describe
>> changes introduced by 471d55859c11b.
>>
>> regards
>>
>> -- 
>> Tomas Vondra                  http://www.2ndQuadrant.com
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>> <0001-multivariate-MCV-lists.patch.gz><0002-multivariate-histograms.patch.gz>
> 
> Hello Tomas,
> 
> After applying both your patches, I get a warning:
> 
> histogram.c:1284:10: warning: taking the absolute value of unsigned type 'uint32' (aka 'unsigned int') has no effect
[-Wabsolute-value]
>         delta = fabs(data->numrows);
>                 ^
> histogram.c:1284:10: note: remove the call to 'fabs' since unsigned values cannot be negative
>         delta = fabs(data->numrows);
>                 ^~~~
> 1 warning generated.
> 

Hmm, yeah. The fabs() call is unnecessary, and probably a remnant from
some previous version where the field was not uint32.

I wonder why you're getting the warning and I don't, though. What
compiler are you using?

> 
> Looking closer at this section, there is some odd integer vs. floating point arithmetic happening
> that is not necessarily wrong, but might be needlessly inefficient:
> 
>     delta = fabs(data->numrows);
>     split_value = values[0].value;
> 
>     for (i = 1; i < data->numrows; i++)
>     {
>         if (values[i].value != values[i - 1].value)
>         {
>             /* are we closer to splitting the bucket in half? */
>             if (fabs(i - data->numrows / 2.0) < delta)
>             {
>                 /* let's assume we'll use this value for the split */
>                 split_value = values[i].value;
>                 delta = fabs(i - data->numrows / 2.0);
>                 nrows = i;
>             }
>         }
>     }
> 
> I'm not sure the compiler will be able to optimize out the recomputation of data->numrows / 2.0
> each time through the loop, since the compiler might not be able to prove to itself that data->numrows
> does not get changed.  Perhaps you should compute it just once prior to entering the outer loop,
> store it in a variable of integer type, round 'delta' off and store in an integer, and do integer comparisons
> within the loop?  Just a thought....
> 

Yeah, that's probably right. But I wonder if the loop is needed at all,
or whether we should start at i=(data->numrows/2.0) instead, and walk to
the closest change of value in both directions. That would probably save
more CPU than computing numrows/2.0 only once.

The other issue in that block of code seems to be that we compare the
values using simple inequality. That probably works for passbyval data
types, but we should use proper comparator (e.g. compare_datums_simple).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Code cleanup patch submission for extended_stats.c
Next
From: Tomas Vondra
Date:
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists