Thread: Export user visible function to make use of convert_to_scalar

Export user visible function to make use of convert_to_scalar

From
Gregory Stark
Date:
Attached is a patch which implements, as discussed briefly on -hackers, a
user-visible function to get at the information that convert_to_scalar uses to
generate selectivity estimates.

The main use case for this is for tools such as pgadmin which want to make
sense of the histograms, they need to know where and how tall to draw each
region of the histogram.

I didn't add any documentation because I couldn't find a reasonable place to
put such a low level function. I don't think we would want to encourage
applications to use it unless they're trying to work with statistics tables
anyways so documenting it on the comparison operators page -- the only page
where it seems it would make sense -- would be a bad idea.


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Attachment

Re: Export user visible function to make use of convert_to_scalar

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Attached is a patch which implements, as discussed briefly on -hackers, a
> user-visible function to get at the information that convert_to_scalar uses to
> generate selectivity estimates.

This is an astonishingly bad idea, as it exposes and thereby sets in
stone one of the worst, most in-need-of-rethinking areas in selfuncs.c.

> I didn't add any documentation because I couldn't find a reasonable place to
> put such a low level function.

No kidding.

> I don't think we would want to encourage
> applications to use it

The way to not encourage it is to not provide it.

            regards, tom lane

Re: Export user visible function to make use of convert_to_scalar

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> Attached is a patch which implements, as discussed briefly on -hackers, a
>> user-visible function to get at the information that convert_to_scalar uses to
>> generate selectivity estimates.
>
> This is an astonishingly bad idea, as it exposes and thereby sets in
> stone one of the worst, most in-need-of-rethinking areas in selfuncs.c.

No, it sets in stone only the concept that at some point somehow we'll have to
come up with an estimate of where a value lies in a range. That's inevitable
if we hope to be able to make any kind of useful estimates at all. How the
internals go about doing it isn't set in stone at all.

What it seems ought to happen here eventually is that each scalar type should
provide a function to implement convert_to_scalar for itself. That would let
new user-defined functions implement the function as appropriate.

> The way to not encourage it is to not provide it.

Well _some_ applications do need to use it. Applications which provide a
graphic view of the pg_statistics information. How would you suggest drawing a
chart of histogram values without access to this information?

The other method I thought of was to run EXPLAIN repeatedly with different
bounds and pick out the estimates from the output. This requires hard-wiring
into the application understanding of every data type and how to generate a
range of values between a range (or implementing width_bucket for every data
type) and executing a whole bucketload of EXPLAINs. And then the resulting
graphs would be a worse representation of the statistics since they wouldn't
correspond exactly to the buckets in the histograms.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Export user visible function to make use of convert_to_scalar

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> Gregory Stark <stark@enterprisedb.com> writes:
>>> Attached is a patch which implements, as discussed briefly on -hackers, a
>.> user-visible function to get at the information that convert_to_scalar uses to
>>> generate selectivity estimates.
>>
>> This is an astonishingly bad idea, as it exposes and thereby sets in
>> stone one of the worst, most in-need-of-rethinking areas in selfuncs.c.

> No, it sets in stone only the concept that at some point somehow we'll have to
> come up with an estimate of where a value lies in a range.

The problem is how much of the actual behavior of that function gets
exposed, and in particular what context it needs to make the estimate.
Further down the road, for example, we might need a locale as an input.

When complaining I hadn't read the pghackers thread in which you
suggested this, and now that I'm caught up on email I remain
unconvinced.  What do you need convert_to_scalar for in order to display
the pg_statistic histogram?  You've already got the underlying data.

            regards, tom lane

Re: Export user visible function to make use of convert_to_scalar

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> When complaining I hadn't read the pghackers thread in which you
> suggested this, and now that I'm caught up on email I remain
> unconvinced.  What do you need convert_to_scalar for in order to display
> the pg_statistic histogram?  You've already got the underlying data.

Then the charting tool would have to duplicate all the same code that
convert_to_scalar has to figure out what to do with the data.

What I can generate already is a bar chart for integer data types. But I think
it would be most useful for string data types. In one of the cases I've helped
someone with before they had comment spam constituting a sizable fraction of
their database throwing off statistics and forcing a sequential scan.

It might be simplest to expose an extern definition of the C function with a
comment saying it's internal and changes are planned.

Incidentally, I wonder if it makes more sense to make this an optional btree
operator class support procedure rather than a type function.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Export user visible function to make use of convert_to_scalar

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> When complaining I hadn't read the pghackers thread in which you
>> suggested this, and now that I'm caught up on email I remain
>> unconvinced.  What do you need convert_to_scalar for in order to display
>> the pg_statistic histogram?  You've already got the underlying data.

> Then the charting tool would have to duplicate all the same code that
> convert_to_scalar has to figure out what to do with the data.

How so?  The entries in the histogram are equidistant by definition.

            regards, tom lane

Re: Export user visible function to make use of convert_to_scalar

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>>> When complaining I hadn't read the pghackers thread in which you
>>> suggested this, and now that I'm caught up on email I remain
>>> unconvinced.  What do you need convert_to_scalar for in order to display
>>> the pg_statistic histogram?  You've already got the underlying data.
>
>> Then the charting tool would have to duplicate all the same code that
>> convert_to_scalar has to figure out what to do with the data.
>
> How so?  The entries in the histogram are equidistant by definition.

Huh? They have equal number of values between them, they're not equidistant in
the scalar space. So the area of each bar should be the same but the width
would be proportional to the width of the bucket and the height inversely
proportional.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Export user visible function to make use of convert_to_scalar

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> How so?  The entries in the histogram are equidistant by definition.

> Huh? They have equal number of values between them, they're not equidistant in
> the scalar space. So the area of each bar should be the same but the width
> would be proportional to the width of the bucket and the height inversely
> proportional.

You're presuming there exists a linear scalar space to reference the
results to.  I'm unconvinced that that's a good assumption to make.
Consider for instance a column of URLs; practically all the entries will
start with http or ftp or one of a small number of other possibilities.
If you try to represent this with some hypothetical linear x-axis you
are going to produce an unusable display.

convert_to_scalar partially avoids this problem because of the way it's
used within the backend: we don't actually assume that there is any
global linear space, only that the values within any single histogram
bucket are fairly uniformly distributed.  (This assumption tends to
become true in the limit as the buckets get smaller, even if the global
distribution is pretty crazy.  And even if it doesn't become true, the
absolute error is bounded by the bucket size.)  You can't use the thing
to measure the "positions" of histogram boundary values, because those
are part of its frame of reference not something it outputs.

            regards, tom lane

Re: Export user visible function to make use of convert_to_scalar

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>>> How so?  The entries in the histogram are equidistant by definition.
>
>> Huh? They have equal number of values between them, they're not equidistant in
>> the scalar space. So the area of each bar should be the same but the width
>> would be proportional to the width of the bucket and the height inversely
>> proportional.
>
> You're presuming there exists a linear scalar space to reference the
> results to.  I'm unconvinced that that's a good assumption to make.
> Consider for instance a column of URLs; practically all the entries will
> start with http or ftp or one of a small number of other possibilities.
> If you try to represent this with some hypothetical linear x-axis you
> are going to produce an unusable display.

I was planning to use the first and last histogram values for the frame of
reference. It could still produce some weird graphs but those cases are
precisely the cases where users might want to look at it and go, "hm, that
looks weird". In most sane cases I expect it'll make a nice pretty graph. It
remains to be seen though. All I have so far are synthetic integer domain
graphs which look just like what my synthetic distribution should look like.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Export user visible function to make use of convert_to_scalar

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> You're presuming there exists a linear scalar space to reference the
>> results to.

> I was planning to use the first and last histogram values for the frame of
> reference. It could still produce some weird graphs but those cases are
> precisely the cases where users might want to look at it and go, "hm, that
> looks weird".

It will produce unusable graphs in precisely the cases where users might
want to know what's going on...

            regards, tom lane

Re: Export user visible function to make use of convert_to_scalar

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
>> I was planning to use the first and last histogram values for the frame of
>> reference. It could still produce some weird graphs but those cases are
>> precisely the cases where users might want to look at it and go, "hm, that
>> looks weird".
>
> It will produce unusable graphs in precisely the cases where users might
> want to know what's going on...

Well that's possible but I think it's worth experimenting. I'm pretty
confident there would be ways to clean the graphs up so they're sensible and
even if not they would still be a lot more helpful than nothing user-visible.

Would it be reasonable to just make the function extern with a comment that
it's an internal function subject to change in the future. That would at least
let us experiment with an add-on function for 8.3 so we know what we really
need for 8.4. Tools like this would be very version-dependent anyways.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com