Thread: Export user visible function to make use of convert_to_scalar
Attached is a patch which implements, as discussed briefly on -hackers, a user-visible function to get at the information that convert_to_scalar uses to generate selectivity estimates. The main use case for this is for tools such as pgadmin which want to make sense of the histograms, they need to know where and how tall to draw each region of the histogram. I didn't add any documentation because I couldn't find a reasonable place to put such a low level function. I don't think we would want to encourage applications to use it unless they're trying to work with statistics tables anyways so documenting it on the comparison operators page -- the only page where it seems it would make sense -- would be a bad idea. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Attachment
Gregory Stark <stark@enterprisedb.com> writes: > Attached is a patch which implements, as discussed briefly on -hackers, a > user-visible function to get at the information that convert_to_scalar uses to > generate selectivity estimates. This is an astonishingly bad idea, as it exposes and thereby sets in stone one of the worst, most in-need-of-rethinking areas in selfuncs.c. > I didn't add any documentation because I couldn't find a reasonable place to > put such a low level function. No kidding. > I don't think we would want to encourage > applications to use it The way to not encourage it is to not provide it. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> Attached is a patch which implements, as discussed briefly on -hackers, a >> user-visible function to get at the information that convert_to_scalar uses to >> generate selectivity estimates. > > This is an astonishingly bad idea, as it exposes and thereby sets in > stone one of the worst, most in-need-of-rethinking areas in selfuncs.c. No, it sets in stone only the concept that at some point somehow we'll have to come up with an estimate of where a value lies in a range. That's inevitable if we hope to be able to make any kind of useful estimates at all. How the internals go about doing it isn't set in stone at all. What it seems ought to happen here eventually is that each scalar type should provide a function to implement convert_to_scalar for itself. That would let new user-defined functions implement the function as appropriate. > The way to not encourage it is to not provide it. Well _some_ applications do need to use it. Applications which provide a graphic view of the pg_statistics information. How would you suggest drawing a chart of histogram values without access to this information? The other method I thought of was to run EXPLAIN repeatedly with different bounds and pick out the estimates from the output. This requires hard-wiring into the application understanding of every data type and how to generate a range of values between a range (or implementing width_bucket for every data type) and executing a whole bucketload of EXPLAINs. And then the resulting graphs would be a worse representation of the statistics since they wouldn't correspond exactly to the buckets in the histograms. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> Gregory Stark <stark@enterprisedb.com> writes: >>> Attached is a patch which implements, as discussed briefly on -hackers, a >.> user-visible function to get at the information that convert_to_scalar uses to >>> generate selectivity estimates. >> >> This is an astonishingly bad idea, as it exposes and thereby sets in >> stone one of the worst, most in-need-of-rethinking areas in selfuncs.c. > No, it sets in stone only the concept that at some point somehow we'll have to > come up with an estimate of where a value lies in a range. The problem is how much of the actual behavior of that function gets exposed, and in particular what context it needs to make the estimate. Further down the road, for example, we might need a locale as an input. When complaining I hadn't read the pghackers thread in which you suggested this, and now that I'm caught up on email I remain unconvinced. What do you need convert_to_scalar for in order to display the pg_statistic histogram? You've already got the underlying data. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > When complaining I hadn't read the pghackers thread in which you > suggested this, and now that I'm caught up on email I remain > unconvinced. What do you need convert_to_scalar for in order to display > the pg_statistic histogram? You've already got the underlying data. Then the charting tool would have to duplicate all the same code that convert_to_scalar has to figure out what to do with the data. What I can generate already is a bar chart for integer data types. But I think it would be most useful for string data types. In one of the cases I've helped someone with before they had comment spam constituting a sizable fraction of their database throwing off statistics and forcing a sequential scan. It might be simplest to expose an extern definition of the C function with a comment saying it's internal and changes are planned. Incidentally, I wonder if it makes more sense to make this an optional btree operator class support procedure rather than a type function. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> When complaining I hadn't read the pghackers thread in which you >> suggested this, and now that I'm caught up on email I remain >> unconvinced. What do you need convert_to_scalar for in order to display >> the pg_statistic histogram? You've already got the underlying data. > Then the charting tool would have to duplicate all the same code that > convert_to_scalar has to figure out what to do with the data. How so? The entries in the histogram are equidistant by definition. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> "Tom Lane" <tgl@sss.pgh.pa.us> writes: >>> When complaining I hadn't read the pghackers thread in which you >>> suggested this, and now that I'm caught up on email I remain >>> unconvinced. What do you need convert_to_scalar for in order to display >>> the pg_statistic histogram? You've already got the underlying data. > >> Then the charting tool would have to duplicate all the same code that >> convert_to_scalar has to figure out what to do with the data. > > How so? The entries in the histogram are equidistant by definition. Huh? They have equal number of values between them, they're not equidistant in the scalar space. So the area of each bar should be the same but the width would be proportional to the width of the bucket and the height inversely proportional. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> How so? The entries in the histogram are equidistant by definition. > Huh? They have equal number of values between them, they're not equidistant in > the scalar space. So the area of each bar should be the same but the width > would be proportional to the width of the bucket and the height inversely > proportional. You're presuming there exists a linear scalar space to reference the results to. I'm unconvinced that that's a good assumption to make. Consider for instance a column of URLs; practically all the entries will start with http or ftp or one of a small number of other possibilities. If you try to represent this with some hypothetical linear x-axis you are going to produce an unusable display. convert_to_scalar partially avoids this problem because of the way it's used within the backend: we don't actually assume that there is any global linear space, only that the values within any single histogram bucket are fairly uniformly distributed. (This assumption tends to become true in the limit as the buckets get smaller, even if the global distribution is pretty crazy. And even if it doesn't become true, the absolute error is bounded by the bucket size.) You can't use the thing to measure the "positions" of histogram boundary values, because those are part of its frame of reference not something it outputs. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> "Tom Lane" <tgl@sss.pgh.pa.us> writes: >>> How so? The entries in the histogram are equidistant by definition. > >> Huh? They have equal number of values between them, they're not equidistant in >> the scalar space. So the area of each bar should be the same but the width >> would be proportional to the width of the bucket and the height inversely >> proportional. > > You're presuming there exists a linear scalar space to reference the > results to. I'm unconvinced that that's a good assumption to make. > Consider for instance a column of URLs; practically all the entries will > start with http or ftp or one of a small number of other possibilities. > If you try to represent this with some hypothetical linear x-axis you > are going to produce an unusable display. I was planning to use the first and last histogram values for the frame of reference. It could still produce some weird graphs but those cases are precisely the cases where users might want to look at it and go, "hm, that looks weird". In most sane cases I expect it'll make a nice pretty graph. It remains to be seen though. All I have so far are synthetic integer domain graphs which look just like what my synthetic distribution should look like. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> You're presuming there exists a linear scalar space to reference the >> results to. > I was planning to use the first and last histogram values for the frame of > reference. It could still produce some weird graphs but those cases are > precisely the cases where users might want to look at it and go, "hm, that > looks weird". It will produce unusable graphs in precisely the cases where users might want to know what's going on... regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> "Tom Lane" <tgl@sss.pgh.pa.us> writes: > >> I was planning to use the first and last histogram values for the frame of >> reference. It could still produce some weird graphs but those cases are >> precisely the cases where users might want to look at it and go, "hm, that >> looks weird". > > It will produce unusable graphs in precisely the cases where users might > want to know what's going on... Well that's possible but I think it's worth experimenting. I'm pretty confident there would be ways to clean the graphs up so they're sensible and even if not they would still be a lot more helpful than nothing user-visible. Would it be reasonable to just make the function extern with a comment that it's an internal function subject to change in the future. That would at least let us experiment with an add-on function for 8.3 so we know what we really need for 8.4. Tools like this would be very version-dependent anyways. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com