Thread: Choosing values for multivariate MCV lists
Hi, The current implementation of multi-column MCV lists (added in this cycle) uses a fairly simple algorithm to pick combinations to include in the MCV list. We just compute a minimum number of occurences, and then include all entries sampled more often. See get_mincount_for_mcv_list(). By coincidence I received a real-world data set where this does not work particularly well, so I'm wondering if we can improve this somehow. It does not make the estimates worse than without the MCV lists, so I don't think we have an issue, but I wonder if we could do better. The data set is very simple - it's a single table of valid addresses (city/street name and street number): CREATE TABLE addresses ( city_name text, street_name text, street_no int ); Data with Czech addresses are available here (it's ~9.3MB, so I'm not attaching it directly) https://drive.google.com/file/d/1EiZGsk6s5hqzZrL7t5KiaqByJnBsndfA and attached is a SQL script I used to compute a "virtual" MCV list. There are about 3M records, so let's query for a street name in one of the cities: EXPLAIN ANALYZE SELECT * FROM addresses WHERE city_name = 'Most' AND street_name = 'Pionýrů' QUERY PLAN ---------------------------------------------------------------------------- Seq Scan on addresses (cost=0.00..62645.38 rows=1 width=25) (actual time=117.225..204.291 rows=779 loops=1) Filter: ((city_name = 'Most'::text) AND (street_name = 'Pionýrů'::text)) Rows Removed by Filter: 2923846 Planning Time: 0.065 ms Execution Time: 204.525 ms (5 rows) It's true 779 rows is only a tiny fraction of the data set (~0.025%), but this data set is a bit weird in one other aspect - about half of the records has empty (NULL) street_name, it's just city + number. Small villages simply don't have streets at all, and large cities probably started as small villages, so they have such addresses too. This however means that by choosing the MCV entries solely based on the number of occurrences in the sample, we end up with MCV lists where vast majority of entries has NULL street name. That's why we got such poor estimate in the example query, despite the fact that the city/street combination is the most frequent in the data set (with non-NULL street name). The other weird thing is that frequency of NULL street names is fairly uniform in the whole data set. In total about 50% addresses match that, and for individual cities it's generally between 25% and 100%, so the estimate is less than 2x off in those cases. But for addresses with non-NULL street names, the situation is very different. Some street names are unique to a single city, etc. Overall, this means we end up with MCV list with entries representing the mostly-uniform part of the data set, instead of prefering the entries that are truly skewed. So I'm wondering if we could/should rethink the algorithm, so look at the frequency and base_frequency, and maybe pick entries based on their ratio, or something like that. For example, we might sort the entries by abs(freq - base_freq) / freq which seems like a reasonable "measure of mis-estimate". Or maybe we might look just at abs(freq - base_freq)? I think the first option would be better, because (1 row vs. 100.000 rows) is probably worse than (1M rows vs. 2M rows). Of course, this is a challenging problem, for a couple of reasons. Firstly, picking simply the most frequent groups is very simple and it gives us additional information about the largest group (which may be useful elsewhere, e.g. the incremental sort patch). Secondly, if the correlated combinations are less frequent, how reliably can we even estimate the frequency from a sample? The combination in the example query was ~0.02% of the data, so how likely it's to be sampled? Alternatively, it's possible to increase statistics target to make the sample larger, but that also keeps more entries in the MCV list, including the correlated ones. So maybe the best thing we can do is to just rely on that? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Tue, 18 Jun 2019 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > The current implementation of multi-column MCV lists (added in this > cycle) uses a fairly simple algorithm to pick combinations to include in > the MCV list. We just compute a minimum number of occurences, and then > include all entries sampled more often. See get_mincount_for_mcv_list(). > > [snip] > > This however means that by choosing the MCV entries solely based on the > number of occurrences in the sample, we end up with MCV lists where vast > majority of entries has NULL street name. > > That's why we got such poor estimate in the example query, despite the > fact that the city/street combination is the most frequent in the data > set (with non-NULL street name). > I think the fact that they're NULL is a bit of a red herring because we're treating NULL just like any other value. The same thing would happen if there were some other very common non-NULL value that dominated the dataset. > The other weird thing is that frequency of NULL street names is fairly > uniform in the whole data set. In total about 50% addresses match that, > and for individual cities it's generally between 25% and 100%, so the > estimate is less than 2x off in those cases. > > But for addresses with non-NULL street names, the situation is very > different. Some street names are unique to a single city, etc. > > Overall, this means we end up with MCV list with entries representing > the mostly-uniform part of the data set, instead of prefering the > entries that are truly skewed. > > So I'm wondering if we could/should rethink the algorithm, so look at > the frequency and base_frequency, and maybe pick entries based on their > ratio, or something like that. > Hmm, interesting. I think I would like to see a more rigorous justification for changing the algorithm deciding which values to keep. If I've understood correctly, I think the problem is this: The mincount calculation is a good way of identifying MCV candidates to keep, because it ensures that we don't keep values that don't appear sufficiently often to produce accurate estimates, and ideally we'd keep everything with count >= mincount. However, in the case were there are more than stats_target items with count >= mincount, simply ordering by count and keeping the most commonly seen values isn't necessarily the best strategy in the case of multivariate statistics. To identify what the best strategy might be, I think you need to examine the errors that would occur as a result of *not* keeping a value in the multivariate MCV list. Given a value that appears with count >= mincount, N*freq ought to be a reasonable estimate for the actual number of occurrences of that value in the table, and N*base_freq ought to be a reasonable estimate for the univariate-stats-based estimate that it would be given if it weren't kept in the multivariate MCV list. So the absolute error resulting from not keeping that value would be N * Abs(freq - base_freq) But then I think we ought to take into account how often we're likely to get that error. If we're simply picking values at random, the likelihood of getting that value is just its frequency, so the average average absolute error would be Sum( N * freq[i] * Abs(freq[i] - base_freq[i]) ) which suggests that, if we wanted to reduce the average absolute error of the estimates, we should order by freq*Abs(freq-base_freq) and keep the top n in the MCV list. On the other hand, if we wanted to reduce the average *relative* error of the estimates, we might instead order by Abs(freq-base_freq). > For example, we might sort the entries by > > abs(freq - base_freq) / freq > I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq though, because that would seem likely to put too much weight on the least commonly occurring values. > Of course, this is a challenging problem, for a couple of reasons. > > Firstly, picking simply the most frequent groups is very simple and it > gives us additional information about the largest group (which may be > useful elsewhere, e.g. the incremental sort patch). > Yes, you would have to keep in mind that changing the algorithm would mean that the MCV list no longer represented all the most common values. For example, it would no longer be valid to assume that no value appeared more often than the first value in the MCV list. I'm not sure that we currently do that though. > Secondly, if the correlated combinations are less frequent, how reliably > can we even estimate the frequency from a sample? The combination in the > example query was ~0.02% of the data, so how likely it's to be sampled? > I think that's OK as long as we keep the mincount filter in the new algorithm. I think some experimentation is definitely worthwhile here, but it looks plausible that a decent approach might be: 1). Discard all values with count < mincount. 2). Order by freq*Abs(freq-base_freq) (or possibly just Abs(freq-base_freq)) and keep the top n, where n is the stats target. 3). Perhaps re-sort by count, so the final list is in frequency order again? Not sure if that's a desirable property to keep. Regards, Dean
On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote: >On Tue, 18 Jun 2019 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> The current implementation of multi-column MCV lists (added in this >> cycle) uses a fairly simple algorithm to pick combinations to include in >> the MCV list. We just compute a minimum number of occurences, and then >> include all entries sampled more often. See get_mincount_for_mcv_list(). >> >> [snip] >> >> This however means that by choosing the MCV entries solely based on the >> number of occurrences in the sample, we end up with MCV lists where vast >> majority of entries has NULL street name. >> >> That's why we got such poor estimate in the example query, despite the >> fact that the city/street combination is the most frequent in the data >> set (with non-NULL street name). >> > >I think the fact that they're NULL is a bit of a red herring because >we're treating NULL just like any other value. The same thing would >happen if there were some other very common non-NULL value that >dominated the dataset. > I wasn't really suggesting the NULL is an issue - sorry if that wasn't clear. It might be any other value, as long as it's very common (and roughly uniform) in all cities. So yes, I agree with you here. >> The other weird thing is that frequency of NULL street names is fairly >> uniform in the whole data set. In total about 50% addresses match that, >> and for individual cities it's generally between 25% and 100%, so the >> estimate is less than 2x off in those cases. >> >> But for addresses with non-NULL street names, the situation is very >> different. Some street names are unique to a single city, etc. >> >> Overall, this means we end up with MCV list with entries representing >> the mostly-uniform part of the data set, instead of prefering the >> entries that are truly skewed. >> >> So I'm wondering if we could/should rethink the algorithm, so look at >> the frequency and base_frequency, and maybe pick entries based on their >> ratio, or something like that. >> > >Hmm, interesting. I think I would like to see a more rigorous >justification for changing the algorithm deciding which values to >keep. > Sure, I'm not going to pretend my proposals were particularly rigorous, it was more a collection of random ideas. >If I've understood correctly, I think the problem is this: The >mincount calculation is a good way of identifying MCV candidates to >keep, because it ensures that we don't keep values that don't appear >sufficiently often to produce accurate estimates, and ideally we'd >keep everything with count >= mincount. However, in the case were >there are more than stats_target items with count >= mincount, simply >ordering by count and keeping the most commonly seen values isn't >necessarily the best strategy in the case of multivariate statistics. > Yes. >To identify what the best strategy might be, I think you need to >examine the errors that would occur as a result of *not* keeping a >value in the multivariate MCV list. Given a value that appears with >count >= mincount, N*freq ought to be a reasonable estimate for the >actual number of occurrences of that value in the table, and >N*base_freq ought to be a reasonable estimate for the >univariate-stats-based estimate that it would be given if it weren't >kept in the multivariate MCV list. So the absolute error resulting >from not keeping that value would be > > N * Abs(freq - base_freq) > Yeah. Considering N is the same for all groups in the sample, this would mean the same thing as Abs(freq - base_freq). >But then I think we ought to take into account how often we're likely >to get that error. If we're simply picking values at random, the >likelihood of getting that value is just its frequency, so the average >average absolute error would be > > Sum( N * freq[i] * Abs(freq[i] - base_freq[i]) ) > >which suggests that, if we wanted to reduce the average absolute error >of the estimates, we should order by freq*Abs(freq-base_freq) and keep >the top n in the MCV list. > Interesting idea. But I'm not sure it makes much sense to assume the rows are picked randomly - OTOH we don't really know anything about how the data will be queries, so we may just as well do that. >On the other hand, if we wanted to reduce the average *relative* error >of the estimates, we might instead order by Abs(freq-base_freq). > Hmmm, yeah. I don't know what's the right choice here, TBH. >> For example, we might sort the entries by >> >> abs(freq - base_freq) / freq >> > >I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq >though, because that would seem likely to put too much weight on the >least commonly occurring values. > But would that be an issue, or a good thing? I mean, as long as the item is above mincount, we take the counts as reliable. As I explained, my motivation for proposing that was that both ... (cost=... rows=1 ...) (actual=... rows=1000001 ...) and ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...) have exactly the same Abs(freq - base_freq), but I think we both agree that the first misestimate is much more dangerous, because it's off by six orders of magnitude. I think the MCV algorithm should reflect this. >> Of course, this is a challenging problem, for a couple of reasons. >> >> Firstly, picking simply the most frequent groups is very simple and it >> gives us additional information about the largest group (which may be >> useful elsewhere, e.g. the incremental sort patch). >> > >Yes, you would have to keep in mind that changing the algorithm would >mean that the MCV list no longer represented all the most common >values. For example, it would no longer be valid to assume that no >value appeared more often than the first value in the MCV list. I'm >not sure that we currently do that though. > We don't, but I've proposed using some of this knowledge in the incremental sort patch. But I think we might actually extend the MCV list to store some extra information (e.g. frequency of the largest groups, even if we don't store it). >> Secondly, if the correlated combinations are less frequent, how reliably >> can we even estimate the frequency from a sample? The combination in the >> example query was ~0.02% of the data, so how likely it's to be sampled? >> > >I think that's OK as long as we keep the mincount filter in the new >algorithm. I think some experimentation is definitely worthwhile here, >but it looks plausible that a decent approach might be: > >1). Discard all values with count < mincount. >2). Order by freq*Abs(freq-base_freq) (or possibly just >Abs(freq-base_freq)) and keep the top n, where n is the stats target. >3). Perhaps re-sort by count, so the final list is in frequency order >again? Not sure if that's a desirable property to keep. > Will try. Thanks for the feedback. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 20 Jun 2019 at 23:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote: > > >I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq > >though, because that would seem likely to put too much weight on the > >least commonly occurring values. > > But would that be an issue, or a good thing? I mean, as long as the item > is above mincount, we take the counts as reliable. As I explained, my > motivation for proposing that was that both > > ... (cost=... rows=1 ...) (actual=... rows=1000001 ...) > > and > > ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...) > > have exactly the same Abs(freq - base_freq), but I think we both agree > that the first misestimate is much more dangerous, because it's off by six > orders of magnitude. > Hmm, that's a good example. That definitely suggests that we should be trying to minimise the relative error, but also perhaps that what we should be looking at is actually just the ratio freq / base_freq, rather than their difference. Regards, Dean
On Fri, Jun 21, 2019 at 08:50:33AM +0100, Dean Rasheed wrote: >On Thu, 20 Jun 2019 at 23:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote: >> >> >I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq >> >though, because that would seem likely to put too much weight on the >> >least commonly occurring values. >> >> But would that be an issue, or a good thing? I mean, as long as the item >> is above mincount, we take the counts as reliable. As I explained, my >> motivation for proposing that was that both >> >> ... (cost=... rows=1 ...) (actual=... rows=1000001 ...) >> >> and >> >> ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...) >> >> have exactly the same Abs(freq - base_freq), but I think we both agree >> that the first misestimate is much more dangerous, because it's off by six >> orders of magnitude. >> > >Hmm, that's a good example. That definitely suggests that we should be >trying to minimise the relative error, but also perhaps that what we >should be looking at is actually just the ratio freq / base_freq, >rather than their difference. > Attached are patches that implement this (well, the first one optimizes how the base frequency is computed, which helps for high statistic target values). The 0002 part picks the items based on Max(freq/base_freq, base_freq/freq) It did help with the addresses data set quite a bit, but I'm sure it needs more testing. I've also tried using freq * abs(freq - base_freq) but the estimates were not as good. One annoying thing I noticed is that the base_frequency tends to end up being 0, most likely due to getting too small. It's a bit strange, though, because with statistic target set to 10k the smallest frequency for a single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is something the float8 can represent). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > One annoying thing I noticed is that the base_frequency tends to end up > being 0, most likely due to getting too small. It's a bit strange, though, > because with statistic target set to 10k the smallest frequency for a > single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is > something the float8 can represent). > Yeah, it should be impossible for the base frequency to underflow to 0. However, it looks like the problem is with mcv_list_items()'s use of %f to convert to text, which is pretty ugly. Regards, Dean
On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote: >On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> One annoying thing I noticed is that the base_frequency tends to end up >> being 0, most likely due to getting too small. It's a bit strange, though, >> because with statistic target set to 10k the smallest frequency for a >> single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is >> something the float8 can represent). >> > >Yeah, it should be impossible for the base frequency to underflow to >0. However, it looks like the problem is with mcv_list_items()'s use >of %f to convert to text, which is pretty ugly. > Yeah, I realized that too, eventually. One way to fix that would be adding %.15f to the sprintf() call, but that just adds ugliness. It's probably time to rewrite the function to build the tuple from datums, instead of relying on BuildTupleFromCStrings. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote: >On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote: >>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>>One annoying thing I noticed is that the base_frequency tends to end up >>>being 0, most likely due to getting too small. It's a bit strange, though, >>>because with statistic target set to 10k the smallest frequency for a >>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is >>>something the float8 can represent). >>> >> >>Yeah, it should be impossible for the base frequency to underflow to >>0. However, it looks like the problem is with mcv_list_items()'s use >>of %f to convert to text, which is pretty ugly. >> > >Yeah, I realized that too, eventually. One way to fix that would be >adding %.15f to the sprintf() call, but that just adds ugliness. It's >probably time to rewrite the function to build the tuple from datums, >instead of relying on BuildTupleFromCStrings. > OK, attached is a patch doing this. It's pretty simple, and it does resolve the issue with frequency precision. There's one issue with the signature, though - currently the function returns null flags as bool array, but values are returned as simple text value (formatted in array-like way, but still just a text). In the attached patch I've reworked both to proper arrays, but obviously that'd require a CATVERSION bump - and there's not much apetite for that past beta2, I suppose. So I'll just undo this bit. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jun 22, 2019 at 04:10:52PM +0200, Tomas Vondra wrote: >On Fri, Jun 21, 2019 at 08:50:33AM +0100, Dean Rasheed wrote: >>On Thu, 20 Jun 2019 at 23:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>> >>>On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote: >>> >>>>I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq >>>>though, because that would seem likely to put too much weight on the >>>>least commonly occurring values. >>> >>>But would that be an issue, or a good thing? I mean, as long as the item >>>is above mincount, we take the counts as reliable. As I explained, my >>>motivation for proposing that was that both >>> >>> ... (cost=... rows=1 ...) (actual=... rows=1000001 ...) >>> >>>and >>> >>> ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...) >>> >>>have exactly the same Abs(freq - base_freq), but I think we both agree >>>that the first misestimate is much more dangerous, because it's off by six >>>orders of magnitude. >>> >> >>Hmm, that's a good example. That definitely suggests that we should be >>trying to minimise the relative error, but also perhaps that what we >>should be looking at is actually just the ratio freq / base_freq, >>rather than their difference. >> > >Attached are patches that implement this (well, the first one optimizes >how the base frequency is computed, which helps for high statistic target >values). The 0002 part picks the items based on > > Max(freq/base_freq, base_freq/freq) > >It did help with the addresses data set quite a bit, but I'm sure it needs >more testing. I've also tried using > > freq * abs(freq - base_freq) > >but the estimates were not as good. > >One annoying thing I noticed is that the base_frequency tends to end up >being 0, most likely due to getting too small. It's a bit strange, though, >because with statistic target set to 10k the smallest frequency for a >single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is >something the float8 can represent). > FWIW while doing more tests on this, I've realized a rather annoying behavior while increasing the statistics target. With the current algorithm picking values merely based on frequency, the MCV list is expanded in a stable way. Increasing the statistics target means the MCV list may grow, but the larger MCV list will contain the smaller MCV one (ignoring changes due to a differences in the sample). After switching to the two-phase algorithm (first picking candidates based on mincount, then picking the items based on error) that's no longer true. I've repeatedly seen cases when increasing the target lowered mincount, adding candidates with larger frequency errors, removing some of the items from the "smaller" MCV list. In practice this means that increasing the statistics target may easily make the estimates much worse (for some of the values). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 24, 2019 at 01:42:32AM +0200, Tomas Vondra wrote: >On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote: >>On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote: >>>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>>>One annoying thing I noticed is that the base_frequency tends to end up >>>>being 0, most likely due to getting too small. It's a bit strange, though, >>>>because with statistic target set to 10k the smallest frequency for a >>>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is >>>>something the float8 can represent). >>>> >>> >>>Yeah, it should be impossible for the base frequency to underflow to >>>0. However, it looks like the problem is with mcv_list_items()'s use >>>of %f to convert to text, which is pretty ugly. >>> >> >>Yeah, I realized that too, eventually. One way to fix that would be >>adding %.15f to the sprintf() call, but that just adds ugliness. It's >>probably time to rewrite the function to build the tuple from datums, >>instead of relying on BuildTupleFromCStrings. >> > >OK, attached is a patch doing this. It's pretty simple, and it does >resolve the issue with frequency precision. > >There's one issue with the signature, though - currently the function >returns null flags as bool array, but values are returned as simple >text value (formatted in array-like way, but still just a text). > >In the attached patch I've reworked both to proper arrays, but obviously >that'd require a CATVERSION bump - and there's not much apetite for that >past beta2, I suppose. So I'll just undo this bit. > Meh, I forgot to attach the patch, of couse ... -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Mon, 24 Jun 2019 at 00:42, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote: > >On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote: > >>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >>>One annoying thing I noticed is that the base_frequency tends to end up > >>>being 0, most likely due to getting too small. It's a bit strange, though, > >>>because with statistic target set to 10k the smallest frequency for a > >>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is > >>>something the float8 can represent). > >>> > >> > >>Yeah, it should be impossible for the base frequency to underflow to > >>0. However, it looks like the problem is with mcv_list_items()'s use > >>of %f to convert to text, which is pretty ugly. > >> > > > >Yeah, I realized that too, eventually. One way to fix that would be > >adding %.15f to the sprintf() call, but that just adds ugliness. It's > >probably time to rewrite the function to build the tuple from datums, > >instead of relying on BuildTupleFromCStrings. > > > > OK, attached is a patch doing this. It's pretty simple, and it does > resolve the issue with frequency precision. > > There's one issue with the signature, though - currently the function > returns null flags as bool array, but values are returned as simple > text value (formatted in array-like way, but still just a text). > > In the attached patch I've reworked both to proper arrays, but obviously > that'd require a CATVERSION bump - and there's not much apetite for that > past beta2, I suppose. So I'll just undo this bit. > Hmm, I didn't spot that the old code was using a single text value rather than a text array. That's clearly broken, especially since it wasn't even necessarily constructing a valid textual representation of an array (e.g., if an individual value's textual representation included the array markers "{" or "}"). IMO fixing this to return a text array is worth doing, even though it means a catversion bump. Regards, Dean
On Mon, Jun 24, 2019 at 02:54:01PM +0100, Dean Rasheed wrote: >On Mon, 24 Jun 2019 at 00:42, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote: >> >On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote: >> >>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >>>One annoying thing I noticed is that the base_frequency tends to end up >> >>>being 0, most likely due to getting too small. It's a bit strange, though, >> >>>because with statistic target set to 10k the smallest frequency for a >> >>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is >> >>>something the float8 can represent). >> >>> >> >> >> >>Yeah, it should be impossible for the base frequency to underflow to >> >>0. However, it looks like the problem is with mcv_list_items()'s use >> >>of %f to convert to text, which is pretty ugly. >> >> >> > >> >Yeah, I realized that too, eventually. One way to fix that would be >> >adding %.15f to the sprintf() call, but that just adds ugliness. It's >> >probably time to rewrite the function to build the tuple from datums, >> >instead of relying on BuildTupleFromCStrings. >> > >> >> OK, attached is a patch doing this. It's pretty simple, and it does >> resolve the issue with frequency precision. >> >> There's one issue with the signature, though - currently the function >> returns null flags as bool array, but values are returned as simple >> text value (formatted in array-like way, but still just a text). >> >> In the attached patch I've reworked both to proper arrays, but obviously >> that'd require a CATVERSION bump - and there's not much apetite for that >> past beta2, I suppose. So I'll just undo this bit. >> > >Hmm, I didn't spot that the old code was using a single text value >rather than a text array. That's clearly broken, especially since it >wasn't even necessarily constructing a valid textual representation of >an array (e.g., if an individual value's textual representation >included the array markers "{" or "}"). > >IMO fixing this to return a text array is worth doing, even though it >means a catversion bump. > Yeah :-( It used to be just a "debugging" function, but now that we're using it e.g. in pg_stats_ext definition, we need to be more careful about the output. Presumably we could keep the text output and make sure it's escaped properly etc. We could even build an array internally and then run it through an output function. That'd not require catversion bump. I'll cleanup the patch changing the function signature. If others think the catversion bump would be too significant annoyance at this point, I will switch back to the text output (with proper formatting). Opinions? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jun 25, 2019 at 11:18:19AM +0200, Tomas Vondra wrote: >On Mon, Jun 24, 2019 at 02:54:01PM +0100, Dean Rasheed wrote: >>On Mon, 24 Jun 2019 at 00:42, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>> >>>On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote: >>>>On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote: >>>>>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>>>>>One annoying thing I noticed is that the base_frequency tends to end up >>>>>>being 0, most likely due to getting too small. It's a bit strange, though, >>>>>>because with statistic target set to 10k the smallest frequency for a >>>>>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is >>>>>>something the float8 can represent). >>>>>> >>>>> >>>>>Yeah, it should be impossible for the base frequency to underflow to >>>>>0. However, it looks like the problem is with mcv_list_items()'s use >>>>>of %f to convert to text, which is pretty ugly. >>>>> >>>> >>>>Yeah, I realized that too, eventually. One way to fix that would be >>>>adding %.15f to the sprintf() call, but that just adds ugliness. It's >>>>probably time to rewrite the function to build the tuple from datums, >>>>instead of relying on BuildTupleFromCStrings. >>>> >>> >>>OK, attached is a patch doing this. It's pretty simple, and it does >>>resolve the issue with frequency precision. >>> >>>There's one issue with the signature, though - currently the function >>>returns null flags as bool array, but values are returned as simple >>>text value (formatted in array-like way, but still just a text). >>> >>>In the attached patch I've reworked both to proper arrays, but obviously >>>that'd require a CATVERSION bump - and there's not much apetite for that >>>past beta2, I suppose. So I'll just undo this bit. >>> >> >>Hmm, I didn't spot that the old code was using a single text value >>rather than a text array. That's clearly broken, especially since it >>wasn't even necessarily constructing a valid textual representation of >>an array (e.g., if an individual value's textual representation >>included the array markers "{" or "}"). >> >>IMO fixing this to return a text array is worth doing, even though it >>means a catversion bump. >> > >Yeah :-( > >It used to be just a "debugging" function, but now that we're using it >e.g. in pg_stats_ext definition, we need to be more careful about the >output. Presumably we could keep the text output and make sure it's >escaped properly etc. We could even build an array internally and then >run it through an output function. That'd not require catversion bump. > >I'll cleanup the patch changing the function signature. If others think >the catversion bump would be too significant annoyance at this point, I >will switch back to the text output (with proper formatting). > >Opinions? > Attached is a cleaned-up version of that patch. The main difference is that instead of using construct_md_array() this uses ArrayBuildState to construct the arrays, which is much easier. The docs don't need any update because those were already using text[] for the parameter, the code was inconsistent with it. This does require catversion bump, but as annoying as it is, I think it's worth it (and there's also the thread discussing the serialization issues). Barring objections, I'll get it committed later next week, once I get back from PostgresLondon. As I mentioned before, if we don't want any additional catversion bumps, it's possible to pass the arrays through output functions - that would allow us keeping the text output (but correct, unlike what we have now). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sat, 29 Jun 2019 at 14:01, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > >>>>>However, it looks like the problem is with mcv_list_items()'s use > >>>>>of %f to convert to text, which is pretty ugly. > >>>> > >>>There's one issue with the signature, though - currently the function > >>>returns null flags as bool array, but values are returned as simple > >>>text value (formatted in array-like way, but still just a text). > >>> > >>IMO fixing this to return a text array is worth doing, even though it > >>means a catversion bump. > >> > Attached is a cleaned-up version of that patch. The main difference is > that instead of using construct_md_array() this uses ArrayBuildState to > construct the arrays, which is much easier. The docs don't need any > update because those were already using text[] for the parameter, the > code was inconsistent with it. > Cool, this looks a lot neater and fixes the issues discussed with both floating point values no longer being converted to text, and proper text arrays for values. One minor nitpick -- it doesn't seem to be necessary to build the arrays *outfuncs and *fmgrinfo. You may as well just fetch the individual column output function information at the point where it's used (in the "if (!item->isnull[i])" block) rather than building those arrays. > This does require catversion bump, but as annoying as it is, I think > it's worth it (and there's also the thread discussing the serialization > issues). Barring objections, I'll get it committed later next week, once > I get back from PostgresLondon. > > As I mentioned before, if we don't want any additional catversion bumps, > it's possible to pass the arrays through output functions - that would > allow us keeping the text output (but correct, unlike what we have now). > I think this is a clear bug fix, so I'd vote for fixing it properly now, with a catversion bump. Regards, Dean
On Mon, Jul 01, 2019 at 12:02:28PM +0100, Dean Rasheed wrote: >On Sat, 29 Jun 2019 at 14:01, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> >>>>>However, it looks like the problem is with mcv_list_items()'s use >> >>>>>of %f to convert to text, which is pretty ugly. >> >>>> >> >>>There's one issue with the signature, though - currently the function >> >>>returns null flags as bool array, but values are returned as simple >> >>>text value (formatted in array-like way, but still just a text). >> >>> >> >>IMO fixing this to return a text array is worth doing, even though it >> >>means a catversion bump. >> >> >> Attached is a cleaned-up version of that patch. The main difference is >> that instead of using construct_md_array() this uses ArrayBuildState to >> construct the arrays, which is much easier. The docs don't need any >> update because those were already using text[] for the parameter, the >> code was inconsistent with it. >> > >Cool, this looks a lot neater and fixes the issues discussed with both >floating point values no longer being converted to text, and proper >text arrays for values. > >One minor nitpick -- it doesn't seem to be necessary to build the >arrays *outfuncs and *fmgrinfo. You may as well just fetch the >individual column output function information at the point where it's >used (in the "if (!item->isnull[i])" block) rather than building those >arrays. > OK, thanks for the feedback. I'll clean-up the function lookup. > >> This does require catversion bump, but as annoying as it is, I think >> it's worth it (and there's also the thread discussing the serialization >> issues). Barring objections, I'll get it committed later next week, once >> I get back from PostgresLondon. >> >> As I mentioned before, if we don't want any additional catversion bumps, >> it's possible to pass the arrays through output functions - that would >> allow us keeping the text output (but correct, unlike what we have now). >> > >I think this is a clear bug fix, so I'd vote for fixing it properly >now, with a catversion bump. > I agree. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services