Thread: comment regarding double timestamps; and, infinite timestamps and NaN
selfuncs.c convert_to_scalar() says: |* The several datatypes representing absolute times are all converted |* to Timestamp, which is actually a double, and then we just use that |* double value. Note this will give correct results even for the "special" |* values of Timestamp, since those are chosen to compare correctly; |* see timestamp_cmp. But: https://www.postgresql.org/docs/10/release-10.html |Remove support for floating-point timestamps and intervals (Tom Lane) |This removes configure's --disable-integer-datetimes option. Floating-point timestamps have few advantages and have notbeen the default since PostgreSQL 8.3. |b6aa17e De-support floating-point timestamps. |configure | 18 ++++++------------ |configure.in | 12 ++++++------ |doc/src/sgml/config.sgml | 8 +++----- |doc/src/sgml/datatype.sgml | 55 +++++++++++-------------------------------------------- |doc/src/sgml/installation.sgml | 22 ---------------------- |src/include/c.h | 7 ++++--- |src/include/pg_config.h.in | 4 ---- |src/include/pg_config.h.win32 | 4 ---- |src/interfaces/ecpg/include/ecpg_config.h.in | 4 ---- |src/interfaces/ecpg/include/pgtypes_interval.h | 2 -- |src/interfaces/ecpg/test/expected/pgtypeslib-dt_test2.c | 6 ++---- |src/interfaces/ecpg/test/expected/pgtypeslib-dt_test2.stdout | 2 ++ |src/interfaces/ecpg/test/pgtypeslib/dt_test2.pgc | 6 ++---- |src/tools/msvc/Solution.pm | 9 --------- |src/tools/msvc/config_default.pl | 1 - |15 files changed, 36 insertions(+), 124 deletions(-) It's true that convert_to_scalar sees doubles: |static double |convert_timevalue_to_scalar(Datum value, Oid typid, bool *failure) |{ | switch (typid) | { | case TIMESTAMPOID: | return DatumGetTimestamp(value); But: $ git grep DatumGetTimestamp src/include/ src/include/utils/timestamp.h:#define DatumGetTimestamp(X) ((Timestamp) DatumGetInt64(X)) So I propose it should say something like: |* The several datatypes representing absolute times are all converted |* to Timestamp, which is actually an int64, and then we just promote that |* to double. Note this will give correct results even for the "special" |* values of Timestamp, since those are chosen to compare correctly; |* see timestamp_cmp. That seems to be only used for ineq_histogram_selectivity() interpolation of histogram bins. It looks to me that at least isn't working for "special values", and needs to use something other than isnan(). I added debugging code and tested the attached like: DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes'); INSERT INTO t VALUES('-infinity'); ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t; explain SELECT * FROM t WHERE t>='2010-12-29';
Attachment
Justin Pryzby <pryzby@telsasoft.com> writes: > selfuncs.c convert_to_scalar() says: > |* The several datatypes representing absolute times are all converted > |* to Timestamp, which is actually a double, and then we just use that > |* double value. > So I propose it should say something like: > |* The several datatypes representing absolute times are all converted > |* to Timestamp, which is actually an int64, and then we just promote that > |* to double. Check, obviously this comment never got updated. > That seems to be only used for ineq_histogram_selectivity() interpolation of > histogram bins. It looks to me that at least isn't working for "special > values", and needs to use something other than isnan(). Uh, what? This seems completely wrong to me. We could possibly promote DT_NOBEGIN and DT_NOEND to +/- infinity (not NaN), but I don't really see the point. They'll compare to other timestamp values correctly without that, cf timestamp_cmp_internal(). The example you give seems to me to be working sanely, or at least as sanely as it can given the number of histogram points available, with the existing code. In any case, shoving NaNs into the computation is not going to make anything better. regards, tom lane
On Mon, Dec 30, 2019 at 09:05:24AM -0500, Tom Lane wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > That seems to be only used for ineq_histogram_selectivity() interpolation of > > histogram bins. It looks to me that at least isn't working for "special > > values", and needs to use something other than isnan(). > > Uh, what? This seems completely wrong to me. We could possibly > promote DT_NOBEGIN and DT_NOEND to +/- infinity (not NaN), but > I don't really see the point. They'll compare to other timestamp > values correctly without that, cf timestamp_cmp_internal(). > The example you give seems to me to be working sanely, or at least > as sanely as it can given the number of histogram points available, > with the existing code. In any case, shoving NaNs into the > computation is not going to make anything better. As I see it, the problem is that the existing code tests for isnan(), but infinite timestamps are PG_INT64_MIN/MAX (here, stored in a double), so there's absurdly large values being used as if they were isnormal(). src/include/datatype/timestamp.h:#define DT_NOBEGIN PG_INT64_MIN src/include/datatype/timestamp.h-#define DT_NOEND PG_INT64_MAX On v12, my test gives: |DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes'); |INSERT INTO t VALUES('-infinity'); |ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t; |explain analyze SELECT * FROM t WHERE t>='2010-12-29'; | Seq Scan on t (cost=0.00..5.62 rows=3 width=8) (actual time=0.012..0.042 rows=289 loops=1) vs patched master: |DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes'); |INSERT INTO t VALUES('-infinity'); |ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t; |explain analyze SELECT * FROM t WHERE t>='2010-12-29'; | Seq Scan on t (cost=0.00..5.62 rows=146 width=8) (actual time=0.048..0.444 rows=289 loops=1) IMO 146 rows is a reasonable estimate given a single histogram bucket of infinite width, and 3 rows is a less reasonable result of returning INT64_MAX in one place and then handling it as a normal value. The comments in ineq_histogram seem to indicate that this case is intended to get binfrac=0.5: | Watch out for the possibility that we got a NaN or Infinity from the | division. This can happen despite the previous checks, if for example "low" is | -Infinity. I changed to use INFINITY, -INFINITY and !isnormal() rather than nan() and isnan() (although binfrac is actually NAN at that point so the existing test is ok). Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > On Mon, Dec 30, 2019 at 09:05:24AM -0500, Tom Lane wrote: >> Uh, what? This seems completely wrong to me. We could possibly >> promote DT_NOBEGIN and DT_NOEND to +/- infinity (not NaN), but >> I don't really see the point. They'll compare to other timestamp >> values correctly without that, cf timestamp_cmp_internal(). >> The example you give seems to me to be working sanely, or at least >> as sanely as it can given the number of histogram points available, >> with the existing code. In any case, shoving NaNs into the >> computation is not going to make anything better. > As I see it, the problem is that the existing code tests for isnan(), but > infinite timestamps are PG_INT64_MIN/MAX (here, stored in a double), so there's > absurdly large values being used as if they were isnormal(). I still say that (1) you're confusing NaN with Infinity, and (2) you haven't actually shown that there's a problem to fix. These endpoint values are *not* NaNs. > On v12, my test gives: > |DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes'); > |INSERT INTO t VALUES('-infinity'); > |ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t; > |explain analyze SELECT * FROM t WHERE t>='2010-12-29'; > | Seq Scan on t (cost=0.00..5.62 rows=3 width=8) (actual time=0.012..0.042 rows=289 loops=1) This is what it should do. There's only one histogram bucket, and it extends down to -infinity, so the conclusion is going to be that the WHERE clause excludes all but a small part of the bucket. This is the correct answer based on the available stats; the problem is not with the calculation, but with the miserable granularity of the available stats. > vs patched master: > |DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes'); > |INSERT INTO t VALUES('-infinity'); > |ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t; > |explain analyze SELECT * FROM t WHERE t>='2010-12-29'; > | Seq Scan on t (cost=0.00..5.62 rows=146 width=8) (actual time=0.048..0.444 rows=289 loops=1) This answer is simply broken. You've caused it to estimate half of the bucket, which is an insane estimate for the given bucket boundaries and WHERE constraint. > IMO 146 rows is a reasonable estimate given a single histogram bucket of > infinite width, No, it isn't. regards, tom lane
infinite histogram bounds and nan (Re: comment regarding doubletimestamps; and, infinite timestamps and NaN)
From
Justin Pryzby
Date:
On Mon, Dec 30, 2019 at 02:18:17PM -0500, Tom Lane wrote: > > On v12, my test gives: > > |DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes'); > > |INSERT INTO t VALUES('-infinity'); > > |ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t; > > |explain analyze SELECT * FROM t WHERE t>='2010-12-29'; > > | Seq Scan on t (cost=0.00..5.62 rows=3 width=8) (actual time=0.012..0.042 rows=289 loops=1) > > This is what it should do. There's only one histogram bucket, and > it extends down to -infinity, so the conclusion is going to be that > the WHERE clause excludes all but a small part of the bucket. This > is the correct answer based on the available stats; the problem is > not with the calculation, but with the miserable granularity of the > available stats. > > > vs patched master: > > |DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes'); > > |INSERT INTO t VALUES('-infinity'); > > |ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t; > > |explain analyze SELECT * FROM t WHERE t>='2010-12-29'; > > | Seq Scan on t (cost=0.00..5.62 rows=146 width=8) (actual time=0.048..0.444 rows=289 loops=1) > > This answer is simply broken. You've caused it to estimate half > of the bucket, which is an insane estimate for the given bucket > boundaries and WHERE constraint. > > > IMO 146 rows is a reasonable estimate given a single histogram bucket of > > infinite width, > > No, it isn't. When using floats, v12 also returns half the histogram: DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(0, 99, 1)::float; INSERT INTO t VALUES('-Infinity'); ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t; explain analyze SELECT * FROM t WHERE t>='50'; Seq Scan on t (cost=0.00..2.26 rows=51 width=8) (actual time=0.014..0.020 rows=50 loops=1) I'm fine if the isnan() logic changes, but the comment indicates it's intended to be hit for an infinite histogram bound, but that doesn't work for timestamps (convert_to_scalar() should return (double)INFINITY and not (double)INT64_MIN/MAX). On Mon, Dec 30, 2019 at 02:18:17PM -0500, Tom Lane wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > On Mon, Dec 30, 2019 at 09:05:24AM -0500, Tom Lane wrote: > >> Uh, what? This seems completely wrong to me. We could possibly > >> promote DT_NOBEGIN and DT_NOEND to +/- infinity (not NaN), but > >> I don't really see the point. They'll compare to other timestamp > >> values correctly without that, cf timestamp_cmp_internal(). > >> The example you give seems to me to be working sanely, or at least > >> as sanely as it can given the number of histogram points available, > >> with the existing code. In any case, shoving NaNs into the > >> computation is not going to make anything better. > > > As I see it, the problem is that the existing code tests for isnan(), but > > infinite timestamps are PG_INT64_MIN/MAX (here, stored in a double), so there's > > absurdly large values being used as if they were isnormal(). > > I still say that (1) you're confusing NaN with Infinity, and (2) > you haven't actually shown that there's a problem to fix. > These endpoint values are *not* NaNs. I probably did confuse it while trying to make the behavior match the comment for timestamps. The Subject says NAN since isnan(binfrac) is what's supposed to be hit for that case. The NAN is intended to come from: |binfrac = (val - low) / (high - low); which is some variation of -inf / inf. Justin
Re: infinite histogram bounds and nan (Re: comment regarding double timestamps; and, infinite timestamps and NaN)
From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes: > On Mon, Dec 30, 2019 at 02:18:17PM -0500, Tom Lane wrote: >> This answer is simply broken. You've caused it to estimate half >> of the bucket, which is an insane estimate for the given bucket >> boundaries and WHERE constraint. > I'm fine if the isnan() logic changes, but the comment indicates it's intended > to be hit for an infinite histogram bound, but that doesn't work for timestamps > (convert_to_scalar() should return (double)INFINITY and not > (double)INT64_MIN/MAX). I suppose the code you're looking at is binfrac = (val - low) / (high - low); /* * Watch out for the possibility that we got a NaN or * Infinity from the division. This can happen * despite the previous checks, if for example "low" * is -Infinity. */ if (isnan(binfrac) || binfrac < 0.0 || binfrac > 1.0) binfrac = 0.5; This doesn't really have any goals beyond "make sure we get a result between 0.0 and 1.0, even if the calculation went pear-shaped for some reason". You could make an argument that it should be like if (isnan(binfrac)) binfrac = 0.5; /* throw up our hands for NaN */ else if (binfrac <= 0.0) binfrac = 0.0; /* clamp in case of -Inf or -0 */ else if (binfrac > 1.0) binfrac = 1.0; /* clamp in case of +Inf */ which would probably produce saner results in edge cases like these. I think it'd also obviate the need for fooling with the conversion in convert_to_scalar: while DT_NOBEGIN/DT_NOEND wouldn't produce exactly the same result (hard 0.0 or 1.0) as an infinity, they'd produce results very close to that. regards, tom lane