Thread: tstzrange on large table gives poor estimate of expected rows

tstzrange on large table gives poor estimate of expected rows

From
Tom Dearman
Date:
Hi,

We have a fairly big table (22 million rows) which has a start and end
timestamp with time zone and other columns.  Some of the columns plus
start timestamp make a primary key.  The end timestamp is exactly one
day ahead of the start timestamp for each row and there are
approximately 10000 rows per day, so each day there will be about
10000 inserts (and many updates) where the start is yyyy-mm-dd
00:00:00.0 and the end column is one day ahead of that.  We have
created a tstzrange on start and end column and then do a query which
looks for exactly one day range:

explain (analyze, buffers) select tstzrange_test.interval_start_date
as interval_start_date
from tstzrange_test tstzrange_test
where tstzrange('2021-07-20 00:00:00.0', '2021-07-21 00:00:00.0',
'[]') @> tstzrange(tstzrange_test.interval_start_date,
tstzrange_test.interval_end_date, '[]');

In the real query this result is then grouped other columns then
joined on another table which has similar results but by the hour
instead of by the day.

The query always underestimates the number of rows:

Index Scan using tstzrange_test_ix01 on tstzrange_test
(cost=0.41..8.43 rows=1 width=8) (actual time=0.347..8.889 rows=10000
loops=1)
   Index Cond: (tstzrange(interval_start_date, interval_end_date,
'[]'::text) <@ '["2021-07-20 00:00:00+00","2021-07-21
00:00:00+00"]'::tstzrange)
   Buffers: shared hit=815
 Planning Time: 0.120 ms
 Execution Time: 9.591 ms
(5 rows)

and when this is aggregated and then joined on another table it leads
to a slow query.  If I change the query so the start date is 12 hours
earlier (ie so no extra results would be found as the previous day
start date would be 24 hours earlier) it gives a better estimate and
uses a different plan:

explain (analyze, buffers) select tstzrange_test.interval_start_date
as interval_start_date
from tstzrange_test tstzrange_test
where tstzrange('2021-07-19 12:00:00.0', '2021-07-21 00:00:00.0',
'[]') @> tstzrange(tstzrange_test.interval_start_date,
tstzrange_test.interval_end_date, '[]');

 Bitmap Heap Scan on tstzrange_test  (cost=199.89..11672.03 rows=4577
width=8) (actual time=6.880..7.844 rows=10000 loops=1)
   Recheck Cond: ('["2021-07-19 12:00:00+00","2021-07-21
00:00:00+00"]'::tstzrange @> tstzrange(interval_start_date,
interval_end_date, '[]'::text))
   Heap Blocks: exact=65
   Buffers: shared hit=330
   ->  Bitmap Index Scan on tstzrange_test_ix01  (cost=0.00..198.74
rows=4577 width=0) (actual time=6.866..6.866 rows=10000 loops=1)
         Index Cond: (tstzrange(interval_start_date,
interval_end_date, '[]'::text) <@ '["2021-07-19
12:00:00+00","2021-07-21 00:00:00+00"]'::tstzrange)
         Buffers: shared hit=265
 Planning Time: 0.157 ms
 Execution Time: 8.186 ms
(9 rows)

I have tried increasing the states to 10000 on each of the timestamp
columns and also setting up a statistics object to say that start and
end are related (dependencies).

This is how I created and  populated the table:

CREATE TABLE tstzrange_test (
    interval_start_date        timestamp with time zone,
    interval_end_date timestamp with time zone,
    user_registration_id bigint
);

insert into tstzrange_test (interval_start_date, interval_end_date,
user_registration_id) select '2021-01-01 00:00:00.0'::timestamp with
time zone + ((psid)/10000 || ' day')::interval, '2021-01-02
00:00:00.0'::timestamp with time zone + ((psid)/10000 || '
day')::interval, floor(random() * (500) + 1)::int from
generate_series(1,3600000) as s(psid);

CREATE INDEX tstzrange_test_ix01 ON tstzrange_test USING gist
(tstzrange(interval_start_date, interval_end_date, '[]'::text));


Any help on how I can make the planner estimate better would be much
appreciated.

Thanks.



Re: tstzrange on large table gives poor estimate of expected rows

From
Michael Lewis
Date:
If interval_end_date is always 1 day ahead, why store it at all?

Dependencies on a custom stats object wouldn't do anything I don't think because they are offset. They are 100% correlated, but not in a way that any of the existing stat types capture as far as I can figure.

Re: tstzrange on large table gives poor estimate of expected rows

From
Tom Dearman
Date:
Thanks for your help.  It is true we could get rid of it but we still
want to use the functional index on the date range as we understand it
is supposed to be a better look up - we also have other date range
look ups on tables that seem to be degrading.  I have found a solution
to the problem.  The postgres default_statistics_target is 100 and
when we upped it to 10000 the estimate was good.  We could not have
set the default to 10000 on production but there appeared to be no way
to change the value for the function index as statistics is set per
column.  However, in a post answered by Tom Lane in 2012 he gives a
way to set the value for the statistics target on the functional index
(https://www.postgresql.org/message-id/6668.1351105908%40sss.pgh.pa.us)

Thanks.

On Mon, 24 Jan 2022 at 17:43, Michael Lewis <mlewis@entrata.com> wrote:
>
> If interval_end_date is always 1 day ahead, why store it at all?
>
> Dependencies on a custom stats object wouldn't do anything I don't think because they are offset. They are 100%
correlated,but not in a way that any of the existing stat types capture as far as I can figure.