Re: How to properly query lots of rows based on timestamps? - Mailing list pgsql-general
From | Alban Hertroys |
---|---|
Subject | Re: How to properly query lots of rows based on timestamps? |
Date | |
Msg-id | 32CEEF29-ED8C-4300-B663-7B3D31A6B2CF@gmail.com Whole thread Raw |
In response to | How to properly query lots of rows based on timestamps? (Thorsten Schöning <tschoening@am-soft.de>) |
Responses |
Re: How to properly query lots of rows based on timestamps?
|
List | pgsql-general |
> On 29 Aug 2020, at 10:24, Thorsten Schöning <tschoening@am-soft.de> wrote: > > Hi all, > > I have a table containing around 95 million rows, pretty much only > storing a timestamp and further IDs of related tables containing the > actual data in the end. > >> CREATE TABLE clt_rec >> ( >> id BIGSERIAL NOT NULL, >> oms_rec BIGINT NOT NULL, >> captured_at TIMESTAMP(6) WITH TIME ZONE NOT NULL, >> rssi SMALLINT NOT NULL, >> CONSTRAINT pk_clt_rec PRIMARY KEY (id), >> CONSTRAINT fk_clt_rec_oms_rec FOREIGN KEY (oms_rec) REFERENCES "oms_rec" ("id"), >> CONSTRAINT uk_clt_rec_oms_rec UNIQUE (oms_rec) >> ); > > In many use cases I need to search all of those rows based on their > timestamp to find rows arbitrary in the past: Sometimes it's only 15 > minutes into the past, sometimes it's 2 years, sometimes it's finding > the first day of each month over 15 months for some of those telegrams > etc. In the end, I pretty often need to compare those timestamps and > some queries simply take multiple seconds in the end, especially > adding up if multiple, but slightly different queries need to be > executed one after another. The following are two abstracts of > Postgres' query plans: > > Plan 1: > >> -> Nested Loop (cost=1.14..343169.49 rows=43543 width=20) (actual time=0.313..113.974 rows=34266 loops=3) >> -> Parallel Index Scan using idx_clt_rec_captured_at on clt_rec (cost=0.57..3437.90 rows=43543 width=24) (actualtime=0.153..20.192 rows=34266 loops=3) >> Index Cond: ((captured_at >= ('2020-08-01 00:00:00+02'::timestamp with time zone - '00:00:00'::interval)) AND (captured_at<= ('2020-08-01 00:00:00+02'::timestamp with time zone + '1 day'::interval))) >> -> Index Scan using pk_oms_rec on oms_rec (cost=0.57..7.80 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=102799) >> Index Cond: (id = clt_rec.oms_rec) What happens here is that the planner looks up the lower and upper boundaries, everything in between those index nodes isa candidate record. Next, it loops over those to match the other condition of your query (id = clt_rec.oms_rec). You didn’ttell whether there’s an index on that column. You’d probably see a performance improvement were you to create an index on (captured_at, id). If your Postgres version issomewhat recent, that could even lead to an Index Only Scan. > Plan 2: > >> -> Nested Loop (cost=1.14..836381.50 rows=111934 width=20) (actual time=0.379..911.697 rows=334465 loops=3) >> -> Parallel Index Scan using idx_clt_rec_captured_at on clt_rec (cost=0.57..8819.57 rows=111934 width=24) (actualtime=0.193..154.176 rows=334465 loops=3) >> Index Cond: ((captured_at >= ('2020-08-28 10:21:06.968+02'::timestamp with time zone - '14 days'::interval)) AND(captured_at <= ('2020-08-28 10:21:06.968+02'::timestamp with time zone + '00:00:00'::interval))) >> -> Index Scan using pk_oms_rec on oms_rec (cost=0.57..7.39 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=1003394) >> Index Cond: (id = clt_rec.oms_rec) And this situation is very much the same issue, apart from the larger number of candidate records. > Postgres seems to properly use available indexes, parallel workers and > stuff like that. But looking at the actual times and compared to all > the other parts of the query, comparing those timestamps simply takes > the most time. It only needs to compare 2 timestamps. > I've looked into this topic and found statements about that one > shouldn't put too many rows into the index[1] and stuff like that or > it will be ignored at all. But that doesn't seem to be the case for me > according to the plan. OTOH, my index really simply is about the > column containing the timestamp, no function reducing things to dates > or stuff like that to reduce the number of rows. > >> CREATE INDEX idx_clt_rec_captured_at ON clt_rec USING btree ( captured_at ); Try this: CREATE INDEX idx_clt_rec_captured_at ON clt_rec USING btree ( captured_at, id ); Alban Hertroys -- There is always an exception to always.
pgsql-general by date: