Thread: Proposition for better performance
Hi,
I'm trying to create an availability table for existing products. I'm fetching the suitable schema to design in order to get good performance when I fetch products in a table contains millions of rows.product_id product_name start_date end_date
1 product-1 27-03-2018 31-03-2018
However if the product is not available in 29/03/2018 I have to divide the period to 2 intervals:
product_id product_name start_date end_date
1 product-1 27-03-2018 28-03-2018
1 product-1 30-03-2018 31-03-2018
With a table contains millions of products and if I divide the availability by intervals I will got a huge number of rows and fetching data will degrade the performance. Is there any solution to use such as range types or anything which mentions the unavailable dates.
Based on the example mentioned above, can I mention a data type like range type that take the start_date and end_date and exclude the unavailable dates?
On 03/27/2018 07:42 AM, hmidi slim wrote: > Hi, > I'm trying to create an availability table for existing products. I'm > fetching the suitable schema to design in order to get good performance > when I fetch products in a table contains millions of rows. > I think to make a schema like this: > *create table availability (product_id integer, product_name > varchar(255), start_date date, end_date date)*. I would use a tstzrange (or daterange) instead of separate start_date and end_date columns. Then you can create an exclusion constraint that has `EXCLUDE USING gist (id WITH =, available_during WITH &&)`. That will automatically add a GiST index on those columns that should help with faster lookups. (It will also prevent contradictions where a product has two overlapping rows.) You didn't mention the queries you want to be fast, but that index should cover the reasonable ones I think. > Is there any solution to use such as range types or anything which > mentions the unavailable dates. For any product, there should be the same number of unavailable periods as available, right---or often one more? So I don't see any performance benefit in doing it that way, and it certainly seems less intuitive to store when something is *not* available. -- Paul ~{:-) pj@illuminatedcomputing.com
the query that I used to fetch products was:
select * from availabilityI added another column named during of type daterange and I created a gist index :
select * from availability
where during @> daterange(''27-03-2018', '31-03-2018');With a table of 15M rows I got an execution time of 1 minute.Thtat's why I asked if it is a good solution to divide the availability of a product by intervals.
If I assume that a product has in every month 5 availability intervals, in 12 months we will have 60 intervals.For 1M products that's will be 60M rows. This will affect the performance.
On 03/27/2018 10:04 AM, hmidi slim wrote: > the query that I used to fetch products was: > select * from availability > where ('27-03-2018' between start_date and end_date) > and ('31-03-2018' between start_date and end_date); > > I added another column named during of type daterange and I created a > gist index : > create index idx on availability(during); That creates a b-tree index. Also, what did you set `during` to? > select * from availability > where during @> daterange(''27-03-2018', '31-03-2018'); > > With a table of 15M rows I got an execution time of 1 minute. What was the query plan? Did you confirm that it used the index? -- Paul ~{:-) pj@illuminatedcomputing.com
I update the example:
create table product_availabilities(product_id integer, start_date date, end_date date);
insert into product_availabilities(product_id, start_date, end_date)
select a, '2018-01-01', '2018-05-03' from generate_series(1,15000000) as a
alter table product_availabilities add column during daterange;
update product_availabilities set during = daterange(start_date, end_date);
CREATE INDEX idx_time ON product_availabilities USING gist(during);
explain analyze SELECT product_id, start_date, end_date, during
FROM product_availabilities
WHERE during @> daterange('2018-02-01', '2018-04-01')
Query plan:create table product_availabilities(product_id integer, start_date date, end_date date);
insert into product_availabilities(product_id, start_date, end_date)
select a, '2018-01-01', '2018-05-03' from generate_series(1,15000000) as a
alter table product_availabilities add column during daterange;
update product_availabilities set during = daterange(start_date, end_date);
CREATE INDEX idx_time ON product_availabilities USING gist(during);
explain analyze SELECT product_id, start_date, end_date, during
FROM product_availabilities
WHERE during @> daterange('2018-02-01', '2018-04-01')
Bitmap Heap Scan on product_availabilities (cost=33728.79..236086.04 rows=878500 width=26) (actual time=2775.058..5792.842 rows=15000000 loops=1)
Recheck Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
Heap Blocks: exact=31040 lossy=79255
-> Bitmap Index Scan on idx_time (cost=0.00..33509.17 rows=878500 width=0) (actual time=2767.262..2767.262 rows=15000000 loops=1)
Index Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
Planning time: 0.063 ms
Execution time: 6408.456 ms
SELECT product_id, start_date, end_date, during
FROM product_availabilities
WHERE during @> daterange('2018-02-01', '2018-04-01')
(Including the list....) On 03/27/2018 10:49 AM, hmidi slim wrote: > Sorry I didn't copy it very well: > create index idx on availability using gist(during); > > and during = daterange(start_date,end_date) > > And the query plan used was seq scan not index scan. It sounds like there must be some important information missing. Here is my attempt to mock up some fake data: insert into availability select p.id, concat('prod ', p.id::text), daterange((now() + concat(r.v, ' days')::interval)::date, (now() + concat((r.v + 1 + random() * 21)::int, ' days')::interval)::date) from (select * from generate_series(1, 1000000)) p(id) cross join (select * from generate_series(1, 15)) n(i) join lateral ( select p2.id, n2.i, (random() * 600 - 300)::int AS v from generate_series(1, 1000000) p2(id), generate_series(1, 15) n2(i) ) r on r.id = p.id and r.i = n.i ; Then explain select * from availability where during @> daterange('2018-03-27', '2018-03-31'); QUERY PLAN ----------------------------------------------------------------------------- Index Scan using idx on availability (cost=0.39..1644.41 rows=1 width=552) Index Cond: (during @> '[2018-03-27,2018-03-31)'::daterange) (2 rows) Running that query for real I get: ... (215044 rows) Time: 1450.099 ms (00:01.450) So if the index isn't being used you'll have to do some digging to find out why. -- Paul ~{:-) pj@illuminatedcomputing.com
On 03/27/2018 11:14 AM, hmidi slim wrote: > Query plan:* > Bitmap Heap Scan on product_availabilities (cost=33728.79..236086.04 > rows=878500 width=26) (actual time=2775.058..5792.842 rows=15000000 loops=1) > Recheck Cond: (during @> '[2018-02-01,2018-04-01)'::daterange) > Heap Blocks: exact=31040 lossy=79255 > -> Bitmap Index Scan on idx_time (cost=0.00..33509.17 rows=878500 > width=0) (actual time=2767.262..2767.262 rows=15000000 loops=1) > Index Cond: (during @> '[2018-02-01,2018-04-01)'::daterange) > Planning time: 0.063 ms > Execution time: 6408.456 ms > > > *SELECT product_id, start_date, end_date, during > FROM product_availabilities > WHERE during @> daterange('2018-02-01', '2018-04-01')* > * > returns query runtime*: 2min > * It is 6 seconds or 2 minutes? Where is the 2 minutes number coming from? Are you sure that's all Postgres? With 878k rows even sending them over the network is going to take a while, and then more time to JSONify them or whatever else you need to do. -- Paul ~{:-) pj@illuminatedcomputing.com
This is the message that I got:
Successfully run. Total query runtime: 2 min. 15000000 rows affected.
On 03/27/2018 11:42 AM, hmidi slim wrote: > This is the message that I got: > > Successfully run. Total query runtime: 2 min. > 15000000 rows affected. Sorry, I don't think this is enough information to suggest anything. -- Paul ~{:-) pj@illuminatedcomputing.com
> On Mar 27, 2018, at 4:25 PM, Paul Jungwirth <pj@illuminatedcomputing.com> wrote: > > On 03/27/2018 11:42 AM, hmidi slim wrote: >> This is the message that I got: >> Successfully run. Total query runtime: 2 min. >> 15000000 rows affected. > > Sorry, I don't think this is enough information to suggest anything. > > -- > Paul ~{:-) > pj@illuminatedcomputing.com > I took the case prsented and ran in a sandbox and UPDATE 15000000 explain analyze SELECT product_id, start_date, end_date, during FROM product_availabilities WHERE during @> daterange('2018-02-01', '2018-04-01') Bitmap Heap Scan on product_availabilities (cost=1156.67..46856.67 rows=75000 width=44) (actual time=5413.792..11367.379rows=15000000 loops =1) Recheck Cond: (during @> '[2018-02-01,2018-04-01)'::daterange) Heap Blocks: exact=110295 -> Bitmap Index Scan on idx_time (cost=0.00..1137.92 rows=75000 width=0) (actual time=5325.844..5325.844 rows=15000000loops=1) Index Cond: (during @> '[2018-02-01,2018-04-01)'::daterange) Planning time: 0.145 ms Execution time: 14055.666 ms (7 rows) But considering the update (I did not check bloating or anything but still) vacuum full product_availabilities; analyze product_availabilities; The plan changes to Seq Scan on product_availabilities (cost=0.00..242647.91 rows=15000033 width=26) (actual time=0.034..7207.697 rows=15000000loops=1) Filter: (during @> '[2018-02-01,2018-04-01)'::daterange) Planning time: 6.701 ms Execution time: 9238.285 ms And the runtime does get in the two minutes time psql -U csidba -d armandp <f.sql > /dev/null real 2m39.767s user 1m45.576s sys 0m12.324s Not sure if that confirms the OP’s findings but to me a first question would be if the fact that the execution time reportedby epxlain analyze does not seem to be even close to the actual run time is expected or not BTW I was the postgres version reported ? I ran the presented case on 9.5.8 — Armand