Thread: Proposition for better performance

Proposition for better performance

From
hmidi slim
Date:
Hi,
I'm trying to create an availability table for existing products. I'm fetching the suitable schema to design in order to get good performance when I fetch products in a table contains millions of rows.
I think to make a schema like this:
create table availability (product_id integer, product_name varchar(255), start_date date, end_date date).

E.g:
Given an example 'product-1' which has a product_id = 1 and available from 27/03/2018 to 31/03/2018:

product_id     product_name    start_date        end_date
     
     1                product-1          27-03-2018    31-03-2018


However if the product is not available in 29/03/2018 I have to divide the period to 2 intervals:

product_id     product_name    start_date        end_date
     
     1                product-1          27-03-2018        28-03-2018
     1                product-1          30-03-2018        31-03-2018


With a table contains millions of products and if I divide the availability by intervals I will got a huge number of rows and fetching data will degrade the performance. Is there any solution to use such as range types or anything which mentions the unavailable dates.
Based on the example mentioned above, can I mention a data type like range type that take the start_date and end_date and exclude the unavailable dates?


Re: Proposition for better performance

From
Paul Jungwirth
Date:
On 03/27/2018 07:42 AM, hmidi slim wrote:
> Hi,
> I'm trying to create an availability table for existing products. I'm 
> fetching the suitable schema to design in order to get good performance 
> when I fetch products in a table contains millions of rows.
> I think to make a schema like this:
> *create table availability (product_id integer, product_name 
> varchar(255), start_date date, end_date date)*.

I would use a tstzrange (or daterange) instead of separate start_date 
and end_date columns. Then you can create an exclusion constraint that 
has `EXCLUDE USING gist (id WITH =, available_during WITH &&)`. That 
will automatically add a GiST index on those columns that should help 
with faster lookups. (It will also prevent contradictions where a 
product has two overlapping rows.)

You didn't mention the queries you want to be fast, but that index 
should cover the reasonable ones I think.

 > Is there any solution to use such as range types or anything which
 > mentions the unavailable dates.

For any product, there should be the same number of unavailable periods 
as available, right---or often one more? So I don't see any performance 
benefit in doing it that way, and it certainly seems less intuitive to 
store when something is *not* available.

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com


Re: Proposition for better performance

From
hmidi slim
Date:
the query that I used to fetch products was:
select * from availability
where ('27-03-2018' between start_date and end_date)
and ('31-03-2018' between start_date and end_date);

I added another column named during of type daterange and I created a gist index :
create index idx on availability(during);

select * from availability
where during @> daterange(''27-03-2018', '31-03-2018');

With a table of 15M rows I got an execution time of 1 minute.Thtat's why I asked if it is a good solution to divide the availability of a product by intervals.
If I assume that a product has in every month 5 availability intervals, in 12 months we will have 60 intervals.For 1M products that's will be 60M rows. This will affect the performance.

Re: Proposition for better performance

From
Paul Jungwirth
Date:
On 03/27/2018 10:04 AM, hmidi slim wrote:
> the query that I used to fetch products was:
> select * from availability
> where ('27-03-2018' between start_date and end_date)
> and ('31-03-2018' between start_date and end_date);
> 
> I added another column named during of type daterange and I created a 
> gist index :
> create index idx on availability(during);

That creates a b-tree index.

Also, what did you set `during` to?

> select * from availability
> where during @> daterange(''27-03-2018', '31-03-2018');
> 
> With a table of 15M rows I got an execution time of 1 minute.

What was the query plan? Did you confirm that it used the index?

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com


Re: Proposition for better performance

From
hmidi slim
Date:
I update the example:
    create table product_availabilities(product_id integer, start_date date, end_date date);

    insert into product_availabilities(product_id, start_date, end_date)
    select a, '2018-01-01', '2018-05-03' from generate_series(1,15000000) as a
   
   
    alter table product_availabilities add column during daterange;
   
    update product_availabilities set during = daterange(start_date, end_date);

    CREATE INDEX idx_time ON product_availabilities USING gist(during); 

    explain analyze SELECT product_id, start_date, end_date, during
    FROM product_availabilities
    WHERE during @> daterange('2018-02-01', '2018-04-01')

Query plan:
Bitmap Heap Scan on product_availabilities  (cost=33728.79..236086.04 rows=878500 width=26) (actual time=2775.058..5792.842 rows=15000000 loops=1)
  Recheck Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
  Heap Blocks: exact=31040 lossy=79255
  ->  Bitmap Index Scan on idx_time  (cost=0.00..33509.17 rows=878500 width=0) (actual time=2767.262..2767.262 rows=15000000 loops=1)
        Index Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
Planning time: 0.063 ms
Execution time: 6408.456 ms


SELECT product_id, start_date, end_date, during
    FROM product_availabilities
    WHERE during @> daterange('2018-02-01', '2018-04-01')

returns  query runtime: 2min

Re: Proposition for better performance

From
Paul Jungwirth
Date:
(Including the list....)

On 03/27/2018 10:49 AM, hmidi slim wrote:
> Sorry I didn't copy it very well:
> create index idx on availability using gist(during);
> 
> and during = daterange(start_date,end_date)
> 
> And the query plan used was seq scan not index scan.

It sounds like there must be some important information missing. Here is 
my attempt to mock up some fake data:

     insert into availability
     select p.id, concat('prod ', p.id::text), daterange((now() + 
concat(r.v, ' days')::interval)::date, (now() + concat((r.v + 1 + 
random() * 21)::int, ' days')::interval)::date)
     from (select * from generate_series(1, 1000000)) p(id)
     cross join (select * from generate_series(1, 15)) n(i)
     join lateral (
       select p2.id, n2.i, (random() * 600 - 300)::int AS v
       from generate_series(1, 1000000) p2(id),
       generate_series(1, 15) n2(i)
     ) r
     on r.id = p.id and r.i = n.i
     ;

Then

     explain select * from availability where during @> 
daterange('2018-03-27', '2018-03-31');
                                      QUERY PLAN 

 
-----------------------------------------------------------------------------
      Index Scan using idx on availability  (cost=0.39..1644.41 rows=1 
width=552)
        Index Cond: (during @> '[2018-03-27,2018-03-31)'::daterange)
     (2 rows)

Running that query for real I get:

     ...
     (215044 rows)

     Time: 1450.099 ms (00:01.450)

So if the index isn't being used you'll have to do some digging to find 
out why.    
    
-- 
Paul              ~{:-)
pj@illuminatedcomputing.com


Re: Proposition for better performance

From
Paul Jungwirth
Date:
On 03/27/2018 11:14 AM, hmidi slim wrote:
> Query plan:*
> Bitmap Heap Scan on product_availabilities  (cost=33728.79..236086.04 
> rows=878500 width=26) (actual time=2775.058..5792.842 rows=15000000 loops=1)
>    Recheck Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
>    Heap Blocks: exact=31040 lossy=79255
>    ->  Bitmap Index Scan on idx_time  (cost=0.00..33509.17 rows=878500 
> width=0) (actual time=2767.262..2767.262 rows=15000000 loops=1)
>          Index Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
> Planning time: 0.063 ms
> Execution time: 6408.456 ms
> 
> 
> *SELECT product_id, start_date, end_date, during
>      FROM product_availabilities
>      WHERE during @> daterange('2018-02-01', '2018-04-01')*
> *
> returns  query runtime*: 2min
> *

It is 6 seconds or 2 minutes? Where is the 2 minutes number coming from? 
Are you sure that's all Postgres? With 878k rows even sending them over 
the network is going to take a while, and then more time to JSONify them 
or whatever else you need to do.

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com


Re: Proposition for better performance

From
hmidi slim
Date:
This is the message that I got:
Successfully run. Total query runtime: 2 min.
15000000 rows affected.

Re: Proposition for better performance

From
Paul Jungwirth
Date:
On 03/27/2018 11:42 AM, hmidi slim wrote:
> This is the message that I got:
> 
> Successfully run. Total query runtime: 2 min.
> 15000000 rows affected.

Sorry, I don't think this is enough information to suggest anything.

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com


Re: Proposition for better performance

From
armand pirvu
Date:
> On Mar 27, 2018, at 4:25 PM, Paul Jungwirth <pj@illuminatedcomputing.com> wrote:
>
> On 03/27/2018 11:42 AM, hmidi slim wrote:
>> This is the message that I got:
>> Successfully run. Total query runtime: 2 min.
>> 15000000 rows affected.
>
> Sorry, I don't think this is enough information to suggest anything.
>
> --
> Paul              ~{:-)
> pj@illuminatedcomputing.com
>

I took the case prsented and ran in a sandbox and

UPDATE 15000000

explain analyze SELECT product_id, start_date, end_date, during
FROM product_availabilities
WHERE during @> daterange('2018-02-01', '2018-04-01')

 Bitmap Heap Scan on product_availabilities  (cost=1156.67..46856.67 rows=75000 width=44) (actual
time=5413.792..11367.379rows=15000000 loops 
=1)
   Recheck Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
   Heap Blocks: exact=110295
   ->  Bitmap Index Scan on idx_time  (cost=0.00..1137.92 rows=75000 width=0) (actual time=5325.844..5325.844
rows=15000000loops=1) 
         Index Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
 Planning time: 0.145 ms
 Execution time: 14055.666 ms
(7 rows)

But

considering the update (I did not check bloating or anything but still)

vacuum full product_availabilities;
analyze product_availabilities;

The plan changes to
 Seq Scan on product_availabilities  (cost=0.00..242647.91 rows=15000033 width=26) (actual time=0.034..7207.697
rows=15000000loops=1) 
   Filter: (during @> '[2018-02-01,2018-04-01)'::daterange)
 Planning time: 6.701 ms
 Execution time: 9238.285 ms

And the runtime does get in the two minutes
time psql -U csidba -d armandp <f.sql > /dev/null
real    2m39.767s
user    1m45.576s
sys    0m12.324s


Not sure if that confirms the OP’s findings but to me a first question would be if the fact that the execution time
reportedby epxlain analyze does not seem to be even close to the actual run time is expected or not 

BTW I was the postgres version reported ? I ran the presented case on 9.5.8


— Armand