Thread: Partitioned tables and SELECT ... ORDER BY ... LIMIT

Partitioned tables and SELECT ... ORDER BY ... LIMIT

From
Дмитрий Шалашов
Date:
Hi,

lets imagine that we have some table, partitioned by timestamp field, and we query it with SELECT with ordering by that field (DESC for example), with some modest limit.
Lets further say that required amount of rows is found in the first table that query encounters (say, latest one).
I am just wondering, why nevertheless PostgreSQL does read couple of buffers from each of the older tables?

Best regards,
Dmitriy Shalashov

Re: Partitioned tables and SELECT ... ORDER BY ... LIMIT

From
François Beausoleil
Date:
Hello!

Le 2014-10-16 à 08:35, Дмитрий Шалашов <skaurus@gmail.com> a écrit :
> lets imagine that we have some table, partitioned by timestamp field, and we query it with SELECT with ordering by
thatfield (DESC for example), with some modest limit. 
> Lets further say that required amount of rows is found in the first table that query encounters (say, latest one).
> I am just wondering, why nevertheless PostgreSQL does read couple of buffers from each of the older tables?

Could you share a specific plan with us, as well as your PostgreSQL version? It would make the conversation much
easier.

Can you also confirm your constraint_exclusion parameter is set to either 'partition' or 'on'?

Thanks!
François Beausoleil



Re: Partitioned tables and SELECT ... ORDER BY ... LIMIT

From
Дмитрий Шалашов
Date:
Hi!

So this is not obviously normal, I guess?)

My version is 9.2.9, constraint_exclusion set to 'partition'.


# \d user_feed_master
                                     Table "public.user_feed_master"
   Column   |            Type             |                           Modifiers
------------+-----------------------------+---------------------------------------------------------------
 id         | bigint                      | not null default nextval('user_feed_master_id_seq'::regclass)
 user_id    | integer                     | not null
 type       | smallint                    | not null
 added      | timestamp without time zone | not null
 active_id  | integer                     | not null
 url_id     | integer                     | not null
 channel_id | integer                     |
 updated    | timestamp without time zone | default now()
 activity   | text                        |
Number of child tables: 11 (Use \d+ to list them.)


# \d user_feed_201406 -- one of partitions
                                     Table "public.user_feed_201406"
   Column   |            Type             |                           Modifiers
------------+-----------------------------+---------------------------------------------------------------
 id         | bigint                      | not null default nextval('user_feed_master_id_seq'::regclass)
 user_id    | integer                     | not null
 type       | smallint                    | not null
 added      | timestamp without time zone | not null
 active_id  | integer                     | not null
 url_id     | integer                     | not null
 channel_id | integer                     |
 updated    | timestamp without time zone | default now()
 activity   | text                        |
Indexes:
    "user_feed_201406_pkey" PRIMARY KEY, btree (id)
    "user_feed_201406_url_id_user_id_idx" btree (url_id, user_id)
    "user_feed_201406_user_id_active_id_added_idx" btree (user_id, active_id, added DESC)
    "user_feed_201406_user_id_added_idx" btree (user_id, added DESC)
Check constraints:
    "user_feed_201406_added_check" CHECK (added >= '2014-06-01'::date AND added < '2014-07-01'::date)
Inherits: user_feed_master


# SELECT count(*) FROM user_feed_201406 WHERE user_id = 83586;                                                           count
-------
909
(1 row)


EXPLAIN (ANALYZE,BUFFERS) SELECT url_id FROM user_feed_master WHERE user_id = 83586 AND added <= '2014-06-30 23:59:59.99999' ORDER BY added DESC LIMIT 100;
                                                                                            QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------
 Limit  (cost=0.13..397.23 rows=100 width=12) (actual time=107.442..107.706 rows=100 loops=1)
   Buffers: shared hit=104 read=18
   I/O Timings: read=107.131
   ->  Result  (cost=0.13..19664.74 rows=4952 width=12) (actual time=107.442..107.695 rows=100 loops=1)
         Buffers: shared hit=104 read=18
         I/O Timings: read=107.131
         ->  Merge Append  (cost=0.13..19664.74 rows=4952 width=12) (actual time=107.440..107.683 rows=100 loops=1)
               Sort Key: public.user_feed_master.added
               Buffers: shared hit=104 read=18
               I/O Timings: read=107.131
               ->  Sort  (cost=0.01..0.02 rows=1 width=12) (actual time=0.006..0.006 rows=0 loops=1)
                     Sort Key: public.user_feed_master.added
                     Sort Method: quicksort  Memory: 25kB
                     ->  Seq Scan on user_feed_master  (cost=0.00..0.00 rows=1 width=12) (actual time=0.001..0.001 rows=0 loops=1)
                           Filter: ((added <= '2014-06-30 23:59:59.63551'::timestamp without time zone) AND (user_id = 83586))
               ->  Index Scan using user_feed_201312_user_id_added_idx on user_feed_201312 user_feed_master  (cost=0.00..1525.71 row
s=392 width=12) (actual time=15.020..15.020 rows=1 loops=1)
                     Index Cond: ((user_id = 83586) AND (added <= '2014-06-30 23:59:59.63551'::timestamp without time zone))
                     Buffers: shared hit=1 read=3
                     I/O Timings: read=14.980
               ->  Index Scan using user_feed_201401_user_id_added_idx on user_feed_201401 user_feed_master  (cost=0.00..966.92 rows
=272 width=12) (actual time=8.703..8.703 rows=1 loops=1)
                     Index Cond: ((user_id = 83586) AND (added <= '2014-06-30 23:59:59.63551'::timestamp without time zone))
                     Buffers: shared hit=2 read=2
                     I/O Timings: read=8.667
               ->  Index Scan using user_feed_201402_user_id_added_idx on user_feed_201402 user_feed_master  (cost=0.00..1356.38 row
s=396 width=12) (actual time=12.818..12.818 rows=1 loops=1)
                     Index Cond: ((user_id = 83586) AND (added <= '2014-06-30 23:59:59.63551'::timestamp without time zone))
                     Buffers: shared hit=2 read=2
                     I/O Timings: read=12.782
               ->  Index Scan using user_feed_201403_user_id_added_idx on user_feed_201403 user_feed_master  (cost=0.00..4400.92 row
s=1116 width=12) (actual time=16.959..16.959 rows=1 loops=1)
                     Index Cond: ((user_id = 83586) AND (added <= '2014-06-30 23:59:59.63551'::timestamp without time zone))
                     Buffers: shared hit=2 read=3
                     I/O Timings: read=16.921
               ->  Index Scan using user_feed_201404_user_id_added_idx on user_feed_201404 user_feed_master  (cost=0.00..5576.73 row
s=1375 width=12) (actual time=15.534..15.534 rows=1 loops=1)
                     Index Cond: ((user_id = 83586) AND (added <= '2014-06-30 23:59:59.63551'::timestamp without time zone))
                     Buffers: shared hit=2 read=3
                     I/O Timings: read=15.485
               ->  Index Scan using user_feed_201405_user_id_added_idx on user_feed_201405 user_feed_master  (cost=0.00..2895.72 row
s=714 width=12) (actual time=17.328..17.328 rows=1 loops=1)
                     Index Cond: ((user_id = 83586) AND (added <= '2014-06-30 23:59:59.63551'::timestamp without time zone))
                     Buffers: shared hit=2 read=3
                     I/O Timings: read=17.281
               ->  Index Scan using user_feed_201406_user_id_added_idx on user_feed_201406 user_feed_master  (cost=0.00..2781.28 row
s=686 width=12) (actual time=21.064..21.276 rows=100 loops=1)
                     Index Cond: ((user_id = 83586) AND (added <= '2014-06-30 23:59:59.63551'::timestamp without time zone))
                     Buffers: shared hit=93 read=2
                     I/O Timings: read=21.015
 Total runtime: 107.797 ms
(44 rows)


Best regards,
Dmitriy Shalashov

2014-10-16 17:19 GMT+04:00 François Beausoleil <francois@teksol.info>:
Hello!

Le 2014-10-16 à 08:35, Дмитрий Шалашов <skaurus@gmail.com> a écrit :
> lets imagine that we have some table, partitioned by timestamp field, and we query it with SELECT with ordering by that field (DESC for example), with some modest limit.
> Lets further say that required amount of rows is found in the first table that query encounters (say, latest one).
> I am just wondering, why nevertheless PostgreSQL does read couple of buffers from each of the older tables?

Could you share a specific plan with us, as well as your PostgreSQL version? It would make the conversation much easier.

Can you also confirm your constraint_exclusion parameter is set to either 'partition' or 'on'?

Thanks!
François Beausoleil


Re: Partitioned tables and SELECT ... ORDER BY ... LIMIT

From
Jeff Janes
Date:
On Thu, Oct 16, 2014 at 5:35 AM, Дмитрий Шалашов <skaurus@gmail.com> wrote:
Hi,

lets imagine that we have some table, partitioned by timestamp field, and we query it with SELECT with ordering by that field (DESC for example), with some modest limit.
Lets further say that required amount of rows is found in the first table that query encounters (say, latest one).
I am just wondering, why nevertheless PostgreSQL does read couple of buffers from each of the older tables?

The planner only does partition pruning statically, not dynamically.The LIMIT has to be implemented dynamically--it cannot prove absolutely that the "first" partition will have enough rows, so it cannot eliminate the others.

The "Merge Append" does a priority queue merge, and so needs to read the "first" row (according to the ORDER BY) from each partition in order to seed the priority queue.  I guess what it could be made to do in the case where there are suitable check constraints on a partition, is seed the priority queue with a dummy value constructed from the constraint.  If the merge never gets far enough to draw upon that dummy value, then that whole plan node never needs to get started up.

In your case that would save very little, as reading a few blocks for each partition is not much of a burden.  Especially as it the same few blocks every time, so they should be well cached.  There may be other case where this would be more helpful.  But it isn't clear to me how the planner could build such a feature into its cost estimates, and the whole thing would be a rather complex and esoteric optimization to make for uncertain gain.

Cheers,

Jeff

Re: Partitioned tables and SELECT ... ORDER BY ... LIMIT

From
Felipe Santos
Date:

2014-10-16 14:04 GMT-03:00 Jeff Janes <jeff.janes@gmail.com>:
On Thu, Oct 16, 2014 at 5:35 AM, Дмитрий Шалашов <skaurus@gmail.com> wrote:
Hi,

lets imagine that we have some table, partitioned by timestamp field, and we query it with SELECT with ordering by that field (DESC for example), with some modest limit.
Lets further say that required amount of rows is found in the first table that query encounters (say, latest one).
I am just wondering, why nevertheless PostgreSQL does read couple of buffers from each of the older tables?

The planner only does partition pruning statically, not dynamically.The LIMIT has to be implemented dynamically--it cannot prove absolutely that the "first" partition will have enough rows, so it cannot eliminate the others.

The "Merge Append" does a priority queue merge, and so needs to read the "first" row (according to the ORDER BY) from each partition in order to seed the priority queue.  I guess what it could be made to do in the case where there are suitable check constraints on a partition, is seed the priority queue with a dummy value constructed from the constraint.  If the merge never gets far enough to draw upon that dummy value, then that whole plan node never needs to get started up.

In your case that would save very little, as reading a few blocks for each partition is not much of a burden.  Especially as it the same few blocks every time, so they should be well cached.  There may be other case where this would be more helpful.  But it isn't clear to me how the planner could build such a feature into its cost estimates, and the whole thing would be a rather complex and esoteric optimization to make for uncertain gain.

Cheers,

Jeff

Like Jeff said, it shouldn't be much of a burden.

If you think it is, than you can query only the last partition (since partitions are tables themselves).

It seems to me that your application is querying some sample data from the last date to show something in your application and this approach would do for that purpose.

Re: Partitioned tables and SELECT ... ORDER BY ... LIMIT

From
Дмитрий Шалашов
Date:
Hi Jeff,

Thanks for clarifications!

In my case yes, it's just few blocks, but different ones every time I change user_id value in my WHERE clause. When I change user_id - buffers are no longer "shared hit" in EXPLAIN. This is a bit more worrying.

But if there is no easy fix - well, OK.


Best regards,
Dmitriy Shalashov

2014-10-16 21:04 GMT+04:00 Jeff Janes <jeff.janes@gmail.com>:
On Thu, Oct 16, 2014 at 5:35 AM, Дмитрий Шалашов <skaurus@gmail.com> wrote:
Hi,

lets imagine that we have some table, partitioned by timestamp field, and we query it with SELECT with ordering by that field (DESC for example), with some modest limit.
Lets further say that required amount of rows is found in the first table that query encounters (say, latest one).
I am just wondering, why nevertheless PostgreSQL does read couple of buffers from each of the older tables?

The planner only does partition pruning statically, not dynamically.The LIMIT has to be implemented dynamically--it cannot prove absolutely that the "first" partition will have enough rows, so it cannot eliminate the others.

The "Merge Append" does a priority queue merge, and so needs to read the "first" row (according to the ORDER BY) from each partition in order to seed the priority queue.  I guess what it could be made to do in the case where there are suitable check constraints on a partition, is seed the priority queue with a dummy value constructed from the constraint.  If the merge never gets far enough to draw upon that dummy value, then that whole plan node never needs to get started up.

In your case that would save very little, as reading a few blocks for each partition is not much of a burden.  Especially as it the same few blocks every time, so they should be well cached.  There may be other case where this would be more helpful.  But it isn't clear to me how the planner could build such a feature into its cost estimates, and the whole thing would be a rather complex and esoteric optimization to make for uncertain gain.

Cheers,

Jeff