Re: [HACKERS] Secondary index access optimizations - Mailing list pgsql-hackers

From Antonin Houska
Subject Re: [HACKERS] Secondary index access optimizations
Date
Msg-id 25337.1515516505@localhost
Whole thread Raw
In response to Re: [HACKERS] Secondary index access optimizations  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: [HACKERS] Secondary index access optimizations
List pgsql-hackers
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

> On 14.08.2017 19:33, Konstantin Knizhnik wrote:
>
>  On 14.08.2017 12:37, Konstantin Knizhnik wrote:
>
>  Hi hackers,
>
>  I am trying to compare different ways of optimizing work with huge append-only tables in PostgreSQL where primary
keyis something like timestamp and queries are usually accessing most recent data using some secondary keys. Size of
secondaryindex is one of the most critical 
>  factors limiting insert/search performance. As far as data is inserted in timestamp ascending order, access to
primarykey is well localized and accessed tables are present in memory. But if we create secondary key for the whole
table,then access to it will require random reads from 
>  the disk and significantly decrease performance.
>
>  There are two well known solutions of the problem:
>  1. Table partitioning
>  2. Partial indexes
>
>  This approaches I want to compare. First of all I want to check if optimizer is able to generate efficient query
executionplan covering different time intervals.  
>  Unfortunately in both cases generated plan is not optimal.
>
>  1. Table partitioning:
>
>  create table base (k integer primary key, v integer);
>  create table part1 (check (k between 1 and 10000)) inherits (base);
>  create table part2 (check (k between 10001 and 20000)) inherits (base);
>  create index pi1 on part1(v);
>  create index pi2 on part2(v);
>  insert int part1 values (generate series(1,10000), random());
>  insert into part2 values (generate_series(10001,20000), random());
>  explain select * from base where k between 1 and 20000 and v = 100;
>  QUERY PLAN
>  -----------------------------------------------------------------------
>  Append (cost=0.00..15.65 rows=3 width=8)
>  -> Seq Scan on base (cost=0.00..0.00 rows=1 width=8)
>  Filter: ((k >= 1) AND (k <= 20000) AND (v = 100))
>  -> Index Scan using pi1 on part1 (cost=0.29..8.31 rows=1 width=8)
>  Index Cond: (v = 100)
>  Filter: ((k >= 1) AND (k <= 20000))
>  -> Index Scan using pi2 on part2 (cost=0.29..7.34 rows=1 width=8)
>  Index Cond: (v = 100)
>  Filter: ((k >= 1) AND (k <= 20000))
>
>  Questions:
>  - Is there some way to avoid sequential scan of parent table? Yes, it is empty and so sequential scan will not take
muchtime, but ... it still requires some additional actions and so increasing query execution time.  
>  - Why index scan of partition indexes includes filter condition if it is guaranteed by check constraint that all
recordsof this partition match search predicate?  
>
>  2. Partial indexes:
>
>  create table t (k integer primary key, v integer);
>  insert into t values (generate_series(1,20000),random());
>  create index i1 on t(v) where k between 1 and 10000;
>  create index i2 on t(v) where k between 10001 and 20000;
>  postgres=# explain select * from t where k between 1 and 10000 and v = 100;
>  QUERY PLAN
>  ------------------------------------------------------------
>  Index Scan using i1 on t (cost=0.29..7.28 rows=1 width=8)
>  Index Cond: (v = 100)
>  (2 rows)
>
>  Here we get perfect plan. Let's try to extend search interval:
>
>  postgres=# explain select * from t where k between 1 and 20000 and v = 100;
>  QUERY PLAN
>  ------------------------------------------------------------------
>  Index Scan using t_pkey on t (cost=0.29..760.43 rows=1 width=8)
>  Index Cond: ((k >= 1) AND (k <= 20000))
>  Filter: (v = 100)
>  (3 rows)
>
>  Unfortunately in this case Postgres is not able to apply partial indexes.
>  And this is what I expected to get:
>
>  postgres=# explain select * from t where k between 1 and 10000 and v = 100 union all select * from t where k between
10001and 20000 and v = 100;  
>  QUERY PLAN
>  ----------------------------------------------------------------------
>  Append (cost=0.29..14.58 rows=2 width=8)
>  -> Index Scan using i1 on t (cost=0.29..7.28 rows=1 width=8)
>  Index Cond: (v = 100)
>  -> Index Scan using i2 on t t_1 (cost=0.29..7.28 rows=1 width=8)
>  Index Cond: (v = 100)
>
>  I wonder if there are some principle problems in supporting this two things in optimizer:
>  1. Remove search condition for primary key if it is fully satisfied by derived table check constraint.
>  2. Append index scans of several partial indexes if specified interval is covered by their conditions.
>
>  I wonder if someone is familiar with this part of optimizer ad can easily fix it.
>  Otherwise I am going to spend some time on solving this problems (if community think that such optimizations will be
useful). 
>
>  Replying to myself: the following small patch removes redundant checks from index scans for derived tables:
>
>  diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
>  index 939045d..1f7c9cf 100644
>  --- a/src/backend/optimizer/util/plancat.c
>  +++ b/src/backend/optimizer/util/plancat.c
>  @@ -1441,6 +1441,20 @@ relation_excluded_by_constraints(PlannerInfo *root,
>  if (predicate_refuted_by(safe_constraints, rel->baserestrictinfo, false))
>  return true;
>
>  + /*
>  + * Remove from restriction list items implied by table constraints
>  + */
>  + safe_restrictions = NULL;
>  + foreach(lc, rel->baserestrictinfo)
>  + {
>  + RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
>  + if (!predicate_implied_by(list_make1(rinfo->clause), safe_constraints, false)) {
>  + safe_restrictions = lappend(safe_restrictions, rinfo);
>  + }
>  + }
>  + rel->baserestrictinfo = safe_restrictions;
>  +
>  +
>  return false;
>  }
>
>  =================================================
>
>  I am not sure if this is the right place of adjusting restriction list and is such transformation always correct.
>  But for the queries I have tested produced plans are correct:
>
>  postgres=# explain select * from base where k between 1 and 20000 and v = 100;
>  QUERY PLAN
>  -----------------------------------------------------------------------
>  Append (cost=0.00..15.64 rows=3 width=8)
>  -> Seq Scan on base (cost=0.00..0.00 rows=1 width=8)
>  Filter: ((k >= 1) AND (k <= 20000) AND (v = 100))
>  -> Index Scan using pi1 on part1 (cost=0.29..8.30 rows=1 width=8)
>  Index Cond: (v = 100)
>  -> Index Scan using pi2 on part2 (cost=0.29..7.34 rows=1 width=8)
>  Index Cond: (v = 100)
>  (7 rows)
>
>  postgres=# explain select * from base where k between 1 and 15000 and v = 100;
>  QUERY PLAN
>  -----------------------------------------------------------------------
>  Append (cost=0.00..15.64 rows=3 width=8)
>  -> Seq Scan on base (cost=0.00..0.00 rows=1 width=8)
>  Filter: ((k >= 1) AND (k <= 15000) AND (v = 100))
>  -> Index Scan using pi1 on part1 (cost=0.29..8.30 rows=1 width=8)
>  Index Cond: (v = 100)
>  -> Index Scan using pi2 on part2 (cost=0.29..7.34 rows=1 width=8)
>  Index Cond: (v = 100)
>  Filter: (k <= 15000)
>  (8 rows)
>
> There is one more problem with predicate_implied_by function and standard Postgres partitions.
> Right now it specifies constrains for partitions using intervals with open high boundary:
>
> postgres=# create table bt (k integer, v integer) partition by range (k);
> postgres=# create table dt1 partition of bt for values from (1) to (10001);
> postgres=# create table dt2 partition of bt for values from (10001) to (20001);
> postgres=# create index dti1 on dt1(v);
> postgres=# create index dti2 on dt2(v);
> postgres=# insert into bt values (generate_series(1,20000), random());
>
> postgres=# \d+ dt2
> Table "public.dt2"
> Column | Type | Collation | Nullable | Default | Storage | Stats target | De
> scription
> --------+---------+-----------+----------+---------+---------+--------------+---
> ----------
> k | integer | | | | plain | |
> v | integer | | | | plain | |
> Partition of: bt FOR VALUES FROM (10001) TO (20001)
> Partition constraint: ((k IS NOT NULL) AND (k >= 10001) AND (k < 20001))
> Indexes:
> "dti2" btree (v)
>
> If now I run the query with predicate "between 1 and 20000", I still see extra check in the plan:
>
> postgres=# explain select * from bt where k between 1 and 20000 and v = 100;
> QUERY PLAN
> ----------------------------------------------------------------------
> Append (cost=0.29..15.63 rows=2 width=8)
> -> Index Scan using dti1 on dt1 (cost=0.29..8.30 rows=1 width=8)
> Index Cond: (v = 100)
> -> Index Scan using dti2 on dt2 (cost=0.29..7.33 rows=1 width=8)
> Index Cond: (v = 100)
> Filter: (k <= 20000)
> (6 rows)
>
> It is because operator_predicate_proof is not able to understand that k < 20001 and k <= 20000 are equivalent for
integertype. 

> If I rewrite query as (k >= 1 and k < 20001) then plan is optimal:
>
> postgres=# explain select * from bt where k >= 1 and k < 20001 and v = 100;
> QUERY PLAN
> ----------------------------------------------------------------------
> Append (cost=0.29..15.63 rows=2 width=8)
> -> Index Scan using dti1 on dt1 (cost=0.29..8.30 rows=1 width=8)
> Index Cond: (v = 100)
> -> Index Scan using dti2 on dt2 (cost=0.29..7.33 rows=1 width=8)
> Index Cond: (v = 100)
> (5 rows)
>
> It is fixed by one more patch of predtest.c:

Have you considered using the range types (internally in
operator_predicate_proof()) instead of hard-coding operator OIDs? The range
types do have the knowledge that k < 20001 and k <= 20000 are equivalent for
integer type:

postgres=# SELECT int4range '(, 20001)' = int4range '(, 20000]';
 ?column?
----------
 t
(1 row)


--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


pgsql-hackers by date:

Previous
From: Ildar Musin
Date:
Subject: Re: General purpose hashing func in pgbench
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] UPDATE of partition key