Home > mailing lists

Batch update of indexes - Mailing list pgsql-hackers

From	Konstantin Knizhnik
Subject	Batch update of indexes
Date	January 20, 2016 12:28:45
Msg-id	569F5346.1010005@postgrespro.ru Whole thread Raw
Responses	Re: Batch update of indexes (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>) Re: Batch update of indexes (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>) Re: Batch update of indexes (Robert Haas <robertmhaas@gmail.com>)
List	pgsql-hackers

Tree view

Hi hackers,<br /><br /> I want to know opinion of community about possible ways of solving quite common problem:
increasinginsert speed while still providing indexes for efficient execution of queries.<br /><br /> Many applications
haveto deal with high input stream of data. Most of the time while record inserting in the database is taken for update
ofindexes. And without indexes we are not able to efficiently execute queries. <br /> So in many cases it is desirable
tohave "batch or concurrent" index update. And it is acceptable that an index is slightly behind current state of the
table.<br/><br /> One interesting approach of solving this problem is discussed in this article:<br /><br /><a
class="moz-txt-link-freetext"
href="https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb">https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb</a><br
/><br/> Them are using materialized views to build indexes in background.<br /> Interesting idea, but copying content
ofthe whole table just to be able to build index concurrently seems to be overkill.<br /><br /> I thought about more
straightforwardways of solving this problem. It will be nice if we can preserve of of them main postulates of Postgres
andother RDBMSes: indexes are just optimization and result of query should not depend on presence of indexes.<br /><br
/>First idea is to use inheritance. I have investigated different ways of splitting table into "archival" and
"operational"parts, but all of them requiring physical copying of data from one table to another.<br /><br /> Another
ideais to use partial indexes (<a class="moz-txt-link-freetext"
href="http://www.postgresql.org/docs/current/static/indexes-partial.html">http://www.postgresql.org/docs/current/static/indexes-partial.html</a>)<br
/>Assume that we have stream of input data where each record have increased timestamp:<br /><br />   create table t(<br
/>     ts timestamp primary key,<br />      c1 real,<br />      c2 integer,<br />      c3 varchar,<br />      ...<br />
    cN char(5)<br />   );<br /><br /> We want to provide the highest insert speed for "t" but provide indexes for
c1..cNfields.<br /> We can declared partial indexes:<br /><br />   create index idx1 on t(c1) where ts <
'20/01/2016';<br/>   create index idx2 on t(c2) where ts < '20/01/2016';<br />   ...<br />   create index idxN on
t(cN)where ts < '20/01/2016';<br /><br /> As far as this indexes do not cover current date, them will not be
affectedduring insert operations.<br /> But we can still efficiently run queries like<br /><br />   select * from t
wherec1>100 and ts < '20/01/2016';<br /><br /> Then, in background, may be at night, we can do<br /><br />  
alterindex idx1 where <font color="#ff0000">ts < '</font><font color="#ff0000">21/01/2016'</font>;<br /><br />
Pleasenotice that such alter table statement, changing condition for partial index, is not supported now.<br /> But I
donot see any principle problems with supporting such construction.<br /> We should just include in the index all
recordswhich match new condition and do not match old condition:<br /><br />    ts < '21/01/2016' and not (ts <
'20/01/2016')<br/><br /> If there is index for "ts" field it can be done quite efficiently.<br /> This approach doesn't
causecontradictions with concepts of indexes in RDBMS.<br /><br /> But there is one more problem with this approach
withI think should be addressed.<br /> Right now optimizer builds the following execution plan for query with partial
indexes:<br/><br />  postgres=# explain select * from t where c1 < 10 and ts < '20/01/2016'::timestamp;<br />
                                                 QUERY PLAN                                                   <br />
 ---------------------------------------------------------------------------------------------------------------<br/>
 BitmapHeap Scan on t  (cost=7.20..732.14 rows=12263 width=12)<br />    Recheck Cond: ((c1 < '10'::double precision)
AND(ts < '2016-01-20 00:00:00'::timestamp without time zone))<br />    ->  Bitmap Index Scan on idx1 
(cost=0.00..4.13rows=12263 width=0)<br />          Index Cond: (c1 < '10'::double precision)<br /> (4 rows)<br /><br
/>As you can see optimizer insert recheck in query execution plan while it is not needed in this case: search condition
isexactly the same as partial index condition.<br /> Optimal plan should be:<br /><br />    Index Scan using idx1 on t
(cost=0.00..4.13rows=12263 width=0)<br />    Index Cond: (c1 < '10'::double precision)<br /><br /><br /> What do you
thinkabout this approach? Will it be useful to work in this direction?<br /> Or there are some better solutions for the
problem?<br/><br /><pre class="moz-signature" cols="72">-- 
 
Konstantin Knizhnik
Postgres Professional: <a class="moz-txt-link-freetext"
href="http://www.postgrespro.com">http://www.postgrespro.com</a>
The Russian Postgres Company </pre>

pgsql-hackers by date:

From: Pavel Stehule
Date: 20 January 2016, 12:20:47
Subject: Re: Why format() adds double quote?

From: Etsuro Fujita
Date: 20 January 2016, 12:53:16
Subject: Re: Odd behavior in foreign table modification (Was: Re: Optimization for updating foreign tables in Postgres FDW)

Batch update of indexes - Mailing list pgsql-hackers

Previous

Next