Home > mailing lists
Re: Batch update of indexes - Mailing list pgsql-hackers

From	Konstantin Knizhnik
Subject	Re: Batch update of indexes
Date	January 26, 2016 18:13:33
Msg-id	56A78D16.8010607@postgrespro.ru Whole thread Raw
In response to	Batch update of indexes (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
List	pgsql-hackers
Tree view
Hi hackers,<br /><br /> I have implemented "ALTER INDEX ... WHERE ..." clause allowing to change condition for partial
index.<br/> Actually it allows us to append index without fully rebuilding it.<br /> As I explained in the previous
mails,partial indexes can be used to increase insert speed. <br /> Right now I get the following results (with one
insertstream):<br /><br /> Insert with 1 index (primary key, monotonically ascending):                  324275 TPS<br
/>Insert with 9 indexes (primary key + 8 indexes with random keys):         52495 TPS<br /> Insert with primary key and
8concurrently updated partial indexes:     194458 TPS<br /> Insert with primary key and 8 "frozen" partial
indexes:                         278446 TPS<br /><br /> So, as you can see insert with indexes is about 6 times slower
thaninsert without indexes.<br /> And partial indexes allows to eliminate  this gap.<br /> When partial indexes are not
affected(assuming that them will be reconstructed "at night"), <br /> performance is almost the same, as without
indexes.<br/> And if "ALTER INDEX" is done concurrently with inserts, it certainly decrease insert speed, <br /> but
stillit is 4 times faster than with normal indexes.<br /><br /> Such high TPS values were obtained using "insert from
select"to bypass libpq overhead.<br /> With libpq (when each insert is sent as independent statement) results are less
impressive:<br/><br /> Insert with 1 index (primary key, monotonically ascending):                  37892 TPS<br />
Insertwith 9 indexes (primary key + 8 indexes with random keys):       20231 TPS<br /> Insert with primary key and 8
concurrentlyupdated partial indexes:     26934 TPS<br /> Insert with primary key and 8 "frozen" partial
indexes:                         28863 TPS<br /><br /> But still partial indexes allows to almost eliminate two times
differences...<br/><br /> This results can be reproduced using our public repository: <br /><a
class="moz-txt-link-freetext"
href="https://github.com/postgrespro/postgres_cluster">https://github.com/postgrespro/postgres_cluster</a><br/><br />
Mostof the code related with support of "ALTER INDEX .. WHERE" is in AlterIndex function in <br />
postgres_cluster/src/backend/commands/indexcmds.c<br/> I have also added insbench utility for measuring insert
performance,using which this results were obtained.<br /> It is located in postgres_cluster/src/bin/insbench
directory.<br/><br /> Known issues:<br /> 1. I do not handle case when new condition for partial index is more
restrictedthan original.<br /> There is no way in Postgres to exclude records from index (except VACUUM), so in this
caseindex has to be reconstructed from scratch.<br /> 2. Currently I am using SPI to locate records which should be
includedin index.<br /> 3. I am not  completely sure that  there are no synchronization/isolation problems in
AlterIndexfunction<br /><br /> If this approach is considered to be interesting by community, I will try to address
theseissues.<br /><br /><br /><div class="moz-cite-prefix">On 20.01.2016 12:28, Konstantin Knizhnik wrote:<br
/></div><blockquotecite="mid:569F5346.1010005@postgrespro.ru" type="cite"> Hi hackers,<br /><br /> I want to know
opinionof community about possible ways of solving quite common problem: increasing insert speed while still providing
indexesfor efficient execution of queries.<br /><br /> Many applications have to deal with high input stream of data.
Mostof the time while record inserting in the database is taken for update of indexes. And without indexes we are not
ableto efficiently execute queries. <br /> So in many cases it is desirable to have "batch or concurrent" index update.
Andit is acceptable that an index is slightly behind current state of the table.<br /><br /> One interesting approach
ofsolving this problem is discussed in this article:<br /><br /><a class="moz-txt-link-freetext"
href="https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb"
moz-do-not-send="true">https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb</a><br
/><br/> Them are using materialized views to build indexes in background.<br /> Interesting idea, but copying content
ofthe whole table just to be able to build index concurrently seems to be overkill.<br /><br /> I thought about more
straightforwardways of solving this problem. It will be nice if we can preserve of of them main postulates of Postgres
andother RDBMSes: indexes are just optimization and result of query should not depend on presence of indexes.<br /><br
/>First idea is to use inheritance. I have investigated different ways of splitting table into "archival" and
"operational"parts, but all of them requiring physical copying of data from one table to another.<br /><br /> Another
ideais to use partial indexes (<a class="moz-txt-link-freetext"
href="http://www.postgresql.org/docs/current/static/indexes-partial.html"
moz-do-not-send="true">http://www.postgresql.org/docs/current/static/indexes-partial.html</a>)<br/> Assume that we have
streamof input data where each record have increased timestamp:<br /><br />   create table t(<br />      ts timestamp
primarykey,<br />      c1 real,<br />      c2 integer,<br />      c3 varchar,<br />      ...<br />      cN char(5)<br
/>  );<br /><br /> We want to provide the highest insert speed for "t" but provide indexes for c1..cN fields.<br /> We
candeclared partial indexes:<br /><br />   create index idx1 on t(c1) where ts < '20/01/2016';<br />   create index
idx2on t(c2) where ts < '20/01/2016';<br />   ...<br />   create index idxN on t(cN) where ts < '20/01/2016';<br
/><br/> As far as this indexes do not cover current date, them will not be affected during insert operations.<br /> But
wecan still efficiently run queries like<br /><br />   select * from t where c1>100 and ts < '20/01/2016';<br
/><br/> Then, in background, may be at night, we can do<br /><br />   alter index idx1 where <font color="#ff0000">ts
<'</font><font color="#ff0000">21/01/2016'</font>;<br /><br /> Please notice that such alter table statement,
changingcondition for partial index, is not supported now.<br /> But I do not see any principle problems with
supportingsuch construction.<br /> We should just include in the index all records which match new condition and do not
matchold condition:<br /><br />    ts < '21/01/2016' and not (ts < '20/01/2016')<br /><br /> If there is index
for"ts" field it can be done quite efficiently.<br /> This approach doesn't cause contradictions with concepts of
indexesin RDBMS.<br /><br /> But there is one more problem with this approach with I think should be addressed.<br />
Rightnow optimizer builds the following execution plan for query with partial indexes:<br /><br />  postgres=# explain
select* from t where c1 < 10 and ts < '20/01/2016'::timestamp;<br />
                                                 QUERY PLAN                                                   <br />
 ---------------------------------------------------------------------------------------------------------------<br/>
 BitmapHeap Scan on t  (cost=7.20..732.14 rows=12263 width=12)<br />    Recheck Cond: ((c1 < '10'::double precision)
AND(ts < '2016-01-20 00:00:00'::timestamp without time zone))<br />    ->  Bitmap Index Scan on idx1 
(cost=0.00..4.13rows=12263 width=0)<br />          Index Cond: (c1 < '10'::double precision)<br /> (4 rows)<br /><br
/>As you can see optimizer insert recheck in query execution plan while it is not needed in this case: search condition
isexactly the same as partial index condition.<br /> Optimal plan should be:<br /><br />    Index Scan using idx1 on t
(cost=0.00..4.13rows=12263 width=0)<br />    Index Cond: (c1 < '10'::double precision)<br /><br /><br /> What do you
thinkabout this approach? Will it be useful to work in this direction?<br /> Or there are some better solutions for the
problem?<br/><br /><pre class="moz-signature" cols="72">-- 
 
Konstantin Knizhnik
Postgres Professional: <a class="moz-txt-link-freetext" href="http://www.postgrespro.com"
moz-do-not-send="true">http://www.postgrespro.com</a>
The Russian Postgres Company </pre></blockquote><br /><pre class="moz-signature" cols="72">-- 
Konstantin Knizhnik
Postgres Professional: <a class="moz-txt-link-freetext"
href="http://www.postgrespro.com">http://www.postgrespro.com</a>
The Russian Postgres Company </pre>
pgsql-hackers by date:
From: Yury Zhuravlev
Date: 26 January 2016, 18:07:16
Subject: Proposal:Use PGDLLEXPORT for libpq
From: Craig Ringer
Date: 26 January 2016, 18:43:26
Subject: Re: pglogical - logical replication contrib module
Re: Batch update of indexes - Mailing list pgsql-hackers

Previous

Next