Re: Batch update of indexes - Mailing list pgsql-hackers

From Anastasia Lubennikova
Subject Re: Batch update of indexes
Date
Msg-id 569F980A.6080305@postgrespro.ru
Whole thread Raw
In response to Batch update of indexes  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: Batch update of indexes  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
List pgsql-hackers
20.01.2016 12:28, Konstantin Knizhnik :<br /><blockquote cite="mid:569F5346.1010005@postgrespro.ru" type="cite"> Hi
hackers,<br/><br /> I want to know opinion of community about possible ways of solving quite common problem: increasing
insertspeed while still providing indexes for efficient execution of queries.<br /><br /> Many applications have to
dealwith high input stream of data. Most of the time while record inserting in the database is taken for update of
indexes.And without indexes we are not able to efficiently execute queries. <br /> So in many cases it is desirable to
have"batch or concurrent" index update. And it is acceptable that an index is slightly behind current state of the
table.<br/></blockquote><br /> Hi, I glad to see that you interested in that too.<br /> I think this is a good feature
andI think it will be very useful to have.<br /> I have already mentioned some related problems and possible
improvementsin my presentation.<br /><a
href="http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts"><aclass="moz-txt-link-freetext"
href="http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts">http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts</a></a><br
/>Last two slides concern to this thread. Briefly, I've suggested to think about insertion buffer. Something very
similarto it is already implemented in BRIN. It does not index last data from heap, while the number of last pages is
lessthan pages_per_block.<br /> The next point, I've thought about is a bulk update. Problem is that update like
"UPDATEmytable set a = a+1;" causes N searches from the root of B-tree. I looks very strange to me, and I'd like to fix
itsomehow. The obvious solution is to update all tuples on the page at a time, and keep the number of last updated
page. But, maybe it's a bit off-thread here.<br /><br /><blockquote cite="mid:569F5346.1010005@postgrespro.ru"
type="cite">One interesting approach of solving this problem is discussed in this article:<br /><br /><a
class="moz-txt-link-freetext"
href="https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb"
moz-do-not-send="true">https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb</a><br
/><br/> Them are using materialized views to build indexes in background.<br /> Interesting idea, but copying content
ofthe whole table just to be able to build index concurrently seems to be overkill.<br /></blockquote><br /> This
approachseems like a tricky crutch to me. And I agree that it requires a lot of extra work.<br /><br /><blockquote
cite="mid:569F5346.1010005@postgrespro.ru"type="cite"> I thought about more straightforward ways of solving this
problem.It will be nice if we can preserve of of them main postulates of Postgres and other RDBMSes: indexes are just
optimizationand result of query should not depend on presence of indexes.<br /><br /> First idea is to use inheritance.
Ihave investigated different ways of splitting table into "archival" and "operational" parts, but all of them requiring
physicalcopying of data from one table to another.<br /><br /></blockquote><blockquote
cite="mid:569F5346.1010005@postgrespro.ru"type="cite"> Another idea is to use partial indexes (<a
class="moz-txt-link-freetext"href="http://www.postgresql.org/docs/current/static/indexes-partial.html"
moz-do-not-send="true">http://www.postgresql.org/docs/current/static/indexes-partial.html</a>)<br/> Assume that we have
streamof input data where each record have increased timestamp:<br /><br />   create table t(<br />      ts timestamp
primarykey,<br />      c1 real,<br />      c2 integer,<br />      c3 varchar,<br />      ...<br />      cN char(5)<br
/>  );<br /><br /> We want to provide the highest insert speed for "t" but provide indexes for c1..cN fields.<br /> We
candeclared partial indexes:<br /><br />   create index idx1 on t(c1) where ts < '20/01/2016';<br />   create index
idx2on t(c2) where ts < '20/01/2016';<br />   ...<br />   create index idxN on t(cN) where ts < '20/01/2016';<br
/><br/> As far as this indexes do not cover current date, them will not be affected during insert operations.<br /> But
wecan still efficiently run queries like<br /><br />   select * from t where c1>100 and ts < '20/01/2016';<br
/><br/> Then, in background, may be at night, we can do<br /><br />   alter index idx1 where <font color="#ff0000">ts
<'</font><font color="#ff0000">21/01/2016'</font>;<br /></blockquote><br /> This idea sounds very interesting to me.
<br/><blockquote cite="mid:569F5346.1010005@postgrespro.ru" type="cite"><br /> Please notice that such alter table
statement,changing condition for partial index, is not supported now.<br /></blockquote><br /> Don't you think, that
thisfeature could be used in a very wrong way? Do not take it as criticism, just a bit of thoughts.<br /><br
/><blockquotecite="mid:569F5346.1010005@postgrespro.ru" type="cite"> But I do not see any principle problems with
supportingsuch construction.<br /> We should just include in the index all records which match new condition and do not
matchold condition:<br /><br />    ts < '21/01/2016' and not (ts < '20/01/2016')<br /><br /> If there is index
for"ts" field it can be done quite efficiently.<br /> This approach doesn't cause contradictions with concepts of
indexesin RDBMS.<br /><br /> But there is one more problem with this approach with I think should be addressed.<br />
Rightnow optimizer builds the following execution plan for query with partial indexes:<br /><br />  postgres=# explain
select* from t where c1 < 10 and ts < '20/01/2016'::timestamp;<br />
                                                 QUERY PLAN                                                   <br />
 ---------------------------------------------------------------------------------------------------------------<br/>
 BitmapHeap Scan on t  (cost=7.20..732.14 rows=12263 width=12)<br />    Recheck Cond: ((c1 < '10'::double precision)
AND(ts < '2016-01-20 00:00:00'::timestamp without time zone))<br />    ->  Bitmap Index Scan on idx1 
(cost=0.00..4.13rows=12263 width=0)<br />          Index Cond: (c1 < '10'::double precision)<br /> (4 rows)<br /><br
/>As you can see optimizer insert recheck in query execution plan while it is not needed in this case: search condition
isexactly the same as partial index condition.<br /> Optimal plan should be:<br /><br />    Index Scan using idx1 on t
(cost=0.00..4.13rows=12263 width=0)<br />    Index Cond: (c1 < '10'::double precision)<br /></blockquote><br />
Therewas the discussion of the patch for partial indexes.<br />  <a
href="http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html"><a
class="moz-txt-link-freetext"
href="http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html">http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html</a></a><br
/> Since I haven't watched it closely, It seems to be open still. I think it'll be interesting to you.<br /><br /><pre
class="moz-signature"cols="72">-- 
 
Anastasia Lubennikova
Postgres Professional: <a class="moz-txt-link-freetext"
href="http://www.postgrespro.com">http://www.postgrespro.com</a>
The Russian Postgres Company</pre>

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: GIN pending list clean up exposure to SQL
Next
From: Fujii Masao
Date:
Subject: removal of unused argument in ginInsertCleanup()