Re: Batch update of indexes - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: Batch update of indexes
Date
Msg-id 570029B7.3020109@postgrespro.ru
Whole thread Raw
In response to Re: Batch update of indexes  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Batch update of indexes  (David Steele <david@pgmasters.net>)
List pgsql-hackers
On 04/02/2016 09:57 PM, Tom Lane wrote:
> Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
>> Attached please find patch for "ALTER INDEX ... WHERE ..." clause.
>> It is now able to handle all three possible situations:
>> 1. Making index partial (add WHERE condition to the ordinary index)
>> 2. Extend partial index range (less restricted index predicate)
>> 3. Arbitrary change of partial index predicate
> I've not been following this thread previously, but this proposal
> scares me quite a lot.  I am certain there are places in our code
> that assume that the properties of an index don't change after it's
> been created.  One area that this almost certainly breaks is HOT updates:
> adding a previously-unindexed column to an index predicate might break
> existing HOT chains, and I see nothing in this patch that could deal
> with that.  I seem to recall there are other places that would be
> broken by changing an index's DDL definition after creation, but can't
> recall specifics right now.
>
> I am also, frankly, not seeing a use-case for this functionality that
> would justify trying to find and remove those assumptions.
>
> There's a lot of things I don't care for about the way the patch is
> written, in particular its willingness to use SPI (which opens a lot of
> potential for search-path problems, failure to see uncommitted tuples,
> etc).  But we need not get to that if we don't believe the functionality
> can work.

Thank you for review, Tom.

I completely agree with all your arguments against this patch.
I have proposed this patch mostly as prove of concept.
Yes, I have not take in account hot updates and may be there are other possible issues which I not considered.

The main question is whether the proposed way of batch update of indexes is viable or it is conceptually wrong
approach
(because it beaks assumption that index properties can't be changed or because it is not convenient to use...).

I hope that everybody agree that maintaining of indexes is the main limiting factor for insert speed.
If table has no indexes, then insert speed can be as high as disk write speed (100Mb/sec or 1000 for SSD).
So if size of record is about 10 bytes, then we can get about 10 millions TPS.
But presence of indexes will dramatically change this picture: if database is large enough so that even index can not
fitin memory
 
and records are inserted in random key order, then each insert in index will require reading of 3-4 pages from random
locationson the disk.
 
With average HDD positioning time 10 msec, we get 100 reads per second and ... 20-30 TPS. It is just with one index.
If we have 10 indexes, then TPS can be less than fingers on a hand.

Certainly it is very pessimistic estimation.
But still it is true that we can not provide good insert speed if we have to update indexes immediately.
And without indexes we can not efficiently execute most of queries.

I do not see any way in Postgres to solve this problem now. The hack with creating materialized views requires a lot of
extratime and space.
 
It will not work for really large table.

So we need some way to postpone insertion of new records in the index. Then we can do such insertion in background or
inidle time (at night), try to use bulk insert if index implementation supports it (for example sorting records by key
beforeinsert can 
 
significantly increase locality and so improve speed of insert in index). But the principle moment here is that such
delayedupdate of index violates the main RDBMS rule that results of query execution with and without indexes should be
thesame. The trick 
 
with partial indexes allows to eliminate this contradiction. But it requires more actions from user. So are users ready
todo some exatra job just because of "idealogical" reasons? Because if user wants to have delayed update of indexes,
thenhe actually 
 
approves that it is ok for him that query results may not include some most recent updates.

Another aspect is which database objects are allowed to be altered and which not. Right now with tables we can alter
almosteverything.
 
With indexes - almost nothing. It is assumed that index can always be reconstructed. But for very big table
reconstructionof indexes from scratch will take unacceptable amount of time. So should we make it possible to alter
someindex characteristics 
 
which do not require to rebuild index from scratch (and it is definitely true for partial index predicate)? Or price of
supportingit is so high, that it can not be compensated by obtained benefits?
 

So what do you think?
1. Should I continue work in this direction and fix all possible issues with hot updates,... to make it possible to
alterpartial index predicates and support batch inserts i this way?
 
2. Or it is better to just add extra option to the index, allowing it to be slightly out-of-sync? It will allow, for
example,to eliminate pending list for GIN which can cause very significant degradation of query speed, while for most
full-textsearch 
 
engine it is acceptable that changes are not immediately visible.
Certainly much more work is required here except of just adding a new index option...
3. Or both of the approaches are wrong and we should leave everything as it is?
4. Or may be there is some other approach which is more acceptable?





>
>> This patch includes src/bin/insbench utility for testing insert
>> performance. It can be easily excluded from the patch to reduce it size.
> C++ is not likely to get accepted into our tree ...
>
>             regards, tom lane


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




pgsql-hackers by date:

Previous
From: "Shulgin, Oleksandr"
Date:
Subject: Re: More stable query plans via more predictable column statistics
Next
From: Tom Lane
Date:
Subject: Re: Add schema-qualified relnames in constraint error messages.