Re: GIN fast insert - Mailing list pgsql-hackers

From Teodor Sigaev
Subject Re: GIN fast insert
Date
Msg-id 4992E86F.20702@sigaev.ru
Whole thread Raw
In response to Re: GIN fast insert  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: GIN fast insert
List pgsql-hackers
>> What would be wrong with letting it degrade to lossy?  I suppose the
>> reason it's trying to avoid that is to avoid having to recheck all the
>> rows on that page when it comes time to do the index insertion; but
>> surely having to do that is better than having arbitrary, unpredictable
>> failure conditions.
> 
> No, I don't think that's it.  See here, beginning with "the problem
> with lossy tbm has two aspects":
> 
> http://archives.postgresql.org/message-id/4974B002.3040202@sigaev.ru
Right. Some comments to that points:
 - amgettuple interface hasn't possibility to work with page-wide result instead   of exact ItemPointer. amgettuple can
notreturn just a block number as   amgetbitmap can.
 

It's not so difficult to teach GIN to return ItemPointer one by one from pending 
list and eliminate this point. GIN will not collect matched ItemPointers in 
tidbitmap and will return them immediately. But:
 - Because of concurrent vacuum process: while we scan pending list, it's   content could be transferred into regular
structureof index and then we will   find the same tuple twice. Again, amgettuple hasn't protection from that,   only
amgetbitmaphas it. So, we need to filter results from regular GIN   by results from pending list. ANd for filtering we
can'tuse lossy tbm.
 

Again, we talk about amgettuple interface. We need to filter results from 
regular GIN by results from pending list. Now GIN does it by lookup matched 
ItemPointer in tidbitmap constructed from pending list. We could use ordered 
array of ItemPointers instead of tidbitmap, but I don't believe that it will 
take less memory. It's impossible to rescan pending list for two reasons: a) too 
slow, b) concurrent cleanup process (vacuum or insert now), because they could 
move tuples into regular structure.

Is it acceptable to add option to tidbitmap which will forbid tidbitmap to 
become lossy? That removes disabling index scan in gincostestimate and checking 
of non-lossy tidbitmap.

In current version, cleanup of pending list starts much earlier than non-lossy 
limit is reached in typical use-cases. Insertion process will start cleanup with 
at least one fired trigger: - number of heap tuples in pending list could produce lossy tidbitmap - total size of
pendinglist is greater than work_mem. This trigger is 
 
developed to speedup bulk insertion (which is used in cleanup), because it will 
collect whole pending list in memory at once. And this trigger is more strict 
than first one because in typical use-case of GIN heap tuple is rather big.

I believe that user could get GIN's error about work_mem only intentionally:
- turn off autovacuum
- set big work_mem
- populate table with GIN index (by needed number of insertion)
- prepare query which will return a lot of results (possibly, with seqscan=off 
because cost of scan of pending list grows fast)
- decrease work_mem for at least ten times
- execute query


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


pgsql-hackers by date:

Previous
From: Teodor Sigaev
Date:
Subject: Re: [PATCHES] GIN improvements
Next
From: Robert Haas
Date:
Subject: Re: GIN fast insert