Re: FSM versus GIN pending list bloat - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: FSM versus GIN pending list bloat
Date
Msg-id CANP8+jJsjj8HOzVKbLB4+Bc+B1tkzymJf3O3K5BFS=zpXbTX1Q@mail.gmail.com
Whole thread Raw
In response to Re: FSM versus GIN pending list bloat  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: FSM versus GIN pending list bloat  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers
On 4 August 2015 at 21:04, Jeff Janes <jeff.janes@gmail.com> wrote:
 
Couple of questions here...

* the docs say "it's desirable to have pending-list cleanup occur in the background", but there is no way to invoke that, except via VACUUM. I think we need a separate function to be able to call this as a background action. If we had that, we wouldn't need much else, would we?

I thought maybe the new bgworker framework would be a way to have a backend signal a bgworker to do the cleanup when it notices the pending list is getting large.  But that wouldn't directly fix this issue, because the bgworker still wouldn't recycle that space (without further changes), only vacuum workers do that currently.

But I don't think this could be implemented as an extension, because the signalling code has to be in core, so (not having studied the matter at all) I don't know if it is good fit for bgworker.  

We need to expose 2 functions:

1. a function to perform the recycling directly (BRIN has an equivalent function)

2. a function to see how big the pending list is for a particular index, i.e. do we need to run function 1? 

We can then build a bgworker that polls the pending list and issues a recycle if and when needed - which is how autovac started.
 
* why do we have two parameters: gin_pending_list_limit and fastupdate? What happens if we set gin_pending_list_limit but don't set fastupdate?

Fastupdate is on by default.  If it were turned off, then gin_pending_list_limit would be mostly irrelevant for those tables. Fastupdate could have been implemented as a magic value (0 or -1) for gin_pending_list_limit but that would break backwards compatibility (and arguably would not be a better way of doing things, anyway).
 
* how do we know how to set that parameter? Is there a way of knowing gin_pending_list_limit has been reached?

I don't think there is an easier answer to that.  The trade offs are complex and depend on things like how well cached the parts of the index needing insertions are, how many lexemes/array elements are in an average document, and how many documents inserted near the same time as each other share lexemes in common.  And of course what you need to optimize for, latency or throughput, and if latency search latency or insert latency.

So we also need a way to count the number of times the pending list is flushed. Perhaps record that on the metapage, so we can see how often it has happened - and another function to view the stats on that

This and the OP seem like 9.5 open items to me.

I don't think so.  Freeing gin_pending_list_limit from being forcibly tied to work_mem is a good thing.  Even if I don't know exactly how to set gin_pending_list_limit, I know I don't want to be 4GB just because work_mem was set there for some temporary reason.  I'm happy to leave it at its default and let its fine tuning be a topic for people who really care about every microsecond of performance.

OK, I accept this.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: RFC: replace pg_stat_activity.waiting with something more descriptive
Next
From: Robert Haas
Date:
Subject: Re: RFC: replace pg_stat_activity.waiting with something more descriptive