GIN fast-insert vs autovacuum scheduling - Mailing list pgsql-hackers

From Tom Lane
Subject GIN fast-insert vs autovacuum scheduling
Date
Msg-id 29127.1237830982@sss.pgh.pa.us
Whole thread Raw
Responses Re: GIN fast-insert vs autovacuum scheduling
List pgsql-hackers
I'm looking again at the fast-insert patch, and I find myself still
desperately unhappy about the mechanism for scheduling autovacuum
cleanup of pending insertions.  I complained about that before, but
I think I only cited a worry about adding overhead to statistics
tracking in order to have the "recently inserted tuples" counts.
It's got worse problems though:

1. The "recently inserted tuples" count is simply the wrong measurement
if the index is partial --- it could be a drastic overestimate.

2. Since the patch has pgstats unconditionally resetting the count to
zero after every vacuum, it's not safe for an index AM to use any other
cleanup policy except "flush all pending insertions on every vacuum".
This doesn't seem particularly optimal to me; isn't the idea to make
sure we insert lots of tuples at once?  Seems like if there's not very
much in the pending list it'd be better to do nothing.

3. Given that ginHeapTupleFastInsert forces a cleanup cycle whenever
the pending list gets too big, it's far from clear why we should have
to force autovacuum just because of pending list size at all.  I also
note that such cleanups aren't being accounted for in the "recently
inserted tuples" stat, anyhow.

On top of those issues, there are implementation problems in the
proposed relation_has_pending_indexes() check: it has hard-wired
knowledge about GIN indexes, which means the feature cannot be
extended to add-on index AMs; and it's examining indexes without any
lock whatsoever on either the indexes or their parent table.  (And
we really would rather not let autovacuum take a lock here.)

So I'm fairly strongly tempted to just rip out the whole mechanism,
and rely on existing autovacuum rules plus the ginHeapTupleFastInsert-
driven cleanups.

The only case that I can see where this is really any step backwards
is that following a bulk insert operation, autovacuum will only think
it needs to ANALYZE the table, but we would like it to clean out the
pending insertion lists too.  But even then, the patch's mechanism
isn't all that desirable because it forces a useless VACUUM pass over
the heap.  ISTM what might be a better, more flexible approach is to
allow the amvacuumcleanup hook to be called at the end of ANALYZE too,
letting the index AM make its own decision about whether it needs
to do anything then.  A decision at that point could be made on the
actual size of the index's pending list, rather than any stats-driven
guess.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: contrib function naming, and upgrade issues
Next
From: Alvaro Herrera
Date:
Subject: Re: GIN fast-insert vs autovacuum scheduling