Thread: GIN fast-insert vs autovacuum scheduling
I'm looking again at the fast-insert patch, and I find myself still desperately unhappy about the mechanism for scheduling autovacuum cleanup of pending insertions. I complained about that before, but I think I only cited a worry about adding overhead to statistics tracking in order to have the "recently inserted tuples" counts. It's got worse problems though: 1. The "recently inserted tuples" count is simply the wrong measurement if the index is partial --- it could be a drastic overestimate. 2. Since the patch has pgstats unconditionally resetting the count to zero after every vacuum, it's not safe for an index AM to use any other cleanup policy except "flush all pending insertions on every vacuum". This doesn't seem particularly optimal to me; isn't the idea to make sure we insert lots of tuples at once? Seems like if there's not very much in the pending list it'd be better to do nothing. 3. Given that ginHeapTupleFastInsert forces a cleanup cycle whenever the pending list gets too big, it's far from clear why we should have to force autovacuum just because of pending list size at all. I also note that such cleanups aren't being accounted for in the "recently inserted tuples" stat, anyhow. On top of those issues, there are implementation problems in the proposed relation_has_pending_indexes() check: it has hard-wired knowledge about GIN indexes, which means the feature cannot be extended to add-on index AMs; and it's examining indexes without any lock whatsoever on either the indexes or their parent table. (And we really would rather not let autovacuum take a lock here.) So I'm fairly strongly tempted to just rip out the whole mechanism, and rely on existing autovacuum rules plus the ginHeapTupleFastInsert- driven cleanups. The only case that I can see where this is really any step backwards is that following a bulk insert operation, autovacuum will only think it needs to ANALYZE the table, but we would like it to clean out the pending insertion lists too. But even then, the patch's mechanism isn't all that desirable because it forces a useless VACUUM pass over the heap. ISTM what might be a better, more flexible approach is to allow the amvacuumcleanup hook to be called at the end of ANALYZE too, letting the index AM make its own decision about whether it needs to do anything then. A decision at that point could be made on the actual size of the index's pending list, rather than any stats-driven guess. Comments? regards, tom lane
Tom Lane wrote: > On top of those issues, there are implementation problems in the > proposed relation_has_pending_indexes() check: it has hard-wired > knowledge about GIN indexes, which means the feature cannot be > extended to add-on index AMs; and it's examining indexes without any > lock whatsoever on either the indexes or their parent table. (And > we really would rather not let autovacuum take a lock here.) I wonder if it's workable to have GIN send pgstats a message with number of fast-inserted tuples, and have autovacuum check that number as well as dead/live tuples. ISTM this shouldn't be considered part of either vacuum or analyze at all, and have autovacuum invoke it separately from both, with its own decision equations and such. We could even have a scan over pg_class just for GIN indexes to implement this. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> On top of those issues, there are implementation problems in the >> proposed relation_has_pending_indexes() check: > I wonder if it's workable to have GIN send pgstats a message with number > of fast-inserted tuples, and have autovacuum check that number as well > as dead/live tuples. > ISTM this shouldn't be considered part of either vacuum or analyze at > all, and have autovacuum invoke it separately from both, with its own > decision equations and such. We could even have a scan over pg_class > just for GIN indexes to implement this. That's going in the wrong direction IMHO, because it's building GIN-specific infrastructure into the core system. There is no need for any such infrastructure if we just drive it off a post-ANALYZE callback. regards, tom lane
On Mon, 2009-03-23 at 15:23 -0400, Tom Lane wrote: > There is no need for any such infrastructure if we just drive it off a > post-ANALYZE callback. That sounds reasonable, although it does seem a little strange for analyze to actually perform cleanup. Now that we have FSM, the cost of VACUUMing insert-only tables is a lot less. Does that possibly justify running VACUUM on insert-only tables? On tables without GIN indexes, that wouldn't be a complete waste, because it could set hint bits, which needs to be done sometime anyway. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Mon, 2009-03-23 at 15:23 -0400, Tom Lane wrote: >> There is no need for any such infrastructure if we just drive it off a >> post-ANALYZE callback. > That sounds reasonable, although it does seem a little strange for > analyze to actually perform cleanup. My thought was to have GIN do cleanup only in an autovacuum-driven ANALYZE, not in a client-issued ANALYZE. You could argue it either way I suppose, but I agree that if a user says ANALYZE he's probably not expecting index cleanup to happen. > Now that we have FSM, the cost of VACUUMing insert-only tables is a lot > less. Well, not if you just did a huge pile of inserts, which is the case that we need to worry about here. > On tables without GIN indexes, that wouldn't be a complete waste, > because it could set hint bits, which needs to be done sometime anyway. True, but we have not chosen to make autovacuum do that, and whether we should or not seems to me to be orthogonal to when GIN index cleanup should happen. regards, tom lane