RE: Plans for solving the VACUUM problem - Mailing list pgsql-hackers
From | Mikheev, Vadim |
---|---|
Subject | RE: Plans for solving the VACUUM problem |
Date | |
Msg-id | 3705826352029646A3E91C53F7189E3201662C@sectorbase2.sectorbase.com Whole thread Raw |
In response to | Plans for solving the VACUUM problem (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Plans for solving the VACUUM problem
(Bruce Momjian <pgman@candle.pha.pa.us>)
Re: Plans for solving the VACUUM problem (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
> I have been thinking about the problem of VACUUM and how we > might fix it for 7.2. Vadim has suggested that we should > attack this by implementing an overwriting storage manager > and transaction UNDO, but I'm not totally comfortable with > that approach: it seems to me that it's an awfully large > change in the way Postgres works. I'm not sure if we should implement overwriting smgr at all. I was/is going to solve space reusing problem with non-overwriting one, though I'm sure that we have to reimplement it (> 1 table per data file, stored on disk FSM etc). > Second: if VACUUM can run in the background, then there's no > reason not to run it fairly frequently. In fact, it could become > an automatically scheduled activity like CHECKPOINT is now, > or perhaps even a continuously running daemon (which was the > original conception of it at Berkeley, BTW). And original authors concluded that daemon was very slow in reclaiming dead space, BTW. > 3. Lazy VACUUM processes a table in five stages: > A. Scan relation looking for dead tuples;... > B. Remove index entries for the dead tuples... > C. Physically delete dead tuples and compact free space... > D. Truncate any completely-empty pages at relation's end. > E. Create/update FSM entry for the table. ... > If a tuple is dead, we care not whether its index entries are still > around or not; so there's no risk to logical consistency. What does this sentence mean? We canNOT remove dead heap tuple untill we know that there are no index tuples referencing it and your A,B,C reflect this, so ..? > Another place where lazy VACUUM may be unable to do its job completely > is in compaction of space on individual disk pages. It can physically > move tuples to perform compaction only if there are not currently any > other backends with pointers into that page (which can be tested by > looking to see if the buffer reference count is one). Again, we punt > and leave the space to be compacted next time if we can't do it right > away. We could keep share buffer lock (or add some other kind of lock) untill tuple projected - after projection we need not to read data for fetched tuple from shared buffer and time between fetching tuple and projection is very short, so keeping lock on buffer will not impact concurrency significantly. Or we could register callback cleanup function with buffer so bufmgr would call it when refcnt drops to 0. > Presently, VACUUM deletes index tuples by doing a standard index > scan and checking each returned index tuple to see if it points > at any of the tuples to be deleted. If so, the index AM is called > back to delete the tested index tuple. This is horribly inefficient: ... > This is mainly a problem of a poorly chosen API. The index AMs > should offer a "bulk delete" call, which is passed a sorted array > of main-table TIDs. The loop over the index tuples should happen > internally to the index AM. I agreed with others who think that the main problem of index cleanup is reading all index data pages to remove some index tuples. You told youself about partial heap scanning - so for each scanned part of table you'll have to read all index pages again and again - very good way to trash buffer pool with big indices. Well, probably it's ok for first implementation and you'll win some CPU with "bulk delete" - I'm not sure how much, though, and there is more significant issue with index cleanup if table is not locked exclusively: concurrent index scan returns tuple (and unlock index page), heap_fetch reads table row and find that it's dead, now index scan *must* find current index tuple to continue, but background vacuum could already remove that index tuple => elog(FATAL, "_bt_restscan: my bits moved..."); Two ways: hold index page lock untill heap tuple is checked or (rough schema) store info in shmem (just IndexTupleData.t_tid and flag) that an index tuple is used by some scan so cleaner could change stored TID (get one from prev index tuple) and set flag to help scan restore its current position on return. I'm particularly interested in discussing this issue because of it must be resolved for UNDO and chosen way will affect in what volume we'll be able to implement dirty reads (first way doesn't allow to implement them in full - ie selects with joins, - but good enough to resolve RI constraints concurrency issue). > There you have it. If people like this, I'm prepared to commit to > making it happen for 7.2. Comments, objections, better ideas? Well, my current TODO looks as (ORDER BY PRIORITY DESC): 1. UNDO; 2. New SMGR; 3. Space reusing. and I cannot commit at this point anything about 3. So, why not to refine vacuum if you want it. I, personally, was never be able to convince myself to spend time for this. Vadim
pgsql-hackers by date: