Re: optimizing vacuum truncation scans - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: optimizing vacuum truncation scans
Date
Msg-id CAMkU=1yY_7jjhvSZ-SsA2MVn1+nSYo3VNRgMVQmr_XGrmzvxeQ@mail.gmail.com
Whole thread Raw
In response to Re: optimizing vacuum truncation scans  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: optimizing vacuum truncation scans  (Haribabu Kommi <kommi.haribabu@gmail.com>)
Re: optimizing vacuum truncation scans  (Amit Kapila <amit.kapila16@gmail.com>)
Re: optimizing vacuum truncation scans  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, May 26, 2015 at 12:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Apr 20, 2015 at 10:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/20/15 1:50 AM, Jeff Janes wrote:

    For that matter, why do we scan backwards anyway? The comments don't
    explain it, and we have nonempty_pages as a starting point, so why
    don't we just scan forward? I suspect that eons ago we didn't have
    that and just blindly reverse-scanned until we finally hit a
    non-empty buffer...


nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan.  But maybe the combination of the
two?  If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.

Right.

In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.

Yeah, but I think we'd do a LOT better with the VM idea, because we could immediately truncate without scanning anything.

Right now all the interlocks to make this work seem to be in place (only vacuum and startup can set visibility map bits, and only one vacuum can be in a table at a time).  But as far as I can tell, those assumption are not "baked in" and we have pondered loosening them before.  

For example, letting HOT clean up mark a page as all-visible if it finds it be such.  Now in that specific case it would be OK, as HOT cleanup would not cause a page to become empty (or could it?  If an insert on a table with no indexes was rolled back, and hot clean up found it and cleaned it up, it could conceptually become empty--unless we make special code to prevent it) , and so the page would have to be below nonempty_pages.  But there may be other cases.

And I know other people have mentioned making VACUUM concurrent (although I don't see the value in that myself).

So doing it this way would be hard to beat (scanning a bitmap vs the table itself), but it would also introduce a modularity violation that I am not sure is worth it.  

Of course this could always be reverted if its requirements became a problem for a more important change (assuming of course that we detected the problem)

Attached is a patch that implements the vm scan for truncation.  It introduces a variable to hold the last blkno which was skipped during the forward portion.  Any blocks after both this blkno and after the last inspected nonempty page (which the code is already tracking) must have been observed to be empty by the current vacuum.  Any other process rendering the page nonempty are required to clear the vm bit, and no other process can set the bit again during the vacuum's lifetime.  So if the bit is still set, the page is still empty without needing to inspect it.

There is still the case of pages which had their visibility bit set by a prior vacuum and then were not inspected by the current one.  Once the truncation scan runs into these pages, it falls back to the previous behavior of reading block by block backwards.  So there could still be reason to optimize that fallback using forward-reading prefetch.

Using the previously shown test case, this patch reduces the truncation part of the vacuum to 2 seconds.

Cheers,

Jeff
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: anole: assorted stability problems
Next
From: Michael Paquier
Date:
Subject: Re: pg_rewind failure by file deletion in source server