Re: Proposal: Another attempt at vacuum improvements - Mailing list pgsql-hackers

From Pavan Deolasee
Subject Re: Proposal: Another attempt at vacuum improvements
Date
Msg-id BANLkTimKSzkUPck6ghm-Er3YTU8jE86JCA@mail.gmail.com
Whole thread Raw
In response to Re: Proposal: Another attempt at vacuum improvements  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Proposal: Another attempt at vacuum improvements  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers


On Tue, May 24, 2011 at 10:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
So, first of all, thanks for putting some effort and thought into
this.  Despite the large number of improvements in this area in 8.3
and 8.4, this is still a pain point, and it would be really nice to
find a way to make some further improvements.


Thanks for bringing up the idea during PGCon. That helped me to get interested in this again. I hope we would be able to take this to a logical conclusion this time and do something to alleviate the pain.
 
On Tue, May 24, 2011 at 2:58 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> So the idea is to separate the index vacuum (removing index pointers to dead
> tuples) from the heap vacuum. When we do heap vacuum (either by HOT-pruning
> or using regular vacuum), we can spool the dead line pointers somewhere. To
> avoid any hot-spots during normal processing, the spooling can be done
> periodically like the stats collection.

What happens if the system crashes after a line pointer becomes dead
but before the record of its death is safely on disk?  The fact that a
previous index vacuum has committed is only sufficient justification
for reclaiming the dead line pointers if you're positive that the
index vacuum killed the index pointers for *every* dead line pointer.
I'm not sure we want to go there; any operation that wants to make a
line pointer dead will need to be XLOG'd.  Instead, I think we should
stick with your original idea and just try to avoid the second heap
pass.


I would not mind keeping the design simple for the first release. So even if we can avoid the second heap scan in vacuum, that would be a big win. In the long term though, I think it will pay off keeping track of dead line pointers as they are generated. The only way though they are generated is while cleaning up the page holding the clean-up lock and the operation is WAL logged. So spooling dead line pointers during WAL replay should be possible.

Anyways, I would like not to pursue the idea and I am OK with a simplified version to start with where every heap vacuum is followed by index vacuum, collecting and holding the dead line pointers in the maintenance memory.
 
So to do that, as you say, we can have every operation that creates a
dead line pointer note the LSN of the operation in the page.

Yes.
 
But instead of allocating permanent space in the page header, which would
both reduce (admittedly only by 8 bytes) the amount of space available
for tuples, and more significantly have the effect of breaking on-disk
compatibility, I'm wondering if we could get by with making space for
that extra LSN only when it's actually present. In other words, when
it's present, we set a bit PD_HAS_DEAD_LINE_PTR_LSN or somesuch,
increment pd_upper, and use the extra space to store the LSN.  There
is an alignment problem to worry about there but that shouldn't be a
huge issue.


That might work but would require us to move tuples around when the first dead line pointer gets generated in the page. You may argue that we should be holding a cleanup-lock when that happens and the dead line pointer creation is always followed by a call to PageRepairFragmentation(), so it should be easier to make space for the LSN.

Instead of storing the LSN after the page header, would it be easier to set pd_special and store the LSN at the end of the page ?
 
When we vacuum, we remember the LSN before we start.  When we finish,
if we scanned the indexes and everything completed without error, then
we bump the heap's notion (wherever we store it) of the last
successful index vacuum.  When we vacuum or do HOT cleanup on a page,
if the page has a most-recent-dead-line pointer LSN and it precedes
the start-of-last-successful-index-vacuum LSN, then we mark all the
LP_DEAD tuples as LP_UNUSED and throw away the
most-recent-dead-line-pointer LSN.


Right. And if the cleanup generates new dead line pointers, the LSN will be replaced with the LSN of the current operation.
 
One downside of this approach is that, if we do something like this,
it'll become slightly more complicated to figure out where the item
pointer array ends.  Another issue is that we might find ourselves
wanting to extend the item pointer array to add a new item, and unable
to do so easily because this most-recent-dead-line-pointer LSN is in
the way.

I think that should be not so difficult to handle. I think handling this by special space mechanism might be less complicated.
 
If the LSN stored in the page precedes the
start-of-last-successful-index-vacuum LSN, and if, further, we can get
a buffer cleanup lock on the page, then we can do a HOT cleanup and
life is good.  Otherwise, we can either (1) just forget about the
most-recent-dead-line-pointer LSN - not ideal but not catastrophic
either - or (2) if the start-of-last-successful-vacuum-LSN is old
enough, we could overwrite an LP_DEAD line pointer in place.


I don't think we need the cleanup lock to turn the LP_DEAD line pointers to LP_UNUSED since that does not involve moving tuples around. So a simple EXCLUSIVE lock should be enough. But we would need to WAL log the operation of turning DEAD to UNUSED, so it would be simpler to consolidate this in HOT pruning. There could be exceptions such as, say large number of DEAD line pointers are accumulated towards the end and reclaiming those would free up substantial space in the page. But may be we can use those conditions to invoke HOT prune instead of handling them separately.

 
Another issue is that this causes problems for temporary and unlogged
tables, because no WAL records are generated and, therefore, the LSN
does not advance.  This is also a problem for GIST indexes; Heikki
fixed temporary GIST indexes by generating fake LSNs off of a
backend-local counter.  Unlogged GIST indexes are currently not
supported.  I think what we need to do is create an API to which you
can pass a relation and get an LSN.  If it's a permanent relation, you
get a regular LSN.  If it's a temporary relation, you get a fake LSN
based on a backend-local counter.  If it's an unlogged relation, you
get a fake LSN based on a shared-memory counter that is reset on
restart.  If we can encapsulate that properly, it should provide both
what we need to make this idea work and allow a somewhat graceful fix
for GIST-vs-unlogged problem.


Can you explain more how things would work for unlogged tables ? Do we use the same shared memory counter for tracking last successful index vacuum ? If so, how do we handle the case where after restart the page may get LSN less than the index vacuum LSN if the index vacuum happened before the crash/stop ? We might be fooled into believing that the index pointers are all removed even for dead line pointers generated after the restart ? We can possibly handle that by resetting the index vacuum LSN so that nothing gets removed until one cycle of heap and index vacuum is done. But I am not sure how easy would it be to reset the index vacuum LSNs for all unlogged relations at the end of recovery.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Pre-alloc ListCell's optimization
Next
From: Simon Riggs
Date:
Subject: Volunteering as Commitfest Manager