Re: BTree vacuum before page splitting - Mailing list pgsql-patches

From Junji TERAMOTO
Subject Re: BTree vacuum before page splitting
Date
Msg-id 43E090F5.3060603@lab.ntt.co.jp
Whole thread Raw
In response to Re: BTree vacuum before page splitting  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-patches
Thanks!

I had misunderstood the content of the comment. I think that I can
understand the problem now.

I might be able to present the solution of the restarting an index scan.
I try to correct the patch.

Simon Riggs wrote:
> On Tue, 2006-01-31 at 15:18 +0900, Junji TERAMOTO wrote:
>> Tom Lane wrote:
>>> I think this is quite likely to break things :-(.  What sort of
>>> conditions have you tested it under?  (If this were safe, we'd
>>> not have invented the LP_DELETE flag to begin with, but just have
>>> deleted known-dead items immediately.)
>> I apologize for my insufficient description(and bad english).
>> I explain the operation of this patch in detail again.
>>
>> In _bt_insertonpg(), we insert the tupple by the following methods.
>>
>>  (1) Finds the right place(page) to insert the tuple.
>>  (2) If free space is insufficient, we operate as follows.
>> + (2.1) Confirm whether the lock to the found page is considerable
>>         super-exclusive lock in BufferSuperLockHeldByMe()[new]. We
>>         check bufHdr->refcount(=pin), and if pin == 1, we know this
>>         lock is super-exclusive lock.
>>         If our lock is not super-exclusive lock, goto (2.4).
>> + (2.2) If our lock is super-exclusive lock, and the found page is
>>         leaf, we look for tupples marked as "LP_DELETE" from the found
>>         page, and remove found tuples in _bt_vacuum_page()[new].
>> + (2.3) If free space is sufficient after removal, goto (4).
>>   (2.4) Step right to next non-dead page.
>>   (2.5) (2) is repeated until an enough free space is found or it
>>         reaches a right edge at last.
>>  (3) When an enough free space is not found by processing (2), splits
>>      the target page (making sure that the split is equitable as far as
>>      post-insert free space goes).
>>  (4) Inserts the tuple.
>>
>> The numbers from 2.1) to 2.3) are new processing.
>>
>> The pointer's shifting by removing "LP_DELETE" tuples as shown in the
>> above-mentioned becomes a problem, when we resume a stopped indexscan.
>> Therefore, I am having it confirm the state of the lock by (2.1), and
>> process only at super-exclucive lock, same as btbulkdelete().
>>
>> However, because own indexscan might be lost because of this removal,
>> I also modify _bt_restscan() as follows.
>>
>>  (1) Check for index tuple we stopped on.
>>  (2) If index tuple is moved, first of all, we search in this page
>>      right.
>> +(3) If not found, we try to look for from the head of this page
>>      because index tuple may moved left due to removal.
>>  (4) If still not found, we look for index tuple right.
>>
>> The number (3) is new processing.
>>
>> We do not delete the empty page. Therefore, there is no necessity for
>> tracing a left page.
>>
>> I think that no problem occurs by this logic.
>> Please point it out if there is a wrong point. Thanks.
>
> Please read comments in nbtree.c p.557-566 which describe the required
> locking protocol for btree pages.
>
> Just because you hold the lock does *not* mean you are allowed to delete
> tuples marked as deleted. This patch doesn't work in all cases, which is
> what is required, since the failure cases don't raise an ERROR - they
> just miss data - which is unacceptable. So your patch works 99.9% of the
> time, but sometimes gives wrong answers to queries without you knowing.
> So, patch rejected, but good thinking all the same.
>
> You will need to argue for a change in the locking protocol before it
> could be accepted. That would speed up VACUUM also, so if it were
> possible it would be gratefully accepted.
>
> The only help I can give is this: Why does an index scan ever want to
> stop on a deleted tuple? Currently, the scan only remembers one tuple
> its seen. If the tuple was dead, but wasn't yet marked as dead the index
> scan will have read it and stopped there. Some while later, the index
> scan will return and begin scanning right for the tuple. If you delete
> it, the index scan *may* not be able to work out where to restart --
> ERROR. So, if you can work out a way of not restarting an index scan on
> a deleted tuple (and yet still marking them as deleted when it sees
> them) then you may be able to solve the problem.
>
> I'm looking at the same problem domain also (table bloat), but focusing
> on a different approach that avoids this issue.
>
> Best Regards, Simon Riggs


--
Junji Teramoto / teramoto.junji (a) lab.ntt.co.jp

pgsql-patches by date:

Previous
From: Simon Riggs
Date:
Subject: Re: BTree vacuum before page splitting
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] FW: PGBuildfarm member snake Branch REL8_0_STABLE Status