Thread: optimizing vacuum truncation scans

optimizing vacuum truncation scans

From
Jeff Janes
Date:
After a large bulk load aborted near the end, I decided to vacuum the main table so as to not leave a huge chunk of free space in the middle of it, before re-running the bulk load.  This vacuum took a frustratingly long time, as the backwards scan over the table to truncate the space did not trigger read-ahead (or read-behind) in the file system.  (Which was ext4).

In the hind-sight, the best thing to do would have been to open a psql session and take an access-share lock on the main table, so that it would prevent the truncation from occurring. Leaving cleaned, reusable space at the end of the table to be refilled upon the next (successful) run of the bulk load.  But it would be nice if the system could do some optimization for me, as hind sight never comes in time.

I did literally the simplest thing I could think of as a proof of concept patch, to see if it would actually fix things.  I just jumped back a certain number of blocks occasionally and prefetched them forward, then resumed the regular backward scan.  The patch and driving script are attached.  

It was very successful in this test.  I've report the fastest time to do the truncation for each setting of the prefetch size out of about 10 runs each, after removing a handful of cases where buffer pins prevented it from doing a full-size truncation.  (I used the minimum, because the server has other duties what would occasionally kick in and cause a spike in time for a few runs).  0 setting disables this feature.

JJFETCH  trunc_time(s)
0355.78
1350.5
2277.61
4213.9
8172.79
16138.87
32133.02
6476.25
12847.63
25635.28
51220.44
102418.6
204815.63
409615.34
819214.72
1638413.45
3276813.13

For context, the forward-scan part of this vacuum took about 80 sec (IQR 75 to 88).

Do we want something like this at all?  If so, what would we have to do to implement it for real?  I thought that maybe the back-off amount could start at one block and double each time, until it hits a maximum.  And that the maximum would be either seg_size, or 1/8 of effective_cache_size, whichever is lower.  (I don't think we would want to prefetch more than a fraction of effective_cache_size, or else it could have already been displaced by the time the scan gets back to it).

Also, what would be the best way to drill a hole through bufmgr.c into md.c so that the prefetch could specify an entire range, rather than looping over each individual block?

What would have to be done to detect people running on SSD and disable the feature, if anything?

I'll add this to next commitfest as WIP patch.

Cheers,

Jeff
Attachment

Re: optimizing vacuum truncation scans

From
Jim Nasby
Date:
On 4/19/15 9:09 PM, Jeff Janes wrote:
> I did literally the simplest thing I could think of as a proof of
> concept patch, to see if it would actually fix things.  I just jumped
> back a certain number of blocks occasionally and prefetched them
> forward, then resumed the regular backward scan.  The patch and driving
> script are attached.

Shouldn't completely empty pages be set as all-visible in the VM? If so, 
can't we just find the largest not-all-visible page and move forward 
from there, instead of moving backwards like we currently do?

For that matter, why do we scan backwards anyway? The comments don't 
explain it, and we have nonempty_pages as a starting point, so why don't 
we just scan forward? I suspect that eons ago we didn't have that and 
just blindly reverse-scanned until we finally hit a non-empty buffer...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: optimizing vacuum truncation scans

From
Jeff Janes
Date:

On Sun, Apr 19, 2015 at 10:38 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/19/15 9:09 PM, Jeff Janes wrote:
I did literally the simplest thing I could think of as a proof of
concept patch, to see if it would actually fix things.  I just jumped
back a certain number of blocks occasionally and prefetched them
forward, then resumed the regular backward scan.  The patch and driving
script are attached.

Shouldn't completely empty pages be set as all-visible in the VM? If so, can't we just find the largest not-all-visible page and move forward from there, instead of moving backwards like we currently do?

If the entire table is all-visible, we would be starting from the beginning, even though the beginning of the table still has read only tuples present.
 
For that matter, why do we scan backwards anyway? The comments don't explain it, and we have nonempty_pages as a starting point, so why don't we just scan forward? I suspect that eons ago we didn't have that and just blindly reverse-scanned until we finally hit a non-empty buffer...

nonempty_pages is not concurrency safe, as the pages could become used after vacuum passed them over but before the access exclusive lock was grabbed before the truncation scan.  But maybe the combination of the two?  If it is above nonempty_pages, then anyone who wrote into the page after vacuum passed it must have cleared the VM bit. And currently I think no one but vacuum ever sets VM bit back on, so once cleared it would stay cleared.

In any event nonempty_pages could be used to set the guess as to how many pages (if any) might be worth prefetching, as that is not needed for correctness.

Cheers,

Jeff

Re: optimizing vacuum truncation scans

From
Jim Nasby
Date:
On 4/20/15 1:50 AM, Jeff Janes wrote:
>     Shouldn't completely empty pages be set as all-visible in the VM? If
>     so, can't we just find the largest not-all-visible page and move
>     forward from there, instead of moving backwards like we currently do?
>
>
> If the entire table is all-visible, we would be starting from the
> beginning, even though the beginning of the table still has read only
> tuples present.

Except we'd use it in conjunction with nonempty_pages. IIRC, that's set 
to the last page that vacuum saw data on. If any page after that got 
written to after vacuum visited it, the VM bit would be cleared. So 
after we acquire the exclusive lock, AFAICT it's safe to just scan the 
VM starting with nonempty_pages.

>     For that matter, why do we scan backwards anyway? The comments don't
>     explain it, and we have nonempty_pages as a starting point, so why
>     don't we just scan forward? I suspect that eons ago we didn't have
>     that and just blindly reverse-scanned until we finally hit a
>     non-empty buffer...
>
>
> nonempty_pages is not concurrency safe, as the pages could become used
> after vacuum passed them over but before the access exclusive lock was
> grabbed before the truncation scan.  But maybe the combination of the
> two?  If it is above nonempty_pages, then anyone who wrote into the page
> after vacuum passed it must have cleared the VM bit. And currently I
> think no one but vacuum ever sets VM bit back on, so once cleared it
> would stay cleared.

Right.

> In any event nonempty_pages could be used to set the guess as to how
> many pages (if any) might be worth prefetching, as that is not needed
> for correctness.

Yeah, but I think we'd do a LOT better with the VM idea, because we 
could immediately truncate without scanning anything.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: optimizing vacuum truncation scans

From
Qingqing Zhou
Date:
On Sun, Apr 19, 2015 at 7:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I did literally the simplest thing I could think of as a proof of concept patch, to see if it would actually fix
things. I just jumped back a certain number of blocks occasionally and prefetched them forward, then resumed the
regularbackward scan.
 
IIRC, this patches introduces optional prefetch for the backward scan
where file system prefetch may not work well. The result is promising!

> Also, what would be the best way to drill a hole through bufmgr.c into md.c so that the prefetch could specify an
entirerange, rather than looping over each individual block?
 
>
Different from mdread(), which requires a buffer in argument.
Extending mdprefetch() looks natural and safe to me. This shall also
layout a foundation for other scan style optimizations.

> What would have to be done to detect people running on SSD and disable the feature, if anything?
>
There are other call sites of prefetch - so this shall be a topic not
just for this optimization.

Regards,
Qingqing



Re: optimizing vacuum truncation scans

From
Jeff Janes
Date:
On Mon, Apr 20, 2015 at 10:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/20/15 1:50 AM, Jeff Janes wrote:

    For that matter, why do we scan backwards anyway? The comments don't
    explain it, and we have nonempty_pages as a starting point, so why
    don't we just scan forward? I suspect that eons ago we didn't have
    that and just blindly reverse-scanned until we finally hit a
    non-empty buffer...


nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan.  But maybe the combination of the
two?  If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.

Right.

In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.

Yeah, but I think we'd do a LOT better with the VM idea, because we could immediately truncate without scanning anything.

Right now all the interlocks to make this work seem to be in place (only vacuum and startup can set visibility map bits, and only one vacuum can be in a table at a time).  But as far as I can tell, those assumption are not "baked in" and we have pondered loosening them before.  

For example, letting HOT clean up mark a page as all-visible if it finds it be such.  Now in that specific case it would be OK, as HOT cleanup would not cause a page to become empty (or could it?  If an insert on a table with no indexes was rolled back, and hot clean up found it and cleaned it up, it could conceptually become empty--unless we make special code to prevent it) , and so the page would have to be below nonempty_pages.  But there may be other cases.

And I know other people have mentioned making VACUUM concurrent (although I don't see the value in that myself).

So doing it this way would be hard to beat (scanning a bitmap vs the table itself), but it would also introduce a modularity violation that I am not sure is worth it.  

Of course this could always be reverted if its requirements became a problem for a more important change (assuming of course that we detected the problem)

Cheers,

Jeff

Re: optimizing vacuum truncation scans

From
Jeff Janes
Date:
On Tue, May 26, 2015 at 12:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Apr 20, 2015 at 10:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/20/15 1:50 AM, Jeff Janes wrote:

    For that matter, why do we scan backwards anyway? The comments don't
    explain it, and we have nonempty_pages as a starting point, so why
    don't we just scan forward? I suspect that eons ago we didn't have
    that and just blindly reverse-scanned until we finally hit a
    non-empty buffer...


nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan.  But maybe the combination of the
two?  If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.

Right.

In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.

Yeah, but I think we'd do a LOT better with the VM idea, because we could immediately truncate without scanning anything.

Right now all the interlocks to make this work seem to be in place (only vacuum and startup can set visibility map bits, and only one vacuum can be in a table at a time).  But as far as I can tell, those assumption are not "baked in" and we have pondered loosening them before.  

For example, letting HOT clean up mark a page as all-visible if it finds it be such.  Now in that specific case it would be OK, as HOT cleanup would not cause a page to become empty (or could it?  If an insert on a table with no indexes was rolled back, and hot clean up found it and cleaned it up, it could conceptually become empty--unless we make special code to prevent it) , and so the page would have to be below nonempty_pages.  But there may be other cases.

And I know other people have mentioned making VACUUM concurrent (although I don't see the value in that myself).

So doing it this way would be hard to beat (scanning a bitmap vs the table itself), but it would also introduce a modularity violation that I am not sure is worth it.  

Of course this could always be reverted if its requirements became a problem for a more important change (assuming of course that we detected the problem)

The fatal problem here is that nonempty_pages is unreliable.  If vacuum skips all-visible pages, it doesn't necessarily increment nonempty_pages beyond that skippage.  So if you just rely on nonempty_pages, you will truncate away pages that were already all visible but are not empty.  If we changed it so that it did increment nonempty_pages past the skipped ones, then pages which were all empty and got marked as all visible without being truncated (say, because a lock could not be acquired, or because there was a non-empty page after them which later became empty), then those pages would never get truncated away.

As it is currently, it is not clear what purpose nonempty_pages serves.  It is a guardian value which doesn't seem to actually guard anything.  At best it prevents you from needing to inspect one page (the page of the guardian value itself) to see if that page is actually empty, and finding that it is not.  That hardly seems worthwhile.

We could adopt two nonempty_pages counters, once that fails low on skipped all-visible pages, and one that failed high on them.  And then fast truncate down to the high one, and do the current page by page scan between the low and high. That seems rather grotesque.

Cheers,

Jeff

Re: optimizing vacuum truncation scans

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> As it is currently, it is not clear what purpose nonempty_pages serves.

It's to allow an early exit at vacuumlazy.c:271 in case there is no chance
of deleting a useful number of pages.
        regards, tom lane



Re: optimizing vacuum truncation scans

From
Jeff Janes
Date:
On Tue, May 26, 2015 at 12:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Apr 20, 2015 at 10:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/20/15 1:50 AM, Jeff Janes wrote:

    For that matter, why do we scan backwards anyway? The comments don't
    explain it, and we have nonempty_pages as a starting point, so why
    don't we just scan forward? I suspect that eons ago we didn't have
    that and just blindly reverse-scanned until we finally hit a
    non-empty buffer...


nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan.  But maybe the combination of the
two?  If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.

Right.

In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.

Yeah, but I think we'd do a LOT better with the VM idea, because we could immediately truncate without scanning anything.

Right now all the interlocks to make this work seem to be in place (only vacuum and startup can set visibility map bits, and only one vacuum can be in a table at a time).  But as far as I can tell, those assumption are not "baked in" and we have pondered loosening them before.  

For example, letting HOT clean up mark a page as all-visible if it finds it be such.  Now in that specific case it would be OK, as HOT cleanup would not cause a page to become empty (or could it?  If an insert on a table with no indexes was rolled back, and hot clean up found it and cleaned it up, it could conceptually become empty--unless we make special code to prevent it) , and so the page would have to be below nonempty_pages.  But there may be other cases.

And I know other people have mentioned making VACUUM concurrent (although I don't see the value in that myself).

So doing it this way would be hard to beat (scanning a bitmap vs the table itself), but it would also introduce a modularity violation that I am not sure is worth it.  

Of course this could always be reverted if its requirements became a problem for a more important change (assuming of course that we detected the problem)

Attached is a patch that implements the vm scan for truncation.  It introduces a variable to hold the last blkno which was skipped during the forward portion.  Any blocks after both this blkno and after the last inspected nonempty page (which the code is already tracking) must have been observed to be empty by the current vacuum.  Any other process rendering the page nonempty are required to clear the vm bit, and no other process can set the bit again during the vacuum's lifetime.  So if the bit is still set, the page is still empty without needing to inspect it.

There is still the case of pages which had their visibility bit set by a prior vacuum and then were not inspected by the current one.  Once the truncation scan runs into these pages, it falls back to the previous behavior of reading block by block backwards.  So there could still be reason to optimize that fallback using forward-reading prefetch.

Using the previously shown test case, this patch reduces the truncation part of the vacuum to 2 seconds.

Cheers,

Jeff
Attachment

Re: optimizing vacuum truncation scans

From
Haribabu Kommi
Date:
On Mon, Jun 29, 2015 at 3:54 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> Attached is a patch that implements the vm scan for truncation.  It
> introduces a variable to hold the last blkno which was skipped during the
> forward portion.  Any blocks after both this blkno and after the last
> inspected nonempty page (which the code is already tracking) must have been
> observed to be empty by the current vacuum.  Any other process rendering the
> page nonempty are required to clear the vm bit, and no other process can set
> the bit again during the vacuum's lifetime.  So if the bit is still set, the
> page is still empty without needing to inspect it.

The patch looks good and it improves the vacuum truncation speed significantly.

> There is still the case of pages which had their visibility bit set by a
> prior vacuum and then were not inspected by the current one.  Once the
> truncation scan runs into these pages, it falls back to the previous
> behavior of reading block by block backwards.  So there could still be
> reason to optimize that fallback using forward-reading prefetch.

The case, I didn't understand is that, how the current vacuum misses
the page which
was set by the prior vacuum?

The page should be counted either in skipped_pages or in
nonempty_pages. Is it possible
that a page doesn't comes under these two categories and it is not empty also?

If the above doesn't exist, then we can directly truncate the relation
from the highest block
number of either nonempty_pages or skipped_pages to till the end of
the relation.

Regards,
Hari Babu
Fujitsu Australia



Re: optimizing vacuum truncation scans

From
Jeff Janes
Date:
On Wed, Jul 8, 2015 at 9:46 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Mon, Jun 29, 2015 at 3:54 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> Attached is a patch that implements the vm scan for truncation.  It
> introduces a variable to hold the last blkno which was skipped during the
> forward portion.  Any blocks after both this blkno and after the last
> inspected nonempty page (which the code is already tracking) must have been
> observed to be empty by the current vacuum.  Any other process rendering the
> page nonempty are required to clear the vm bit, and no other process can set
> the bit again during the vacuum's lifetime.  So if the bit is still set, the
> page is still empty without needing to inspect it.

The patch looks good and it improves the vacuum truncation speed significantly.

> There is still the case of pages which had their visibility bit set by a
> prior vacuum and then were not inspected by the current one.  Once the
> truncation scan runs into these pages, it falls back to the previous
> behavior of reading block by block backwards.  So there could still be
> reason to optimize that fallback using forward-reading prefetch.

The case, I didn't understand is that, how the current vacuum misses
the page which
was set by the prior vacuum?

The prior vacuum set them to all visible, but then doesn't delete them.  Either because it got interrupted or because there were still some pages after them (at the time) that were not all empty.  

The current vacuum skips them entirely on the forward scan because it thinks it is waste of time to vacuum all visible pages.  Since it skips them, it doesn't know if they are empty as well as being all-visible.  There is no permanent indicator of the pages being all-empty, there is only the inference based on the current vacuum's counters and protected by the lock held on the table. 
 

The page should be counted either in skipped_pages or in
nonempty_pages. Is it possible
that a page doesn't comes under these two categories and it is not empty also?

If the above doesn't exist, then we can directly truncate the relation
from the highest block
number of either nonempty_pages or skipped_pages to till the end of
the relation.

Right, and that is what this does (provided the vm bit is still set, so it does still have to loop over the vm to verify that it is still set, while it holds the access exclusive lock).

The problem is that the pages between the two counters are not known to be empty, and also not known to be nonempty.  Someone has to be willing to go check on those pages at some point, or else they will never be eligible for truncation.

Cheers,

Jeff

Re: optimizing vacuum truncation scans

From
Haribabu Kommi
Date:
On Thu, Jul 9, 2015 at 4:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, Jul 8, 2015 at 9:46 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Mon, Jun 29, 2015 at 3:54 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> >
>> > There is still the case of pages which had their visibility bit set by a
>> > prior vacuum and then were not inspected by the current one.  Once the
>> > truncation scan runs into these pages, it falls back to the previous
>> > behavior of reading block by block backwards.  So there could still be
>> > reason to optimize that fallback using forward-reading prefetch.
>>
>> The case, I didn't understand is that, how the current vacuum misses
>> the page which
>> was set by the prior vacuum?
>
>
> The prior vacuum set them to all visible, but then doesn't delete them.
> Either because it got interrupted or because there were still some pages
> after them (at the time) that were not all empty.
>
> The current vacuum skips them entirely on the forward scan because it thinks
> it is waste of time to vacuum all visible pages.  Since it skips them, it
> doesn't know if they are empty as well as being all-visible.  There is no
> permanent indicator of the pages being all-empty, there is only the
> inference based on the current vacuum's counters and protected by the lock
> held on the table.
>
>>
>>
>> The page should be counted either in skipped_pages or in
>> nonempty_pages. Is it possible
>> that a page doesn't comes under these two categories and it is not empty
>> also?
>>
>> If the above doesn't exist, then we can directly truncate the relation
>> from the highest block
>> number of either nonempty_pages or skipped_pages to till the end of
>> the relation.
>
>
> Right, and that is what this does (provided the vm bit is still set, so it
> does still have to loop over the vm to verify that it is still set, while it
> holds the access exclusive lock).

Thanks I got it. To re-verify the vm bit of a page after getting the
access exclusive lock,
we have to do the backward scan.

I also feel that your prefetch buffer patch needed to improve the
performance, as because
there is a need of backward scan to re-verify vm bit,

I will do some performance tests and send you the results.

> The problem is that the pages between the two counters are not known to be
> empty, and also not known to be nonempty.  Someone has to be willing to go
> check on those pages at some point, or else they will never be eligible for
> truncation.

Yes, there is no way to identify the page is empty or not without
reading the page.

Regards,
Hari Babu
Fujitsu Australia



Re: optimizing vacuum truncation scans

From
Haribabu Kommi
Date:
On Thu, Jul 9, 2015 at 5:36 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> I will do some performance tests and send you the results.

Here are the performance results tested on my machine.


                             Head          vm patch            vm+prefetch patch

First vacuum        120sec        <1sec                 <1sec
second vacuum    180 sec       180 sec                30 sec

I did some modifications in the code to skip the vacuum truncation by
the first vacuum command.
This way I collected the second vacuum time taken performance.

I just combined your vm and prefetch patch into a single patch
vm+prefetch patch without a GUC.
I kept the prefetch as 32 and did the performance test. I chosen
prefetch based on the current
buffer access strategy, which is 32 for vacuum presently instead of an
user option.
Here I attached the modified patch with both vm+prefetch logic.

I will do some tests on a machine with SSD and let you know the
results. Based on these results,
we can decide whether we need a GUC or not? based on the impact of
prefetch on ssd machines.

Regards,
Hari Babu
Fujitsu Australia

Attachment

Re: optimizing vacuum truncation scans

From
Haribabu Kommi
Date:
On Mon, Jul 13, 2015 at 12:06 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> On Thu, Jul 9, 2015 at 5:36 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>>
>> I will do some performance tests and send you the results.
>
> Here are the performance results tested on my machine.
>
>
>                              Head          vm patch            vm+prefetch patch
>
> First vacuum        120sec        <1sec                 <1sec
> second vacuum    180 sec       180 sec                30 sec
>
> I did some modifications in the code to skip the vacuum truncation by
> the first vacuum command.
> This way I collected the second vacuum time taken performance.
>
> I just combined your vm and prefetch patch into a single patch
> vm+prefetch patch without a GUC.
> I kept the prefetch as 32 and did the performance test. I chosen
> prefetch based on the current
> buffer access strategy, which is 32 for vacuum presently instead of an
> user option.
> Here I attached the modified patch with both vm+prefetch logic.
>
> I will do some tests on a machine with SSD and let you know the
> results. Based on these results,
> we can decide whether we need a GUC or not? based on the impact of
> prefetch on ssd machines.

Following are the performance readings on a machine with SSD.
I increased the pgbench scale factor to 1000 in the test instead of 500
to show a better performance numbers.
                            Head           vm patch        vm+prefetch patch

First vacuum        6.24 sec       2.91 sec           2.91 sec
second vacuum    6.66 sec       6.66 sec           7.19 sec

There is a small performance impact on SSD with prefetch.

Regards,
Hari Babu
Fujitsu Australia



Re: optimizing vacuum truncation scans

From
Haribabu Kommi
Date:
On Mon, Jul 13, 2015 at 5:16 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> On Mon, Jul 13, 2015 at 12:06 PM, Haribabu Kommi
> <kommi.haribabu@gmail.com> wrote:
>> On Thu, Jul 9, 2015 at 5:36 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>>>
>>> I will do some performance tests and send you the results.
>>
>> Here are the performance results tested on my machine.
>>
>>
>>                              Head          vm patch            vm+prefetch patch
>>
>> First vacuum        120sec        <1sec                 <1sec
>> second vacuum    180 sec       180 sec                30 sec
>>
>> I did some modifications in the code to skip the vacuum truncation by
>> the first vacuum command.
>> This way I collected the second vacuum time taken performance.
>>
>> I just combined your vm and prefetch patch into a single patch
>> vm+prefetch patch without a GUC.
>> I kept the prefetch as 32 and did the performance test. I chosen
>> prefetch based on the current
>> buffer access strategy, which is 32 for vacuum presently instead of an
>> user option.
>> Here I attached the modified patch with both vm+prefetch logic.
>>
>> I will do some tests on a machine with SSD and let you know the
>> results. Based on these results,
>> we can decide whether we need a GUC or not? based on the impact of
>> prefetch on ssd machines.
>
> Following are the performance readings on a machine with SSD.
> I increased the pgbench scale factor to 1000 in the test instead of 500
> to show a better performance numbers.
>
>                              Head           vm patch        vm+prefetch patch
>
> First vacuum        6.24 sec       2.91 sec           2.91 sec
> second vacuum    6.66 sec       6.66 sec           7.19 sec
>
> There is a small performance impact on SSD with prefetch.

The above prefetch overhead is observed with prefeching of 1639345 pages.
I feel this overhead is small.

Hi Jeff,

If you are fine with earlier attached patch, then I will mark this patch as
ready for committer, to get some committer view on the patch.


Regards,
Hari Babu
Fujitsu Australia



Re: optimizing vacuum truncation scans

From
Amit Kapila
Date:
On Mon, Jun 29, 2015 at 11:24 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
>
> Attached is a patch that implements the vm scan for truncation.  It introduces a variable to hold the last blkno which was skipped during the forward portion.  Any blocks after both this blkno and after the last inspected nonempty page (which the code is already tracking) must have been observed to be empty by the current vacuum.  Any other process rendering the page nonempty are required to clear the vm bit, and no other process can set the bit again during the vacuum's lifetime.  So if the bit is still set, the page is still empty without needing to inspect it.
>

One case where this patch can behave differently then current
code is, when before taking Exclusive lock on relation for truncation,
if some backend clears the vm bit and then inserts-deletes-prune that
page.  I think with patch it will not consider to truncate pages whereas
current code will allow to truncate it as it re-verifies each page.  I think
such a case would be rare and we might not want to bother about it,
but still worth to think about it once.

Some minor points about patch:

count_nondeletable_pages()
{
..
if (PageIsNew(page) || PageIsEmpty(page))
{
/* PageIsNew probably shouldn't happen... */
UnlockReleaseBuffer(buf);
continue;
}
..
}

Why vmbuffer is not freed in above loop?


count_nondeletable_pages()
{
..
+ /*
+ * Pages that were inspected and found to be empty by this vacuum run
+ * must still be empty if the vm bit is still set.  Only vacuum sets
+ * vm bits, and only one vacuum can exist in a table at one time.
+ */
+ trust_vm=vacrelstats->nonempty_pages>vacrelstats->skipped_pages ?
+ vacrelstats->nonempty_pages : vacrelstats->skipped_pages;

..
}

I think it is better to have spaces in above assignment statement
(near '=' and near '>')


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: optimizing vacuum truncation scans

From
Haribabu Kommi
Date:
On Wed, Jul 15, 2015 at 11:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jun 29, 2015 at 11:24 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>>
>> Attached is a patch that implements the vm scan for truncation.  It
>> introduces a variable to hold the last blkno which was skipped during the
>> forward portion.  Any blocks after both this blkno and after the last
>> inspected nonempty page (which the code is already tracking) must have been
>> observed to be empty by the current vacuum.  Any other process rendering the
>> page nonempty are required to clear the vm bit, and no other process can set
>> the bit again during the vacuum's lifetime.  So if the bit is still set, the
>> page is still empty without needing to inspect it.
>>
>
> One case where this patch can behave differently then current
> code is, when before taking Exclusive lock on relation for truncation,
> if some backend clears the vm bit and then inserts-deletes-prune that
> page.  I think with patch it will not consider to truncate pages whereas
> current code will allow to truncate it as it re-verifies each page.  I think
> such a case would be rare and we might not want to bother about it,
> but still worth to think about it once.

Thanks for your review.

corrected the code as instead of returning the blkno after visibility map
check failure, the code just continue to verify the contents as
earlier approach.


> Some minor points about patch:
>
> count_nondeletable_pages()
> {
> ..
> if (PageIsNew(page) || PageIsEmpty(page))
> {
> /* PageIsNew probably shouldn't happen... */
> UnlockReleaseBuffer(buf);
> continue;
> }
> ..
> }
>
> Why vmbuffer is not freed in above loop?
>

In the above check, we are just continuing to the next blkno, at the
end of the loop the vmbuffer
is freed.


> count_nondeletable_pages()
> {
> ..
> + /*
> + * Pages that were inspected and found to be empty by this vacuum run
> + * must still be empty if the vm bit is still set.  Only vacuum sets
> + * vm bits, and only one vacuum can exist in a table at one time.
> + */
> + trust_vm=vacrelstats->nonempty_pages>vacrelstats->skipped_pages ?
> + vacrelstats->nonempty_pages : vacrelstats->skipped_pages;
>
> ..
> }
>
> I think it is better to have spaces in above assignment statement
> (near '=' and near '>')
>

corrected.

Here I attached updated patches,
1) without prefetch logic.
2) with combined vm and prefetch logic.

I marked the patch as ready for committer.


Regards,
Hari Babu
Fujitsu Australia

Attachment

Re: optimizing vacuum truncation scans

From
Amit Kapila
Date:
On Thu, Jul 16, 2015 at 11:21 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Wed, Jul 15, 2015 at 11:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > One case where this patch can behave differently then current
> > code is, when before taking Exclusive lock on relation for truncation,
> > if some backend clears the vm bit and then inserts-deletes-prune that
> > page.  I think with patch it will not consider to truncate pages whereas
> > current code will allow to truncate it as it re-verifies each page.  I think
> > such a case would be rare and we might not want to bother about it,
> > but still worth to think about it once.
>
> Thanks for your review.
>
> corrected the code as instead of returning the blkno after visibility map
> check failure, the code just continue to verify the contents as
> earlier approach.
>

Okay, I think this should work.

>
> Here I attached updated patches,
> 1) without prefetch logic.
> 2) with combined vm and prefetch logic.
>

I think it is better to just get the first patch in as that in itself is a
clear win and then we can evaluate the second one (prefetching during
truncation) separately.  I think after first patch, the use case for doing
prefetch seems to be less beneficial and I think it could hurt by loading
not required pages (assume you have prefetched 32 pages and after
inspecting first page, it decides to quit the loop as that is non-empty page
and other is when it has prefetched just because for page the visibility map
bit was cleared, but for others it is set) in OS buffer cache.

Having said that, I am not against prefetching in this scenario as that
can help in more cases then it could hurt, but I think it will be better
to evaluate that separately.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: optimizing vacuum truncation scans

From
Haribabu Kommi
Date:
On Thu, Jul 16, 2015 at 10:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jul 16, 2015 at 11:21 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>
>>
>> Here I attached updated patches,
>> 1) without prefetch logic.
>> 2) with combined vm and prefetch logic.
>>
>
> I think it is better to just get the first patch in as that in itself is a
> clear win and then we can evaluate the second one (prefetching during
> truncation) separately.  I think after first patch, the use case for doing
> prefetch seems to be less beneficial and I think it could hurt by loading
> not required pages (assume you have prefetched 32 pages and after
> inspecting first page, it decides to quit the loop as that is non-empty page
> and other is when it has prefetched just because for page the visibility map
> bit was cleared, but for others it is set) in OS buffer cache.

Yes, in the above cases, the prefetch is an overhead. I am not sure
how frequently such
scenarios occur in real time scenarios.

> Having said that, I am not against prefetching in this scenario as that
> can help in more cases then it could hurt, but I think it will be better
> to evaluate that separately.

Yes, the prefetch patch works better in cases, where majorly the first
vacuum skips
the truncation because of lock waiters. If such cases are more in real
time scenarios,
then the prefetch patch is needed.

Regards,
Hari Babu
Fujitsu Australia



Re: optimizing vacuum truncation scans

From
Simon Riggs
Date:
On 16 July 2015 at 06:51, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
 
I marked the patch as ready for committer.

The most recent results seem to indicate that the prefetch isn't worth pursuing, but the vm test is. Can someone repeat the perf tests on something larger so we can see, when/if there is a benefit?

Jeff, can you add detailed comments to explain the theory of operation of these patches? The patches add the code, but don't say why. We will forget, so lets put the detail in there now please. Thanks.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: optimizing vacuum truncation scans

From
Robert Haas
Date:
On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Attached is a patch that implements the vm scan for truncation.  It
> introduces a variable to hold the last blkno which was skipped during the
> forward portion.  Any blocks after both this blkno and after the last
> inspected nonempty page (which the code is already tracking) must have been
> observed to be empty by the current vacuum.  Any other process rendering the
> page nonempty are required to clear the vm bit, and no other process can set
> the bit again during the vacuum's lifetime.  So if the bit is still set, the
> page is still empty without needing to inspect it.

Urgh.  So if we do this, that forever precludes having HOT pruning set
the all-visible bit.  At the least we'd better document that carefully
so that nobody breaks it later.  But I wonder if there isn't some
better approach, because I would certainly rather that we didn't
foreclose the possibility of doing something like that in the future.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: optimizing vacuum truncation scans

From
Jeff Janes
Date:
On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Attached is a patch that implements the vm scan for truncation.  It
> introduces a variable to hold the last blkno which was skipped during the
> forward portion.  Any blocks after both this blkno and after the last
> inspected nonempty page (which the code is already tracking) must have been
> observed to be empty by the current vacuum.  Any other process rendering the
> page nonempty are required to clear the vm bit, and no other process can set
> the bit again during the vacuum's lifetime.  So if the bit is still set, the
> page is still empty without needing to inspect it.

Urgh.  So if we do this, that forever precludes having HOT pruning set
the all-visible bit. 

I wouldn't say forever, as it would be easy to revert the change if something more important came along that conflicted with it.  I don't think this change would grow tentacles across the code that make it hard to revert, you would just have to take the performance hit (and by that time, maybe HDD will truly be dead anyway and so we don't care anymore). But yes, that is definitely a downside.  HOT pruning is one example, but also one could envision having someone (bgwriter?) set vm bits on unindexed tables.  Or if we invent some efficient way to know that no expiring tids for a certain block range are stored in indexes, other jobs could also set the vm bit on indexed tables.  Or parallel vacuums in the same table, not that I really see a reason to have those.
 
At the least we'd better document that carefully
so that nobody breaks it later.  But I wonder if there isn't some
better approach, because I would certainly rather that we didn't
foreclose the possibility of doing something like that in the future.

But where do we document it (other than in-place)?  README.HOT doesn't seem sufficient, and there is no README.vm.

I guess add an "Assert(InRecovery || running_a_vacuum);" to the visibilitymap_set with a comment there, except that I don't know how to implement running_a_vacuum so that it covers manual vacs as well as autovac.  Perhaps assert that we hold a SHARE UPDATE EXCLUSIVE on rel?

The advantage of the other approach, just force kernel read-ahead to work for us, is that it doesn't impose any of these restrictions on future development.  The disadvantage is that I don't know how to auto-tune it, or auto-disable it for SSD, and it will never be as quite as efficient.
 
Cheers,

Jeff

Re: optimizing vacuum truncation scans

From
Robert Haas
Date:
On Wed, Jul 22, 2015 at 12:11 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> Urgh.  So if we do this, that forever precludes having HOT pruning set
>> the all-visible bit.
>
> I wouldn't say forever, as it would be easy to revert the change if
> something more important came along that conflicted with it.

Well, the people who want this change would likely object to reverting it later.

> I don't think
> this change would grow tentacles across the code that make it hard to
> revert, you would just have to take the performance hit (and by that time,
> maybe HDD will truly be dead anyway and so we don't care anymore). But yes,
> that is definitely a downside.  HOT pruning is one example, but also one
> could envision having someone (bgwriter?) set vm bits on unindexed tables.
> Or if we invent some efficient way to know that no expiring tids for a
> certain block range are stored in indexes, other jobs could also set the vm
> bit on indexed tables.  Or parallel vacuums in the same table, not that I
> really see a reason to have those.

Yes, good points.  (I think we will need parallel vacuum to cope with
TB-sized tables, but that's another story.)

>> At the least we'd better document that carefully
>> so that nobody breaks it later.  But I wonder if there isn't some
>> better approach, because I would certainly rather that we didn't
>> foreclose the possibility of doing something like that in the future.
>
> But where do we document it (other than in-place)?  README.HOT doesn't seem
> sufficient, and there is no README.vm.

visibilitymap_set?

> I guess add an "Assert(InRecovery || running_a_vacuum);" to the
> visibilitymap_set with a comment there, except that I don't know how to
> implement running_a_vacuum so that it covers manual vacs as well as autovac.
> Perhaps assert that we hold a SHARE UPDATE EXCLUSIVE on rel?

I care more about the comment than I do about the Assert().

> The advantage of the other approach, just force kernel read-ahead to work
> for us, is that it doesn't impose any of these restrictions on future
> development.  The disadvantage is that I don't know how to auto-tune it, or
> auto-disable it for SSD, and it will never be as quite as efficient.

I have to say that I like the prefetch approach.  I don't know what to
do about the fact that it loses to the VM-bit based approach, but I
think it's a bad bet that we will never care about setting visibility
map bits anyplace other than VACUUM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: optimizing vacuum truncation scans

From
Andres Freund
Date:
On 2015-07-23 11:06:25 -0400, Robert Haas wrote:
> I don't know what to do about the fact that it loses to the VM-bit
> based approach, but I think it's a bad bet that we will never care
> about setting visibility map bits anyplace other than VACUUM.

+1

A background process that tries to hint all pages shortly before they're
written out would be truly awesome. And would very likely set hint
bits...

Andres



Re: optimizing vacuum truncation scans

From
Jim Nasby
Date:
On 7/22/15 11:11 AM, Jeff Janes wrote:
> I guess add an "Assert(InRecovery || running_a_vacuum);" to
> the visibilitymap_set with a comment there, except that I don't know how
> to implement running_a_vacuum so that it covers manual vacs as well as
> autovac.

static bool RunningVacuum; ? It only needs to be per-backend, right?

If it is that simple then my $0.02 is we should have the assert, not 
just a comment.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: optimizing vacuum truncation scans

From
Simon Riggs
Date:
On 22 July 2015 at 14:59, Robert Haas <robertmhaas@gmail.com> wrote:
 
Urgh.  So if we do this, that forever precludes having HOT pruning set
the all-visible bit.

What is the reason why we don't do that already? Surely its a one liner?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: optimizing vacuum truncation scans

From
Robert Haas
Date:
On Sat, Jul 25, 2015 at 4:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 22 July 2015 at 14:59, Robert Haas <robertmhaas@gmail.com> wrote:
>> Urgh.  So if we do this, that forever precludes having HOT pruning set
>> the all-visible bit.
>
> What is the reason why we don't do that already? Surely its a one liner?

I wish!

It actually doesn't look simple to me to modify heap_page_prune to
know whether any not-all-visible items remain at the end.  It's
definitely true that, in order for marking the page all-visible to be
a possibility, we need to reach heap_page_prune_execute() with ndead
== 0.  But that's not enough: we also need to know that any tuple that
survived the prune operation (that is, it wasn't redirected or marked
unused) has a new-enough xmin, which isn't tracked anywhere.
vacuumlazy.c solves that problem by making a second pass over all the
tuples on the page to determine whether the page is all-visible, but
that seems pretty inefficient and I think heap_page_prune() is already
a CPU hog on some workloads.  (Didn't Heikki have a patch to improve
that?)

Another sticky wicket is that, as things stand today, we'd have to
insert a second WAL record to mark the page all-visible, which would
add probably-significant overhead.  We could probably modify the
format of the record we're already emitting for pruning to carry a
flag indicating whether to additionally set PD_ALL_VISIBLE.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: optimizing vacuum truncation scans

From
Simon Riggs
Date:
On 22 July 2015 at 17:11, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Attached is a patch that implements the vm scan for truncation.  It
> introduces a variable to hold the last blkno which was skipped during the
> forward portion.  Any blocks after both this blkno and after the last
> inspected nonempty page (which the code is already tracking) must have been
> observed to be empty by the current vacuum.  Any other process rendering the
> page nonempty are required to clear the vm bit, and no other process can set
> the bit again during the vacuum's lifetime.  So if the bit is still set, the
> page is still empty without needing to inspect it.

Urgh.  So if we do this, that forever precludes having HOT pruning set
the all-visible bit. 

I wouldn't say forever, as it would be easy to revert the change if something more important came along that conflicted with it. 

I think what is being said here is that someone is already using this technique, or if not, then we actively want to encourage them to do so as an extension or as a submission to core.

In that case, I think the rely-on-VM technique sinks again, sorry Jim, Jeff. Probably needs code comments added.

That does still leave the prefetch technique, so all is not lost.

Can we see a patch with just prefetch, probably with a simple choice of stride? Thanks.
 
--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: optimizing vacuum truncation scans

From
Jim Nasby
Date:
On 7/27/15 10:39 AM, Robert Haas wrote:
> But that's not enough: we also need to know that any tuple that
> survived the prune operation (that is, it wasn't redirected or marked
> unused) has a new-enough xmin, which isn't tracked anywhere.

Wouldn't that be old-enough xmin?

heap_prune_chain() is already calling HeapTupleSatisfiesVacuum, so it 
should be able to figure out if the page is all visible without a lot of 
extra work, and pass that info back to heap_page_prune (which would then 
pass it down to _execute()).

Definitely not a one-liner though.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: optimizing vacuum truncation scans

From
Jeff Janes
Date:
On Mon, Jul 27, 2015 at 1:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 22 July 2015 at 17:11, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Attached is a patch that implements the vm scan for truncation.  It
> introduces a variable to hold the last blkno which was skipped during the
> forward portion.  Any blocks after both this blkno and after the last
> inspected nonempty page (which the code is already tracking) must have been
> observed to be empty by the current vacuum.  Any other process rendering the
> page nonempty are required to clear the vm bit, and no other process can set
> the bit again during the vacuum's lifetime.  So if the bit is still set, the
> page is still empty without needing to inspect it.

Urgh.  So if we do this, that forever precludes having HOT pruning set
the all-visible bit. 

I wouldn't say forever, as it would be easy to revert the change if something more important came along that conflicted with it. 

I think what is being said here is that someone is already using this technique, or if not, then we actively want to encourage them to do so as an extension or as a submission to core.

In that case, I think the rely-on-VM technique sinks again, sorry Jim, Jeff. Probably needs code comments added.

Sure, that sounds like the consensus.  The VM method was very efficient, but I agree it is pretty fragile and restricting.
 

That does still leave the prefetch technique, so all is not lost.

Can we see a patch with just prefetch, probably with a simple choice of stride? Thanks.

I probably won't get back to it this commit fest, so it can be set to returned with feedback.  But if anyone has good ideas for how to set the stride (or detect that it is on SSD and so is pointless to try) I'd love to hear about them anytime.

Cheers,

Jeff

Re: optimizing vacuum truncation scans

From
Simon Riggs
Date:
On 3 August 2015 at 17:18, Jeff Janes <jeff.janes@gmail.com> wrote:
 
That does still leave the prefetch technique, so all is not lost.

Can we see a patch with just prefetch, probably with a simple choice of stride? Thanks.

I probably won't get back to it this commit fest, so it can be set to returned with feedback.  But if anyone has good ideas for how to set the stride (or detect that it is on SSD and so is pointless to try) I'd love to hear about them anytime.

I've Returned With Feedback, as you suggest.

Given your earlier numbers, I'd just pick a constant value of 128, to keep it simple. That balances out the various factors of small/large tables and the uncertainty that we will be able to truncate the whole chunk of blocks. I'd like to see tests on SSD before commit, but I hope and expect the slightly negative results with SSD won't be substantiated on larger tests.

Kept simple, its a patch with a clear win in a restricted use case and I would be happy to commit that sometime in the future.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: optimizing vacuum truncation scans

From
Haribabu Kommi
Date:
On Tue, Aug 4, 2015 at 2:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Jul 27, 2015 at 1:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> On 22 July 2015 at 17:11, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>
>>> On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas <robertmhaas@gmail.com>
>>> wrote:
>>>>
>>>> On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com>
>>>> wrote:
>>>> > Attached is a patch that implements the vm scan for truncation.  It
>>>> > introduces a variable to hold the last blkno which was skipped during
>>>> > the
>>>> > forward portion.  Any blocks after both this blkno and after the last
>>>> > inspected nonempty page (which the code is already tracking) must have
>>>> > been
>>>> > observed to be empty by the current vacuum.  Any other process
>>>> > rendering the
>>>> > page nonempty are required to clear the vm bit, and no other process
>>>> > can set
>>>> > the bit again during the vacuum's lifetime.  So if the bit is still
>>>> > set, the
>>>> > page is still empty without needing to inspect it.
>>>>
>>>> Urgh.  So if we do this, that forever precludes having HOT pruning set
>>>> the all-visible bit.
>>>
>>>
>>> I wouldn't say forever, as it would be easy to revert the change if
>>> something more important came along that conflicted with it.
>>
>>
>> I think what is being said here is that someone is already using this
>> technique, or if not, then we actively want to encourage them to do so as an
>> extension or as a submission to core.
>>
>> In that case, I think the rely-on-VM technique sinks again, sorry Jim,
>> Jeff. Probably needs code comments added.
>
>
> Sure, that sounds like the consensus.  The VM method was very efficient, but
> I agree it is pretty fragile and restricting.
>
>>
>>
>> That does still leave the prefetch technique, so all is not lost.
>>
>> Can we see a patch with just prefetch, probably with a simple choice of
>> stride? Thanks.
>
>
> I probably won't get back to it this commit fest, so it can be set to
> returned with feedback.  But if anyone has good ideas for how to set the
> stride (or detect that it is on SSD and so is pointless to try) I'd love to
> hear about them anytime.

I got the following way to get the whether data file is present in the
DISK or SSD.

1. Get the device file system that table data file is mapped using the
following or similar.

df -P "filename" | awk 'NR==2{print $1}'

2. if the device file system is of type /dev/sd* then treat is as a
disk system and proceed
with the prefetch optimization.

3. if we are not able to find the device details directly then we need
to get the information
from the mapping system.

Usually the devices will map like the following

/dev/mapper/v** points to ../dm-*

4. Get the name of the "dm-*"  from the above details and check
whether it is a SSD or not
with the following command.

/sys/block/dm-*/queue/rotation

5. If the value is 0 then it is an SSD drive, 1 means disk drive.

The described above procedure works only for linux. I didn't check for
other operating systems yet.
Is it worth to consider?

Regards,
Hari Babu
Fujitsu Australia



Re: optimizing vacuum truncation scans

From
Robert Haas
Date:
On Mon, Sep 28, 2015 at 2:05 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> I got the following way to get the whether data file is present in the
> DISK or SSD.
>
> 1. Get the device file system that table data file is mapped using the
> following or similar.
>
> df -P "filename" | awk 'NR==2{print $1}'
>
> 2. if the device file system is of type /dev/sd* then treat is as a
> disk system and proceed
> with the prefetch optimization.
>
> 3. if we are not able to find the device details directly then we need
> to get the information
> from the mapping system.
>
> Usually the devices will map like the following
>
> /dev/mapper/v** points to ../dm-*
>
> 4. Get the name of the "dm-*"  from the above details and check
> whether it is a SSD or not
> with the following command.
>
> /sys/block/dm-*/queue/rotation
>
> 5. If the value is 0 then it is an SSD drive, 1 means disk drive.
>
> The described above procedure works only for linux. I didn't check for
> other operating systems yet.
> Is it worth to consider?

No.  If we need to have the behavior depend on the hardware, it should
be a GUC or tablespace option or reloption, not some kind of crazy
OS-dependent discovery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company