Thread: optimizing vacuum truncation scans
After a large bulk load aborted near the end, I decided to vacuum the main table so as to not leave a huge chunk of free space in the middle of it, before re-running the bulk load. This vacuum took a frustratingly long time, as the backwards scan over the table to truncate the space did not trigger read-ahead (or read-behind) in the file system. (Which was ext4).
In the hind-sight, the best thing to do would have been to open a psql session and take an access-share lock on the main table, so that it would prevent the truncation from occurring. Leaving cleaned, reusable space at the end of the table to be refilled upon the next (successful) run of the bulk load. But it would be nice if the system could do some optimization for me, as hind sight never comes in time.
I did literally the simplest thing I could think of as a proof of concept patch, to see if it would actually fix things. I just jumped back a certain number of blocks occasionally and prefetched them forward, then resumed the regular backward scan. The patch and driving script are attached.
It was very successful in this test. I've report the fastest time to do the truncation for each setting of the prefetch size out of about 10 runs each, after removing a handful of cases where buffer pins prevented it from doing a full-size truncation. (I used the minimum, because the server has other duties what would occasionally kick in and cause a spike in time for a few runs). 0 setting disables this feature.
JJFETCH | trunc_time(s) |
0 | 355.78 |
1 | 350.5 |
2 | 277.61 |
4 | 213.9 |
8 | 172.79 |
16 | 138.87 |
32 | 133.02 |
64 | 76.25 |
128 | 47.63 |
256 | 35.28 |
512 | 20.44 |
1024 | 18.6 |
2048 | 15.63 |
4096 | 15.34 |
8192 | 14.72 |
16384 | 13.45 |
32768 | 13.13 |
For context, the forward-scan part of this vacuum took about 80 sec (IQR 75 to 88).
Do we want something like this at all? If so, what would we have to do to implement it for real? I thought that maybe the back-off amount could start at one block and double each time, until it hits a maximum. And that the maximum would be either seg_size, or 1/8 of effective_cache_size, whichever is lower. (I don't think we would want to prefetch more than a fraction of effective_cache_size, or else it could have already been displaced by the time the scan gets back to it).
Also, what would be the best way to drill a hole through bufmgr.c into md.c so that the prefetch could specify an entire range, rather than looping over each individual block?
What would have to be done to detect people running on SSD and disable the feature, if anything?
I'll add this to next commitfest as WIP patch.
Cheers,
Jeff
Attachment
On 4/19/15 9:09 PM, Jeff Janes wrote: > I did literally the simplest thing I could think of as a proof of > concept patch, to see if it would actually fix things. I just jumped > back a certain number of blocks occasionally and prefetched them > forward, then resumed the regular backward scan. The patch and driving > script are attached. Shouldn't completely empty pages be set as all-visible in the VM? If so, can't we just find the largest not-all-visible page and move forward from there, instead of moving backwards like we currently do? For that matter, why do we scan backwards anyway? The comments don't explain it, and we have nonempty_pages as a starting point, so why don't we just scan forward? I suspect that eons ago we didn't have that and just blindly reverse-scanned until we finally hit a non-empty buffer... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Sun, Apr 19, 2015 at 10:38 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/19/15 9:09 PM, Jeff Janes wrote:I did literally the simplest thing I could think of as a proof of
concept patch, to see if it would actually fix things. I just jumped
back a certain number of blocks occasionally and prefetched them
forward, then resumed the regular backward scan. The patch and driving
script are attached.
Shouldn't completely empty pages be set as all-visible in the VM? If so, can't we just find the largest not-all-visible page and move forward from there, instead of moving backwards like we currently do?
If the entire table is all-visible, we would be starting from the beginning, even though the beginning of the table still has read only tuples present.
For that matter, why do we scan backwards anyway? The comments don't explain it, and we have nonempty_pages as a starting point, so why don't we just scan forward? I suspect that eons ago we didn't have that and just blindly reverse-scanned until we finally hit a non-empty buffer...
nonempty_pages is not concurrency safe, as the pages could become used after vacuum passed them over but before the access exclusive lock was grabbed before the truncation scan. But maybe the combination of the two? If it is above nonempty_pages, then anyone who wrote into the page after vacuum passed it must have cleared the VM bit. And currently I think no one but vacuum ever sets VM bit back on, so once cleared it would stay cleared.
In any event nonempty_pages could be used to set the guess as to how many pages (if any) might be worth prefetching, as that is not needed for correctness.
Cheers,
Jeff
On 4/20/15 1:50 AM, Jeff Janes wrote: > Shouldn't completely empty pages be set as all-visible in the VM? If > so, can't we just find the largest not-all-visible page and move > forward from there, instead of moving backwards like we currently do? > > > If the entire table is all-visible, we would be starting from the > beginning, even though the beginning of the table still has read only > tuples present. Except we'd use it in conjunction with nonempty_pages. IIRC, that's set to the last page that vacuum saw data on. If any page after that got written to after vacuum visited it, the VM bit would be cleared. So after we acquire the exclusive lock, AFAICT it's safe to just scan the VM starting with nonempty_pages. > For that matter, why do we scan backwards anyway? The comments don't > explain it, and we have nonempty_pages as a starting point, so why > don't we just scan forward? I suspect that eons ago we didn't have > that and just blindly reverse-scanned until we finally hit a > non-empty buffer... > > > nonempty_pages is not concurrency safe, as the pages could become used > after vacuum passed them over but before the access exclusive lock was > grabbed before the truncation scan. But maybe the combination of the > two? If it is above nonempty_pages, then anyone who wrote into the page > after vacuum passed it must have cleared the VM bit. And currently I > think no one but vacuum ever sets VM bit back on, so once cleared it > would stay cleared. Right. > In any event nonempty_pages could be used to set the guess as to how > many pages (if any) might be worth prefetching, as that is not needed > for correctness. Yeah, but I think we'd do a LOT better with the VM idea, because we could immediately truncate without scanning anything. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Sun, Apr 19, 2015 at 7:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > I did literally the simplest thing I could think of as a proof of concept patch, to see if it would actually fix things. I just jumped back a certain number of blocks occasionally and prefetched them forward, then resumed the regularbackward scan. IIRC, this patches introduces optional prefetch for the backward scan where file system prefetch may not work well. The result is promising! > Also, what would be the best way to drill a hole through bufmgr.c into md.c so that the prefetch could specify an entirerange, rather than looping over each individual block? > Different from mdread(), which requires a buffer in argument. Extending mdprefetch() looks natural and safe to me. This shall also layout a foundation for other scan style optimizations. > What would have to be done to detect people running on SSD and disable the feature, if anything? > There are other call sites of prefetch - so this shall be a topic not just for this optimization. Regards, Qingqing
On Mon, Apr 20, 2015 at 10:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/20/15 1:50 AM, Jeff Janes wrote:For that matter, why do we scan backwards anyway? The comments don't
explain it, and we have nonempty_pages as a starting point, so why
don't we just scan forward? I suspect that eons ago we didn't have
that and just blindly reverse-scanned until we finally hit a
non-empty buffer...
nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan. But maybe the combination of the
two? If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.
Right.In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.
Yeah, but I think we'd do a LOT better with the VM idea, because we could immediately truncate without scanning anything.
Right now all the interlocks to make this work seem to be in place (only vacuum and startup can set visibility map bits, and only one vacuum can be in a table at a time). But as far as I can tell, those assumption are not "baked in" and we have pondered loosening them before.
For example, letting HOT clean up mark a page as all-visible if it finds it be such. Now in that specific case it would be OK, as HOT cleanup would not cause a page to become empty (or could it? If an insert on a table with no indexes was rolled back, and hot clean up found it and cleaned it up, it could conceptually become empty--unless we make special code to prevent it) , and so the page would have to be below nonempty_pages. But there may be other cases.
And I know other people have mentioned making VACUUM concurrent (although I don't see the value in that myself).
So doing it this way would be hard to beat (scanning a bitmap vs the table itself), but it would also introduce a modularity violation that I am not sure is worth it.
Of course this could always be reverted if its requirements became a problem for a more important change (assuming of course that we detected the problem)
Cheers,
Jeff
On Tue, May 26, 2015 at 12:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Apr 20, 2015 at 10:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:On 4/20/15 1:50 AM, Jeff Janes wrote:For that matter, why do we scan backwards anyway? The comments don't
explain it, and we have nonempty_pages as a starting point, so why
don't we just scan forward? I suspect that eons ago we didn't have
that and just blindly reverse-scanned until we finally hit a
non-empty buffer...
nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan. But maybe the combination of the
two? If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.
Right.In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.
Yeah, but I think we'd do a LOT better with the VM idea, because we could immediately truncate without scanning anything.Right now all the interlocks to make this work seem to be in place (only vacuum and startup can set visibility map bits, and only one vacuum can be in a table at a time). But as far as I can tell, those assumption are not "baked in" and we have pondered loosening them before.For example, letting HOT clean up mark a page as all-visible if it finds it be such. Now in that specific case it would be OK, as HOT cleanup would not cause a page to become empty (or could it? If an insert on a table with no indexes was rolled back, and hot clean up found it and cleaned it up, it could conceptually become empty--unless we make special code to prevent it) , and so the page would have to be below nonempty_pages. But there may be other cases.And I know other people have mentioned making VACUUM concurrent (although I don't see the value in that myself).So doing it this way would be hard to beat (scanning a bitmap vs the table itself), but it would also introduce a modularity violation that I am not sure is worth it.Of course this could always be reverted if its requirements became a problem for a more important change (assuming of course that we detected the problem)
The fatal problem here is that nonempty_pages is unreliable. If vacuum skips all-visible pages, it doesn't necessarily increment nonempty_pages beyond that skippage. So if you just rely on nonempty_pages, you will truncate away pages that were already all visible but are not empty. If we changed it so that it did increment nonempty_pages past the skipped ones, then pages which were all empty and got marked as all visible without being truncated (say, because a lock could not be acquired, or because there was a non-empty page after them which later became empty), then those pages would never get truncated away.
As it is currently, it is not clear what purpose nonempty_pages serves. It is a guardian value which doesn't seem to actually guard anything. At best it prevents you from needing to inspect one page (the page of the guardian value itself) to see if that page is actually empty, and finding that it is not. That hardly seems worthwhile.
We could adopt two nonempty_pages counters, once that fails low on skipped all-visible pages, and one that failed high on them. And then fast truncate down to the high one, and do the current page by page scan between the low and high. That seems rather grotesque.
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > As it is currently, it is not clear what purpose nonempty_pages serves. It's to allow an early exit at vacuumlazy.c:271 in case there is no chance of deleting a useful number of pages. regards, tom lane
On Tue, May 26, 2015 at 12:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Apr 20, 2015 at 10:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:On 4/20/15 1:50 AM, Jeff Janes wrote:For that matter, why do we scan backwards anyway? The comments don't
explain it, and we have nonempty_pages as a starting point, so why
don't we just scan forward? I suspect that eons ago we didn't have
that and just blindly reverse-scanned until we finally hit a
non-empty buffer...
nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan. But maybe the combination of the
two? If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.
Right.In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.
Yeah, but I think we'd do a LOT better with the VM idea, because we could immediately truncate without scanning anything.Right now all the interlocks to make this work seem to be in place (only vacuum and startup can set visibility map bits, and only one vacuum can be in a table at a time). But as far as I can tell, those assumption are not "baked in" and we have pondered loosening them before.For example, letting HOT clean up mark a page as all-visible if it finds it be such. Now in that specific case it would be OK, as HOT cleanup would not cause a page to become empty (or could it? If an insert on a table with no indexes was rolled back, and hot clean up found it and cleaned it up, it could conceptually become empty--unless we make special code to prevent it) , and so the page would have to be below nonempty_pages. But there may be other cases.And I know other people have mentioned making VACUUM concurrent (although I don't see the value in that myself).So doing it this way would be hard to beat (scanning a bitmap vs the table itself), but it would also introduce a modularity violation that I am not sure is worth it.Of course this could always be reverted if its requirements became a problem for a more important change (assuming of course that we detected the problem)
Attached is a patch that implements the vm scan for truncation. It introduces a variable to hold the last blkno which was skipped during the forward portion. Any blocks after both this blkno and after the last inspected nonempty page (which the code is already tracking) must have been observed to be empty by the current vacuum. Any other process rendering the page nonempty are required to clear the vm bit, and no other process can set the bit again during the vacuum's lifetime. So if the bit is still set, the page is still empty without needing to inspect it.
There is still the case of pages which had their visibility bit set by a prior vacuum and then were not inspected by the current one. Once the truncation scan runs into these pages, it falls back to the previous behavior of reading block by block backwards. So there could still be reason to optimize that fallback using forward-reading prefetch.
Using the previously shown test case, this patch reduces the truncation part of the vacuum to 2 seconds.
Cheers,
Jeff
Attachment
On Mon, Jun 29, 2015 at 3:54 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > > Attached is a patch that implements the vm scan for truncation. It > introduces a variable to hold the last blkno which was skipped during the > forward portion. Any blocks after both this blkno and after the last > inspected nonempty page (which the code is already tracking) must have been > observed to be empty by the current vacuum. Any other process rendering the > page nonempty are required to clear the vm bit, and no other process can set > the bit again during the vacuum's lifetime. So if the bit is still set, the > page is still empty without needing to inspect it. The patch looks good and it improves the vacuum truncation speed significantly. > There is still the case of pages which had their visibility bit set by a > prior vacuum and then were not inspected by the current one. Once the > truncation scan runs into these pages, it falls back to the previous > behavior of reading block by block backwards. So there could still be > reason to optimize that fallback using forward-reading prefetch. The case, I didn't understand is that, how the current vacuum misses the page which was set by the prior vacuum? The page should be counted either in skipped_pages or in nonempty_pages. Is it possible that a page doesn't comes under these two categories and it is not empty also? If the above doesn't exist, then we can directly truncate the relation from the highest block number of either nonempty_pages or skipped_pages to till the end of the relation. Regards, Hari Babu Fujitsu Australia
On Wed, Jul 8, 2015 at 9:46 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Mon, Jun 29, 2015 at 3:54 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> Attached is a patch that implements the vm scan for truncation. It
> introduces a variable to hold the last blkno which was skipped during the
> forward portion. Any blocks after both this blkno and after the last
> inspected nonempty page (which the code is already tracking) must have been
> observed to be empty by the current vacuum. Any other process rendering the
> page nonempty are required to clear the vm bit, and no other process can set
> the bit again during the vacuum's lifetime. So if the bit is still set, the
> page is still empty without needing to inspect it.
The patch looks good and it improves the vacuum truncation speed significantly.
> There is still the case of pages which had their visibility bit set by a
> prior vacuum and then were not inspected by the current one. Once the
> truncation scan runs into these pages, it falls back to the previous
> behavior of reading block by block backwards. So there could still be
> reason to optimize that fallback using forward-reading prefetch.
The case, I didn't understand is that, how the current vacuum misses
the page which
was set by the prior vacuum?
The prior vacuum set them to all visible, but then doesn't delete them. Either because it got interrupted or because there were still some pages after them (at the time) that were not all empty.
The current vacuum skips them entirely on the forward scan because it thinks it is waste of time to vacuum all visible pages. Since it skips them, it doesn't know if they are empty as well as being all-visible. There is no permanent indicator of the pages being all-empty, there is only the inference based on the current vacuum's counters and protected by the lock held on the table.
The page should be counted either in skipped_pages or in
nonempty_pages. Is it possible
that a page doesn't comes under these two categories and it is not empty also?
If the above doesn't exist, then we can directly truncate the relation
from the highest block
number of either nonempty_pages or skipped_pages to till the end of
the relation.
Right, and that is what this does (provided the vm bit is still set, so it does still have to loop over the vm to verify that it is still set, while it holds the access exclusive lock).
The problem is that the pages between the two counters are not known to be empty, and also not known to be nonempty. Someone has to be willing to go check on those pages at some point, or else they will never be eligible for truncation.
Cheers,
Jeff
On Thu, Jul 9, 2015 at 4:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, Jul 8, 2015 at 9:46 PM, Haribabu Kommi <kommi.haribabu@gmail.com> > wrote: >> >> On Mon, Jun 29, 2015 at 3:54 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> > >> > There is still the case of pages which had their visibility bit set by a >> > prior vacuum and then were not inspected by the current one. Once the >> > truncation scan runs into these pages, it falls back to the previous >> > behavior of reading block by block backwards. So there could still be >> > reason to optimize that fallback using forward-reading prefetch. >> >> The case, I didn't understand is that, how the current vacuum misses >> the page which >> was set by the prior vacuum? > > > The prior vacuum set them to all visible, but then doesn't delete them. > Either because it got interrupted or because there were still some pages > after them (at the time) that were not all empty. > > The current vacuum skips them entirely on the forward scan because it thinks > it is waste of time to vacuum all visible pages. Since it skips them, it > doesn't know if they are empty as well as being all-visible. There is no > permanent indicator of the pages being all-empty, there is only the > inference based on the current vacuum's counters and protected by the lock > held on the table. > >> >> >> The page should be counted either in skipped_pages or in >> nonempty_pages. Is it possible >> that a page doesn't comes under these two categories and it is not empty >> also? >> >> If the above doesn't exist, then we can directly truncate the relation >> from the highest block >> number of either nonempty_pages or skipped_pages to till the end of >> the relation. > > > Right, and that is what this does (provided the vm bit is still set, so it > does still have to loop over the vm to verify that it is still set, while it > holds the access exclusive lock). Thanks I got it. To re-verify the vm bit of a page after getting the access exclusive lock, we have to do the backward scan. I also feel that your prefetch buffer patch needed to improve the performance, as because there is a need of backward scan to re-verify vm bit, I will do some performance tests and send you the results. > The problem is that the pages between the two counters are not known to be > empty, and also not known to be nonempty. Someone has to be willing to go > check on those pages at some point, or else they will never be eligible for > truncation. Yes, there is no way to identify the page is empty or not without reading the page. Regards, Hari Babu Fujitsu Australia
On Thu, Jul 9, 2015 at 5:36 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > > I will do some performance tests and send you the results. Here are the performance results tested on my machine. Head vm patch vm+prefetch patch First vacuum 120sec <1sec <1sec second vacuum 180 sec 180 sec 30 sec I did some modifications in the code to skip the vacuum truncation by the first vacuum command. This way I collected the second vacuum time taken performance. I just combined your vm and prefetch patch into a single patch vm+prefetch patch without a GUC. I kept the prefetch as 32 and did the performance test. I chosen prefetch based on the current buffer access strategy, which is 32 for vacuum presently instead of an user option. Here I attached the modified patch with both vm+prefetch logic. I will do some tests on a machine with SSD and let you know the results. Based on these results, we can decide whether we need a GUC or not? based on the impact of prefetch on ssd machines. Regards, Hari Babu Fujitsu Australia
Attachment
On Mon, Jul 13, 2015 at 12:06 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > On Thu, Jul 9, 2015 at 5:36 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: >> >> I will do some performance tests and send you the results. > > Here are the performance results tested on my machine. > > > Head vm patch vm+prefetch patch > > First vacuum 120sec <1sec <1sec > second vacuum 180 sec 180 sec 30 sec > > I did some modifications in the code to skip the vacuum truncation by > the first vacuum command. > This way I collected the second vacuum time taken performance. > > I just combined your vm and prefetch patch into a single patch > vm+prefetch patch without a GUC. > I kept the prefetch as 32 and did the performance test. I chosen > prefetch based on the current > buffer access strategy, which is 32 for vacuum presently instead of an > user option. > Here I attached the modified patch with both vm+prefetch logic. > > I will do some tests on a machine with SSD and let you know the > results. Based on these results, > we can decide whether we need a GUC or not? based on the impact of > prefetch on ssd machines. Following are the performance readings on a machine with SSD. I increased the pgbench scale factor to 1000 in the test instead of 500 to show a better performance numbers. Head vm patch vm+prefetch patch First vacuum 6.24 sec 2.91 sec 2.91 sec second vacuum 6.66 sec 6.66 sec 7.19 sec There is a small performance impact on SSD with prefetch. Regards, Hari Babu Fujitsu Australia
On Mon, Jul 13, 2015 at 5:16 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > On Mon, Jul 13, 2015 at 12:06 PM, Haribabu Kommi > <kommi.haribabu@gmail.com> wrote: >> On Thu, Jul 9, 2015 at 5:36 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: >>> >>> I will do some performance tests and send you the results. >> >> Here are the performance results tested on my machine. >> >> >> Head vm patch vm+prefetch patch >> >> First vacuum 120sec <1sec <1sec >> second vacuum 180 sec 180 sec 30 sec >> >> I did some modifications in the code to skip the vacuum truncation by >> the first vacuum command. >> This way I collected the second vacuum time taken performance. >> >> I just combined your vm and prefetch patch into a single patch >> vm+prefetch patch without a GUC. >> I kept the prefetch as 32 and did the performance test. I chosen >> prefetch based on the current >> buffer access strategy, which is 32 for vacuum presently instead of an >> user option. >> Here I attached the modified patch with both vm+prefetch logic. >> >> I will do some tests on a machine with SSD and let you know the >> results. Based on these results, >> we can decide whether we need a GUC or not? based on the impact of >> prefetch on ssd machines. > > Following are the performance readings on a machine with SSD. > I increased the pgbench scale factor to 1000 in the test instead of 500 > to show a better performance numbers. > > Head vm patch vm+prefetch patch > > First vacuum 6.24 sec 2.91 sec 2.91 sec > second vacuum 6.66 sec 6.66 sec 7.19 sec > > There is a small performance impact on SSD with prefetch. The above prefetch overhead is observed with prefeching of 1639345 pages. I feel this overhead is small. Hi Jeff, If you are fine with earlier attached patch, then I will mark this patch as ready for committer, to get some committer view on the patch. Regards, Hari Babu Fujitsu Australia
On Mon, Jun 29, 2015 at 11:24 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
>
> Attached is a patch that implements the vm scan for truncation. It introduces a variable to hold the last blkno which was skipped during the forward portion. Any blocks after both this blkno and after the last inspected nonempty page (which the code is already tracking) must have been observed to be empty by the current vacuum. Any other process rendering the page nonempty are required to clear the vm bit, and no other process can set the bit again during the vacuum's lifetime. So if the bit is still set, the page is still empty without needing to inspect it.
>
>
>
> Attached is a patch that implements the vm scan for truncation. It introduces a variable to hold the last blkno which was skipped during the forward portion. Any blocks after both this blkno and after the last inspected nonempty page (which the code is already tracking) must have been observed to be empty by the current vacuum. Any other process rendering the page nonempty are required to clear the vm bit, and no other process can set the bit again during the vacuum's lifetime. So if the bit is still set, the page is still empty without needing to inspect it.
>
One case where this patch can behave differently then current
code is, when before taking Exclusive lock on relation for truncation,
if some backend clears the vm bit and then inserts-deletes-prune that
page. I think with patch it will not consider to truncate pages whereas
current code will allow to truncate it as it re-verifies each page. I think
such a case would be rare and we might not want to bother about it,
but still worth to think about it once.
Some minor points about patch:
count_nondeletable_pages()
{
..
if (PageIsNew(page) || PageIsEmpty(page))
{
/* PageIsNew probably shouldn't happen... */
UnlockReleaseBuffer(buf);
continue;
}
..
}
Why vmbuffer is not freed in above loop?
count_nondeletable_pages()
{
..
+ /*
+ * Pages that were inspected and found to be empty by this vacuum run
+ * must still be empty if the vm bit is still set. Only vacuum sets
+ * vm bits, and only one vacuum can exist in a table at one time.
+ */
+ trust_vm=vacrelstats->nonempty_pages>vacrelstats->skipped_pages ?
+ vacrelstats->nonempty_pages : vacrelstats->skipped_pages;
..
}
I think it is better to have spaces in above assignment statement
(near '=' and near '>')
On Wed, Jul 15, 2015 at 11:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Jun 29, 2015 at 11:24 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> >> >> Attached is a patch that implements the vm scan for truncation. It >> introduces a variable to hold the last blkno which was skipped during the >> forward portion. Any blocks after both this blkno and after the last >> inspected nonempty page (which the code is already tracking) must have been >> observed to be empty by the current vacuum. Any other process rendering the >> page nonempty are required to clear the vm bit, and no other process can set >> the bit again during the vacuum's lifetime. So if the bit is still set, the >> page is still empty without needing to inspect it. >> > > One case where this patch can behave differently then current > code is, when before taking Exclusive lock on relation for truncation, > if some backend clears the vm bit and then inserts-deletes-prune that > page. I think with patch it will not consider to truncate pages whereas > current code will allow to truncate it as it re-verifies each page. I think > such a case would be rare and we might not want to bother about it, > but still worth to think about it once. Thanks for your review. corrected the code as instead of returning the blkno after visibility map check failure, the code just continue to verify the contents as earlier approach. > Some minor points about patch: > > count_nondeletable_pages() > { > .. > if (PageIsNew(page) || PageIsEmpty(page)) > { > /* PageIsNew probably shouldn't happen... */ > UnlockReleaseBuffer(buf); > continue; > } > .. > } > > Why vmbuffer is not freed in above loop? > In the above check, we are just continuing to the next blkno, at the end of the loop the vmbuffer is freed. > count_nondeletable_pages() > { > .. > + /* > + * Pages that were inspected and found to be empty by this vacuum run > + * must still be empty if the vm bit is still set. Only vacuum sets > + * vm bits, and only one vacuum can exist in a table at one time. > + */ > + trust_vm=vacrelstats->nonempty_pages>vacrelstats->skipped_pages ? > + vacrelstats->nonempty_pages : vacrelstats->skipped_pages; > > .. > } > > I think it is better to have spaces in above assignment statement > (near '=' and near '>') > corrected. Here I attached updated patches, 1) without prefetch logic. 2) with combined vm and prefetch logic. I marked the patch as ready for committer. Regards, Hari Babu Fujitsu Australia
Attachment
On Thu, Jul 16, 2015 at 11:21 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Wed, Jul 15, 2015 at 11:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > One case where this patch can behave differently then current
> > code is, when before taking Exclusive lock on relation for truncation,
> > if some backend clears the vm bit and then inserts-deletes-prune that
> > page. I think with patch it will not consider to truncate pages whereas
> > current code will allow to truncate it as it re-verifies each page. I think
> > such a case would be rare and we might not want to bother about it,
> > but still worth to think about it once.
>
> Thanks for your review.
>
> corrected the code as instead of returning the blkno after visibility map
> check failure, the code just continue to verify the contents as
> earlier approach.
>
>
> Here I attached updated patches,
> 1) without prefetch logic.
> 2) with combined vm and prefetch logic.
>
>
> On Wed, Jul 15, 2015 at 11:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > One case where this patch can behave differently then current
> > code is, when before taking Exclusive lock on relation for truncation,
> > if some backend clears the vm bit and then inserts-deletes-prune that
> > page. I think with patch it will not consider to truncate pages whereas
> > current code will allow to truncate it as it re-verifies each page. I think
> > such a case would be rare and we might not want to bother about it,
> > but still worth to think about it once.
>
> Thanks for your review.
>
> corrected the code as instead of returning the blkno after visibility map
> check failure, the code just continue to verify the contents as
> earlier approach.
>
Okay, I think this should work.
>
> Here I attached updated patches,
> 1) without prefetch logic.
> 2) with combined vm and prefetch logic.
>
I think it is better to just get the first patch in as that in itself is a
clear win and then we can evaluate the second one (prefetching during
truncation) separately. I think after first patch, the use case for doing
prefetch seems to be less beneficial and I think it could hurt by loading
not required pages (assume you have prefetched 32 pages and after
inspecting first page, it decides to quit the loop as that is non-empty page
and other is when it has prefetched just because for page the visibility map
bit was cleared, but for others it is set) in OS buffer cache.
Having said that, I am not against prefetching in this scenario as that
can help in more cases then it could hurt, but I think it will be better
to evaluate that separately.
On Thu, Jul 16, 2015 at 10:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jul 16, 2015 at 11:21 AM, Haribabu Kommi <kommi.haribabu@gmail.com> > wrote: > >> >> Here I attached updated patches, >> 1) without prefetch logic. >> 2) with combined vm and prefetch logic. >> > > I think it is better to just get the first patch in as that in itself is a > clear win and then we can evaluate the second one (prefetching during > truncation) separately. I think after first patch, the use case for doing > prefetch seems to be less beneficial and I think it could hurt by loading > not required pages (assume you have prefetched 32 pages and after > inspecting first page, it decides to quit the loop as that is non-empty page > and other is when it has prefetched just because for page the visibility map > bit was cleared, but for others it is set) in OS buffer cache. Yes, in the above cases, the prefetch is an overhead. I am not sure how frequently such scenarios occur in real time scenarios. > Having said that, I am not against prefetching in this scenario as that > can help in more cases then it could hurt, but I think it will be better > to evaluate that separately. Yes, the prefetch patch works better in cases, where majorly the first vacuum skips the truncation because of lock waiters. If such cases are more in real time scenarios, then the prefetch patch is needed. Regards, Hari Babu Fujitsu Australia
On 16 July 2015 at 06:51, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
--
I marked the patch as ready for committer.
The most recent results seem to indicate that the prefetch isn't worth pursuing, but the vm test is. Can someone repeat the perf tests on something larger so we can see, when/if there is a benefit?
Jeff, can you add detailed comments to explain the theory of operation of these patches? The patches add the code, but don't say why. We will forget, so lets put the detail in there now please. Thanks.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > Attached is a patch that implements the vm scan for truncation. It > introduces a variable to hold the last blkno which was skipped during the > forward portion. Any blocks after both this blkno and after the last > inspected nonempty page (which the code is already tracking) must have been > observed to be empty by the current vacuum. Any other process rendering the > page nonempty are required to clear the vm bit, and no other process can set > the bit again during the vacuum's lifetime. So if the bit is still set, the > page is still empty without needing to inspect it. Urgh. So if we do this, that forever precludes having HOT pruning set the all-visible bit. At the least we'd better document that carefully so that nobody breaks it later. But I wonder if there isn't some better approach, because I would certainly rather that we didn't foreclose the possibility of doing something like that in the future. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
The advantage of the other approach, just force kernel read-ahead to work for us, is that it doesn't impose any of these restrictions on future development. The disadvantage is that I don't know how to auto-tune it, or auto-disable it for SSD, and it will never be as quite as efficient.
On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Attached is a patch that implements the vm scan for truncation. It
> introduces a variable to hold the last blkno which was skipped during the
> forward portion. Any blocks after both this blkno and after the last
> inspected nonempty page (which the code is already tracking) must have been
> observed to be empty by the current vacuum. Any other process rendering the
> page nonempty are required to clear the vm bit, and no other process can set
> the bit again during the vacuum's lifetime. So if the bit is still set, the
> page is still empty without needing to inspect it.
Urgh. So if we do this, that forever precludes having HOT pruning set
the all-visible bit.
I wouldn't say forever, as it would be easy to revert the change if something more important came along that conflicted with it. I don't think this change would grow tentacles across the code that make it hard to revert, you would just have to take the performance hit (and by that time, maybe HDD will truly be dead anyway and so we don't care anymore). But yes, that is definitely a downside. HOT pruning is one example, but also one could envision having someone (bgwriter?) set vm bits on unindexed tables. Or if we invent some efficient way to know that no expiring tids for a certain block range are stored in indexes, other jobs could also set the vm bit on indexed tables. Or parallel vacuums in the same table, not that I really see a reason to have those.
At the least we'd better document that carefully
so that nobody breaks it later. But I wonder if there isn't some
better approach, because I would certainly rather that we didn't
foreclose the possibility of doing something like that in the future.
But where do we document it (other than in-place)? README.HOT doesn't seem sufficient, and there is no README.vm.
I guess add an "Assert(InRecovery || running_a_vacuum);" to the visibilitymap_set with a comment there, except that I don't know how to implement running_a_vacuum so that it covers manual vacs as well as autovac. Perhaps assert that we hold a SHARE UPDATE EXCLUSIVE on rel?
Cheers,
Jeff
On Wed, Jul 22, 2015 at 12:11 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> Urgh. So if we do this, that forever precludes having HOT pruning set >> the all-visible bit. > > I wouldn't say forever, as it would be easy to revert the change if > something more important came along that conflicted with it. Well, the people who want this change would likely object to reverting it later. > I don't think > this change would grow tentacles across the code that make it hard to > revert, you would just have to take the performance hit (and by that time, > maybe HDD will truly be dead anyway and so we don't care anymore). But yes, > that is definitely a downside. HOT pruning is one example, but also one > could envision having someone (bgwriter?) set vm bits on unindexed tables. > Or if we invent some efficient way to know that no expiring tids for a > certain block range are stored in indexes, other jobs could also set the vm > bit on indexed tables. Or parallel vacuums in the same table, not that I > really see a reason to have those. Yes, good points. (I think we will need parallel vacuum to cope with TB-sized tables, but that's another story.) >> At the least we'd better document that carefully >> so that nobody breaks it later. But I wonder if there isn't some >> better approach, because I would certainly rather that we didn't >> foreclose the possibility of doing something like that in the future. > > But where do we document it (other than in-place)? README.HOT doesn't seem > sufficient, and there is no README.vm. visibilitymap_set? > I guess add an "Assert(InRecovery || running_a_vacuum);" to the > visibilitymap_set with a comment there, except that I don't know how to > implement running_a_vacuum so that it covers manual vacs as well as autovac. > Perhaps assert that we hold a SHARE UPDATE EXCLUSIVE on rel? I care more about the comment than I do about the Assert(). > The advantage of the other approach, just force kernel read-ahead to work > for us, is that it doesn't impose any of these restrictions on future > development. The disadvantage is that I don't know how to auto-tune it, or > auto-disable it for SSD, and it will never be as quite as efficient. I have to say that I like the prefetch approach. I don't know what to do about the fact that it loses to the VM-bit based approach, but I think it's a bad bet that we will never care about setting visibility map bits anyplace other than VACUUM. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-07-23 11:06:25 -0400, Robert Haas wrote: > I don't know what to do about the fact that it loses to the VM-bit > based approach, but I think it's a bad bet that we will never care > about setting visibility map bits anyplace other than VACUUM. +1 A background process that tries to hint all pages shortly before they're written out would be truly awesome. And would very likely set hint bits... Andres
On 7/22/15 11:11 AM, Jeff Janes wrote: > I guess add an "Assert(InRecovery || running_a_vacuum);" to > the visibilitymap_set with a comment there, except that I don't know how > to implement running_a_vacuum so that it covers manual vacs as well as > autovac. static bool RunningVacuum; ? It only needs to be per-backend, right? If it is that simple then my $0.02 is we should have the assert, not just a comment. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On 22 July 2015 at 14:59, Robert Haas <robertmhaas@gmail.com> wrote:
--
Urgh. So if we do this, that forever precludes having HOT pruning set
the all-visible bit.
What is the reason why we don't do that already? Surely its a one liner?
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jul 25, 2015 at 4:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 22 July 2015 at 14:59, Robert Haas <robertmhaas@gmail.com> wrote: >> Urgh. So if we do this, that forever precludes having HOT pruning set >> the all-visible bit. > > What is the reason why we don't do that already? Surely its a one liner? I wish! It actually doesn't look simple to me to modify heap_page_prune to know whether any not-all-visible items remain at the end. It's definitely true that, in order for marking the page all-visible to be a possibility, we need to reach heap_page_prune_execute() with ndead == 0. But that's not enough: we also need to know that any tuple that survived the prune operation (that is, it wasn't redirected or marked unused) has a new-enough xmin, which isn't tracked anywhere. vacuumlazy.c solves that problem by making a second pass over all the tuples on the page to determine whether the page is all-visible, but that seems pretty inefficient and I think heap_page_prune() is already a CPU hog on some workloads. (Didn't Heikki have a patch to improve that?) Another sticky wicket is that, as things stand today, we'd have to insert a second WAL record to mark the page all-visible, which would add probably-significant overhead. We could probably modify the format of the record we're already emitting for pruning to carry a flag indicating whether to additionally set PD_ALL_VISIBLE. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22 July 2015 at 17:11, Jeff Janes <jeff.janes@gmail.com> wrote:
--
On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Attached is a patch that implements the vm scan for truncation. It
> introduces a variable to hold the last blkno which was skipped during the
> forward portion. Any blocks after both this blkno and after the last
> inspected nonempty page (which the code is already tracking) must have been
> observed to be empty by the current vacuum. Any other process rendering the
> page nonempty are required to clear the vm bit, and no other process can set
> the bit again during the vacuum's lifetime. So if the bit is still set, the
> page is still empty without needing to inspect it.
Urgh. So if we do this, that forever precludes having HOT pruning set
the all-visible bit.I wouldn't say forever, as it would be easy to revert the change if something more important came along that conflicted with it.
I think what is being said here is that someone is already using this technique, or if not, then we actively want to encourage them to do so as an extension or as a submission to core.
In that case, I think the rely-on-VM technique sinks again, sorry Jim, Jeff. Probably needs code comments added.
That does still leave the prefetch technique, so all is not lost.
Can we see a patch with just prefetch, probably with a simple choice of stride? Thanks.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 7/27/15 10:39 AM, Robert Haas wrote: > But that's not enough: we also need to know that any tuple that > survived the prune operation (that is, it wasn't redirected or marked > unused) has a new-enough xmin, which isn't tracked anywhere. Wouldn't that be old-enough xmin? heap_prune_chain() is already calling HeapTupleSatisfiesVacuum, so it should be able to figure out if the page is all visible without a lot of extra work, and pass that info back to heap_page_prune (which would then pass it down to _execute()). Definitely not a one-liner though. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On Mon, Jul 27, 2015 at 1:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 22 July 2015 at 17:11, Jeff Janes <jeff.janes@gmail.com> wrote:On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Attached is a patch that implements the vm scan for truncation. It
> introduces a variable to hold the last blkno which was skipped during the
> forward portion. Any blocks after both this blkno and after the last
> inspected nonempty page (which the code is already tracking) must have been
> observed to be empty by the current vacuum. Any other process rendering the
> page nonempty are required to clear the vm bit, and no other process can set
> the bit again during the vacuum's lifetime. So if the bit is still set, the
> page is still empty without needing to inspect it.
Urgh. So if we do this, that forever precludes having HOT pruning set
the all-visible bit.I wouldn't say forever, as it would be easy to revert the change if something more important came along that conflicted with it.I think what is being said here is that someone is already using this technique, or if not, then we actively want to encourage them to do so as an extension or as a submission to core.In that case, I think the rely-on-VM technique sinks again, sorry Jim, Jeff. Probably needs code comments added.
Sure, that sounds like the consensus. The VM method was very efficient, but I agree it is pretty fragile and restricting.
That does still leave the prefetch technique, so all is not lost.Can we see a patch with just prefetch, probably with a simple choice of stride? Thanks.
I probably won't get back to it this commit fest, so it can be set to returned with feedback. But if anyone has good ideas for how to set the stride (or detect that it is on SSD and so is pointless to try) I'd love to hear about them anytime.
Cheers,
Jeff
On 3 August 2015 at 17:18, Jeff Janes <jeff.janes@gmail.com> wrote:
--
That does still leave the prefetch technique, so all is not lost.Can we see a patch with just prefetch, probably with a simple choice of stride? Thanks.I probably won't get back to it this commit fest, so it can be set to returned with feedback. But if anyone has good ideas for how to set the stride (or detect that it is on SSD and so is pointless to try) I'd love to hear about them anytime.
I've Returned With Feedback, as you suggest.
Given your earlier numbers, I'd just pick a constant value of 128, to keep it simple. That balances out the various factors of small/large tables and the uncertainty that we will be able to truncate the whole chunk of blocks. I'd like to see tests on SSD before commit, but I hope and expect the slightly negative results with SSD won't be substantiated on larger tests.
Kept simple, its a patch with a clear win in a restricted use case and I would be happy to commit that sometime in the future.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Aug 4, 2015 at 2:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Mon, Jul 27, 2015 at 1:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> On 22 July 2015 at 17:11, Jeff Janes <jeff.janes@gmail.com> wrote: >>> >>> On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas <robertmhaas@gmail.com> >>> wrote: >>>> >>>> On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com> >>>> wrote: >>>> > Attached is a patch that implements the vm scan for truncation. It >>>> > introduces a variable to hold the last blkno which was skipped during >>>> > the >>>> > forward portion. Any blocks after both this blkno and after the last >>>> > inspected nonempty page (which the code is already tracking) must have >>>> > been >>>> > observed to be empty by the current vacuum. Any other process >>>> > rendering the >>>> > page nonempty are required to clear the vm bit, and no other process >>>> > can set >>>> > the bit again during the vacuum's lifetime. So if the bit is still >>>> > set, the >>>> > page is still empty without needing to inspect it. >>>> >>>> Urgh. So if we do this, that forever precludes having HOT pruning set >>>> the all-visible bit. >>> >>> >>> I wouldn't say forever, as it would be easy to revert the change if >>> something more important came along that conflicted with it. >> >> >> I think what is being said here is that someone is already using this >> technique, or if not, then we actively want to encourage them to do so as an >> extension or as a submission to core. >> >> In that case, I think the rely-on-VM technique sinks again, sorry Jim, >> Jeff. Probably needs code comments added. > > > Sure, that sounds like the consensus. The VM method was very efficient, but > I agree it is pretty fragile and restricting. > >> >> >> That does still leave the prefetch technique, so all is not lost. >> >> Can we see a patch with just prefetch, probably with a simple choice of >> stride? Thanks. > > > I probably won't get back to it this commit fest, so it can be set to > returned with feedback. But if anyone has good ideas for how to set the > stride (or detect that it is on SSD and so is pointless to try) I'd love to > hear about them anytime. I got the following way to get the whether data file is present in the DISK or SSD. 1. Get the device file system that table data file is mapped using the following or similar. df -P "filename" | awk 'NR==2{print $1}' 2. if the device file system is of type /dev/sd* then treat is as a disk system and proceed with the prefetch optimization. 3. if we are not able to find the device details directly then we need to get the information from the mapping system. Usually the devices will map like the following /dev/mapper/v** points to ../dm-* 4. Get the name of the "dm-*" from the above details and check whether it is a SSD or not with the following command. /sys/block/dm-*/queue/rotation 5. If the value is 0 then it is an SSD drive, 1 means disk drive. The described above procedure works only for linux. I didn't check for other operating systems yet. Is it worth to consider? Regards, Hari Babu Fujitsu Australia
On Mon, Sep 28, 2015 at 2:05 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > I got the following way to get the whether data file is present in the > DISK or SSD. > > 1. Get the device file system that table data file is mapped using the > following or similar. > > df -P "filename" | awk 'NR==2{print $1}' > > 2. if the device file system is of type /dev/sd* then treat is as a > disk system and proceed > with the prefetch optimization. > > 3. if we are not able to find the device details directly then we need > to get the information > from the mapping system. > > Usually the devices will map like the following > > /dev/mapper/v** points to ../dm-* > > 4. Get the name of the "dm-*" from the above details and check > whether it is a SSD or not > with the following command. > > /sys/block/dm-*/queue/rotation > > 5. If the value is 0 then it is an SSD drive, 1 means disk drive. > > The described above procedure works only for linux. I didn't check for > other operating systems yet. > Is it worth to consider? No. If we need to have the behavior depend on the hardware, it should be a GUC or tablespace option or reloption, not some kind of crazy OS-dependent discovery. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company