Thread: Removing PD_ALL_VISIBLE

Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
Follow up to discussion:
http://archives.postgresql.org/pgsql-hackers/2012-11/msg00817.php

I worked out a patch that replaces PD_ALL_VISIBLE with calls to
visibilitymap_test. It rips out a lot of complexity, with a net drop of
about 300 lines (not a lot, but some of that code is pretty complex).

The patch is quite rough, and I haven't yet added the code to keep the
VM buffer pins around longer, but I still don't see any major problems.
I'm fairly certain that scans will not be adversely affected, and I
think that inserts/updates/deletes should be fine as well.

The main worry I have with inserts/updates/deletes is that there will be
less spatial locality for allocating new buffers or for modifying
existing buffers. So keeping a pinned VM buffer might not help as much,
because it might need to pin a different one, anyway.

Adding to January commitfest, as it's past 11/15.

Regards,
    Jeff Davis

Attachment

Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Wed, 2012-11-21 at 18:25 -0800, Jeff Davis wrote:
> Follow up to discussion:
> http://archives.postgresql.org/pgsql-hackers/2012-11/msg00817.php
>
> I worked out a patch that replaces PD_ALL_VISIBLE with calls to
> visibilitymap_test. It rips out a lot of complexity, with a net drop of
> about 300 lines (not a lot, but some of that code is pretty complex).

Updated patch attached.

Now it tries to keep the VM buffer pinned during scans, inserts,
updates, and deletes. This should avoid increased contention pinning the
VM pages, but performance tests are required.

For updates, it currently only tries to hold a pin on the VM buffer for
the page of the original tuple. For HOT updates, that's always the same
as the new buffer anyway. For cold updates, we could also try to keep a
pin on the buffer for the new tuple, but right now I don't see an
obvious need for that complexity. It may plausibly be a problem when
doing a bulk update on a freshly-loaded table.

It occurred to me that it might be difficult to test this patch without
a fairly large test case. A big assumption of my patch is that there
will be locality of access (and the VM page you already have a pin on is
likely to be needed the next time), which is obvious during a scan but
not so obvious during I/U/D. But a single 8K VM page represents some 60K
pages, or about 500MB of data. So anything less than that means that
there is only one VM page, and locality is trivial... it seems like any
test on a table less than 5GB would not be fair.

Then again, if a 5GB table is being randomly accessed, an extra pin is
unlikely to matter. Also, without locality, the contention would not be
nearly as bad either. I'm still pretty unclear what the "worst case" for
this patch is supposed to look like.

Regards,
    Jeff Davis


Attachment

Re: Removing PD_ALL_VISIBLE

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> Now it tries to keep the VM buffer pinned during scans, inserts,
> updates, and deletes. This should avoid increased contention pinning the
> VM pages, but performance tests are required.
> ...
> Then again, if a 5GB table is being randomly accessed, an extra pin is
> unlikely to matter. Also, without locality, the contention would not be
> nearly as bad either. I'm still pretty unclear what the "worst case" for
> this patch is supposed to look like.

I'd be worried about the case of a lot of sessions touching a lot of
unrelated tables.  This change implies doubling the number of buffers
that are held pinned by any given query, and the distributed overhead
from that (eg, adding cycles to searches for free buffers) is what you
ought to be afraid of.

Another possibly important point is that reducing the number of
pin/unpin cycles for a given VM page might actually hurt the chances of
it being found in shared buffers, because IIRC the usage_count is bumped
once per pin/unpin.  That algorithm is based on the assumption that
buffer pins are not drastically different in lifespan, but I think you
just broke that for pins on VM pages.

I'm not particularly concerned about devising solutions for these
problems, though, because I think this idea is a loser from the get-go;
the increase in contention for VM pages is alone going to destroy any
possible benefit.
        regards, tom lane



Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Sun, 2012-11-25 at 22:30 -0500, Tom Lane wrote:
> I'd be worried about the case of a lot of sessions touching a lot of
> unrelated tables.  This change implies doubling the number of buffers
> that are held pinned by any given query, and the distributed overhead
> from that (eg, adding cycles to searches for free buffers) is what you
> ought to be afraid of.

That's a good point. "Doubling" might be an exaggeration if indexes are
involved, but it's still a concern. The cost of this might be difficult
to measure though.

> Another possibly important point is that reducing the number of
> pin/unpin cycles for a given VM page might actually hurt the chances of
> it being found in shared buffers, because IIRC the usage_count is bumped
> once per pin/unpin.  That algorithm is based on the assumption that
> buffer pins are not drastically different in lifespan, but I think you
> just broke that for pins on VM pages.

If doing a bunch of simple key lookups using an index, then the root of
the index page is only pinned once per query, but we expect that to stay
in shared buffers. I see the VM page as about the same: one pin per
query (or maybe a couple for large tables).

I don't see how the lifetime of the pin matters a whole lot in this
case; it's more about the rate of pins/unpins, right?

> I'm not particularly concerned about devising solutions for these
> problems, though, because I think this idea is a loser from the get-go;
> the increase in contention for VM pages is alone going to destroy any
> possible benefit.

Your intuition here is better than mine, but I am still missing
something here. If we keep the buffer pinned, then there will be very
few pin/unpin cycles here, so I don't see where the contention would
come from (any more than there is contention pinning the root of an
index).

Do you still think I need a shared lock here or something? If so, then
index-only scans have a pretty big problem right now, too.

I'll try to quantify some of these effects you've mentioned, and see how
the numbers turn out. I'm worried that I'll need more than 4 cores to
show anything though, so perhaps someone with a many-core box would be
interested to test this out?

Regards,Jeff Davis




Re: Removing PD_ALL_VISIBLE

From
Robert Haas
Date:
On Mon, Nov 26, 2012 at 3:29 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> Your intuition here is better than mine, but I am still missing
> something here. If we keep the buffer pinned, then there will be very
> few pin/unpin cycles here, so I don't see where the contention would
> come from (any more than there is contention pinning the root of an
> index).

Based on previous measurements, I think there *is* contention pinning
the root of an index.  Currently, I believe it's largely overwhelmed
by contention from other sources, such as the buffer manager lwlocks
and the very-evil ProcArrayLock.  However, I believe that as we fix
those problems, this will start to percolate up towards the top of the
heap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Removing PD_ALL_VISIBLE

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sun, 2012-11-25 at 22:30 -0500, Tom Lane wrote:
>> Another possibly important point is that reducing the number of
>> pin/unpin cycles for a given VM page might actually hurt the chances of
>> it being found in shared buffers, because IIRC the usage_count is bumped
>> once per pin/unpin.

> If doing a bunch of simple key lookups using an index, then the root of
> the index page is only pinned once per query, but we expect that to stay
> in shared buffers. I see the VM page as about the same: one pin per
> query (or maybe a couple for large tables).

Hmmm ... that seems like a valid analogy.  I may be worried about
nothing as far as this point goes.

> Do you still think I need a shared lock here or something? If so, then
> index-only scans have a pretty big problem right now, too.

There's still the issue of whether the IOS code is safe in machines with
weak memory ordering.  I think that it probably is safe, but the
argument for it in the current code comment is wrong; most likely, a
correct argument has to depend on read/write barriers associated with
taking snapshots.  I think what you ought to do is work through that,
fix the existing comment, and then see whether the same argument works
for what you want to do.
        regards, tom lane



Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Mon, 2012-11-26 at 16:10 -0500, Tom Lane wrote:
> There's still the issue of whether the IOS code is safe in machines with
> weak memory ordering.  I think that it probably is safe, but the
> argument for it in the current code comment is wrong; most likely, a
> correct argument has to depend on read/write barriers associated with
> taking snapshots.  I think what you ought to do is work through that,
> fix the existing comment, and then see whether the same argument works
> for what you want to do.

As a part of the patch, I did change the comment, and here's what I came
up with:
 * Note on Memory Ordering Effects: visibilitymap_test does not lock * the visibility map buffer, and therefore the
resultwe read here * could be slightly stale.  However, it can't be stale enough to * matter. * * We need to detect
clearinga VM bit due to an insert right away, * because the tuple is present in the index page but not visible. The *
readingof the TID by this scan (using a shared lock on the index * buffer) is serialized with the insert of the TID
intothe index * (using an exclusive lock on the index buffer). Because the VM bit is * cleared before updating the
index,and locking/unlocking of the * index page acts as a full memory barrier, we are sure to see the * cleared bit if
wesee a recently-inserted TID. * * Deletes do not update the index page (only VACUUM will clear out the * TID), so the
clearingof the VM bit by a delete is not serialized * with this test below, and we may see a value that is
significantly* stale. However, we don't care about the delete right away, because * the tuple is still visible until
thedeleting transaction commits or * the statement ends (if it's our transaction). In either case, the * lock on the VM
bufferwill have been released (acting as a write * barrier) after clearing the bit. And for us to have a snapshot that
*includes the deleting transaction (making the tuple invisible), we * must have acquired ProcArrayLock after that time,
actingas a read * barrier. * * It's worth going through this complexity to avoid needing to lock * the VM buffer, which
couldcause significant contention.
 

And I updated the comment in visibilitymap.c as well (reformatted for
this email):

"To test a bit in the visibility map, most callers should have a pin on
the VM buffer, and at least a shared lock on the data buffer. Any
process that clears the VM bit must have an exclusive lock on the data
buffer, so that will serialize access to the appropriate bit. Because
lock acquisition and release are full memory barriers, then there is no
danger of seeing the state of the bit before it was last cleared.
Callers that don't have the data buffer yet, such as an index only scan
or a VACUUM that is skipping pages, must handle the concurrency as
appropriate."

Regards,Jeff Davis




Re: Removing PD_ALL_VISIBLE

From
Merlin Moncure
Date:
On Mon, Nov 26, 2012 at 3:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 26, 2012 at 3:29 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> Your intuition here is better than mine, but I am still missing
>> something here. If we keep the buffer pinned, then there will be very
>> few pin/unpin cycles here, so  I don't see where the contention would
>> come from (any more than there is contention pinning the root of an
>> index).
>
> Based on previous measurements, I think there *is* contention pinning
> the root of an index.  Currently, I believe it's largely overwhelmed
> by contention from other sources, such as the buffer manager lwlocks
> and the very-evil ProcArrayLock.  However, I believe that as we fix
> those problems, this will start to percolate up towards the top of the
> heap.

Yup -- it (buffer pin contention on high traffic buffers) been caught
in the wild -- just maintaining the pin count was enough to do it in
at least one documented case.  Pathological workloads demonstrate
contention today and there's no good reason to assume it's limited
index root nodes -- i'm strongly suspicious buffer spinlock issues are
behind some other malfeasance we've seen recently.

merlin



Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Mon, 2012-11-26 at 16:55 -0600, Merlin Moncure wrote:
> > Based on previous measurements, I think there *is* contention pinning
> > the root of an index.  Currently, I believe it's largely overwhelmed
> > by contention from other sources, such as the buffer manager lwlocks
> > and the very-evil ProcArrayLock.  However, I believe that as we fix
> > those problems, this will start to percolate up towards the top of the
> > heap.
> 
> Yup -- it (buffer pin contention on high traffic buffers) been caught
> in the wild -- just maintaining the pin count was enough to do it in
> at least one documented case.  Pathological workloads demonstrate
> contention today and there's no good reason to assume it's limited
> index root nodes -- i'm strongly suspicious buffer spinlock issues are
> behind some other malfeasance we've seen recently.

I tried for quite a while to show any kind of performance difference
between checking the VM and checking PD_ALL_VISIBLE on a 12-core box (24
if you count HT).

Three patches in question: 1. Current unpatched master 2. patch that naively always checks the VM page, pinning and
unpinning
each time 3. Same as #2, but tries to keep buffers pinned (I had to fix a bug in
this patch though -- new version forthcoming)

I tested from 1 to 64 concurrent connections.

One test was just concurrent scans of a ~400MB table.

The other test was a DELETE FROM foo WHERE ctid BETWEEN '(N,0)' AND
'(N,256)' where N is the worker number in the test program. The table
here is only about 2 MB. The idea is that the scan will happen quickly,
leading to many quick deletes, but the deletes won't actually touch the
same pages (aside from the VM page). So, this was designed to be
uncontended except for pinning the VM page.

On the scan test, it was really hard to see any difference in the test
noise, but I may have seen about a 3-4% degradation between patch #1 and
patch #2 at higher concurrencies. It was difficult for me to reproduce
this result -- it usually wouldn't show up. I didn't see any difference
between patch #1 and patch #3.

On the delete test I detected no difference between #1 and #2 at all.

I think someone with access to a larger box may need to test this. Or,
if someone has a more specific suggestion about how I can create a worst
case, then let me know.

Regards,Jeff Davis




Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Fri, 2012-11-30 at 13:16 -0800, Jeff Davis wrote:
> I tried for quite a while to show any kind of performance difference
> between checking the VM and checking PD_ALL_VISIBLE on a 12-core box (24
> if you count HT).
>
> Three patches in question:
>   1. Current unpatched master
>   2. patch that naively always checks the VM page, pinning and unpinning
> each time
>   3. Same as #2, but tries to keep buffers pinned (I had to fix a bug in
> this patch though -- new version forthcoming)

New patch attached.

Nathan Boley kindly lent me access to a 64-core box, and that shows a
much more interesting result. The previous test (on the 12-core)
basically showed no difference between any of the patches.

Now, I see why on the 64 core box: the interesting region seems to be
around 32 concurrent connections.

The left column is the concurrency, and the right is the runtime. This
test was for concurrent scans of a 350MB table (each process did 4 scans
and quit). Test program attached.

Patch 1 (scan test):

001 004.299533
002 004.434378
004 004.708533
008 004.518470
012 004.487033
016 004.513915
024 004.765459
032 006.425780
048 007.089146
064 007.908850
072 009.461419
096 013.098646
108 015.278592
128 019.797206

Patch 2 (scan test):

001 004.385206
002 004.596340
004 004.616684
008 004.832248
012 004.858336
016 004.689959
024 005.016797
032 006.857642
048 012.049407
064 025.774772
072 032.680710
096 059.147500
108 083.654806
128 120.350200

Patch 3 (scan test):

001 004.464991
002 004.555595
004 004.562364
008 004.649633
012 004.628159
016 004.518748
024 004.768348
032 004.834177
048 007.003305
064 008.242714
072 009.732261
096 013.231056
108 014.996977
128 020.488570

As you can see, patch #2 starts to show a difference at around 32 and
completely falls over by 48 connections. This is expected because it's
the naive approach that pins the VM page every time it needs it.

Patch #1 and #3 are effectively the same, subsequent runs (and with more
measurements around concurrency 32) show that the differences are just
noise (which seems to be greater around the inflection point of 32). All
of the numbers that seem to show any difference can end up with patch #1
better or patch #3 better, depending on the run.

I tried the delete test, too, but I still couldn't see any difference.
(I neglected to mention in my last email: I aborted after each delete so
that it would be repeatable). The inflection point there is
significantly lower, so I assume it must be contending over something
else. I tried making the table unlogged to see if that would change
things, but it didn't change much. This test only scales linearly to
about 8 or so. Or, there could be something wrong with my test.

So, I conclude that contention is certainly a problem for scans for
patch #2, but patch #3 seems to fix that completely by holding the
buffer pins. The deletes are somewhat inconclusive, but I just can't see
a difference.

Holding more pins does have a distributed cost in theory, as Tom points
out, but I don't know where to begin testing that. We'll have to make a
decision between (a) maintaining the extra complexity and doing the
extra page writes involved with PD_ALL_VISIBLE; or (b) holding onto one
extra pin per table being scanned. Right now, if PD_ALL_VISIBLE did not
exist, it would be pretty hard to justify putting it in as far as I can
tell.

Regards,
    Jeff Davis

Attachment

Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
Rebased patch attached. No significant changes.

Regards,
    Jeff Davis

Attachment

Re: Removing PD_ALL_VISIBLE

From
Simon Riggs
Date:
On 17 January 2013 03:02, Jeff Davis <pgsql@j-davis.com> wrote:

> Rebased patch attached. No significant changes.

Jeff, can you summarise/collate why we're doing this, what concerns it
raises and how you've dealt with them? That will help decide whether
to commit.

Thanks

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Removing PD_ALL_VISIBLE

From
Pavan Deolasee
Date:
On Thu, Jan 17, 2013 at 2:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 17 January 2013 03:02, Jeff Davis <pgsql@j-davis.com> wrote:
>
> > Rebased patch attached. No significant changes.
>
> Jeff, can you summarise/collate why we're doing this, what concerns it
> raises and how you've dealt with them? That will help decide whether
> to commit.
>

+1. On another thread "Set visibility map bit after HOT prune", Robert
mentioned that its not such a good idea to remove the PD_ALL_VISIBLE
flag because it helps us to reduce the contention on the VM page,
especially when we need to reset the VM bit. Here is an excerpt from
Robert's comment that thread:

"Sure, but you're zipping rather blithely past the disadvantages of
such an approach.  Jeff Davis recently proposed getting rid of
PD_ALL_VISIBLE, and Tom and I both expressed considerable skepticism
about that; this proposal has the same problems.  One of the major
benefits of PD_ALL_VISIBLE is that, when it isn't set, inserts,
updates, and deletes to the page can ignore the visibility map.  That
means that a server under heavy concurrency is much less likely to
encounter contention on the visibility map blocks.  Now, maybe that's
not really a problem, but I sure haven't seen enough evidence to make
me believe it.  If it's really true that PD_ALL_VISIBLE needn't fill
this role, then Heikki wasted an awful lot of time implementing it,
and I wasted an awful lot of time keeping it working when I made the
visibility map crash-safe for IOS.  That could be true, but I tend to
think it isn't."

May be you've already addressed that concern with the proven
performance numbers, but I'm not sure.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee



Re: Removing PD_ALL_VISIBLE

From
Abhijit Menon-Sen
Date:
At 2013-01-17 08:41:37 +0000, simon@2ndQuadrant.com wrote:
>
> Jeff, can you summarise/collate why we're doing this, what concerns it
> raises and how you've dealt with them?

Since I was just looking at the original patch and discussion, and since
Pavan has posted an excerpt from one objection to it, here's an excerpt
from Jeff's original post titled "Do we need so many hint bits?"

http://www.postgresql.org/message-id/1353026577.14335.91.camel@sussancws0025
   Also, I am wondering about PD_ALL_VISIBLE. It was originally   introduced in the visibility map patch, apparently as
away to know   when to clear the VM bit when doing an update. It was then also used   for scans, which showed a
significantspeedup. But I wonder: why not   just use the visibilitymap directly from those places? It can be   used for
thescan because it is crash safe now (not possible   before). And since it's only one lookup per scanned page, then I
don'tthink it would be a measurable performance loss there.   Inserts/updates/deletes also do a significant amount of
work,so   again, I doubt it's a big drop in performance there -- maybe under a   lot of concurrency or something.
 
   The benefit of removing PD_ALL_VISIBLE would be significantly   higher. It's quite common to load a lot of data, and
thendo some   reads for a while (setting hint bits and flushing them to disk), and   then do a VACUUM a while later,
settingPD_ALL_VISIBLE and writing   all of the pages again. Also, if I remember correctly, Robert went   to significant
effortwhen making the VM crash-safe to keep the   PD_ALL_VISIBLE and VM bits consistent. Maybe this was all discussed
before?

There was considerable discussion after this (accessible through the
archives link above), which I won't attempt to summarise.

-- Abhijit



Re: Removing PD_ALL_VISIBLE

From
Pavan Deolasee
Date:
<br /><br /><div class="gmail_quote">On Thu, Jan 17, 2013 at 2:57 PM, Abhijit Menon-Sen <span dir="ltr"><<a
href="mailto:ams@2ndquadrant.com"target="_blank">ams@2ndquadrant.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br /><br /> There was
considerablediscussion after this (accessible through the<br /> archives link above), which I won't attempt to
summarise.<br/><span class="HOEnZb"><font color="#888888"></font></span><br clear="all" /></blockquote></div><br />I
thoughtRobert made those comments after considerable discussions on Jeff's approach. So he probably still stands by his
objectionsor at least not satisfied/seen the numbers.<br /><br />Now that I look at the patch, I wonder if there is
anotherfundamental issue with the patch. Since the patch removes WAL logging for the VM set operation, this can
happen:<br/><br />1. Vacuum kicks in and clears all dead tuples in a page and decides that its all-visible<br /> 2.
VacuumWAL-logs the cleanup activity and marks the page dirty<br />3. Vacuum sets the visibility bit and marks the VM
pagedirty<br />4. Say the VM page gets written to the disk. The heap page is not yet written neither the WAL log
correspondingto the cleanup operation<br /> 5. CRASH<br /><br />After recovery, the VM bit will remain set because the
VMpage got written before the crash. But since heap page's cleanup WAL did not made to the disk, those operations won't
bereplayed. The heap page will be left with not-all-visible tuples in that case and its not a good state to be in.<br
/><br/>The original code does not have this problem because the VM set WAL gets written after the heap page cleanup
WAL.So its guaranteed that the VM bit will be set during recovery only if the cleanup WAL is replayed too (there is
moremagic than what meets the eye and I think its not fully documented). <br /><br />Thanks,<br />Pavan<br /><br />--
<br/>Pavan Deolasee<br /><a href="http://www.linkedin.com/in/pavandeolasee"
target="_blank">http://www.linkedin.com/in/pavandeolasee</a>

Re: Removing PD_ALL_VISIBLE

From
Robert Haas
Date:
On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> May be you've already addressed that concern with the proven
> performance numbers, but I'm not sure.

It would be nice to hear what Heikki's reasons were for adding
PD_ALL_VISIBLE in the first place.

Jeff's approach of holding the VM pins for longer certainly mitigates
some of the damage, in the sense that it reduces buffer lookups and
pin/unpin cycles - and might be worth doing independently of the rest
of the patch if we think it's a win.  Index-only scans already use a
similar optimization, so extending it to inserts, updates, and deletes
is surely worth considering.  The main question in my mind is whether
there are any negative consequences to holding a VM buffer pin for
that long without interruption.  The usual consideration - namely,
blocking vacuum - doesn't apply here because vacuum does not take a
cleanup lock on the visibility map page, only the heap page, but I'm
not sure if there are others.

The other part of the issue is cache pressure. I don't think I can say
it better than what Tom already wrote:

# I'd be worried about the case of a lot of sessions touching a lot of
# unrelated tables.  This change implies doubling the number of buffers
# that are held pinned by any given query, and the distributed overhead
# from that (eg, adding cycles to searches for free buffers) is what you
# ought to be afraid of.

I agree that we ought to be afraid of that.  A pgbench test isn't
going to find a problem in this area because there you have a bunch of
sessions all accessing the same small group of tables.  To find a
problem of the type above, you'd need lots of backends accessing lots
of different, small tables.  That's not the use case we usually
benchmark, but I think there are a fair number of people doing things
like that - for example, imagine a hosting provider or web application
with many databases or schemas on a single instance.  AFAICS, Jeff
hasn't tried to test this scenario.

Now, on the flip side, we should also be thinking about what we would
gain from this patch, and what other ways there might be to achieve
the same goals.  As far as I can see, the main gain is that if you
bulk-load a table, don't vacuum it right away, get all the hint bits
set by some other mechanism, and then vacuum the table, you'll only
read the whole table instead of rewriting the whole table.  So ISTM
that, for example, if we adopted the idea of making HOT prune set
visibility-map bits, most of the benefit of this change evaporates,
but whatever costs it may have will remain.  There are other possible
ways of getting the same benefit as well - for example, I've been
thinking for a while now that we should try to find some judicious way
of vacuuming insert-only tables, perhaps only in small chunks when
there's nothing else going on.  That wouldn't as clearly obsolete this
patch, but it would address a very similar use case and would also
preset hint bits, which would help a lot of people.  Some of the ideas
we've had about a HEAP_XMIN_FREEZE intersect with this idea, too - if
we can do an early freeze without losing forensic information, we can
roll setting the hint bit, setting PD_ALL_VISIBLE, and freezing the
page into a single write.

All of which is to say that I think this patch is premature.  If we
adopt something like this, we're likely never going to revert back to
the way we do it now, and whatever cache-pressure or other costs this
approach carries will be hear to stay - so we had better think awfully
carefully before we do that.  And, FWIW, I don't believe that there is
sufficient time in this release cycle to carefully test this patch and
the other alternative designs that aim toward the same ends.  Even if
there were, this is exactly the sort of thing that should be committed
at the beginning of a release cycle, not the end, so as to allow
adequate time for discovery of unforeseen consequences before the code
ends up out in the wild.

Of course, there's another issue here too, which is that as Pavan
points out, the page throws crash-safety out the window, which breaks
index-only scans (if you have a crash).  HEAP_XLOG_VISIBLE is intended
principally to protect the VM bit, not PD_ALL_VISIBLE, but the patch
rips it out even though its purpose is to remove the latter, not the
former.  Removing PD_ALL_VISIBLE eliminates the need to keep the
visibility-map bit consist with PD_ALL_VISIBLE, but it does not
eliminate the need to keep PD_ALL_VISIBLE consistent with the page
contents.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Removing PD_ALL_VISIBLE

From
Heikki Linnakangas
Date:
On 17.01.2013 16:53, Robert Haas wrote:
> On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
> <pavan.deolasee@gmail.com>  wrote:
>> May be you've already addressed that concern with the proven
>> performance numbers, but I'm not sure.
>
> It would be nice to hear what Heikki's reasons were for adding
> PD_ALL_VISIBLE in the first place.

The idea was to avoid clearing the bit in the VM page on every update, 
when the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is 
not set. I assumed the traffic and contention on the VM page would be a 
killer otherwise. I don't remember if I ever actually tested that 
though. Maybe I was worrying about nothing and hitting the VM page on 
every update is ok.

- Heikki



Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote:
> Now that I look at the patch, I wonder if there is another fundamental
> issue with the patch. Since the patch removes WAL logging for the VM
> set operation, this can happen:
> 
Thank you. I think I was confused by this comment here:

"When we *set* a visibility map during VACUUM, we must write WAL.  This
may seem counterintuitive, since the bit is basically a hint: if it is
clear, it may still be the case that every tuple on the page is visible
to all transactions; we just don't know that for certain.  The
difficulty is that there are two bits which are typically set together:
the PD_ALL_VISIBLE bit on the page itself, and the visibility map bit.
If a crash occurs after the visibility map page makes it to disk and
before the updated heap page makes it to disk, redo must set the bit on
the heap page. Otherwise, the next insert, update, or delete on the heap
page will fail to realize that the visibility map bit must be cleared,
possibly causing index-only scans to return wrong answers."

Which lead me to believe that I could just rip out the WAL-related code
if PD_ALL_VISIBLE goes away, which is incorrect. But the incorrectness
doesn't have to do with the WAL directly, it's because the VM page's LSN
is not bumped past the LSN of the related heap page cleanup, so it can
be written too early.

Of course, the way to bump the LSN is to write WAL for the
visibilitymap_set operation. But that would be a very simple WAL
routine, rather than the complex one that exists without the patch.

I suppose we could even try to bump the LSN without writing WAL somehow,
but it doesn't seem worth reasoning through that (setting a VM bit is
rare enough).

Regards,Jeff Davis




Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Thu, 2013-01-17 at 10:39 -0800, Jeff Davis wrote:
> On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote:
> > Now that I look at the patch, I wonder if there is another fundamental
> > issue with the patch. Since the patch removes WAL logging for the VM
> > set operation, this can happen:
> >
> Thank you.

New patch attached with simple WAL logging.

Regards,
    Jeff Davis


Attachment

Re: Removing PD_ALL_VISIBLE

From
Simon Riggs
Date:
On 17 January 2013 15:14, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 17.01.2013 16:53, Robert Haas wrote:
>>
>> On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
>> <pavan.deolasee@gmail.com>  wrote:
>>>
>>> May be you've already addressed that concern with the proven
>>> performance numbers, but I'm not sure.
>>
>>
>> It would be nice to hear what Heikki's reasons were for adding
>> PD_ALL_VISIBLE in the first place.
>
>
> The idea was to avoid clearing the bit in the VM page on every update, when
> the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is not set.
> I assumed the traffic and contention on the VM page would be a killer
> otherwise. I don't remember if I ever actually tested that though. Maybe I
> was worrying about nothing and hitting the VM page on every update is ok.

Presumably we remember the state of the VM so we can skip the re-visit
after every write?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Thu, 2013-01-17 at 17:14 +0200, Heikki Linnakangas wrote:
> I don't remember if I ever actually tested that 
> though. Maybe I was worrying about nothing and hitting the VM page on 
> every update is ok.

I tried, but was unable to show really anything at all, even without
keeping the VM page pinned. I think the bottleneck is elsewhere;
although I am keeping the page pinned in this patch to prevent it from
becoming a bottleneck.

Regards,Jeff Davis




Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Thu, 2013-01-17 at 19:58 +0000, Simon Riggs wrote:
> Presumably we remember the state of the VM so we can skip the re-visit
> after every write?

That was not a part of my patch, although I remember that you mentioned
that previously and I thought it could be a good way to mitigate a
problem if it ever came up.

However, the tests I did didn't show any problem there. The tests were
somewhat noisy, so perhaps I was doing something wrong, but it didn't
appear that looking at the VM page for every update was a bottleneck.

Regards,Jeff Davis





Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Thu, 2013-01-17 at 09:53 -0500, Robert Haas wrote:
> The main question in my mind is whether
> there are any negative consequences to holding a VM buffer pin for
> that long without interruption.  The usual consideration - namely,
> blocking vacuum - doesn't apply here because vacuum does not take a
> cleanup lock on the visibility map page, only the heap page, but I'm
> not sure if there are others.

If the "without interruption" part becomes a practical problem, it seems
fairly easy to fix: drop the pin and pick it up again once every K
pages. Unless I'm missing something, this is a minor concern.

> The other part of the issue is cache pressure. I don't think I can say
> it better than what Tom already wrote:
> 
> # I'd be worried about the case of a lot of sessions touching a lot of
> # unrelated tables.  This change implies doubling the number of buffers
> # that are held pinned by any given query, and the distributed overhead
> # from that (eg, adding cycles to searches for free buffers) is what you
> # ought to be afraid of.
> 
> I agree that we ought to be afraid of that.

It's a legitimate concern, but I think being "afraid" goes to far (more
below).

> A pgbench test isn't
> going to find a problem in this area because there you have a bunch of
> sessions all accessing the same small group of tables.  To find a
> problem of the type above, you'd need lots of backends accessing lots
> of different, small tables.  That's not the use case we usually
> benchmark, but I think there are a fair number of people doing things
> like that - for example, imagine a hosting provider or web application
> with many databases or schemas on a single instance.  AFAICS, Jeff
> hasn't tried to test this scenario.

The concern here is over a lot of different, small tables (e.g.
multi-tenancy or something similar) as you say. If we're talking about
nearly empty tables, that's easy to fix: just don't use the VM on tables
less than N pages.

You could say that "small" tables are really 1-10MB each, and you could
have a zillion of those. I will try to create a worst-case here and see
what numbers come out. Perhaps the extra time to look for a free buffer
does add up.

Test plan:
 1. Take current patch (without "skip VM check for small tables"
optimization mentioned above). 2. Create 500 tables each about 1MB. 3. VACUUM them all. 4. Start 500 connections (one
foreach table) 5. Time the running of a loop that executes a COUNT(*) on that
 
connection's table 100 times.

I think shared_buffers=64MB is probably appropriate. We want some memory
pressure so that it has to find and evict pages to satisfy the queries.
But we don't want it to be totally exhausted and unable to even pin a
new page; that really doesn't tell us much except that max_connections
is too high.

Sound reasonable?

> Now, on the flip side, we should also be thinking about what we would
> gain from this patch, and what other ways there might be to achieve
> the same goals.

One thing to keep in mind is that the current code to maintain a
crash-safe PD_ALL_VISIBLE and VM bit is quite complex and doesn't play
by the normal rules. If you want to talk about distributed costs, that
has some very real distributed costs in terms of development effort. For
instance, my checksums patch took me extra time to write (despite the
patch being the simplest checksums design on the table) and will take
others extra time to review.

So, if the only things keeping that code in place are theoretical fears,
let's take them one by one and see if they are real problems or not.

> All of which is to say that I think this patch is premature.  If we
> adopt something like this, we're likely never going to revert back to
> the way we do it now, and whatever cache-pressure or other costs this
> approach carries will be hear to stay - so we had better think awfully
> carefully before we do that.

What about this patch makes it hard to undo/rework in the future?

>  Even if
> there were, this is exactly the sort of thing that should be committed
> at the beginning of a release cycle, not the end, so as to allow
> adequate time for discovery of unforeseen consequences before the code
> ends up out in the wild.

I'm concerned that such a comment at this stage will cut review early,
which could prevent also it from being committed early in 9.4.

> Of course, there's another issue here too, which is that as Pavan
> points out, the page throws crash-safety out the window

My mistake. I believe that is already fixed, and certainly not a
fundamental issue.

Regards,Jeff Davis





Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Thu, 2013-01-17 at 14:53 -0800, Jeff Davis wrote:
> Test plan:
>
>   1. Take current patch (without "skip VM check for small tables"
> optimization mentioned above).
>   2. Create 500 tables each about 1MB.
>   3. VACUUM them all.
>   4. Start 500 connections (one for each table)
>   5. Time the running of a loop that executes a COUNT(*) on that
> connection's table 100 times.

Done, with a few extra variables. Again, thanks to Nathan Boley for
lending me the 64-core box. Test program attached.

I did both 1MB tables and 1 tuple tables, but I ended up throwing out
the 1-tuple table results. First of all, as I said, that's a pretty easy
problem to solve, so not really what I want to test. Second, I had to do
so many iterations that I don't think I was testing anything useful. I
did see what might have been a couple differences, but I would need to
explore in more detail and I don't think it's worth it, so I'm just
reporting on the 1MB tables.

For each test, each of 500 connections runs 10 iterations of a COUNT(*)
on it's own 1MB table (which is vacuumed and has the VM bit set). The
query is prepared once. The table has only an int column.

The variable is shared_buffers, going from 32MB (near exhaustion for 500
connections) to 2048MB (everything fits).

The last column is the time range in seconds. I included the range this
time, because there was more variance in the runs but I still think they
are good test results.

master:
    32MB: 16.4 - 18.9
    64MB: 16.9 - 17.3
   128MB: 17.5 - 17.9
   256MB: 14.7 - 15.8
   384MB:  8.1 -  9.3
   448MB:  4.3 -  9.2
   512MB:  1.7 -  2.2
   576MB:  0.6 -  0.6
  1024MB:  0.6 -  0.6
  2048MB:  0.6 -  0.6

patch:
    32MB: 16.8 - 17.6
    64MB: 17.1 - 17.5
   128MB: 17.2 - 18.0
   256MB: 14.8 - 16.2
   384MB:  8.0 - 10.1
   448MB:  4.6 -  7.2
   512MB:  2.0 -  2.6
   576MB:  0.6 -  0.6
  1024MB:  0.6 -  0.6
  2048MB:  0.6 -  0.6

Conclusion:

I see about what I expect: a precipitous drop in runtime after
everything fits in shared_buffers (500 1MB tables means the inflection
point around 512MB makes a lot of sense). There does seem to be a
measurable difference right around that inflection point, but it's not
much. Considering that this is the worst case that I could devise, I am
not too concerned about this.

However, it is interesting to see that there really is a lot of
maintenance work being done when we need to move pages in and out of
shared buffers. I'm not sure that it's related to the freelists though.

For the extra pins to really be a problem, I think a much higher
percentage of the buffers would need to be pinned. Since the case we are
worried about involves scans (if it involved indexes, that would already
be using more than one pin per scan), then that means the only way to
get to a high percentage of pinned buffers is by having very small
tables. But we don't really need to use the VM when scanning very small
tables (the overhead would be elsewhere), so I think we're OK.

So, I attached a new version of the patch that doesn't look at the VM
for tables with fewer than 32 pages. That's the only change.

Regards,
    Jeff Davis

Attachment

Re: Removing PD_ALL_VISIBLE

From
Robert Haas
Date:
On Thu, Jan 17, 2013 at 5:50 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> If the "without interruption" part becomes a practical problem, it seems
> fairly easy to fix: drop the pin and pick it up again once every K
> pages. Unless I'm missing something, this is a minor concern.

I think probably so.

> Test plan:
>
>   1. Take current patch (without "skip VM check for small tables"
> optimization mentioned above).
>   2. Create 500 tables each about 1MB.
>   3. VACUUM them all.
>   4. Start 500 connections (one for each table)
>   5. Time the running of a loop that executes a COUNT(*) on that
> connection's table 100 times.
>
> I think shared_buffers=64MB is probably appropriate. We want some memory
> pressure so that it has to find and evict pages to satisfy the queries.
> But we don't want it to be totally exhausted and unable to even pin a
> new page; that really doesn't tell us much except that max_connections
> is too high.
>
> Sound reasonable?

Well, it's certainly a data point, but each table in that case has 128
pages, so the effect is still pretty small.  The place where you're
going to run into trouble is when many of those tables have 4 pages
each, or 2 pages each, or 1 page each.

>> All of which is to say that I think this patch is premature.  If we
>> adopt something like this, we're likely never going to revert back to
>> the way we do it now, and whatever cache-pressure or other costs this
>> approach carries will be hear to stay - so we had better think awfully
>> carefully before we do that.
>
> What about this patch makes it hard to undo/rework in the future?

Well, if you have a bunch of code, and it preserves property X at all
times, it is pretty easy to decide that new code need no longer
preserve property X.  It is essentially a no-op.  The reverse, going
through a bunch of existing code that does not consistently preserve
property X and making it do so, is always much harder, because you've
got to audit all the code you've already got.  I don't want to undo
the work that's been done on this over the last four years without a
really good reason, and I'm not seeing one.

> My mistake. I believe that is already fixed, and certainly not a
> fundamental issue.

It at least calls for a repetition of any performance testing that has
already been done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Removing PD_ALL_VISIBLE

From
Robert Haas
Date:
On Fri, Jan 18, 2013 at 3:31 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> So, I attached a new version of the patch that doesn't look at the VM
> for tables with fewer than 32 pages. That's the only change.

That certainly seems worthwhile, but I still don't want to get rid of
this code.  I'm just not seeing a reason why that's something that
desperately needs to be done.  I don't think this is a barrier to
anything else we want to do, and it might well be that the situations
where this patch would hurt us are currently masked by other
bottlenecks, but won't be always.  Right now, the vast majority of
heap updates don't need to pin the visibility map page; with this
change, all of them do.  Now, I accept that your test results show
that that doesn't matter, but how can that not be an artifact of some
kind?  Can we really credit that accessing two pages costs no more
than accessing one?  To what countervailing factor could we plausibly
attribute that?

Now, even if it costs more in some narrow sense, the difference might
not be enough to matter.  But without some big gain on the other side,
why tinker with it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Removing PD_ALL_VISIBLE

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jan 18, 2013 at 3:31 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>> So, I attached a new version of the patch that doesn't look at the VM
>> for tables with fewer than 32 pages. That's the only change.

> That certainly seems worthwhile, but I still don't want to get rid of
> this code.  I'm just not seeing a reason why that's something that
> desperately needs to be done.

Yeah, I'm having the same problem.  Despite Jeff's test results, I can't
help thinking that lack of PD_ALL_VISIBLE *will* hurt us under some
workloads, and it's not obvious to me what benefit we get from dropping
it.
        regards, tom lane



Re: Removing PD_ALL_VISIBLE

From
Pavan Deolasee
Date:
On Mon, Jan 21, 2013 at 8:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Jan 18, 2013 at 3:31 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>>> So, I attached a new version of the patch that doesn't look at the VM
>>> for tables with fewer than 32 pages. That's the only change.
>
>> That certainly seems worthwhile, but I still don't want to get rid of
>> this code.  I'm just not seeing a reason why that's something that
>> desperately needs to be done.
>
> Yeah, I'm having the same problem.  Despite Jeff's test results, I can't
> help thinking that lack of PD_ALL_VISIBLE *will* hurt us under some
> workloads, and it's not obvious to me what benefit we get from dropping
> it.

I tend to agree. When I looked at the patch, I thought since its
removing a WAL record (and associated redo logic), it has some
additional value. But that was kind of broken (sorry, I haven't looked
at the latest patch if Jeff fixed it without requiring to reinstate
the WAL logging). I also thought that bug invalidates the performance
numbers reported. Of course, there is an argument that this patch will
simplify the code, but I'm not sure if its enough to justify the
additional contention which may or may not show up in the benchmarks
we are running, but we know its there.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee



Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Sun, 2013-01-20 at 22:19 -0500, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Jan 18, 2013 at 3:31 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> >> So, I attached a new version of the patch that doesn't look at the VM
> >> for tables with fewer than 32 pages. That's the only change.
> 
> > That certainly seems worthwhile, but I still don't want to get rid of
> > this code.  I'm just not seeing a reason why that's something that
> > desperately needs to be done.
> 
> Yeah, I'm having the same problem.  Despite Jeff's test results, I can't
> help thinking that lack of PD_ALL_VISIBLE *will* hurt us under some
> workloads, and it's not obvious to me what benefit we get from dropping
> it.

OK. I respectfully disagree with the arguments I've seen so far, but we
can all move on.

I already had some more test results (again, thanks to Nathan Boley), so
I finished them up and attached them to the bottom of this email for the
archives (there's always hope, right?).

Regards,Jeff Davis


Test goal: is 32 is an appropriate threshold for using the VM after we
remove PD_ALL_VISIBLE?

Test setup: 500 31-page tables and 500 33-page tables. Test recent build
against patched version, with varying shared_buffers. The first test is
500 connections each doing 20 iterations of COUNT(*) against the 500
31-page tables. The next test is the same, except against the 33-page
tables.

Test results:
 (There were a few outliers I discarded as being too fast. It always
happened in the first run, and I strongly suspect the connections began
unevenly, leading to lower concurrency. They didn't seem to favor one
build over another.)

master, 31-page (first column is shared_buffers, second is range of
seconds):   32MB:  5.8 -  6.1   64MB:  3.1 -  3.7   96MB:  1.7 -  2.2  112MB:  0.6 -  1.1  128MB:  0.4 -  0.4  256MB:
0.4-  0.4
 

patch, 31-page (doesn't use VM because 31 is below threshold):   32MB:  5.1 -  5.9     64MB:  3.4 -  3.8   96MB:  1.7 -
2.0  112MB:  0.7 -  1.1  128MB:  0.4 -  0.5  256MB:  0.4 -  0.5
 

master, 33-page:   32MB:  5.9 -  7.0   64MB:  4.2 -  4.7   96MB:  2.4 -  2.8  112MB:  1.2 -  1.6  128MB:  0.5 -  0.5
256MB: 0.4 -  0.5
 

patch, 33-page (does use VM because 33 is above threshold):   32MB:  6.2 -  7.2   64MB:  4.4 -  4.7   96MB:  2.8 -  3.0
112MB:  1.7 -  1.8  128MB:  0.5 -  1.0  256MB:  0.4 -  0.5
 

Conclusion:

32 pages is a low enough threshold that skipping the VM is
insignificant.

When the 500 tables are 33 pages, and it does use the VM, we do see a
measurable cost under cache pressure. The penalty is fairly small (10%
ballpark), and this is a worst-case, so I don't think it's a problem.
From the last test results, we know it gets back to even by the time the
tables are 128 pages (1MB). So it could be that there is a slightly
higher threshold (between 32 and 128) that would be slightly better. But
given how specific this case is and how small the penalty is, I think 32
is a fine threshold.

Also, to reiterate, I focused my tests almost entirely on scans, though
some of the concern was around inserts/updates/deletes. I tried, but was
unable to show any difference on those tests. I suspect that the
bottleneck is elsewhere.





Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Mon, 2013-01-21 at 11:27 +0530, Pavan Deolasee wrote:
> I tend to agree. When I looked at the patch, I thought since its
> removing a WAL record (and associated redo logic), it has some
> additional value. But that was kind of broken (sorry, I haven't looked
> at the latest patch if Jeff fixed it without requiring to reinstate
> the WAL logging). I also thought that bug invalidates the performance
> numbers reported.

I revalidated the performance numbers after reinstating the WAL logging.
No difference (which is unsurprising, since my tests were read-only).

>  Of course, there is an argument that this patch will
> simplify the code, but I'm not sure if its enough to justify the
> additional contention which may or may not show up in the benchmarks
> we are running, but we know its there.

What additional contention? How do you know it's there?

The design is to keep the VM page pinned, and to read it without
requiring locks (like index-only scans already do). So I don't see any
obvious additional contention there, unless you're talking about the
work the CPU does for cache coherency (which did not show up in any of
my tests).

I understand that my patch is probably not going in, but I would like to
be clear about what is a known practical problem, what is a theoretical
problem that has eluded my testing capabilities, and what is no problem
at all.

Regards,Jeff Davis





Re: Removing PD_ALL_VISIBLE

From
Pavan Deolasee
Date:


On Mon, Jan 21, 2013 at 12:22 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2013-01-21 at 11:27 +0530, Pavan Deolasee wrote:
>  Of course, there is an argument that this patch will
> simplify the code, but I'm not sure if its enough to justify the
> additional contention which may or may not show up in the benchmarks
> we are running, but we know its there.

What additional contention? How do you know it's there?


At the minimum your patch will need to have one additional buffer pinned for every K < 8192 * 8 heap pages. For most cases, the value of K will be large enough to ignore the overhead, but for small values of K, I'm not sure if we can say that with confidence. Of course, for very small tables the real contention might be somewhere else and so this change may not show up anything on the benchmarks. Doesn't your own tests for 33 page tables confirm that there is indeed contention for this case ? I agree its not huge, but I don't know if there are workloads where it will matter.


I understand that my patch is probably not going in,

Sorry about that. I know how that feels. But we need some more reasons to justify this change, especially because the performance numbers themselves are not showing any signs either way. One could have argued that this saves us a valuable page level bit, but with pg_upgrade etc it has become hard to re-utilize page level bits for other purposes and we don't yet have a pressing need for more bits.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Mon, 2013-01-21 at 12:49 +0530, Pavan Deolasee wrote:

> At the minimum your patch will need to have one additional buffer
> pinned for every K < 8192 * 8 heap pages.

I assume it's the same K I referred to when responding to Robert: the
max number of heap buffers we read before we unpin and repin the VM
buffer. Right now it's unlimited, because there is currently no need to
have it at all -- I was just offering the solution in case we did want
to do VM page cleanup in the future or something.

>  For most cases, the value of K will be large enough to ignore the
> overhead, but for small values of K, I'm not sure if we can say that
> with confidence.

It's a constant, and we can easily set it high enough that it wouldn't
matter. There's no reason at all to choose a small value for K. Consider
that an index root page pin is held for the entire scan, and we don't
have a problem with that.

> Of course, for very small tables the real contention might be
> somewhere else and so this change may not show up anything on the
> benchmarks. Doesn't your own tests for 33 page tables confirm that
> there is indeed contention for this case ? I agree its not huge, but I
> don't know if there are workloads where it will matter.

That appears to happen because of the fraction of pinned pages being too
high (aside: it was fairly challenging to construct a test where that
happened). I believe it was mostly solved by adding a threshold to use
the VM, and I can probably solve it completely by doing some more
experiments and finding a better threshold value.
>         
>         I understand that my patch is probably not going in, 
> 
> Sorry about that. I know how that feels. But we need some more reasons
> to justify this change, especially because the performance numbers
> themselves are not showing any signs either way.

That confuses me. The testing was to show it didn't hurt other workloads
(like scans or inserts/updates/deletes); so the best possible result is
that they don't show signs either way.

But yes, I see that others are not interested in the benefits offered by
the patch, which is why I'm giving up on it. If people are concerned
about the costs, then I can fix those; but there's nothing I can do to
increase the benefits.

Regards,Jeff Davis





Re: Removing PD_ALL_VISIBLE

From
Heikki Linnakangas
Date:
On 21.01.2013 11:10, Jeff Davis wrote:
> That confuses me. The testing was to show it didn't hurt other workloads
> (like scans or inserts/updates/deletes); so the best possible result is
> that they don't show signs either way.

I went back to look at the initial test results that demonstrated that 
keeping the pin on the VM buffer mitigated the overhead of pinning the 
vm page. The obvious next question is, what is the impact when that's 
inefficient, ie. when you update pages across different 512 MB sections, 
so that the vm pin has to be dropped and reacquired repeatedly.

I tried to construct a worst case scenario for that:

create unlogged table testtable (i int4);
insert into testtable select generate_series(1, 15000000);
insert into testtable select generate_series(1, 15000000);
create index testtable_index on testtable (i);

When you traverse tuples using that index, the tuples will come 
alternating from low-numbered pages and high-numbered pages, which 
defeats keeping the last vm page pinned. To test, I ran this:

set enable_bitmapscan=off; set enable_seqscan=off;
begin;
delete from testtable where i >= 0;
rollback;

I repeated a few times with and without the patch 
(rm-pd-all-visible-0118.patch). According to \timing, the delete takes 
about 12.6 seconds without the patch, and 15.3 seconds with it.

This is a worst-case scenario, and the slowdown isn't huge, so maybe 
it's a worthwhile tradeoff. But it shows that removing PD_ALL_VISIBLE is 
not completely free.

- Heikki



Re: Removing PD_ALL_VISIBLE

From
Robert Haas
Date:
On Mon, Jan 21, 2013 at 1:52 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>>  Of course, there is an argument that this patch will
>> simplify the code, but I'm not sure if its enough to justify the
>> additional contention which may or may not show up in the benchmarks
>> we are running, but we know its there.
>
> What additional contention? How do you know it's there?

We know it's there because every additional page that you access has
to be looked up and locked and the memory that contains it brought
into the CPU cache.  If you look up and lock more pages, you WILL have
more contention - both for memory, and for locks - and more runtime
cost.  Whether or not you can currently detect that contention and/or
runtime cost experimentally is another matter.  Failure to do so could
indicate either that you haven't got the right test case - Heikki
seems to have found one that works - or that it's being masked by
other things, but might not be always.

To raise an example which I believe is similar to this one, consider
Kevin Grittner's work on SSI.  He set himself a goal that SSI should
impose a performance regression of no more than 2% - and he met that
goal, at the time the code was committed.  But then, my read
scalability work during the 9.2 cycle blew the lid off of read
performance for certain workloads, and now SSI is FAR slower on those
workloads.   His code always had a scalability problem, but you
couldn't measure it experimentally before I did that work, because
there were equally bad scalability problems elsewhere.  Now there
aren't, and you can, and I have.  We might well have refused to commit
that patch with the locking scheme it uses today if my scalability
work had been done first - or perhaps we would have decided that the
case where the difference is large is too narrow to be worth worrying
about, but I think it would have at least provoked discussion.

My biggest concern about the visibility map is that it will act as a
focal point for contention.  Map half a gigabyte of heap down to 1
page of visibility map and it's easy to think that you could have some
situation in which that page gets very, very hot.  We know that cache
line stealing is a significant cost of ProcArray manipulation, which
is why the PGXACT patch makes a significant difference in write
throughput.  Now your argument seems to be that we can't measure any
such effect here, but maybe we're just not measuring the right thing.
Heikki's example reminded me that I was concerned about the cost of
repeatedly pinning different VM buffers during an index-only scan, if
the relation were very large.  That's seems easy to justify on the
grounds that you're saving so much I/O and memory traffic that the
pins will seem cheap by comparison ... but they don't, always.  IIRC,
an index-only scan on a fully-cached relation is not necessarily
faster than a regular index-scan, even if the heap is entirely
all-visible.

I realize these examples aren't directly applicable to this case, so
I'm not sure if they help to advance the discussion or not.  I advance
them only as evidence that simple tests don't always manage to capture
the real costs and benefits of a feature or optimization, and that
predicting the system-wide effects of changes in this area is hard.
For a large benefit I think we would perhaps bite the bullet and take
our chances, but in my (human and imperfect) judgement the benefit
here is not large enough to justify it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Removing PD_ALL_VISIBLE

From
Jeff Davis
Date:
On Mon, 2013-01-21 at 12:00 +0200, Heikki Linnakangas wrote:
> On 21.01.2013 11:10, Jeff Davis wrote:
> > That confuses me. The testing was to show it didn't hurt other workloads
> > (like scans or inserts/updates/deletes); so the best possible result is
> > that they don't show signs either way.
>
> I went back to look at the initial test results that demonstrated that
> keeping the pin on the VM buffer mitigated the overhead of pinning the
> vm page. The obvious next question is, what is the impact when that's
> inefficient, ie. when you update pages across different 512 MB sections,
> so that the vm pin has to be dropped and reacquired repeatedly.
>
> I tried to construct a worst case scenario for that:

I confirmed this result in a single connection (no concurrency). I used
a shared_buffers of 2GB so that the whole table would fit.

Also, I fixed a bug that I noticed along the way, which was an
uninitialized variable. New version attached.

FWIW, I'm considering this patch to be rejected; I just didn't want to
leave a patch with a bug in it.

Regards,
    Jeff Davis

Attachment