Thread: visibility maps

visibility maps

From
"Pavan Deolasee"
Date:
<br />    /*<br />     * We don't need to lock the page, as we're only looking at a single bit.<br />     */<br />   
result= (map[mapByte] & (1 << mapBit)) ? true : false;<br /><br /><br /> Isn't this a dangerous assumption to
make? I am not so sure that even a bit can be read atomically on all platforms. I think the only caller of
visibilitymap_test()is VACUUM which can live with a false information. But if this is indeed a problem, should we
eitherfix this or have explicit comments there ?<br clear="all" /><br />BTW, my apologies for random comments. I
haven'tfollowed the discussion well, neither done a full review. So these things might have been discussed before.<br
/><br/> Thanks,<br />Pavan <br /><br />-- <br />Pavan Deolasee<br />EnterpriseDB     <a
href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/> 

Re: visibility maps

From
"Pavan Deolasee"
Date:
<br />/*<br /> * Size of the bitmap on each visibility map page, in bytes. There's no<br /> * extra headers, so the
wholepage minus except for the standard page header<br /> * is used for the bitmap.<br /> */<br />#define MAPSIZE
(BLCKSZ- SizeOfPageHeaderData)<br /><br /><br />ISTM that we should MAXALIGN the SizeOfPageHeaderData to compute
MAPSIZE.PageGetContents() works that way and I believe that's the right thing to do.<br /><br />Thanks,<br />Pavan<br
clear="all"/><br />-- <br />Pavan Deolasee<br /> EnterpriseDB     <a
href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/> 

Re: visibility maps

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
>     /*
>      * We don't need to lock the page, as we're only looking at a single
> bit.
>      */
>     result = (map[mapByte] & (1 << mapBit)) ? true : false;
> 
> 
> Isn't this a dangerous assumption to make ? I am not so sure that even a bit
> can be read atomically on all platforms. 

Umm, what non-atomic state could the bit be in? Half-set, half-cleared? 
Or do you think that if some other bit in proximity is changed, the 
other bit would temporarily flip 0->1->0, or something like that? I 
don't think that should happen.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: visibility maps

From
"Pavan Deolasee"
Date:
<br /><br /><div class="gmail_quote">On Sat, Dec 6, 2008 at 7:57 PM, Heikki Linnakangas <span dir="ltr"><<a
href="mailto:heikki.linnakangas@enterprisedb.com">heikki.linnakangas@enterprisedb.com</a>></span>wrote:<br
/><blockquoteclass="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex;
padding-left:1ex;"><div class="Ih2E3d"><br /></div> Umm, what non-atomic state could the bit be in? Half-set,
half-cleared?Or do you think that if some other bit in proximity is changed, the other bit would temporarily flip
0->1->0,or something like that? I don't think that should happen.<br /><br /></blockquote></div><br clear="all"
/>Sincethe lock is not held, the bit can be flipped while we are reading, isn't it ? IOW, the test is not reliable is
whatI fear.<br /><br />Thanks,<br />Pavan<br /><br />-- <br />Pavan Deolasee<br /> EnterpriseDB     <a
href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/> 

Re: visibility maps

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
> On Sat, Dec 6, 2008 at 7:57 PM, Heikki Linnakangas <
> heikki.linnakangas@enterprisedb.com> wrote:
> 
>> Umm, what non-atomic state could the bit be in? Half-set, half-cleared? Or
>> do you think that if some other bit in proximity is changed, the other bit
>> would temporarily flip 0->1->0, or something like that? I don't think that
>> should happen.
>>
> Since the lock is not held, the bit can be flipped while we are reading,
> isn't it ? IOW, the test is not reliable is what I fear.

If someone is changing the bit at the same time, it doesn't matter 
whether we read it as 1 or 0. Locking the page wouldn't change the 
situation: you would still read the old value if you got the lock before 
the concurrent updater, or the new value if you got the lock after.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: visibility maps

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
> /*
>  * Size of the bitmap on each visibility map page, in bytes. There's no
>  * extra headers, so the whole page minus except for the standard page
> header
>  * is used for the bitmap.
>  */
> #define MAPSIZE (BLCKSZ - SizeOfPageHeaderData)
> 
> 
> ISTM that we should MAXALIGN the SizeOfPageHeaderData to compute MAPSIZE.
> PageGetContents() works that way and I believe that's the right thing to do.

Yep, you're right. Thanks, fixed.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: visibility maps

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Pavan Deolasee wrote:
>>     /*
>>      * We don't need to lock the page, as we're only looking at a single
>> bit.
>>      */
>>     result = (map[mapByte] & (1 << mapBit)) ? true : false;
>>
>>
>> Isn't this a dangerous assumption to make ? I am not so sure that even 
>> a bit
>> can be read atomically on all platforms. 
> 
> Umm, what non-atomic state could the bit be in? Half-set, half-cleared? 
> Or do you think that if some other bit in proximity is changed, the 
> other bit would temporarily flip 0->1->0, or something like that? I 
> don't think that should happen.
> 

IIRC, Memory reading/writing is atomic operation. Only one CPU(hw thread) can 
access to the same memory address(es)* in same time*. The question is how 
compiler compile C code to assembler.  But this code seems to me safe. I think 
we use same principle for TransactionID?

    Zdenek

* Wide is architecture dependent. and of course it depends on alignment. I'm not 
sure how x86 CPUs read nonalignment memory. But if you enable this on SPARC it 
is handled in software thru TRAP handlers and it is not really atomic. But in 
our case we use byte access which is safe everywhere.

** IIRC, some AMD64 processors allows to disable cache coherence check, but it 
not used in standard OS, and we can ignore it.




Re: visibility maps

From
"Pavan Deolasee"
Date:
On Thu, Dec 11, 2008 at 5:01 PM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:
>
>>
>
> IIRC, Memory reading/writing is atomic operation. Only one CPU(hw thread)
> can access to the same memory address(es)* in same time*. The question is
> how compiler compile C code to assembler.  But this code seems to me safe.

Yeah, I think the code is safe because we are just reading a bit.

BTW, I wonder if we need to acquire EXCLUSIVE lock while writing the
visibility map bit ? Since almost (8 * 8192) data blocks would map to
the same visibility map page, the lock can certainly become a hot
spot. I know we also update PageLSN during the set operation and that
would require EXLUSIVE lock, but is that required for consistency
given that the entire visibility map is just a hint ?

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: visibility maps

From
Zdenek Kotala
Date:
Pavan Deolasee napsal(a):
> On Thu, Dec 11, 2008 at 5:01 PM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:
>> IIRC, Memory reading/writing is atomic operation. Only one CPU(hw thread)
>> can access to the same memory address(es)* in same time*. The question is
>> how compiler compile C code to assembler.  But this code seems to me safe.
> 
> Yeah, I think the code is safe because we are just reading a bit.
> 
> BTW, I wonder if we need to acquire EXCLUSIVE lock while writing the
> visibility map bit ? 

Yes, because it is not simple write operation. You need to read byte from memory 
to register, set bit and write it back. Write memory itself is atomic but 
somebody could change other bits between read and write.
    Zdenek



Re: visibility maps

From
"Pavan Deolasee"
Date:
On Thu, Dec 11, 2008 at 7:03 PM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:

>
> Yes, because it is not simple write operation. You need to read byte from
> memory to register, set bit and write it back. Write memory itself is atomic
> but somebody could change other bits between read and write.
>

Yeah, but since its just a hint, we can possibly live with some corner
cases. The benefit of avoiding contention on the VM page would easily
out weigh the downside of wrong hints.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: visibility maps

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
> On Thu, Dec 11, 2008 at 5:01 PM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:
>> IIRC, Memory reading/writing is atomic operation. Only one CPU(hw thread)
>> can access to the same memory address(es)* in same time*. The question is
>> how compiler compile C code to assembler.  But this code seems to me safe.
> 
> Yeah, I think the code is safe because we are just reading a bit.

Right. I wonder if we should declare the char *map variable as volatile, 
though. Shouldn't make a difference in practice, it's only used once in 
the function, but it feels like the right thing to do given that it is 
accessing a piece of memory without a lock.

> BTW, I wonder if we need to acquire EXCLUSIVE lock while writing the
> visibility map bit ? Since almost (8 * 8192) data blocks would map to
> the same visibility map page, the lock can certainly become a hot
> spot. I know we also update PageLSN during the set operation and that
> would require EXLUSIVE lock, but is that required for consistency
> given that the entire visibility map is just a hint ?

Yeah, if we accept that bits can be bogusly set. There is scenarios 
where that can happen already, but they involve crashing, not during 
normal operation and clean shut down. In the future, I'd like to move in 
the direction of making the visibility map *more* reliable, not less, 
ultimately allowing index-only-scans, so I'd rather not start relaxing that.

Only the first update to a page needs to clear the bit in the visibility 
map, so I don't think it'll become a bottleneck in practice. Frequently 
updated pages will never have the bit set in the visibility map to begin 
with.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: visibility maps

From
"Pavan Deolasee"
Date:
On Thu, Dec 11, 2008 at 7:15 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>
>
> Yeah, if we accept that bits can be bogusly set. There is scenarios where
> that can happen already, but they involve crashing, not during normal
> operation and clean shut down. In the future, I'd like to move in the
> direction of making the visibility map *more* reliable, not less, ultimately
> allowing index-only-scans, so I'd rather not start relaxing that.
>

Do we have any tests to prove that the VM page lock does not indeed
become a bottleneck ? I can do some if we don't have already.

> Only the first update to a page needs to clear the bit in the visibility
> map, so I don't think it'll become a bottleneck in practice. Frequently
> updated pages will never have the bit set in the visibility map to begin
> with.
>

Well that's true only if you reject my heap-prune patch :-) Otherwise,
heap-prune will again set the bit (and I believe that's the right
thing to do)

Thanks,
Pavan



-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: visibility maps

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On Thu, Dec 11, 2008 at 5:01 PM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:
>> IIRC, Memory reading/writing is atomic operation. Only one CPU(hw thread)
>> can access to the same memory address(es)* in same time*. The question is
>> how compiler compile C code to assembler.  But this code seems to me safe.

> Yeah, I think the code is safe because we are just reading a bit.

There's no such thing as "just reading a bit" from shared memory.
Yes, you will get *some* value, but it is not very clear which value.

In particular, on machines with weak memory ordering guarantees
(PPC for instance), we put memory fence instructions into the
lock/unlock sequences to ensure that someone who obtains a lock
guarding a shared-memory data structure will see any changes made
by the previous holder of the lock.  An access that is entirely
free of any locking primitive might get a stale value --- meaning
that it might be logically inconsistent with the apparent contents
of other parts of shared memory examined just before or after this
access.

It's likely that there are other lock/unlock operations somewhere in the
code that would prevent a visible failure; and in any case the usage of
the visibility map is constrained in a way that means getting a slightly
stale value isn't a problem.  But it needs a lot closer analysis than
the existing code comment suggests.

I've been thinking for awhile that maybe we should expose the memory
fence operations as separate primitives, similar to what's done inside
the Linux kernel.  This code would feel a lot safer if it had a read
fence just before the fetch.  IIRC there are some other places where
we could use something similar instead of needing a full lock
acquisition/release.
        regards, tom lane


Re: visibility maps

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
> On Thu, Dec 11, 2008 at 7:15 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
> Do we have any tests to prove that the VM page lock does not indeed
> become a bottleneck ?

No.

> I can do some if we don't have already.

Oh, yes please!

>> Only the first update to a page needs to clear the bit in the visibility
>> map, so I don't think it'll become a bottleneck in practice. Frequently
>> updated pages will never have the bit set in the visibility map to begin
>> with.
> 
> Well that's true only if you reject my heap-prune patch :-) Otherwise,
> heap-prune will again set the bit (and I believe that's the right
> thing to do)

I'm not sure if we should set the bits in very aggressively. If we're 
more aggressive about setting the bits, it also means that we have to 
clear the bits more often, increasing the likelihood of contention that 
you were worried about. Also, skipping a few pages here and there in 
vacuum doesn't make it any faster in practice, because you're reading 
sequentially. You need long contiguous regions of pages that can be 
skipped until you see a benefit.

Setting the PD_ALL_VISIBLE flag on the heap page itself more 
aggressively might be more interesting, because of the small seqscan 
optimization.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: visibility maps

From
"Pavan Deolasee"
Date:
On Thu, Dec 11, 2008 at 8:09 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>
>
> I'm not sure if we should set the bits in very aggressively. If we're more
> aggressive about setting the bits, it also means that we have to clear the
> bits more often, increasing the likelihood of contention that you were
> worried about.

Well, I would rather set bits aggressively and reduce contention by
changing the lock type. If HOT is working well, VACUUM will have very
few things to do, but visibility map wouldn't help as much as it can
unless we set the bits after pruning.

Another thing I noticed is the since VACUUM tries to set the bit in
the first phase, it's working only because HOT prunes DEAD tuples just
before we do another scan on line pointers (which I had earlier talked
about getting rid of. May be its time I do that). Otherwise, the
visibility bit won't be set even though the DEAD tuples will be
removed in the second scan and the rest are all LIVE tuples. So if we
at all want to take out the another scan of line pointers from the
first pass, we should rather push the work setting bits in the prune
code.

Thanks,
Pavan

-- 

Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: visibility maps

From
"Pavan Deolasee"
Date:
On Thu, Dec 11, 2008 at 8:09 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Pavan Deolasee wrote:
>
>
>> I can do some if we don't have already.
>
> Oh, yes please!
>

I did some tests today with pgbench on a decent SMP machine. The goal
was to see if multiple clients (20 in the test) tries to update tuples
in different data blocks, if the EX lock on the VM page causes any
contention.

I can confirm that I haven't seen any drop in the tps with VM. I guess
since the bit reset is a very small code compared to the entire UPDATE
code path, may be its less likely than I thought previously that
multiple clients attempt to reset the bit at the same time. I'll do
some more tests to see if setting the bit in HOT-prune path leads to
any contention or not.

I can send details of the test I did, if anyone is interested.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: visibility maps

From
Dimitri Fontaine
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Le 12 déc. 08 à 13:11, Pavan Deolasee a écrit :
> I did some tests today with pgbench on a decent SMP machine. The goal
> was to see if multiple clients (20 in the test) tries to update tuples
> in different data blocks, if the EX lock on the VM page causes any
> contention.

If you want to test for a really high number of clients, you could
have a try at tsung, which is designed for doing just this.
http://archives.postgresql.org/pgsql-admin/2008-12/msg00032.php http://tsung.erlang-projects.org/ 

HTH, regards,
- --
dim



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAklCWekACgkQlBXRlnbh1bmQdACgwloRjx9lZyhLpjGCSuY7K/Au
xmUAoJSAlVoqerio175UHFPS1xVzI3iZ
=45KY
-----END PGP SIGNATURE-----


Re: visibility maps

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
> Another thing I noticed is the since VACUUM tries to set the bit in
> the first phase, it's working only because HOT prunes DEAD tuples just
> before we do another scan on line pointers (which I had earlier talked
> about getting rid of. May be its time I do that). Otherwise, the
> visibility bit won't be set even though the DEAD tuples will be
> removed in the second scan and the rest are all LIVE tuples. So if we
> at all want to take out the another scan of line pointers from the
> first pass, we should rather push the work setting bits in the prune
> code.

I don't quite understand this paragraph. If there's any DEAD tuples or 
line-pointers, the all-visible flag can't be set. After an UPDATE or 
DELETE, it indeed takes two vacuums until the bits in the visibility map 
are set.

Or did you mean that it only works because the prune phase sets the hint 
bits on the tuples? HeapTupleSatisfiesVacuum sets them too, so we're not 
relying on the prune phase to set them.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: visibility maps

From
"Pavan Deolasee"
Date:
On Wed, Dec 17, 2008 at 3:29 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>
>
> I don't quite understand this paragraph. If there's any DEAD tuples or
> line-pointers, the all-visible flag can't be set. After an UPDATE or DELETE,
> it indeed takes two vacuums until the bits in the visibility map are set.
>
> Or did you mean that it only works because the prune phase sets the hint
> bits on the tuples? HeapTupleSatisfiesVacuum sets them too, so we're not
> relying on the prune phase to set them.
>

No, I am saying, HOT-prune removes all DEAD tuples from the page (not
the DEAD line pointers though) and that's why you may not need two
vacuums for the visibility bit to set because the first phase of
vacuum would not find any DEAD tuples. Without HOT prune, two vacuums
will be required to set the bit and that would be counter intuitive
because somebody who has just run vacuum on the table would expect the
next vacuum to be no-op if there are no updates/deletes in between.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: visibility maps

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On Wed, Dec 17, 2008 at 3:29 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> I don't quite understand this paragraph. If there's any DEAD tuples or
>> line-pointers, the all-visible flag can't be set.

> No, I am saying, HOT-prune removes all DEAD tuples from the page (not
> the DEAD line pointers though) and that's why you may not need two
> vacuums for the visibility bit to set because the first phase of
> vacuum would not find any DEAD tuples.

I think what you are suggesting is that we should set the visibility map
bit while dead line pointers (tombstones) still remain.  If that's what
you meant it's a bad idea.  It would be correct in the sense of "there
are no invisible rows here", but it's not correct in the sense of "there
is no work for vacuum to do here".  In particular that would be the
wrong definition for the hoped-for future feature of index-only scans.
        regards, tom lane


Re: visibility maps

From
"Pavan Deolasee"
Date:
On Wed, Dec 17, 2008 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>
> I think what you are suggesting is that we should set the visibility map
> bit while dead line pointers (tombstones) still remain.  If that's what
> you meant it's a bad idea.

No, I'm not suggesting that. I understand the problem there. I was
merely explaining how HOT-prune may some time (when there are no DEAD
line pointers created) help us set the visibility bit early.

OTOH I think we can still set PD_ALL_VISIBLE page header flag even
when there are DEAD line pointers.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: visibility maps

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> OTOH I think we can still set PD_ALL_VISIBLE page header flag even
> when there are DEAD line pointers.

That would mean the header flag means something different than the
map bit does, which would mean extra tests need to be made before
propagating the flag bit to the map.  This is doable, no doubt,
but it seems pretty error-prone ...
        regards, tom lane


Re: visibility maps

From
Bruce Momjian
Date:
Is there anything to do with the below issue?

---------------------------------------------------------------------------

Pavan Deolasee wrote:
> On Wed, Dec 17, 2008 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> >
> > I think what you are suggesting is that we should set the visibility map
> > bit while dead line pointers (tombstones) still remain.  If that's what
> > you meant it's a bad idea.
> 
> No, I'm not suggesting that. I understand the problem there. I was
> merely explaining how HOT-prune may some time (when there are no DEAD
> line pointers created) help us set the visibility bit early.
> 
> OTOH I think we can still set PD_ALL_VISIBLE page header flag even
> when there are DEAD line pointers.
> 
> Thanks,
> Pavan
> 
> -- 
> Pavan Deolasee
> EnterpriseDB     http://www.enterprisedb.com
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: visibility maps

From
Bruce Momjian
Date:
Tom Lane wrote:
> "Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> > OTOH I think we can still set PD_ALL_VISIBLE page header flag even
> > when there are DEAD line pointers.
> 
> That would mean the header flag means something different than the
> map bit does, which would mean extra tests need to be made before
> propagating the flag bit to the map.  This is doable, no doubt,
> but it seems pretty error-prone ...

Are there any open items from this discussion?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: visibility maps

From
Heikki Linnakangas
Date:
Bruce Momjian wrote:
> Tom Lane wrote:
>> "Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
>>> OTOH I think we can still set PD_ALL_VISIBLE page header flag even
>>> when there are DEAD line pointers.
>> That would mean the header flag means something different than the
>> map bit does, which would mean extra tests need to be made before
>> propagating the flag bit to the map.  This is doable, no doubt,
>> but it seems pretty error-prone ...
> 
> Are there any open items from this discussion?

No.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com