Thread: CeBit

CeBit

From
Michael Meskes
Date:
Is anyone on this list in Hannover for CeBit? Maybe we could arrange a
meeting.

Michael
-- 
Michael Meskes
Michael@Fam-Meskes.De
Go SF 49ers! Go Rhein Fire!
Use Debian GNU/Linux! Use PostgreSQL!


WAL & SHM principles

From
Martin Devera
Date:
Hello,

maybe I missed something, but in last days I was thinking how would I
write my own sql server. I got several ideas and because these are not
used in PG they are probably bad - but I can't figure why.

1) WAL
We have buffer manager, ok. So why not to use WAL as part of it and don't
log INSERT/UPDATE/DELETE xlog records but directly changes into buffer
pages ? When someone dirties page it has to inform bmgr about dirty region
and bmgr would formulate xlog record. The record could be for example
fixed bitmap where each bit corresponds to part of page (of size
pgsize/no-of-bits) which was changed. These changed regions follows.
Multiple writes (by multiple backends) can be coalesced together as long
as their transactions overlaps and there is enough memory to keep changed 
buffer pages in memory.

Pros:     upper layers can think thet buffers are always safe/logged and thereis no special handling for indices; very
simple/fastredo
 
Cons:    can't implement undo - but in non-overwriting is not needed (?)

2) SHM vs. MMAP
Why don't use mmap to share pages (instead of shm) ? There would be no
problem with tuning pg's buffer cache size - it is balanced by OS.
When using SHM there are often two copies of page: one in OS' page cache
and one in SHM (vaste of memory).
When using mmap the data goes (almost) directly from HDD into your memory
page - now you need to copy it from OS' page to PG's page.
There is one problem: how to assure that dirtied page is not flushed
before its xlog. One can use mlock but you often need root privileges to
use it. Another way is to implement own COW (copy on write) to create
intermediate buffers used only until xlog is flushed.

Are there considerations correct ?

regards, devik



Re: WAL & SHM principles

From
Bruce Momjian
Date:
> 2) SHM vs. MMAP
> Why don't use mmap to share pages (instead of shm) ? There would be no
> problem with tuning pg's buffer cache size - it is balanced by OS.
> When using SHM there are often two copies of page: one in OS' page cache
> and one in SHM (vaste of memory).
> When using mmap the data goes (almost) directly from HDD into your memory
> page - now you need to copy it from OS' page to PG's page.
> There is one problem: how to assure that dirtied page is not flushed
> before its xlog. One can use mlock but you often need root privileges to
> use it. Another way is to implement own COW (copy on write) to create
> intermediate buffers used only until xlog is flushed.

This was brought up a week ago, and I consider it an interesting idea. 
The only problem is that we would no longer have control over which
pages made it to disk.  The OS would perhaps write pages as we modified
them.  Not sure how important that is.

The good news is that most/all OS's are smart enought that if two
processes mmap() the same file, they see each other's changes, so in a
sense it is shared memory, but a much larger, smarter pool of shared
memory than what we have now.  We would still need buffer headers and
stuff because we need to synchronize access to the buffers.


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: WAL & SHM principles

From
Martin Devera
Date:
> This was brought up a week ago, and I consider it an interesting idea. 
> The only problem is that we would no longer have control over which
> pages made it to disk.  The OS would perhaps write pages as we modified
> them.  Not sure how important that is.

Yes. As I work on linux kernel I know something about it. When page is
accessed the CPU sets one bit in PTE. The OS writes the page when it
needs page frame. It also tries to launder pages periodicaly but actual
alghoritm changes too often in recent kernels ;-)
Also page write is not atomic - several buffer heads are filled for the
page and asynchronously posted for write. Elevator then sort and coalesce
these buffers heads and create actual scsi/ide write requests. But there
is no guarantee that buffer heads from one page will be coalested to one
write request ...
You can call mlock (PageLock on Win32) to lock page in memory. You can
postpone write using it. It is ok under Win32 and many unices but under
linux only admin or one with CAP_MEMLOCK (not exact name) can mlock. 

> The good news is that most/all OS's are smart enought that if two
> processes mmap() the same file, they see each other's changes, so in a

yes, when using SHARED flag to mmap then IMHO it is mandatory for an OS

> sense it is shared memory, but a much larger, smarter pool of shared
> memory than what we have now.  We would still need buffer headers and
> stuff because we need to synchronize access to the buffers.

Also some smart algorithm which tries to mmap several pages in one
continuous block. You can mmap each page at its own but OSes stores mmap
informations per page range. You need to minimize number of such ranges.

devik



Re: WAL & SHM principles

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> The only problem is that we would no longer have control over which
> pages made it to disk.  The OS would perhaps write pages as we modified
> them.  Not sure how important that is.

Unfortunately, this alone is a *fatal* objection.  See nearby
discussions about WAL behavior: we must be able to control the relative
timing of WAL write/flush and data page writes.
        regards, tom lane


Re: WAL & SHM principles

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > The only problem is that we would no longer have control over which
> > pages made it to disk.  The OS would perhaps write pages as we modified
> > them.  Not sure how important that is.
> 
> Unfortunately, this alone is a *fatal* objection.  See nearby
> discussions about WAL behavior: we must be able to control the relative
> timing of WAL write/flush and data page writes.

Bummer.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: WAL & SHM principles

From
ncm@zembu.com (Nathan Myers)
Date:
On Wed, Mar 07, 2001 at 11:21:37AM -0500, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > The only problem is that we would no longer have control over which
> > pages made it to disk.  The OS would perhaps write pages as we modified
> > them.  Not sure how important that is.
> 
> Unfortunately, this alone is a *fatal* objection.  See nearby
> discussions about WAL behavior: we must be able to control the relative
> timing of WAL write/flush and data page writes.

Not so fast!

It is possible to build a logging system so that you mostly don't care
when the data blocks get written; a particular data block on disk is 
considered garbage until the next checkpoint, so that you might as well 
allow the blocks to be written any time, even before the log entry.

Letting the OS manage sharing of disk block images via mmap should be 
an enormous win vs. a fixed shm and manual scheduling by PG.  If that
requires changes in the logging protocol, it's worth it.

(What supported platforms don't have mmap?)

Nathan Myers
ncm@zembu.com


Re: WAL & SHM principles

From
Martin Devera
Date:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > The only problem is that we would no longer have control over which
> > > pages made it to disk.  The OS would perhaps write pages as we modified
> > > them.  Not sure how important that is.
> > 
> > Unfortunately, this alone is a *fatal* objection.  See nearby
> > discussions about WAL behavior: we must be able to control the relative
> > timing of WAL write/flush and data page writes.
> 
> Bummer.
> 
BTW, what means "bummer" ?
But for many OSes you CAN control when to write data - you can mlock
individual pages.



Re: WAL & SHM principles

From
Tim Allen
Date:
On Thu, 8 Mar 2001, Martin Devera wrote:

> > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Unfortunately, this alone is a *fatal* objection.  See nearby
> > > discussions about WAL behavior: we must be able to control the relative
> > > timing of WAL write/flush and data page writes.
> > 
> > Bummer.
> > 
> BTW, what means "bummer" ?

It's a Postgres-specific extension to the SQL standard. It means "I am
disappointed". As far as I can tell, you _may_ use it as a column or table
name. :-)

Tim

-- 
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/



Re: WAL & SHM principles

From
Bruce Momjian
Date:
> > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > > The only problem is that we would no longer have control over which
> > > > pages made it to disk.  The OS would perhaps write pages as we modified
> > > > them.  Not sure how important that is.
> > > 
> > > Unfortunately, this alone is a *fatal* objection.  See nearby
> > > discussions about WAL behavior: we must be able to control the relative
> > > timing of WAL write/flush and data page writes.
> > 
> > Bummer.
> > 
> BTW, what means "bummer" ?

Sorry, it means, "Oh, I am disappointed."

> But for many OSes you CAN control when to write data - you can mlock
> individual pages.

mlock() controls locking in physical memory.  I don't see it controling
write().

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: WAL & SHM principles

From
Martin Devera
Date:
> > BTW, what means "bummer" ?
> 
> Sorry, it means, "Oh, I am disappointed."

thanks :)

> > But for many OSes you CAN control when to write data - you can mlock
> > individual pages.
> 
> mlock() controls locking in physical memory.  I don't see it controling
> write().

When you mmap, you don't use write() !
mlock actualy locks page in memory and as long as the page is locked
the OS doesn't attempt to store the dirty page. It is intended also
for security app to ensure that sensitive data are not written to unsecure
storage (hdd). It is definition of mlock so that you can be probably sure
with it.

There is way to do it without mlock (fallback):
You definitely need some kind of page headers. The header should has info
whether the page can be mmaped or is in "dirty pool". Pages in dirty pool
are pages which are dirty but not written yet and are waiting to
appropriate log record to be flushed. When log is flushed the data at
dirty pool can be copied to its regular mmap location and discarded.

If dirty pool is too large, simply sync log and whole pool can be
discarded.

mmap version could be faster when loading data from hdd and will result in
better utilization of memory (because you are directly working with data
at OS' page-cache instead of having duplicates in pg's buffer cache).
Also page cache expiration is handled by OS and it will allow pg to use as
much memory as is available (no need to specify buffer page size).

devik



Re: WAL & SHM principles

From
Giles Lean
Date:
> When you mmap, you don't use write() !  mlock actualy locks page in
> memory and as long as the page is locked the OS doesn't attempt to
> store the dirty page.  It is intended also for security app to
> ensure that sensitive data are not written to unsecure storage
> (hdd). It is definition of mlock so that you can be probably sure
> with it.

News to me ... can you please point to such a definition?  I see no
reference to such semantics in the mlock() manual page in UNIX98
(Single Unix Standard, version 2).

mlock() guarantees that the locked address space is in memory.  This
doesn't imply that updates are not written to the backing file.

I would expect an OS that doesn't have a unified buffer cache but
tries to keep a consistent view for mmap() and read()/write() to
update the file.

Regards,

Giles


Re: WAL & SHM principles

From
"Ken Hirsch"
Date:
Giles Lean <giles@nemeton.com.au> wrote:

> > When you mmap, you don't use write() !  mlock actualy locks page in
> > memory and as long as the page is locked the OS doesn't attempt to
> > store the dirty page.  It is intended also for security app to
> > ensure that sensitive data are not written to unsecure storage
> > (hdd). It is definition of mlock so that you can be probably sure
> > with it.
>
> News to me ... can you please point to such a definition?  I see no
> reference to such semantics in the mlock() manual page in UNIX98
> (Single Unix Standard, version 2).
>
> mlock() guarantees that the locked address space is in memory.  This
> doesn't imply that updates are not written to the backing file.

I've wondered about this myself.  It _is_ true on Linux that mlock prevents
writes to the backing store, and this is used as a security feature for
cryptography software.   The code for gnupg assumes that if you have mlock()
on any operating system, it does mean this--which doesn't mean it's true,
but perhaps whoever wrote it does have good reason to think so.

But I don't know about other systems.  Does anybody know what the POSIX.1b
standard says?

It was even suggested to me on the linux-fsdev mailing list that mlock() was
a good way to insure the write-ahead condition.

Ken Hirsch





Re: WAL & SHM principles

From
"Ken Hirsch"
Date:
Giles Lean <giles@nemeton.com.au> wrote:

> > When you mmap, you don't use write() !  mlock actualy locks page in
> > memory and as long as the page is locked the OS doesn't attempt to
> > store the dirty page.  It is intended also for security app to
> > ensure that sensitive data are not written to unsecure storage
> > (hdd). It is definition of mlock so that you can be probably sure
> > with it.
>
> News to me ... can you please point to such a definition?  I see no
> reference to such semantics in the mlock() manual page in UNIX98
> (Single Unix Standard, version 2).
>
> mlock() guarantees that the locked address space is in memory.  This
> doesn't imply that updates are not written to the backing file.

I've wondered about this myself.  It _is_ true on Linux that mlock prevents
writes to the backing store, and this is used as a security feature for
cryptography software.   The code for gnupg assumes that if you have mlock()
on any operating system, it does mean this--which doesn't mean it's true,
but perhaps whoever wrote it does have good reason to think so.

But I don't know about other systems.  Does anybody know what the POSIX.1b
standard says?

It was even suggested to me on the linux-fsdev mailing list that mlock() was
a good way to insure the write-ahead condition.

Ken Hirsch




Re: WAL & SHM principles

From
Matthew Kirkwood
Date:
On Tue, 13 Mar 2001, Ken Hirsch wrote:

> > mlock() guarantees that the locked address space is in memory.  This
> > doesn't imply that updates are not written to the backing file.
>
> I've wondered about this myself.  It _is_ true on Linux that mlock
> prevents writes to the backing store,

I don't believe that this is true.  The manpage offers no
such promises, and the semantics are not useful.

> and this is used as a security feature for cryptography software.

mlock() is used to prevent pages being swapped out.  Its
use for crypto software is essentially restricted to anon
memory (allocated via brk() or mmap() of /dev/zero).

If my understanding is accurate, before 2.4 Linux would
never swap out pages which had a backing store.  It would
simply write them back or drop them (if clean).  (This is
why you need around twice as much swap with 2.4.)

> The code for gnupg assumes that if you have mlock() on any operating
> system, it does mean this--which doesn't mean it's true, but perhaps
> whoever wrote it does have good reason to think so.

strace on gpg startup says:

mmap(0, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40015000
getuid()                                = 500
mlock(0x40015000)                       = -1 EPERM (Operation not permitted)

so whatever the authors think, it does not require this semantic.

Matthew.



Re: WAL & SHM principles

From
Alfred Perlstein
Date:
* Matthew Kirkwood <matthew@hairy.beasts.org> [010313 13:12] wrote:
> On Tue, 13 Mar 2001, Ken Hirsch wrote:
> 
> > > mlock() guarantees that the locked address space is in memory.  This
> > > doesn't imply that updates are not written to the backing file.
> >
> > I've wondered about this myself.  It _is_ true on Linux that mlock
> > prevents writes to the backing store,
> 
> I don't believe that this is true.  The manpage offers no
> such promises, and the semantics are not useful.

Afaik FreeBSD's Linux emulator:

revision 1.13
date: 2001/02/28 04:30:27;  author: dillon;  state: Exp;  lines: +3 -1
Linux does not filesystem-sync file-backed writable mmap pages on
a regular basis.  Adjust our linux emulation to conform.  This will
cause more dirty pages to be left for the pagedaemon to deal with,
but our new low-memory handling code can deal with it.   The linux
way appears to be a trend, and we may very well make MAP_NOSYNC the
default for FreeBSD as well (once we have reasonable sequential
write-behind heuristics for random faults).
(will be MFC'd prior to 4.3 freeze)

Suggested by: Andrew Gallatin

Basically any mmap'd data doesn't seem to get sync()'d out on
a regular basis.

> > and this is used as a security feature for cryptography software.
> 
> mlock() is used to prevent pages being swapped out.  Its
> use for crypto software is essentially restricted to anon
> memory (allocated via brk() or mmap() of /dev/zero).

What about userland device drivers that want to send parts
of a disk backed file to a driver's dma routine?

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/



Re: WAL & SHM principles

From
Matthew Kirkwood
Date:
On Tue, 13 Mar 2001, Alfred Perlstein wrote:

[..]
> Linux does not filesystem-sync file-backed writable mmap pages on a
> regular basis.

Very intersting.  I'm not sure that is necessarily the case in
2.4, though -- my understanding is that the new all-singing,
all-dancing page cache makes very little distinction between
mapped and unmapped dirty pages.

> Basically any mmap'd data doesn't seem to get sync()'d out on
> a regular basis.

Hmm.. I'd call that a bug, anyway.

> > > and this is used as a security feature for cryptography software.
> >
> > mlock() is used to prevent pages being swapped out.  Its
> > use for crypto software is essentially restricted to anon
> > memory (allocated via brk() or mmap() of /dev/zero).
>
> What about userland device drivers that want to send parts
> of a disk backed file to a driver's dma routine?

And realtime software.  I'm not disputing that mlock is useful,
but what it can do be security software is not that huge.  The
Linux manpage says:
      Memory locking has two main applications: real-time  algo­      rithms and high-security data processing.

Matthew.



Re: CeBit

From
Jan Wieck
Date:
Michael Meskes wrote:
> Is anyone on this list in Hannover for CeBit? Maybe we could arrange a
> meeting.
   Looks pretty much that I'll be still in Hamburg by then. What   are the days you planned?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: WAL & SHM principles

From
Martin Devera
Date:
> > When you mmap, you don't use write() !  mlock actualy locks page in
> > memory and as long as the page is locked the OS doesn't attempt to
> > store the dirty page.  It is intended also for security app to
> > ensure that sensitive data are not written to unsecure storage
> > (hdd). It is definition of mlock so that you can be probably sure
> > with it.
> 
> News to me ... can you please point to such a definition?  I see no
> reference to such semantics in the mlock() manual page in UNIX98
> (Single Unix Standard, version 2).

sorry, maybe I'm biased toward Linux. The statement above is from Linux's
man page and as I looked into mm code it seems to be right.
I'm not sore about other unices.

> mlock() guarantees that the locked address space is in memory.  This
> doesn't imply that updates are not written to the backing file.

yes, probably it depends on OS in question. In Linux kernel the page is
not written when mlocked (but I'm not sure about msync here).

> I would expect an OS that doesn't have a unified buffer cache but
> tries to keep a consistent view for mmap() and read()/write() to
> update the file.

hmm but why to mlock page then ? Only to be sure the page is not wsapped
out ?

regards, devik