Thread: CeBit
Is anyone on this list in Hannover for CeBit? Maybe we could arrange a meeting. Michael -- Michael Meskes Michael@Fam-Meskes.De Go SF 49ers! Go Rhein Fire! Use Debian GNU/Linux! Use PostgreSQL!
Hello, maybe I missed something, but in last days I was thinking how would I write my own sql server. I got several ideas and because these are not used in PG they are probably bad - but I can't figure why. 1) WAL We have buffer manager, ok. So why not to use WAL as part of it and don't log INSERT/UPDATE/DELETE xlog records but directly changes into buffer pages ? When someone dirties page it has to inform bmgr about dirty region and bmgr would formulate xlog record. The record could be for example fixed bitmap where each bit corresponds to part of page (of size pgsize/no-of-bits) which was changed. These changed regions follows. Multiple writes (by multiple backends) can be coalesced together as long as their transactions overlaps and there is enough memory to keep changed buffer pages in memory. Pros: upper layers can think thet buffers are always safe/logged and thereis no special handling for indices; very simple/fastredo Cons: can't implement undo - but in non-overwriting is not needed (?) 2) SHM vs. MMAP Why don't use mmap to share pages (instead of shm) ? There would be no problem with tuning pg's buffer cache size - it is balanced by OS. When using SHM there are often two copies of page: one in OS' page cache and one in SHM (vaste of memory). When using mmap the data goes (almost) directly from HDD into your memory page - now you need to copy it from OS' page to PG's page. There is one problem: how to assure that dirtied page is not flushed before its xlog. One can use mlock but you often need root privileges to use it. Another way is to implement own COW (copy on write) to create intermediate buffers used only until xlog is flushed. Are there considerations correct ? regards, devik
> 2) SHM vs. MMAP > Why don't use mmap to share pages (instead of shm) ? There would be no > problem with tuning pg's buffer cache size - it is balanced by OS. > When using SHM there are often two copies of page: one in OS' page cache > and one in SHM (vaste of memory). > When using mmap the data goes (almost) directly from HDD into your memory > page - now you need to copy it from OS' page to PG's page. > There is one problem: how to assure that dirtied page is not flushed > before its xlog. One can use mlock but you often need root privileges to > use it. Another way is to implement own COW (copy on write) to create > intermediate buffers used only until xlog is flushed. This was brought up a week ago, and I consider it an interesting idea. The only problem is that we would no longer have control over which pages made it to disk. The OS would perhaps write pages as we modified them. Not sure how important that is. The good news is that most/all OS's are smart enought that if two processes mmap() the same file, they see each other's changes, so in a sense it is shared memory, but a much larger, smarter pool of shared memory than what we have now. We would still need buffer headers and stuff because we need to synchronize access to the buffers. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> This was brought up a week ago, and I consider it an interesting idea. > The only problem is that we would no longer have control over which > pages made it to disk. The OS would perhaps write pages as we modified > them. Not sure how important that is. Yes. As I work on linux kernel I know something about it. When page is accessed the CPU sets one bit in PTE. The OS writes the page when it needs page frame. It also tries to launder pages periodicaly but actual alghoritm changes too often in recent kernels ;-) Also page write is not atomic - several buffer heads are filled for the page and asynchronously posted for write. Elevator then sort and coalesce these buffers heads and create actual scsi/ide write requests. But there is no guarantee that buffer heads from one page will be coalested to one write request ... You can call mlock (PageLock on Win32) to lock page in memory. You can postpone write using it. It is ok under Win32 and many unices but under linux only admin or one with CAP_MEMLOCK (not exact name) can mlock. > The good news is that most/all OS's are smart enought that if two > processes mmap() the same file, they see each other's changes, so in a yes, when using SHARED flag to mmap then IMHO it is mandatory for an OS > sense it is shared memory, but a much larger, smarter pool of shared > memory than what we have now. We would still need buffer headers and > stuff because we need to synchronize access to the buffers. Also some smart algorithm which tries to mmap several pages in one continuous block. You can mmap each page at its own but OSes stores mmap informations per page range. You need to minimize number of such ranges. devik
Bruce Momjian <pgman@candle.pha.pa.us> writes: > The only problem is that we would no longer have control over which > pages made it to disk. The OS would perhaps write pages as we modified > them. Not sure how important that is. Unfortunately, this alone is a *fatal* objection. See nearby discussions about WAL behavior: we must be able to control the relative timing of WAL write/flush and data page writes. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > The only problem is that we would no longer have control over which > > pages made it to disk. The OS would perhaps write pages as we modified > > them. Not sure how important that is. > > Unfortunately, this alone is a *fatal* objection. See nearby > discussions about WAL behavior: we must be able to control the relative > timing of WAL write/flush and data page writes. Bummer. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Wed, Mar 07, 2001 at 11:21:37AM -0500, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > The only problem is that we would no longer have control over which > > pages made it to disk. The OS would perhaps write pages as we modified > > them. Not sure how important that is. > > Unfortunately, this alone is a *fatal* objection. See nearby > discussions about WAL behavior: we must be able to control the relative > timing of WAL write/flush and data page writes. Not so fast! It is possible to build a logging system so that you mostly don't care when the data blocks get written; a particular data block on disk is considered garbage until the next checkpoint, so that you might as well allow the blocks to be written any time, even before the log entry. Letting the OS manage sharing of disk block images via mmap should be an enormous win vs. a fixed shm and manual scheduling by PG. If that requires changes in the logging protocol, it's worth it. (What supported platforms don't have mmap?) Nathan Myers ncm@zembu.com
> > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > The only problem is that we would no longer have control over which > > > pages made it to disk. The OS would perhaps write pages as we modified > > > them. Not sure how important that is. > > > > Unfortunately, this alone is a *fatal* objection. See nearby > > discussions about WAL behavior: we must be able to control the relative > > timing of WAL write/flush and data page writes. > > Bummer. > BTW, what means "bummer" ? But for many OSes you CAN control when to write data - you can mlock individual pages.
On Thu, 8 Mar 2001, Martin Devera wrote: > > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Unfortunately, this alone is a *fatal* objection. See nearby > > > discussions about WAL behavior: we must be able to control the relative > > > timing of WAL write/flush and data page writes. > > > > Bummer. > > > BTW, what means "bummer" ? It's a Postgres-specific extension to the SQL standard. It means "I am disappointed". As far as I can tell, you _may_ use it as a column or table name. :-) Tim -- ----------------------------------------------- Tim Allen tim@proximity.com.au Proximity Pty Ltd http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/
> > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > > The only problem is that we would no longer have control over which > > > > pages made it to disk. The OS would perhaps write pages as we modified > > > > them. Not sure how important that is. > > > > > > Unfortunately, this alone is a *fatal* objection. See nearby > > > discussions about WAL behavior: we must be able to control the relative > > > timing of WAL write/flush and data page writes. > > > > Bummer. > > > BTW, what means "bummer" ? Sorry, it means, "Oh, I am disappointed." > But for many OSes you CAN control when to write data - you can mlock > individual pages. mlock() controls locking in physical memory. I don't see it controling write(). -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > BTW, what means "bummer" ? > > Sorry, it means, "Oh, I am disappointed." thanks :) > > But for many OSes you CAN control when to write data - you can mlock > > individual pages. > > mlock() controls locking in physical memory. I don't see it controling > write(). When you mmap, you don't use write() ! mlock actualy locks page in memory and as long as the page is locked the OS doesn't attempt to store the dirty page. It is intended also for security app to ensure that sensitive data are not written to unsecure storage (hdd). It is definition of mlock so that you can be probably sure with it. There is way to do it without mlock (fallback): You definitely need some kind of page headers. The header should has info whether the page can be mmaped or is in "dirty pool". Pages in dirty pool are pages which are dirty but not written yet and are waiting to appropriate log record to be flushed. When log is flushed the data at dirty pool can be copied to its regular mmap location and discarded. If dirty pool is too large, simply sync log and whole pool can be discarded. mmap version could be faster when loading data from hdd and will result in better utilization of memory (because you are directly working with data at OS' page-cache instead of having duplicates in pg's buffer cache). Also page cache expiration is handled by OS and it will allow pg to use as much memory as is available (no need to specify buffer page size). devik
> When you mmap, you don't use write() ! mlock actualy locks page in > memory and as long as the page is locked the OS doesn't attempt to > store the dirty page. It is intended also for security app to > ensure that sensitive data are not written to unsecure storage > (hdd). It is definition of mlock so that you can be probably sure > with it. News to me ... can you please point to such a definition? I see no reference to such semantics in the mlock() manual page in UNIX98 (Single Unix Standard, version 2). mlock() guarantees that the locked address space is in memory. This doesn't imply that updates are not written to the backing file. I would expect an OS that doesn't have a unified buffer cache but tries to keep a consistent view for mmap() and read()/write() to update the file. Regards, Giles
Giles Lean <giles@nemeton.com.au> wrote: > > When you mmap, you don't use write() ! mlock actualy locks page in > > memory and as long as the page is locked the OS doesn't attempt to > > store the dirty page. It is intended also for security app to > > ensure that sensitive data are not written to unsecure storage > > (hdd). It is definition of mlock so that you can be probably sure > > with it. > > News to me ... can you please point to such a definition? I see no > reference to such semantics in the mlock() manual page in UNIX98 > (Single Unix Standard, version 2). > > mlock() guarantees that the locked address space is in memory. This > doesn't imply that updates are not written to the backing file. I've wondered about this myself. It _is_ true on Linux that mlock prevents writes to the backing store, and this is used as a security feature for cryptography software. The code for gnupg assumes that if you have mlock() on any operating system, it does mean this--which doesn't mean it's true, but perhaps whoever wrote it does have good reason to think so. But I don't know about other systems. Does anybody know what the POSIX.1b standard says? It was even suggested to me on the linux-fsdev mailing list that mlock() was a good way to insure the write-ahead condition. Ken Hirsch
Giles Lean <giles@nemeton.com.au> wrote: > > When you mmap, you don't use write() ! mlock actualy locks page in > > memory and as long as the page is locked the OS doesn't attempt to > > store the dirty page. It is intended also for security app to > > ensure that sensitive data are not written to unsecure storage > > (hdd). It is definition of mlock so that you can be probably sure > > with it. > > News to me ... can you please point to such a definition? I see no > reference to such semantics in the mlock() manual page in UNIX98 > (Single Unix Standard, version 2). > > mlock() guarantees that the locked address space is in memory. This > doesn't imply that updates are not written to the backing file. I've wondered about this myself. It _is_ true on Linux that mlock prevents writes to the backing store, and this is used as a security feature for cryptography software. The code for gnupg assumes that if you have mlock() on any operating system, it does mean this--which doesn't mean it's true, but perhaps whoever wrote it does have good reason to think so. But I don't know about other systems. Does anybody know what the POSIX.1b standard says? It was even suggested to me on the linux-fsdev mailing list that mlock() was a good way to insure the write-ahead condition. Ken Hirsch
On Tue, 13 Mar 2001, Ken Hirsch wrote: > > mlock() guarantees that the locked address space is in memory. This > > doesn't imply that updates are not written to the backing file. > > I've wondered about this myself. It _is_ true on Linux that mlock > prevents writes to the backing store, I don't believe that this is true. The manpage offers no such promises, and the semantics are not useful. > and this is used as a security feature for cryptography software. mlock() is used to prevent pages being swapped out. Its use for crypto software is essentially restricted to anon memory (allocated via brk() or mmap() of /dev/zero). If my understanding is accurate, before 2.4 Linux would never swap out pages which had a backing store. It would simply write them back or drop them (if clean). (This is why you need around twice as much swap with 2.4.) > The code for gnupg assumes that if you have mlock() on any operating > system, it does mean this--which doesn't mean it's true, but perhaps > whoever wrote it does have good reason to think so. strace on gpg startup says: mmap(0, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40015000 getuid() = 500 mlock(0x40015000) = -1 EPERM (Operation not permitted) so whatever the authors think, it does not require this semantic. Matthew.
* Matthew Kirkwood <matthew@hairy.beasts.org> [010313 13:12] wrote: > On Tue, 13 Mar 2001, Ken Hirsch wrote: > > > > mlock() guarantees that the locked address space is in memory. This > > > doesn't imply that updates are not written to the backing file. > > > > I've wondered about this myself. It _is_ true on Linux that mlock > > prevents writes to the backing store, > > I don't believe that this is true. The manpage offers no > such promises, and the semantics are not useful. Afaik FreeBSD's Linux emulator: revision 1.13 date: 2001/02/28 04:30:27; author: dillon; state: Exp; lines: +3 -1 Linux does not filesystem-sync file-backed writable mmap pages on a regular basis. Adjust our linux emulation to conform. This will cause more dirty pages to be left for the pagedaemon to deal with, but our new low-memory handling code can deal with it. The linux way appears to be a trend, and we may very well make MAP_NOSYNC the default for FreeBSD as well (once we have reasonable sequential write-behind heuristics for random faults). (will be MFC'd prior to 4.3 freeze) Suggested by: Andrew Gallatin Basically any mmap'd data doesn't seem to get sync()'d out on a regular basis. > > and this is used as a security feature for cryptography software. > > mlock() is used to prevent pages being swapped out. Its > use for crypto software is essentially restricted to anon > memory (allocated via brk() or mmap() of /dev/zero). What about userland device drivers that want to send parts of a disk backed file to a driver's dma routine? -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/
On Tue, 13 Mar 2001, Alfred Perlstein wrote: [..] > Linux does not filesystem-sync file-backed writable mmap pages on a > regular basis. Very intersting. I'm not sure that is necessarily the case in 2.4, though -- my understanding is that the new all-singing, all-dancing page cache makes very little distinction between mapped and unmapped dirty pages. > Basically any mmap'd data doesn't seem to get sync()'d out on > a regular basis. Hmm.. I'd call that a bug, anyway. > > > and this is used as a security feature for cryptography software. > > > > mlock() is used to prevent pages being swapped out. Its > > use for crypto software is essentially restricted to anon > > memory (allocated via brk() or mmap() of /dev/zero). > > What about userland device drivers that want to send parts > of a disk backed file to a driver's dma routine? And realtime software. I'm not disputing that mlock is useful, but what it can do be security software is not that huge. The Linux manpage says: Memory locking has two main applications: real-time algo rithms and high-security data processing. Matthew.
Michael Meskes wrote: > Is anyone on this list in Hannover for CeBit? Maybe we could arrange a > meeting. Looks pretty much that I'll be still in Hamburg by then. What are the days you planned? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
> > When you mmap, you don't use write() ! mlock actualy locks page in > > memory and as long as the page is locked the OS doesn't attempt to > > store the dirty page. It is intended also for security app to > > ensure that sensitive data are not written to unsecure storage > > (hdd). It is definition of mlock so that you can be probably sure > > with it. > > News to me ... can you please point to such a definition? I see no > reference to such semantics in the mlock() manual page in UNIX98 > (Single Unix Standard, version 2). sorry, maybe I'm biased toward Linux. The statement above is from Linux's man page and as I looked into mm code it seems to be right. I'm not sore about other unices. > mlock() guarantees that the locked address space is in memory. This > doesn't imply that updates are not written to the backing file. yes, probably it depends on OS in question. In Linux kernel the page is not written when mlocked (but I'm not sure about msync here). > I would expect an OS that doesn't have a unified buffer cache but > tries to keep a consistent view for mmap() and read()/write() to > update the file. hmm but why to mlock page then ? Only to be sure the page is not wsapped out ? regards, devik