Thread: making use of large TLB pages
Rohit Seth recently added support for the use of large TLB pages on Linux if the processor architecture supports them (I believe the SPARC, IA32, and IA64 have hugetlb support, more archs will probably be added). The patch was merged into Linux 2.5.36, so it will more than likely be in Linux 2.6. For more information on large TLB pages and why they are generally viewed to improve database performance, see here: http://lwn.net/Articles/6535/ (the patch this refers to is an earlier implementation, I believe, but the ideais the same) http://lwn.net/Articles/10293/ (item #4) I'd like to enable PostgreSQL to use large TLB pages, if the OS and processor support them. In talking to the author of the TLB patches for Linux (Rohit Seth), he described the current API: ====== 1) Only two system calls. These are: sys_alloc_hugepages(int key, unsigned long addr, unsigned long len, int prot, int flag) sys_free_hugepages(unsigned long addr) Key will be equal to zero if user wants these huge pages as private. A positive int value will be used for unrelated apps to share the same physical huge pages. addr is the user prefered address. The kernel may decide to allocate a different virtual address (depending on availability and alignment factors). len is the requested size of memory wanted by user app. prot could get the value of PROT_READ, PROT_WRITE, PROT_EXEC flag: The only allowed value right now is IPC_CREAT, which in case of shred hugepages (across processes) tells the kernel to create a new segment if none is already created. If this flag is not provided and there is no hugepage segment corresponding to the "key" then ENOENT is returned. More like on the lines of IPC_CREAT flag for shmget routine. On success sys_alloc_hugepages returns the virtual address allocated by kernel. ===== So as I understand it, we would basically replace the calls to shmget(), shmdt(), etc. with these system calls. The behavior will be slightly different, however -- I'm not sure if this API supports everything we expect the SysV IPC API to support (e.g. telling the # of clients attached to a given segment). Can anyone comment on exactly what functionality we expect when dealing with the storage mechanism of the shared buffer? Any comments would be appreciated. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway <neilc@samurai.com> writes: > I'd like to enable PostgreSQL to use large TLB pages, if the OS and > processor support them. Hmm ... it seems interesting, but I'm hesitant to do a lot of work to support something that's only available on one hardware-and-OS combination. (If we were talking about a Windows-specific hack, you'd already have lost the audience, no? But I digress.) > So as I understand it, we would basically replace the calls to > shmget(), shmdt(), etc. with these system calls. The behavior will be > slightly different, however -- I'm not sure if this API supports > everything we expect the SysV IPC API to support (e.g. telling the # > of clients attached to a given segment). I trust it at least supports inheriting the page mapping over a fork()? > Can anyone comment on > exactly what functionality we expect when dealing with the storage > mechanism of the shared buffer? The only thing we use beyond the obvious "here's some memory accessible by both parent and child processes" is the #-of-clients functionality you mentioned. The reason that that is interesting is it provides a safety interlock against the case where a postmaster has crashed but left child backends running. If a new postmaster is started and starts its own collection of children then we are in very bad hot water, because the old and new backend sets will be modifying the same database files without any mutual awareness or interlocks. This *will* lead to serious, possibly unrecoverable database corruption. The SysV API provides a reliable interlock to prevent this scenario: we read the old shared memory block ID from the old postmaster's postmaster.pid file, and look to see if that block (a) still exists and (b) still has attached processes (presumably backends). If it's gone or has no attached processes, it's safe for the new postmaster to continue startup. I have little love for the SysV shmem API, but I haven't thought of an equivalently reliable interlock for this scenario without it. (For example, something along the lines of requiring each backend to write its PID into a file isn't very reliable at all: it leaves a window at each backend start where the backend hasn't yet written its PID, and it increases by a large factor the risk we've already seen wherein stale PID entries in lockfiles might by chance match the PIDs of other, unrelated processes.) Any ideas for better answers? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Neil Conway <neilc@samurai.com> writes: > > I'd like to enable PostgreSQL to use large TLB pages, if the OS > > and processor support them. > > Hmm ... it seems interesting, but I'm hesitant to do a lot of work > to support something that's only available on one hardware-and-OS > combination. True; further, I personally find the current API a little cumbersome. For example, we get 4MB pages on Solaris with a few lines of code: #if defined(solaris) && defined(__sparc__) /* use intimate shared memory on SPARC Solaris */ memAddress = shmat(shmid,0, SHM_SHARE_MMU); But given that (a) Linux on x86 is probably our most popular platform (b) Every x86 since the Pentium has supported large pages (c) Other archs, like IA64 and SPARC, also support large pages I think it's worthwhile implementing this, if possible. > I trust it at least supports inheriting the page mapping over a > fork()? I'll check on this, but I'm pretty sure that it does. > The SysV API provides a reliable interlock to prevent this scenario: > we read the old shared memory block ID from the old postmaster's > postmaster.pid file, and look to see if that block (a) still exists > and (b) still has attached processes (presumably backends). If the postmaster is starting up and the segment still exists, could we assume that's an error condition, and force the admin to manually fix it? It does make the system less robust, but I'm suspicious of any attempts to automagically fix a situation in which we *know* something has gone seriously wrong... Another possibility might be to still allocate a small SysV shmem area, and use that to provide the interlock, while we allocate the buffer area using sys_alloc_hugepages. That's somewhat of a hack, but I think it would resolve the interlock problem, at least. > Any ideas for better answers? Still scratching my head on this one, and I'll let you know if I think of anything better. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway <neilc@samurai.com> writes: > I think it's worthwhile implementing this, if possible. I wasn't objecting (I work for Red Hat, remember ;-)). I was just saying there's a limit to the messiness I think we should accept. >> The SysV API provides a reliable interlock to prevent this scenario: >> we read the old shared memory block ID from the old postmaster's >> postmaster.pid file, and look to see if that block (a) still exists >> and (b) still has attached processes (presumably backends). > If the postmaster is starting up and the segment still exists, could > we assume that's an error condition, and force the admin to manually > fix it? It wasn't clear from your description whether large-TLB shmem segments even have IDs that one could use to determine whether "the segment still exists". If the segments are anonymous then how do you do that? > It does make the system less robust, but I'm suspicious of any > attempts to automagically fix a situation in which we *know* something > has gone seriously wrong... We've spent a lot of effort on trying to ensure that we (a) start up when it's safe and (b) refuse to start up when it's not safe. While (b) is clearly the more critical point, backsliding on (a) isn't real nice either. People don't like postmasters that randomly fail to start. > Another possibility might be to still allocate a small SysV shmem > area, and use that to provide the interlock, while we allocate the > buffer area using sys_alloc_hugepages. That's somewhat of a hack, but > I think it would resolve the interlock problem, at least. Not a bad idea ... I have not got a better one offhand ... but watch out for SHMMIN settings. regards, tom lane
Okay, I did some more research into this area. It looks like it will be feasible to use large TLB pages for PostgreSQL. Tom Lane <tgl@sss.pgh.pa.us> writes: > It wasn't clear from your description whether large-TLB shmem segments > even have IDs that one could use to determine whether "the segment still > exists". There are two types of hugepages: (a) private: Not shared on fork(), not accessible to processes other than the one that allocates the pages. (b) shared: Shared across a fork(), accessible to other processes: different processes can access the samesegment if they call sys_alloc_hugepages() with the same key. So for a standalone backend, we can just use private pages (probably worth using private hugepages rather than malloc, although I doubt it matters much either way). > > Another possibility might be to still allocate a small SysV shmem > > area, and use that to provide the interlock, while we allocate the > > buffer area using sys_alloc_hugepages. That's somewhat of a hack, but > > I think it would resolve the interlock problem, at least. > > Not a bad idea ... I have not got a better one offhand ... but watch > out for SHMMIN settings. As it turns out, this will be completely unnecessary. Since hugepages are an in-kernel data structure, the kernel takes care of ensuring that dieing processes don't orphan any unused hugepage segments. The logic works like this: (for shared hugepages) (a) sys_alloc_hugepages() without IPC_EXCL will return a pointer to an existing segment, if there is onethat matches the key. If an existing segment is found, the usage counter for that segment is incremented.If no matching segment exists, an error is returned. (I'm pretty sure the usage counter isalso incremented after a fork(), but I'll double-check that.) (b) sys_free_hugepages() decrements the usage counter (c) when a process that has allocated a shared hugepage dies for *any reason* (even kill -9), the usagecounter is decremented (d) if the usage counter for a given segment ever reaches zero, the segment is deleted and the memory isfree'd. If we used a key that would remain the same between runs of the postmaster, this should ensure that there isn't a possibility of two independant sets of backends operating on the same data dir. The most logical way to do this IMHO would be to just hash the data dir, but I suppose the current method of using the port number should work as well. To elaborate on (a) a bit, we'd want to use this logic when allocating a new set of hugepages on postmaster startup: (1) call sys_alloc_hugepages() without IPC_EXCL. If it returns an error, we're in the clear: there's nopage matching that key. If it returns a pointer to a previously existing segment, panic: it is verylikely that there are some orphaned backends still active. (2) If the previous call didn't find anything, call sys_alloc_hugepages() again, specifying IPC_EXCL tocreate a new segment. Now, the question is: how should this be implemented? You recently did some of the legwork toward supporting different APIs for shared memory / semaphores, which makes this work easier -- unfortunately, some additional stuff is still needed. Specifically, support for hugepages is a configuration option, that may or may not be enabled (if it's disabled, the syscall returns a specific error). So I believe the logic is something like: - if compiling on a Linux system, enable support for hugepages (the regular SysV stuff is still needed asa backup) - if we're compiling on a Linux system but the kernel headers don't define the syscalls we need, use somereasonable defaults (e.g. the syscall numbers for the current hugepage syscalls in Linux 2.5) - at runtime, try to make one of these syscalls. If it fails, fall back to the SysV stuff. Does that sound reasonable? Any other comments would be appreciated. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway <neilc@samurai.com> writes: > If we used a key that would remain the same between runs of the > postmaster, this should ensure that there isn't a possibility of two > independant sets of backends operating on the same data dir. The most > logical way to do this IMHO would be to just hash the data dir, but I > suppose the current method of using the port number should work as > well. You should stick as closely as possible to the key logic currently used for SysV shmem keys. That logic is intended to cope with the case where someone else is already using the key# that we initially generate, as well as the case where we discover a collision with a pre-existing backend set. (We tell the difference by looking for a magic number at the start of the shmem segment.) Note that we do not assume the key is the same on each run; that's why we store it in postmaster.pid. > (1) call sys_alloc_hugepages() without IPC_EXCL. If it returns > an error, we're in the clear: there's no page matching > that key. If it returns a pointer to a previously existing > segment, panic: it is very likely that there are some > orphaned backends still active. s/panic/and the PG magic number appears in the segment header, panic/ > - if we're compiling on a Linux system but the kernel headers > don't define the syscalls we need, use some reasonable > defaults (e.g. the syscall numbers for the current hugepage > syscalls in Linux 2.5) I think this is overkill, and quite possibly dangerous. If we don't see the symbols then don't try to compile the code. On the whole it seems that this allows a very nearly one-to-one mapping to the existing SysV functionality. We don't have the "number of connected processes" syscall, perhaps, but we don't need it: if a hugepages segment exists we can assume the number of connected processes is greater than 0, and that's all we really need to know. I think it's okay to stuff this support into the existing port/sysv_shmem.c file, rather than make a separate file (particularly given your point that we have to be able to fall back to SysV calls at runtime). I'd suggest reorganizing the code in that file slightly to separate the actual syscalls from the controlling logic in PGSharedMemoryCreate(). Probably also will have to extend the API for PGSharedMemoryIsInUse() and RecordSharedMemoryInLockFile() to allow three fields to be recorded in postmaster.pid, not two --- you'll want a boolean indicating whether the stored key is for a SysV or hugepage segment. regards, tom lane
I haven't been following this thread. Can someone answer: Is TLB Linux-only?Why use it and non SysV memory?Is it a lot of code? --------------------------------------------------------------------------- Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > If we used a key that would remain the same between runs of the > > postmaster, this should ensure that there isn't a possibility of two > > independant sets of backends operating on the same data dir. The most > > logical way to do this IMHO would be to just hash the data dir, but I > > suppose the current method of using the port number should work as > > well. > > You should stick as closely as possible to the key logic currently used > for SysV shmem keys. That logic is intended to cope with the case where > someone else is already using the key# that we initially generate, as > well as the case where we discover a collision with a pre-existing > backend set. (We tell the difference by looking for a magic number at > the start of the shmem segment.) > > Note that we do not assume the key is the same on each run; that's why > we store it in postmaster.pid. > > > (1) call sys_alloc_hugepages() without IPC_EXCL. If it returns > > an error, we're in the clear: there's no page matching > > that key. If it returns a pointer to a previously existing > > segment, panic: it is very likely that there are some > > orphaned backends still active. > > s/panic/and the PG magic number appears in the segment header, panic/ > > > - if we're compiling on a Linux system but the kernel headers > > don't define the syscalls we need, use some reasonable > > defaults (e.g. the syscall numbers for the current hugepage > > syscalls in Linux 2.5) > > I think this is overkill, and quite possibly dangerous. If we don't see > the symbols then don't try to compile the code. > > On the whole it seems that this allows a very nearly one-to-one mapping > to the existing SysV functionality. We don't have the "number of > connected processes" syscall, perhaps, but we don't need it: if a > hugepages segment exists we can assume the number of connected processes > is greater than 0, and that's all we really need to know. > > I think it's okay to stuff this support into the existing > port/sysv_shmem.c file, rather than make a separate file (particularly > given your point that we have to be able to fall back to SysV calls at > runtime). I'd suggest reorganizing the code in that file slightly to > separate the actual syscalls from the controlling logic in > PGSharedMemoryCreate(). Probably also will have to extend the API for > PGSharedMemoryIsInUse() and RecordSharedMemoryInLockFile() to allow > three fields to be recorded in postmaster.pid, not two --- you'll want > a boolean indicating whether the stored key is for a SysV or hugepage > segment. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Is TLB Linux-only? Well, the "TLB" is a feature of the CPU, so no. Many modern processors support large TLB pages in some fashion. However, the specific API for using large TLB pages differs between operating systems. The API I'm planning to implement is the one provided by recent versions of Linux (2.5.38+). I've only looked briefly at enabling the usage of large pages on other operating systems. On Solaris, we already use large pages (due to using Intimate Shared Memory). On HPUX, you apparently need call chattr on the executable for it to use large pages. AFAIK the BSDs don't support large pages for user-land apps -- if I'm incorrect, let me know. > Why use it and non SysV memory? It's faster, at least in theory. I posted these links at the start of the thread: http://lwn.net/Articles/6535/ http://lwn.net/Articles/10293/ > Is it a lot of code? I haven't implemented it yet, so I'm not sure. However, I don't think it will be a lot of code. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Is TLB Linux-only? > > Well, the "TLB" is a feature of the CPU, so no. Many modern processors > support large TLB pages in some fashion. > > However, the specific API for using large TLB pages differs between > operating systems. The API I'm planning to implement is the one > provided by recent versions of Linux (2.5.38+). > > I've only looked briefly at enabling the usage of large pages on other > operating systems. On Solaris, we already use large pages (due to > using Intimate Shared Memory). On HPUX, you apparently need call > chattr on the executable for it to use large pages. AFAIK the BSDs > don't support large pages for user-land apps -- if I'm incorrect, let > me know. > > > Why use it and non SysV memory? > > It's faster, at least in theory. I posted these links at the start of > the thread: > > http://lwn.net/Articles/6535/ > http://lwn.net/Articles/10293/ > > > Is it a lot of code? > > I haven't implemented it yet, so I'm not sure. However, I don't think > it will be a lot of code. OK, personally, I would like to see an actual speedup of PostgreSQL queries before I would apply such a OS-specific, version-specific patch. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > OK, personally, I would like to see an actual speedup of PostgreSQL > queries before I would apply such a OS-specific, version-specific > patch. Don't be silly. A performance improvement is a performance improvement. According to your logic, using assembly-optimized locking primitives shouldn't be done unless we've exhausted every possible optimization in every other part of the system (a process which will likely never be finished). If the optimization was for some obscure UNIX variant and/or an obscure processor, I would agree that it wouldn't be worth the bother. But given that (a) Linux on IA32 is likely our most popular platform [1] (b) In theory, this will help performance where we need it most, IMHO (high-end systems using large sharedbuffers) I think it's at least worth implementing -- if it doesn't provide a noticeable performance improvement, then we don't need to merge it. Cheers, Neil [1] It's worth noting that the huge tlb patch currently works in IA64, SPARC, and may well be ported to additional architectures in the future. -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway <neilc@samurai.com> writes: > Bruce Momjian <pgman@candle.pha.pa.us> writes: >> OK, personally, I would like to see an actual speedup of PostgreSQL >> queries before I would apply such a OS-specific, version-specific >> patch. > Don't be silly. A performance improvement is a performance > improvement. No, Bruce was saying that he wanted to see demonstrable improvement *due to this specific change* before committing to support a platform-specific API. I agree with him, actually. If you do the TLB code and can't measure any meaningful performance improvement when using it vs. when not, I'd not be excited about cluttering the distribution with it. > I think it's at least worth implementing -- if it doesn't provide a > noticeable performance improvement, then we don't need to merge it. You're on the same page, you just don't realize it... regards, tom lane
Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> OK, personally, I would like to see an actual speedup of PostgreSQL > >> queries before I would apply such a OS-specific, version-specific > >> patch. > > > Don't be silly. A performance improvement is a performance > > improvement. > > No, Bruce was saying that he wanted to see demonstrable improvement > *due to this specific change* before committing to support a > platform-specific API. I agree with him, actually. If you do the > TLB code and can't measure any meaningful performance improvement > when using it vs. when not, I'd not be excited about cluttering the > distribution with it. > > > I think it's at least worth implementing -- if it doesn't provide a > > noticeable performance improvement, then we don't need to merge it. > > You're on the same page, you just don't realize it... I see what he thought I said, I just can't figure out how he read it that way. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Neil, I agree with Bruce and Tom. AFAIK and in my experience I don't think it will be a significantly measurable increase. Not only that, but the portability issue itself tends to make it less desireable. I recently ported SAP DB and the coinciding DevTools over to OpenBSD and learned again first-hand what a pain in the ass having platform-specific code is. I guess it's up to you, Neil. If you want to spend the time trying to implement it, and it does prove to have a significant performance increase I'd say maybe. IMHO, I just think that time could be better spent improving the current system rather than trying to add to it in a singular way. Sorry if my comments are out-of-line on this one but it has been a thread for some time I'm just kinda tired of reading theory vs proof. Since you are so set on trying to implement this, I'm just wondering what documentation has tested evidence of measurable increases in similar situations? I just like arguments to be backed by proof... and I'm sure there is documentation on this somewhere. -Jonah -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Bruce Momjian Sent: Sunday, September 29, 2002 3:30 PM To: Tom Lane Cc: Neil Conway; PostgreSQL Hackers Subject: Re: [HACKERS] making use of large TLB pages Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> OK, personally, I would like to see an actual speedup of PostgreSQL > >> queries before I would apply such a OS-specific, version-specific > >> patch. > > > Don't be silly. A performance improvement is a performance > > improvement. > > No, Bruce was saying that he wanted to see demonstrable improvement > *due to this specific change* before committing to support a > platform-specific API. I agree with him, actually. If you do the > TLB code and can't measure any meaningful performance improvement > when using it vs. when not, I'd not be excited about cluttering the > distribution with it. > > > I think it's at least worth implementing -- if it doesn't provide a > > noticeable performance improvement, then we don't need to merge it. > > You're on the same page, you just don't realize it... I see what he thought I said, I just can't figure out how he read it that way. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073 ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
"Jonah H. Harris" <jharris@nightstarcorporation.com> writes: > I agree with Bruce and Tom. AFAIK Bruce and Tom (and myself) agree that this is a good idea, provided it makes a noticeable performance difference (and if it doesn't, it's not worth applying). > AFAIK and in my experience I don't think it will be a significantly > measurable increase. Can you elaborate on this experience? > Not only that, but the portability issue itself tends to make it > less desireable. Well, that's obvious: code that improves PostgreSQL on *all* platforms is clearly superior to code that only improves it on a couple. That's not to say that the latter code is absolutely without merit, however. > Sorry if my comments are out-of-line on this one but it has been a > thread for some time I'm just kinda tired of reading theory vs > proof. Well, ISTM the easiest way to get some "proof" is to implement it and benchmark the results. IMHO any claims about performance prior to that are mostly hand waving. > Since you are so set on trying to implement this, I'm just wondering > what documentation has tested evidence of measurable increases in > similar situations? (/me wonders if people bother reading the threads they reply to) http://lwn.net/Articles/10293/ According to the HP guys, Oracle saw an 8% performance improvement in TPC-C when they started using large pages. To be perfectly honest, I really have no idea if that will translate into an 8% performance gain for PostgreSQL, or whether the performance gain only applies if you're using a machine with 16GB of RAM, or whether the speedup from large pages is really just a correction of some Oracle deficiency that we don't suffer from, etc. However, I do think it's worth finding out. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC