Re: making use of large TLB pages - Mailing list pgsql-hackers
From | Neil Conway |
---|---|
Subject | Re: making use of large TLB pages |
Date | |
Msg-id | 87ptuywuii.fsf@mailbox.samurai.com Whole thread Raw |
In response to | Re: making use of large TLB pages (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: making use of large TLB pages
|
List | pgsql-hackers |
Okay, I did some more research into this area. It looks like it will be feasible to use large TLB pages for PostgreSQL. Tom Lane <tgl@sss.pgh.pa.us> writes: > It wasn't clear from your description whether large-TLB shmem segments > even have IDs that one could use to determine whether "the segment still > exists". There are two types of hugepages: (a) private: Not shared on fork(), not accessible to processes other than the one that allocates the pages. (b) shared: Shared across a fork(), accessible to other processes: different processes can access the samesegment if they call sys_alloc_hugepages() with the same key. So for a standalone backend, we can just use private pages (probably worth using private hugepages rather than malloc, although I doubt it matters much either way). > > Another possibility might be to still allocate a small SysV shmem > > area, and use that to provide the interlock, while we allocate the > > buffer area using sys_alloc_hugepages. That's somewhat of a hack, but > > I think it would resolve the interlock problem, at least. > > Not a bad idea ... I have not got a better one offhand ... but watch > out for SHMMIN settings. As it turns out, this will be completely unnecessary. Since hugepages are an in-kernel data structure, the kernel takes care of ensuring that dieing processes don't orphan any unused hugepage segments. The logic works like this: (for shared hugepages) (a) sys_alloc_hugepages() without IPC_EXCL will return a pointer to an existing segment, if there is onethat matches the key. If an existing segment is found, the usage counter for that segment is incremented.If no matching segment exists, an error is returned. (I'm pretty sure the usage counter isalso incremented after a fork(), but I'll double-check that.) (b) sys_free_hugepages() decrements the usage counter (c) when a process that has allocated a shared hugepage dies for *any reason* (even kill -9), the usagecounter is decremented (d) if the usage counter for a given segment ever reaches zero, the segment is deleted and the memory isfree'd. If we used a key that would remain the same between runs of the postmaster, this should ensure that there isn't a possibility of two independant sets of backends operating on the same data dir. The most logical way to do this IMHO would be to just hash the data dir, but I suppose the current method of using the port number should work as well. To elaborate on (a) a bit, we'd want to use this logic when allocating a new set of hugepages on postmaster startup: (1) call sys_alloc_hugepages() without IPC_EXCL. If it returns an error, we're in the clear: there's nopage matching that key. If it returns a pointer to a previously existing segment, panic: it is verylikely that there are some orphaned backends still active. (2) If the previous call didn't find anything, call sys_alloc_hugepages() again, specifying IPC_EXCL tocreate a new segment. Now, the question is: how should this be implemented? You recently did some of the legwork toward supporting different APIs for shared memory / semaphores, which makes this work easier -- unfortunately, some additional stuff is still needed. Specifically, support for hugepages is a configuration option, that may or may not be enabled (if it's disabled, the syscall returns a specific error). So I believe the logic is something like: - if compiling on a Linux system, enable support for hugepages (the regular SysV stuff is still needed asa backup) - if we're compiling on a Linux system but the kernel headers don't define the syscalls we need, use somereasonable defaults (e.g. the syscall numbers for the current hugepage syscalls in Linux 2.5) - at runtime, try to make one of these syscalls. If it fails, fall back to the SysV stuff. Does that sound reasonable? Any other comments would be appreciated. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
pgsql-hackers by date: