Re: making use of large TLB pages - Mailing list pgsql-hackers

From Neil Conway
Subject Re: making use of large TLB pages
Date
Msg-id 87ptuywuii.fsf@mailbox.samurai.com
Whole thread Raw
In response to Re: making use of large TLB pages  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: making use of large TLB pages  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Okay, I did some more research into this area. It looks like it will
be feasible to use large TLB pages for PostgreSQL.

Tom Lane <tgl@sss.pgh.pa.us> writes:
> It wasn't clear from your description whether large-TLB shmem segments
> even have IDs that one could use to determine whether "the segment still
> exists".

There are two types of hugepages:
       (a) private: Not shared on fork(), not accessible to processes           other than the one that allocates the
pages.
       (b) shared: Shared across a fork(), accessible to other           processes: different processes can access the
samesegment           if they call sys_alloc_hugepages() with the same key.
 

So for a standalone backend, we can just use private pages (probably
worth using private hugepages rather than malloc, although I doubt it
matters much either way).

> > Another possibility might be to still allocate a small SysV shmem
> > area, and use that to provide the interlock, while we allocate the
> > buffer area using sys_alloc_hugepages. That's somewhat of a hack, but
> > I think it would resolve the interlock problem, at least.
> 
> Not a bad idea ... I have not got a better one offhand ... but watch
> out for SHMMIN settings.

As it turns out, this will be completely unnecessary. Since hugepages
are an in-kernel data structure, the kernel takes care of ensuring
that dieing processes don't orphan any unused hugepage segments. The
logic works like this: (for shared hugepages)
       (a) sys_alloc_hugepages() without IPC_EXCL will return a           pointer to an existing segment, if there is
onethat           matches the key. If an existing segment is found, the           usage counter for that segment is
incremented.If no           matching segment exists, an error is returned. (I'm pretty           sure the usage counter
isalso incremented after a fork(),           but I'll double-check that.)
 
       (b) sys_free_hugepages() decrements the usage counter
       (c) when a process that has allocated a shared hugepage dies           for *any reason* (even kill -9), the
usagecounter is           decremented
 
       (d) if the usage counter for a given segment ever reaches           zero, the segment is deleted and the memory
isfree'd.
 

If we used a key that would remain the same between runs of the
postmaster, this should ensure that there isn't a possibility of two
independant sets of backends operating on the same data dir. The most
logical way to do this IMHO would be to just hash the data dir, but I
suppose the current method of using the port number should work as
well.

To elaborate on (a) a bit, we'd want to use this logic when allocating
a new set of hugepages on postmaster startup:
       (1) call sys_alloc_hugepages() without IPC_EXCL. If it returns           an error, we're in the clear: there's
nopage matching           that key. If it returns a pointer to a previously existing           segment, panic: it is
verylikely that there are some           orphaned backends still active.
 
       (2) If the previous call didn't find anything, call           sys_alloc_hugepages() again, specifying IPC_EXCL
tocreate           a new segment.
 

Now, the question is: how should this be implemented? You recently
did some of the legwork toward supporting different APIs for shared
memory / semaphores, which makes this work easier -- unfortunately,
some additional stuff is still needed. Specifically, support for
hugepages is a configuration option, that may or may not be enabled
(if it's disabled, the syscall returns a specific error). So I believe
the logic is something like:
       - if compiling on a Linux system, enable support for hugepages         (the regular SysV stuff is still needed
asa backup)
 
       - if we're compiling on a Linux system but the kernel headers         don't define the syscalls we need, use
somereasonable         defaults (e.g. the syscall numbers for the current hugepage         syscalls in Linux 2.5)
 
       - at runtime, try to make one of these syscalls. If it fails,         fall back to the SysV stuff.

Does that sound reasonable?

Any other comments would be appreciated.

Cheers,

Neil

-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Reconstructing FKs in pg_dump
Next
From: Justin Clift
Date:
Subject: How to REINDEX in high volume environments?