Re: 2nd Level Buffer Cache - Mailing list pgsql-hackers

From Greg Stark
Subject Re: 2nd Level Buffer Cache
Date
Msg-id AANLkTikM3qwZ7A1OCPEQjYJ20Ex8EX0_uzUnJKGK6nwC@mail.gmail.com
Whole thread Raw
In response to Re: 2nd Level Buffer Cache  (Josh Berkus <josh@agliodbs.com>)
Responses Re: 2nd Level Buffer Cache
Re: 2nd Level Buffer Cache
Re: 2nd Level Buffer Cache
List pgsql-hackers
On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like
MMAPedshared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of
memoryto them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like
we'rere-inventing the wheels that the OS has already created here... 

A lot of people have talked about it. You can find references to mmap
going at least as far back as 2001 or so. The problem is that it would
depend on the OS implementing things in a certain way and guaranteeing
things we don't think can be portably assumed. We would need to mlock
large amounts of address space which most OS's don't allow, and we
would need to at least mlock and munlock lots of small bits of memory
all over the place which would create lots and lots of mappings which
the kernel and hardware implementations would generally not
appreciate.

> As far as I know, no OS has a more sophisticated approach to eviction
> than LRU.  And clock-sweep is a significant improvement on performance
> over LRU for frequently accessed database objects ... plus our
> optimizations around not overwriting the whole cache for things like VACUUM.

The clock-sweep algorithm was standard OS design before you or I knew
how to type. I would expect any half-decent OS to have sometihng at
least as good -- perhaps better because it can rely on hardware
features to handle things.

However the second point is the crux of the issue and of all similar
issues on where to draw the line between the OS and Postgres. The OS
knows better about the hardware characteristics and can better
optimize the overall system behaviour, but Postgres understands better
its own access patterns and can better optimize its behaviour whereas
the OS is stuck reverse-engineering what Postgres needs, usually from
simple heuristics.

>
> 2-level caches work well for a variety of applications.

I think 2-level caches with simple heuristics like "pin all the
indexes" is unlikely to be helpful. At least it won't optimize the
average case and I think that's been proven. It might be helpful for
optimizing the worst-case which would reduce the standard deviation.
Perhaps we're at the point now where that matters.

Where it might be helpful is as a more refined version of the
"sequential scans use limited set of buffers" patch. Instead of having
each sequential scan use a hard coded number of buffers, perhaps all
sequential scans should share a fraction of the global buffer pool
managed separately from the main pool. Though in my thought
experiments I don't see any real win here. In the current scheme if
there's any sign the buffer is useful it gets thrown from the
sequential scan's set of buffers to reuse anyways.

> Now, what would be *really* useful is some way to avoid all the data
> copying we do between shared_buffers and the FS cache.
>

Well the two options are mmap/mlock or directio. The former might be a
fun experiment but I expect any OS to fall over pretty quickly when
faced with thousands (or millions) of 8kB mappings. The latter would
need Postgres to do async i/o and hopefully a global view of its i/o
access patterns so it could do prefetching in a lot more cases.

--
greg


pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Allowing multiple concurrent base backups
Next
From: Heikki Linnakangas
Date:
Subject: Re: Rectifying wrong Date outputs