Thread: swapcache-style cache?

swapcache-style cache?

From
james
Date:
Has anyone considered managing a system like the DragonFLY swapcache for 
a DBMS like PostgreSQL?

ie where the admin can assign drives with good random read behaviour 
(but perhaps also-ran random write) such as SSDs to provide a cache for 
blocks that were dirtied, with async write that hopefully writes them 
out before they are forcibly discarded.

And where a cache fail (whether by timeout, hard fail, or CRC fail) just 
means having to go back to the real transactional storage.

I'd been thinking that swapcache would help where the working set won't 
fit in RAM, also L2ARC on Solaris - but it seems to me that there is no 
reason not to allow the DBMS to manage the set-aside area itself where 
it is given either access to the raw device or to a pre-sized file on 
the device it can map in segments.

While L2ARC is obviously very heavyweight and entwined in ZFS, 
Dragonfly's swapcache seems to me remarkably elegant and, it would seem, 
very effective.

James


Re: swapcache-style cache?

From
Greg Smith
Date:
On 02/22/2012 05:31 PM, james wrote:
> Has anyone considered managing a system like the DragonFLY swapcache for
> a DBMS like PostgreSQL?
>
> ie where the admin can assign drives with good random read behaviour
> (but perhaps also-ran random write) such as SSDs to provide a cache for
> blocks that were dirtied, with async write that hopefully writes them
> out before they are forcibly discarded.

We know that battery-backed write caches are extremely effective for 
PostgreSQL writes.  I see most of these tiered storage ideas as acting 
like a big one of those, which seems to hold in things like SAN storage 
that have adopted this sort of technique already.  A SSD is quite large 
relative to a typical BBWC.

There are a few reasons that doesn't always give the win hoped for though:

-Database writes have write durability requirements that require safe 
storage more often than most other applications.  One of the reasons the 
swapcache helps is that it aims to bundle writes into 64K chunks, very 
SSD friendly.  The database may force them more often than that.  The 
fact that all the Dragonfly documentation uses Intel drives for its 
examples that don't write reliably doesn't make me too optimistic about 
that being a priority of the design.  The SSDs that have safe, 
battery-backed write buffers >=64KB make that win go away.

-Ultimately all this data needs to make it out to real disk.  The funny 
thing about caches is that no matter how big they are, you can easily 
fill them up if doing something faster than the underlying storage can 
handle.

-If you have something like a BBWC in front of traditional storage, as 
well as a few gigabytes of operating system write buffering, that really 
helps traditional storage a lot already.  Those two things do so much 
write reordering that some of the random seek gain gap between spinning 
disk and SSD shrinks.  And sequential throughput is usually not sped up 
very much by SSD, except at the high end (using lots of banks).

One reaction to all this is to point out that it's sometimes easier to 
add a SSD to a system than a BBWC.  That is true.  The thing that 
benefits most from this are the WAL writes though, and since they're 
both sequential and very high volume they're really smacking into the 
worst case scenario for SSD vs. spinning disk too.

> I'd been thinking that swapcache would help where the working set won't
> fit in RAM, also L2ARC on Solaris - but it seems to me that there is no
> reason not to allow the DBMS to manage the set-aside area itself where
> it is given either access to the raw device or to a pre-sized file on
> the device it can map in segments.

Well, you could argue that if we knew what to do with it, we'd have 
already built that logic into a superior usage of shared_buffers. 
Instead we punt a lot of this work toward the kernel, often usefully. 
Write cache reordering and read-ahead are the two biggest things storage 
does that we'd have to reinvent inside PostgreSQL if more direct disk 
I/O was attempted.

I don't think the idea of a swapcache is without merit; there's surely 
some applications that will benefit from it.  It's got a lot of 
potential as a way to absorb short-term bursts of write activity.  And 
there are some applications that could benefit from having a second tier 
of read cache, not as fast as RAM but larger and faster than real disk 
seeks.  In all of those potential win cases, though, I don't see why the 
OS couldn't just manage the whole thing for us.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: swapcache-style cache?

From
Jan Lentfer
Date:
Am 23.02.2012 21:57, schrieb Greg Smith:
> On 02/22/2012 05:31 PM, james wrote:
>> Has anyone considered managing a system like the DragonFLY swapcache for
>> a DBMS like PostgreSQL?
>>
>> ie where the admin can assign drives with good random read behaviour
>> (but perhaps also-ran random write) such as SSDs to provide a cache for
>> blocks that were dirtied, with async write that hopefully writes them
>> out before they are forcibly discarded.
>
> We know that battery-backed write caches are extremely effective for
> PostgreSQL writes. I see most of these tiered storage ideas as acting
> like a big one of those, which seems to hold in things like SAN storage
> that have adopted this sort of technique already. A SSD is quite large
> relative to a typical BBWC.
[...]

> -Ultimately all this data needs to make it out to real disk. The funny
> thing about caches is that no matter how big they are, you can easily
> fill them up if doing something faster than the underlying storage can
> handle.

[...]

> I don't think the idea of a swapcache is without merit; there's surely
> some applications that will benefit from it. It's got a lot of potential
> as a way to absorb short-term bursts of write activity. And there are
> some applications that could benefit from having a second tier of read
> cache, not as fast as RAM but larger and faster than real disk seeks. In
> all of those potential win cases, though, I don't see why the OS
> couldn't just manage the whole thing for us.

First off, thank's very much for mentioning DragonFly's swapcache on 
this mailing list, which takes the burden off me/us to self-advertise 
this feature :)

But swapcache is clearly not meant or designed to speed up any write 
activity by caching writes and delaying the write to the "target 
storage" to a later point in time. Swapcache does not affect writes in 
any way, actually.
Swapcache does its writing when a clean VM page hits the inactive VM 
page queue. VM pages related to filesystem writes are dirty, the write 
occurs normally, then they become clean.  But they still have to cycle 
into the VM page inactive queue before swapcache will touch them (write 
them out to swap).

So, basically it is designed to speed up Metadata reads, and if 
configured to do so, data reads.

So, it can take some read load burden of the disk subsystem and free the 
disk subsystem for more write activity, but that would be just a side 
effect, not a design goal.

And, yes.. it does effect pgsql performance on read loads seriously.

See BSD Mag 5/2011
http://bsdmag.org/magazine/1691-embedded-bsd-freebsd-alix

and
http://www.shiningsilence.com/dbsdlog/2011/04/12/7586.html

Jan




Re: swapcache-style cache?

From
Greg Smith
Date:
On 02/27/2012 03:24 PM, Jan Lentfer wrote:
> And, yes.. it does effect pgsql performance on read loads seriously.
>
> See BSD Mag 5/2011
> http://bsdmag.org/magazine/1691-embedded-bsd-freebsd-alix
>
> and
> http://www.shiningsilence.com/dbsdlog/2011/04/12/7586.html

Caching on the read-only pgbench is a well defined workload at this 
point.  If your database fits in RAM, once it's all in there additional 
caching doesn't help.  If the database is much larger than the cache, 
the cache barely helps there too; you'll still be facing mostly cache 
misses.  The case in the middle is the one where an additional layer of 
cache really helps.  Read-heavy systems where the working set of the 
database is larger than RAM, but not significantly larger than the extra 
cache, will benefit the most here.

Your test results are in that zone, with 2GB RAM < 5.6GB database < 16GB 
cache.  Having a database slightly larger than physical RAM is where the 
big win with SSD normally shows up at.  Moving the whole database from a 
regular drive to SSD might get as much as a 5X speedup, you're seeing a 
3X to 4X one with the swap cache in the middle.

Having the OS manage all that, to keep the most relevant data on the 
SSD, is a cool feature.  Some systems won't benefit at all though, and 
your test is showing near the best case possible for this feature.  As 
you should, of course.

Anyway, the question upthread was whether the database should manage 
something like this on its own.  I suggested it could be done perfectly 
fine by the OS, without any database knowledge of what is going on. 
Your results seem to validate that idea.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: swapcache-style cache?

From
Rob Wultsch
Date:
On Wed, Feb 22, 2012 at 2:31 PM, james <james@mansionfamily.plus.com> wrote:
> Has anyone considered managing a system like the DragonFLY swapcache for a
> DBMS like PostgreSQL?
>

https://www.facebook.com/note.php?note_id=388112370932

-- 
Rob Wultsch
wultsch@gmail.com


Re: swapcache-style cache?

From
Andrea Suisani
Date:
On 02/28/2012 04:52 AM, Rob Wultsch wrote:
> On Wed, Feb 22, 2012 at 2:31 PM, james<james@mansionfamily.plus.com>  wrote:
>> Has anyone considered managing a system like the DragonFLY swapcache for a
>> DBMS like PostgreSQL?
>>
>
> https://www.facebook.com/note.php?note_id=388112370932
>

in the same vein:

http://bcache.evilpiepirate.org/

from the main page:

"Bcache is a patch for the Linux kernel to use SSDs to cache other block devices. It's analogous to L2Arc for ZFS,
but Bcache also does writeback caching, and it's filesystem agnostic. It's designed to be switched on with a minimum
of effort, and to work well without configuration on any setup. By default it won't cache sequential IO, just the
random
reads and writes that SSDs excel at. It's meant to be suitable for desktops, servers, high end storage arrays, and
perhaps
even embedded."

it was submitted to linux kernel mailing list a bunch of time, the last one:

https://lkml.org/lkml/2011/9/10/13


Andrea


Re: swapcache-style cache?

From
karavelov@mail.bg
Date:
----- Цитат от Andrea Suisani (sickpig@opinioni.net), на 28.02.2012 в 09:54 ----- <br /><br />> On 02/28/2012 04:52
AM,Rob Wultsch wrote: <br />>> On Wed, Feb 22, 2012 at 2:31 PM, james wrote: <br />>>> Has anyone
consideredmanaging a system like the DragonFLY swapcache for a <br />>>> DBMS like PostgreSQL? <br
/>>>><br />>> <br />>> https://www.facebook.com/note.php?note_id=388112370932 <br />>> <br
/>><br />> in the same vein: <br />> <br />> http://bcache.evilpiepirate.org/ <br />> <br />> from
themain page: <br />> <br />> "Bcache is a patch for the Linux kernel to use SSDs to cache other block devices.
It'sanalogous to L2Arc for ZFS, <br />> but Bcache also does writeback caching, and it's filesystem agnostic. It's
designedto be switched on with a minimum <br />> of effort, and to work well without configuration on any setup. By
defaultit won't cache sequential IO, just the random <br />> reads and writes that SSDs excel at. It's meant to be
suitablefor desktops, servers, high end storage arrays, and perhaps <br />> even embedded." <br />> <br />> it
wassubmitted to linux kernel mailing list a bunch of time, the last one: <br />> <br />>
https://lkml.org/lkml/2011/9/10/13<br />> <br />> <br />> Andrea <br />> <br /><br />I am successfully
usingfacebook's flashchache in write-through mode - so it speeds only reads. I have seen 3 times <br />increase on TPS
fordatabases that do not fit in RAM. I am using Intel X-25E over RAID10 of 4 SAS disks. I have tested also <br
/>writebackmode but the gain is not so huge and there is a considerable risk for loosing all your data if/when the SSD
fails.<br /><br />Best regards <br /><br /><br />-- <br />Luben Karavelov 

Re: swapcache-style cache?

From
Andrea Suisani
Date:
On 02/28/2012 08:54 AM, Andrea Suisani wrote:
> On 02/28/2012 04:52 AM, Rob Wultsch wrote:
>> On Wed, Feb 22, 2012 at 2:31 PM, james<james@mansionfamily.plus.com> wrote:
>>> Has anyone considered managing a system like the DragonFLY swapcache for a
>>> DBMS like PostgreSQL?
>>>
>>
>> https://www.facebook.com/note.php?note_id=388112370932
>>
>
> in the same vein:
>
> http://bcache.evilpiepirate.org/

[cut]

> it was submitted to linux kernel mailing list a bunch of time, the last one:
>
> https://lkml.org/lkml/2011/9/10/13

forgot to mention another good write-up

https://lwn.net/Articles/394672/

Andrea


Re: swapcache-style cache?

From
Rob Wultsch
Date:
On Mon, Feb 27, 2012 at 11:54 PM, Andrea Suisani <sickpig@opinioni.net> wrote:
> On 02/28/2012 04:52 AM, Rob Wultsch wrote:
>>
>> On Wed, Feb 22, 2012 at 2:31 PM, james<james@mansionfamily.plus.com>
>>  wrote:
>>>
>>> Has anyone considered managing a system like the DragonFLY swapcache for
>>> a
>>> DBMS like PostgreSQL?
>>>
>>
>> https://www.facebook.com/note.php?note_id=388112370932
>>
>
> in the same vein:
>
> http://bcache.evilpiepirate.org/
>
> from the main page:
>
> "Bcache is a patch for the Linux kernel to use SSDs to cache other block
> devices. It's analogous to L2Arc for ZFS,
> but Bcache also does writeback caching, and it's filesystem agnostic. It's
> designed to be switched on with a minimum
> of effort, and to work well without configuration on any setup. By default
> it won't cache sequential IO, just the random
> reads and writes that SSDs excel at. It's meant to be suitable for desktops,
> servers, high end storage arrays, and perhaps
> even embedded."
>
> it was submitted to linux kernel mailing list a bunch of time, the last one:
>
> https://lkml.org/lkml/2011/9/10/13
>
>
> Andrea


I am pretty sure I won't get fired (or screw up the IPO) by saying
that I have a high opinion of Flashcache (at least within the fb
environment).

Is anyone using bcache at scale?

--
Rob Wultsch
wultsch@gmail.com