Thread: WAL in RAM

WAL in RAM

From
Marcus Engene
Date:
Hi list,

Every now and then I have write peaks which causes annoying delay on my
website. No particular reason it seems, just that laws of probability
dictates that there will be peaks every now and then.

Anyway, thinking of ways to make the peaks more bareable, I saw the new
9.1 feature to bypass WAL. Problems is mainly that some statistics
tables ("x users clicked this link this month") clog the write cache,
not more important writes. I could live with restoring a nightly dump of
these tables and loose a days worth of logs.

Though not keen on jumping over to early major versions an old idea of
putting WAL in RAM came back. Not RAM in main memory but some thingie
pretending to be a drive with proper battery backup.

a) It seems to exist odd hardware with RAM modules and if lucky also battery
b) Some drive manufactureres have done hybird ram-spindle drives
(compare with possibly more common ssd-spindle hybrides).

b) sounds slightly more appealing since it basically means I put
everything on those drives and it magically is faster. The a)
alternatives also seemed to be non ECC which is a no-no and disturbing.

Does anyone here have any recommendations here?

Pricing is not very important but reliability is.

Thanks,
Marcus

Re: WAL in RAM

From
"Kevin Grittner"
Date:
Marcus Engene <mengpg2@engene.se> wrote:

> Every now and then I have write peaks which causes annoying delay
> on my website

> Does anyone here have any recommendations here?

For our largest machines we put WAL on a RAID1 drive pair dedicated
to that task, on its own controller with battery-backed cache
configured for write-back.  It does make a big difference, because
when a DBA accidentally got this wrong once, we saw the problem you
describe, and moving WAL to the dedicated drives/controller caused
the problem to go away.

If problems remain, look for posts by Greg Smith on how to tune
this.  You may want to extend your checkpoint completion target,
make the background writer more aggressive, reduce shared buffers,
or tune the OS.  But if you can afford to put WAL on a dedicated
file system something like the above, that would be a better place
to start, IMO.

-Kevin

Re: WAL in RAM

From
Merlin Moncure
Date:
On Fri, Oct 28, 2011 at 10:28 AM, Marcus Engene <mengpg2@engene.se> wrote:
> Hi list,
>
> Every now and then I have write peaks which causes annoying delay on my
> website. No particular reason it seems, just that laws of probability
> dictates that there will be peaks every now and then.
>
> Anyway, thinking of ways to make the peaks more bareable, I saw the new 9.1
> feature to bypass WAL. Problems is mainly that some statistics tables ("x
> users clicked this link this month") clog the write cache, not more
> important writes. I could live with restoring a nightly dump of these tables
> and loose a days worth of logs.
>
> Though not keen on jumping over to early major versions an old idea of
> putting WAL in RAM came back. Not RAM in main memory but some thingie
> pretending to be a drive with proper battery backup.
>
> a) It seems to exist odd hardware with RAM modules and if lucky also battery
> b) Some drive manufactureres have done hybird ram-spindle drives (compare
> with possibly more common ssd-spindle hybrides).
>
> b) sounds slightly more appealing since it basically means I put everything
> on those drives and it magically is faster. The a) alternatives also seemed
> to be non ECC which is a no-no and disturbing.
>
> Does anyone here have any recommendations here?
>
> Pricing is not very important but reliability is.

Have you ruled out SSD?  They are a little new, but I'd be looking at
the Intel 710.  In every case I've seen SSD permanently ends I/O
issues.  DRAM storage solutions I find to be pricey and complicated
when there are so many workable flash options out now.

merlin

Re: WAL in RAM

From
"Tomas Vondra"
Date:
On 28 Říjen 2011, 18:11, Merlin Moncure wrote:
> On Fri, Oct 28, 2011 at 10:28 AM, Marcus Engene <mengpg2@engene.se> wrote:
>> Hi list,
>>
>> Every now and then I have write peaks which causes annoying delay on my
>> website. No particular reason it seems, just that laws of probability
>> dictates that there will be peaks every now and then.
>>
>> Anyway, thinking of ways to make the peaks more bareable, I saw the new
>> 9.1
>> feature to bypass WAL. Problems is mainly that some statistics tables
>> ("x
>> users clicked this link this month") clog the write cache, not more
>> important writes. I could live with restoring a nightly dump of these
>> tables
>> and loose a days worth of logs.
>>
>> Though not keen on jumping over to early major versions an old idea of
>> putting WAL in RAM came back. Not RAM in main memory but some thingie
>> pretending to be a drive with proper battery backup.
>>
>> a) It seems to exist odd hardware with RAM modules and if lucky also
>> battery
>> b) Some drive manufactureres have done hybird ram-spindle drives
>> (compare
>> with possibly more common ssd-spindle hybrides).
>>
>> b) sounds slightly more appealing since it basically means I put
>> everything
>> on those drives and it magically is faster. The a) alternatives also
>> seemed
>> to be non ECC which is a no-no and disturbing.
>>
>> Does anyone here have any recommendations here?
>>
>> Pricing is not very important but reliability is.
>
> Have you ruled out SSD?  They are a little new, but I'd be looking at
> the Intel 710.  In every case I've seen SSD permanently ends I/O
> issues.  DRAM storage solutions I find to be pricey and complicated
> when there are so many workable flash options out now.

Are you sure SSDs are a reasonable option for WAL? I personally don't
think it's a good option, because WAL is written in a sequential manner,
and that's not an area where SSDs beat spinners really badly.

For example the Intel 710 SSD has a sequential write speed of 210MB/s,
while a simple SATA 7.2k drive can write about 50-100 MB/s for less than
1/10 of the 710 price.

I'm not saying SSDs are a bad thing, but I think it's a waste of money to
use them for WAL.

Tomas


Re: WAL in RAM

From
Merlin Moncure
Date:
On Fri, Oct 28, 2011 at 1:26 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 28 Říjen 2011, 18:11, Merlin Moncure wrote:
>> On Fri, Oct 28, 2011 at 10:28 AM, Marcus Engene <mengpg2@engene.se> wrote:
>>> Hi list,
>>>
>>> Every now and then I have write peaks which causes annoying delay on my
>>> website. No particular reason it seems, just that laws of probability
>>> dictates that there will be peaks every now and then.
>>>
>>> Anyway, thinking of ways to make the peaks more bareable, I saw the new
>>> 9.1
>>> feature to bypass WAL. Problems is mainly that some statistics tables
>>> ("x
>>> users clicked this link this month") clog the write cache, not more
>>> important writes. I could live with restoring a nightly dump of these
>>> tables
>>> and loose a days worth of logs.
>>>
>>> Though not keen on jumping over to early major versions an old idea of
>>> putting WAL in RAM came back. Not RAM in main memory but some thingie
>>> pretending to be a drive with proper battery backup.
>>>
>>> a) It seems to exist odd hardware with RAM modules and if lucky also
>>> battery
>>> b) Some drive manufactureres have done hybird ram-spindle drives
>>> (compare
>>> with possibly more common ssd-spindle hybrides).
>>>
>>> b) sounds slightly more appealing since it basically means I put
>>> everything
>>> on those drives and it magically is faster. The a) alternatives also
>>> seemed
>>> to be non ECC which is a no-no and disturbing.
>>>
>>> Does anyone here have any recommendations here?
>>>
>>> Pricing is not very important but reliability is.
>>
>> Have you ruled out SSD?  They are a little new, but I'd be looking at
>> the Intel 710.  In every case I've seen SSD permanently ends I/O
>> issues.  DRAM storage solutions I find to be pricey and complicated
>> when there are so many workable flash options out now.
>
> Are you sure SSDs are a reasonable option for WAL? I personally don't
> think it's a good option, because WAL is written in a sequential manner,
> and that's not an area where SSDs beat spinners really badly.
>
> For example the Intel 710 SSD has a sequential write speed of 210MB/s,
> while a simple SATA 7.2k drive can write about 50-100 MB/s for less than
> 1/10 of the 710 price.
>
> I'm not saying SSDs are a bad thing, but I think it's a waste of money to
> use them for WAL.

sure, but then you have to have a more complicated setup with a
drive(s) designated for WAL, another for storage, etc.  Also, your
argument falls away if the WAL is shared with another drive. The era
of the SSD is here. All new systems I plan will have SSD storage
unless cost pressures are extreme -- often with a single drive unless
you need the extra storage.  If I need availability, instead of RAID,
I'll just build hot standby in.

merlin

Re: WAL in RAM

From
"Tomas Vondra"
Date:
Hi,

On 28 Říjen 2011, 17:28, Marcus Engene wrote:
> Hi list,
>
> Every now and then I have write peaks which causes annoying delay on my
> website. No particular reason it seems, just that laws of probability
> dictates that there will be peaks every now and then.
>
> Anyway, thinking of ways to make the peaks more bareable, I saw the new
> 9.1 feature to bypass WAL. Problems is mainly that some statistics
> tables ("x users clicked this link this month") clog the write cache,
> not more important writes. I could live with restoring a nightly dump of
> these tables and loose a days worth of logs.

Why do you think the write activity is related to WAL? Does that mean bulk
loading of data by users or what? Have you measured how many WAL segments
that creates? What triggers that write activity?

Write peaks usually mean a checkpoint is in progress, and that has nothing
to do with WAL. More precisely - it does not write data to WAL but to data
files, so moving WAL to a separate device won't help at all.

The common scenario is about this:

(1) The users do some changes (INSERT/UPDATE/DELETE) to the database, it's
written to the WAL (fsynced to the device). This is not a lot of work, as
writing to WAL is a sequential access and the actual modifications are
stored in the shared buffers (not forced to the disk yet).

(2) A checkpoint is triggered, i.e. either a checkpoint_timeout expires or
all available WAL segments are filled - this means all the dirty buffers
has to be actually written from shared buffers to the datafiles. This is a
PITA, as it's a rather random access.

Anyway there are options to tune the write performance - most notably
checkpoint_segments, checkpoint_completion_target, checkpoint_timeout.

> Though not keen on jumping over to early major versions an old idea of
> putting WAL in RAM came back. Not RAM in main memory but some thingie
> pretending to be a drive with proper battery backup.
>
> a) It seems to exist odd hardware with RAM modules and if lucky also
> battery
> b) Some drive manufactureres have done hybird ram-spindle drives
> (compare with possibly more common ssd-spindle hybrides).
>
> b) sounds slightly more appealing since it basically means I put
> everything on those drives and it magically is faster. The a)
> alternatives also seemed to be non ECC which is a no-no and disturbing.
>
> Does anyone here have any recommendations here?

The thing to look for when talking about WAL is a sequential write speed.
The solutions you've mentioned above are great for random access, but when
you need a sequential speed it's a waste of money (IMHO).

> Pricing is not very important but reliability is.

My recommendation is to find out what's wrong before buying anything. My
impression is that you don't know the actual cause, so you really don't
know what features should the device have.

If you're willing to spend money without proper analysis, send the money
to me - the result will be about the same. You'll spend money without
actually solving the issue, plus it will make me a bit happier.

Anyway my recommendation would be these:

(1) Find out what actually happens, i.e. check if it's a checkpoint issue,
or what is going on. Enable log_checkpoints etc.

(2) Try to tune the db a bit - not sure what version are you using or what
are the important values, but generally this is a good starting point for
write-heavy databases

checkpoint_segments = 64
checkpoint_completion_target = 0.9
checkpoint_timeout = 30 minutes

(3) Provide more details - Pg version, important config values, hardware
info etc.

Only if I knew what's wrong and if the above things did not help, I'd
consider buying a new hw. I'd probably go with one of those options:

(a) Move the WAL to a separate disk (not SSD), or maybe a RAID1/RAID10 of
such drives. Start with one, use more if needed and a controller with a
BBWC.

(b) Move the database to a SSD drive, leave the WAL on the original
location (not SSD). This might be signoficantly more expensive, especially
if you want to make it reliable (building RAID1 of SSDs or something like
that).

Tomas


Re: WAL in RAM

From
"Tomas Vondra"
Date:
On 28 Říjen 2011, 20:40, Merlin Moncure wrote:
> sure, but then you have to have a more complicated setup with a
> drive(s) designated for WAL, another for storage, etc.  Also, your
> argument falls away if the WAL is shared with another drive. The era
> of the SSD is here. All new systems I plan will have SSD storage
> unless cost pressures are extreme -- often with a single drive unless
> you need the extra storage.  If I need availability, instead of RAID,
> I'll just build hot standby in.

Well, sure - I'm actually a fan of SSDs. Using an SSDs for the datafiles,
or using an SSD for the whole database (including WAL) makes sense, but my
impression was that the OP wants to buy a new drive and use it for WAL
only and that's not really cost effective I guess.

Tomas


Re: WAL in RAM

From
Claudio Freire
Date:
On Fri, Oct 28, 2011 at 12:28 PM, Marcus Engene <mengpg2@engene.se> wrote:
> Hi list,
>
> Every now and then I have write peaks which causes annoying delay on my
> website. No particular reason it seems, just that laws of probability
> dictates that there will be peaks every now and then.
>
> Anyway, thinking of ways to make the peaks more bareable, I saw the new 9.1
> feature to bypass WAL. Problems is mainly that some statistics tables ("x
> users clicked this link this month") clog the write cache, not more
> important writes. I could live with restoring a nightly dump of these tables
> and loose a days worth of logs.
> ...
> Does anyone here have any recommendations here?

You didn't post configuration details.

Just OTOMH, I'd say you have a low shared_buffers setting and that
increasing it could help.

That's assuming the updates you mention on statistic tables update
heavily the same rows over and over, case in which shared buffers
would tremendously help.

Re: WAL in RAM

From
David Boreham
Date:
On 10/28/2011 12:26 PM, Tomas Vondra wrote:
> For example the Intel 710 SSD has a sequential write speed of 210MB/s,
> while a simple SATA 7.2k drive can write about 50-100 MB/s for less than
> 1/10 of the 710 price.
Bulk data transfer rates mean almost nothing in the context of a database
(unless you're for example backing it up by copying the files to another
machine...)
The key factor typically is small block writes/s (for WAL) and random
small block
reads/s (for data). 710 or similar performance SSDs will deliver on the
order
of 20-50x the performance of a traditional hard drive in these areas.



Re: WAL in RAM

From
Marcus Engene
Date:
On 10/28/11 5:45 , Kevin Grittner wrote:
> Marcus Engene<mengpg2@engene.se>  wrote:
>
>
>> Every now and then I have write peaks which causes annoying delay
>> on my website
>>
>
>
>> Does anyone here have any recommendations here?
>>
>
> For our largest machines we put WAL on a RAID1 drive pair dedicated
> to that task, on its own controller with battery-backed cache
> configured for write-back.  It does make a big difference, because
> when a DBA accidentally got this wrong once, we saw the problem you
> describe, and moving WAL to the dedicated drives/controller caused
> the problem to go away.
>
> If problems remain, look for posts by Greg Smith on how to tune
> this.  You may want to extend your checkpoint completion target,
> make the background writer more aggressive, reduce shared buffers,
> or tune the OS.  But if you can afford to put WAL on a dedicated
> file system something like the above, that would be a better place
> to start, IMO.
>
> -Kevin
>
>

The problem I have with battery backed raid controllers is the battery
part. They're simply not reliable and requires testing etc which I as a
rather insignificant customer at a generic datacenter cannot have done
properly. I have however found this thing which I find primising:
http://news.cnet.com/8301-21546_3-10273658-10253464.html
An Adaptec 5z-controller which has a supercap and flushes to a SSD drive
on mishap. Perhaps that's the answer to everything?

As per others suggestions I don't feel encouraged to put WAL on SSD from
finding several texts by Greg Smith and others warning about this. I do
have 2x OCI Sandforce 1500 drives (with supercap) for some burst load
tables.

The reason I started to think about putting WAL on a RAM drive to begin
with was that performance figures for unlogged tables looked very
promising indeed. And the test were of the sort that's occupying my
bandwidth; accumulating statistical writes.

The present pg9 computer is a Pg 9.0.4, Debian Squeeze, 2xXeon, 72GB,
software 4xRAID6(sorry) + 2xSSD. It's OLTP website with 10M products and
SOLR for FTS. During peak it's using ~3-4% CPU, and it's 99.9% reads or
thereabouts. It's the peaks we want to take down. RAID6 or not, with a
spindle as bottleneck there is just a certain max# of writes/s.

Thanks for your answers so far!

Best regards,
Marcus


Re: WAL in RAM

From
Scott Marlowe
Date:
On Sat, Oct 29, 2011 at 11:54 AM, Marcus Engene <mengpg2@engene.se> wrote:

> The problem I have with battery backed raid controllers is the battery part.
> They're simply not reliable and requires testing etc which I as a rather
> insignificant customer at a generic datacenter cannot have done properly. I
> have however found this thing which I find primising:
> http://news.cnet.com/8301-21546_3-10273658-10253464.html
> An Adaptec 5z-controller which has a supercap and flushes to a SSD drive on
> mishap. Perhaps that's the answer to everything?

In over 10 years of using hardware RAID controllers with battery
backup on many many machines, I have had exactly zero data loss due to
a failed battery backup.  Of course proper monitoring is important, to
make sure the batteries aren't old and dead, but every single BBU RAID
controller I have used automatically switched from write back to write
through when they detected a bad battery pack.

Proper testing is essential whether it's BBU Caching or using an SSD,
and failure to do so is inconceivable if your data is at all
important.  Given the current high failure rate of SSDs due to
firmware issues (and it's not just the intel drives experiencing such
failures) I'm much more confident in Areca, 3Ware, and LSI BBU RAID
controllers right now than I am in SSDs.

> As per others suggestions I don't feel encouraged to put WAL on SSD from
> finding several texts by Greg Smith and others warning about this. I do have
> 2x OCI Sandforce 1500 drives (with supercap) for some burst load tables.
>
> The reason I started to think about putting WAL on a RAM drive to begin with
> was that performance figures for unlogged tables looked very promising
> indeed. And the test were of the sort that's occupying my bandwidth;
> accumulating statistical writes.
>
> The present pg9 computer is a Pg 9.0.4, Debian Squeeze, 2xXeon, 72GB,
> software 4xRAID6(sorry) + 2xSSD. It's OLTP website with 10M products and
> SOLR for FTS. During peak it's using ~3-4% CPU, and it's 99.9% reads or
> thereabouts. It's the peaks we want to take down. RAID6 or not, with a
> spindle as bottleneck there is just a certain max# of writes/s.

First things first, get off RAID-6.  A 4 drive RAID-6 gives no more
storage than a 4 drive RAID-10, and is painfully slow by comparison.
Looking at SSDs for WAL is putting the cart about 1,000 miles ahead of
the horse at this point.  You'd be much better off migrating to a
single SSD for everything than running on a 4 disk RAID-6.

Re: WAL in RAM

From
Marcus Engene
Date:
On 10/29/11 10:11 , Scott Marlowe wrote:
> In over 10 years of using hardware RAID controllers with battery
> backup on many many machines, I have had exactly zero data loss due to
> a failed battery backup.  Of course proper monitoring is important, to
> make sure the batteries aren't old and dead, but every single BBU RAID
> controller I have used automatically switched from write back to write
> through when they detected a bad battery pack.
>
> Proper testing is essential whether it's BBU Caching or using an SSD,
> and failure to do so is inconceivable if your data is at all
> important.  Given the current high failure rate of SSDs due to
> firmware issues (and it's not just the intel drives experiencing such
> failures) I'm much more confident in Areca, 3Ware, and LSI BBU RAID
> controllers right now than I am in SSDs.
>

Rimu got me a setup with 2x5805 BBU configured as two RAID10 with
SAS 15k rpm drives and on top of that 2x Xeon E5645 (the hex core).
Since I heard warnings that with non software raids, the machine could be
unresponsive during boot when doing a rebuild, I took small 300G drives.
Not that SAS 15k come in that much bigger sizes, but still.

I chickened out from pg 9.1 due to the low minor number.

I also set...
wal_buffers = 16MB
...which used to be default 64kB which possibly could explain some of
the choke problems at write bursts.
>
>> As per others suggestions I don't feel encouraged to put WAL on SSD from
>> finding several texts by Greg Smith and others warning about this. I do have
>> 2x OCI Sandforce 1500 drives (with supercap) for some burst load tables.
>>
>> The reason I started to think about putting WAL on a RAM drive to begin with
>> was that performance figures for unlogged tables looked very promising
>> indeed. And the test were of the sort that's occupying my bandwidth;
>> accumulating statistical writes.
>>
>> The present pg9 computer is a Pg 9.0.4, Debian Squeeze, 2xXeon, 72GB,
>> software 4xRAID6(sorry) + 2xSSD. It's OLTP website with 10M products and
>> SOLR for FTS. During peak it's using ~3-4% CPU, and it's 99.9% reads or
>> thereabouts. It's the peaks we want to take down. RAID6 or not, with a
>> spindle as bottleneck there is just a certain max# of writes/s.
>>
> First things first, get off RAID-6.  A 4 drive RAID-6 gives no more
> storage than a 4 drive RAID-10, and is painfully slow by comparison.
> Looking at SSDs for WAL is putting the cart about 1,000 miles ahead of
> the horse at this point.  You'd be much better off migrating to a
> single SSD for everything than running on a 4 disk RAID-6.
>
>

Message received and understood :)

Having read up too much on drive reliability paranoia in combination
with going from 7k2 -> 15k I feel a bit uneasy, but this mama is fast.
I suppose a little bit could be credited the newly restored dump instead
of the little over a year entropy in the other machine. But I also did some
update/write torture and it was hard to provoke any io wait.

I put OS & WAL on one array and the general data files on the other.
The data directory that used to be on the SSD drive was also put on the
WAL raid.

Thanks for your advices!
Marcus