Re: amazon ec2 - Mailing list pgsql-performance

From Mark Rostron
Subject Re: amazon ec2
Date
Msg-id BANLkTinKbgzSGwg1GWTpwUCkwSpaCaebsA@mail.gmail.com
Whole thread Raw
In response to Re: amazon ec2  (Alan Hodgson <ahodgson@simkin.ca>)
Responses Re: amazon ec2
List pgsql-performance
iowait is a problem on any platform that relies on spinning media, compared to RAM.
no matter how fast a disk is, and no matter how intelligent the controller is, you are still dealing with an access speed differential of 10^6 (speed of disk access compared to memory access).
i have had good results by avoiding it.
if you can do this, ec2 is not too shabby, but beware - it doesn't come free.
this is achievable under the following circumstances (and maybe there are other ways to do this).
i use a technique of pro-actively querying enough of my anticipated result set with a daemon procedure.
as long as the frequency of your query daemon execution is greater than that of the competitor processes (eg ETL and other update activity), AND a substantial part of the result set will fit in available RAM, then the result set will be served from file system cache at the time you want it.
i have found that it doesn't take much to get this to happen, once you have identified your critical result set.
like - you can get away with running it once/hour, and i'm still reducing the frequency and getting good results.
this approach basically assumes a 90/10 rule - at any point in time, you only want to access 10% of your data. if you can work out what the 10% is, and it will fit into RAM, then you can set it up to cache it.
it also imposes no additional cost in ec2, because Amazon doesn't bill you for CPU activity, although the large-RAM machines do cost more. Depends on how big your critical result set is, and how much speed you need.

dont know if this helps - the success/failure of it depends on your typical query activity, the size of your critical result set, and whether you are able to get enough RAM to make this work.

as i said it doesn't come for free, but you can make it work.

as a further point, try also checking out greenplum - it is an excellent postgres derivative with a very powerful free version. the reason why i bring it up is because it offers block-level compression (with caveats - it also doesn't come for free, so do due diligence and rtfm carefully). The compression enabled me to improve the cache hit rate, and so you further reduce the iowait problem.
greenplum is also a better parallel machine than postgres, so combining the cache technique above with greenplum compression and parallel query, i have been able to get 20:1 reduction in response times for some of our queries.
obviously introducing new database technology is a big deal, but we needed the speed, and it kinda worked.

mr


On Tue, May 3, 2011 at 1:09 PM, Alan Hodgson <ahodgson@simkin.ca> wrote:
On May 3, 2011 12:43:13 pm you wrote:
> On May 3, 2011, at 8:41 PM, Alan Hodgson wrote:
> > I am also interested in tips for this. EBS seems to suck pretty bad.
>
> Alan, can you elaborate? Are you using PG on top of EBS?
>

Trying to, yes.

Let's see ...

EBS volumes seem to vary in speed. Some are relatively fast. Some are really
slow. Some fast ones become slow randomly. Some are fast attached to one
instance, but really slow attached to another.

Fast being a relative term, though. The fast ones seem to be able to do maybe
400 random IOPS. And of course you can only get about 80MB/sec sequential
access to them on a good day.

Which is why I'm interested in how other people are doing it. So far EC2
doesn't seem well suited to running databases at all.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

pgsql-performance by date:

Previous
From: Greg Spiegelberg
Date:
Subject: Re: amazon ec2
Next
From: david@lang.hm
Date:
Subject: Re: amazon ec2