Re: Need help with 8.4 Performance Testing - Mailing list pgsql-performance

From Scott Carey
Subject Re: Need help with 8.4 Performance Testing
Date
Msg-id BDFBB77C9E07BE4A984DAAE981D19F961ACA17D9FC@EXVMBX018-1.exch018.msoutlookonline.net
Whole thread Raw
In response to Re: Need help with 8.4 Performance Testing  (Jean-David Beyer <jeandavid8@verizon.net>)
Responses Re: Need help with 8.4 Performance Testing  ("Scott Marlowe" <scott.marlowe@gmail.com>)
Re: Need help with 8.4 Performance Testing  (Gregory Stark <stark@enterprisedb.com>)
Re: Need help with 8.4 Performance Testing  (Scott Carey <scott@richrelevance.com>)
Re: Need help with 8.4 Performance Testing  (Greg Smith <gsmith@gregsmith.com>)
List pgsql-performance
> ________________________________________
> From: pgsql-performance-owner@postgresql.org [pgsql-performance-owner@postgresql.org] On Behalf Of > Jean-David Beyer
[jeandavid8@verizon.net]
> Sent: Tuesday, December 09, 2008 5:08 AM
> To: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Need help with 8.4 Performance Testing
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1

> But one thing that can matter is whether you want to improve just the
> performance of the dbms, or the performance of the entire system, on which
> the dbms runs. Because if you want to improve the whole system, you would
> want as much of the caching to take place in the system's buffers so the use
> of the memory could be optimized over the entire workload, not just the load
> of the dbms itself. I suppose on a dedicated system with only one dbms
> running with only one database open (at a time, anyway), this might be moot,
> but not otherwise.

Yes, the OS is in better position to arbitrate between multiple things.  Of course, we aren't talking about the highest
performancedatabases if we are talking about mixed use systems though. 
Additionally, the OS can never really get it right, with a DB or other apps.  Any app can behave badly and grab too
muchRAM and access it regularly enough for it to not be 'idle' much but give the OS VM fits trying to figure out if its
importantor not versus other processes. 

> Now I agree that it would be good to get the entire index (or at least the
> working set of the index) into the memory of the computer. But does it
> really matter if it is in the system's cache, or the postgres cache? Is it
> any more likely to be in postgres's cache than in the system cache if the
> system is hurting for memory? I would think the system would be equally
> likely to page out "idle" pages no matter where they are unless they are
> locked to memory, and I do not know if all operating systems can do this,
> and even if they can, I do not know if postgres uses that ability. I doubt
> it, since I believe (at least in Linux) a process can do that only if run as
> root, which I imagine few (if any) users do.

The problem, is when none of them are really 'idle'.  When the OS has to decide which pages, all of which have been
accessedrecently, to evict.  Most OS's will make bad choices if the load is mixed random and sequential access, as they
treatall pages equally with respect to freshness versus eviction. 
Another problem is that there IS a difference between being in postgres' cache and the OS cache.  One is more expensive
toretrieve than the other.  Significantly. 

Aaccessing buffers in shared_buffers, in process, uses a good chunk less CPU (and data copy and shared buffer eviction
overhead)than going over the sys call to the OS. 

And as far as I can tell, even after the 8.4 fadvise patch, all I/O is in block_size chunks.  (hopefully I am wrong)
My system is now CPU bound, the I/O can do sequential reads of more than 1.2GB/sec but Postgres can't do a seqscan 30%
asfast because it eats up CPU like crazy just reading and identifying tuples.  It does seqscans ~ 25% faster if its
fromshared_buffers than from the OS's page cache though.   Seqscans are between 250MB/sec and 400MB/sec peak, from mem
ordisk, typically showing no more than 35% iostat utilization of the array if off disk -- so we run a few concurrently
wherewe can. 

In addition to the fadvise patch, postgres needs to merge adjacent I/O's into larger ones to reduce the overhead.  It
onlyreally needs to merge up to sizes of about 128k or 256k, and gain a 8x to 16x drop in syscall overhead, and
additionallypotentially save code trips down the shared buffer management code paths.  At lest, thats my guess I
haven'tlooked at any code and could be wrong. 


Additionally, the "If your operating system has any reasonable caching itself" comment earlier in this conversation ---
Linux (2.6.18, Centos 5.2) does NOT.  I can easily make it spend 100% CPU in system time trying to figure out what to
dowith the system cache for an hour.  Just do large seqscans with memory pressure from work_mem or other forces that
theOS will not deem 'idle'.  Once the requested memory is ~75% of the system total, it will freak out.  Linux simply
willnot give up that last 25% or so of the RAM for anything but page cache, even though the disk subsustem is very fast
andmost of the access is sequential, marginalizing the benefit of the cache.  Depending on how you tune it, it will
eitherspin system cpu or swap storm, but the system cpu spin times for the same work load are a lot shorter than an
equivalentswap storm. 
Mount the data drive in O_DIRECT and the problem vanishes.  I've been told that this problem may be gone in some of the
latestkernels.  I have seriously considered bumping shared_buffers up a lot and mounting the thing direct -- but then
welose the useful scheduler and readahead algorithms.  The other way around (small shared_buffers, let the OS do it)
hurtsperformance overall quite a bit -- randomly accessed pages get pushed out to the OS cache more often, and the OS
tossesthouse out when a big seqscan occurs, resulting in a lot more random access from disk and more disk bound periods
oftime. Great wonder, this operating system caching, eh? 

In any event, don't hold up these OS page cache things as if they're the best thing in the world for a database, they
haveserious flaws themselves and typically are difficult or impossible to tune to be ideal for a database. 

Its one thing to propose that a database build its own file system (hard, and why bother?) versus have a database
manageits own page cache intelligently and access the OS file system as optimally as it can.  In both of the latter,
theDB knows much more about what data is really important than the OS (and could for example, prioritize cache versus
work_memintelligently while the OS can get that one horribly wrong in my experience, and knows when a huge seqscan
occursto make caching those results low priority).  No matter how you do it using the OS cache, you cache twice and
copytwice.  O_DIRECT isn't usually an option for other reasons, the OS disk scheduler, readahead, and other benefits of
afile system are real and substantial.  If you are caching twice, you might as well have the "closer" copy of that data
bethe larger, more efficient pool. 

As for tipping points and pg_bench -- It doesn't seem to reflect the kind of workload we use postgres for at all,
thoughmy workload does a lot of big hashes and seqscans, and I'm curious how much improved those may be due to the hash
improvements. 32GB RAM and 3TB data (about 250GB scanned regularly) here.  And yes, we are almost completely CPU bound
nowexcept for a few tasks.  Iostat only reports above 65% disk utilization for about 5% of the workload duty-cycle, and
isregularly < 20%.  COPY doesn't get anywhere near platter speeds, on indexless bulk transfer.  The highest disk usage
spikesoccur when some of our radom-access data/indexes get shoved out of cache.  These aren't too large, but high
enoughseqscan load will cause postgres and the OS to dump them from cache.  If we put these on some SSD's the disk
utilization% would drop a lot further. 

I feel confident in saying that in about a year, I could spec out a medium sized budget for hardware ($25k) for almost
anypostgres setup and make it almost pure CPU bound. 
SSDs and hybrid tech such as ZFS L2ARC make this possible with easy access to 10k+ iops, and it it will take no more
than12 SATA drives in raid 10 next year (and a good controller or software raid) to get 1GB/sec sequential reads. 

pgsql-performance by date:

Previous
From: Aidan Van Dyk
Date:
Subject: Re: Experience with HP Smart Array P400 and SATA drives?
Next
From: "Joshua D. Drake"
Date:
Subject: Re: Experience with HP Smart Array P400 and SATA drives?