Thread: Hardware upgrade for a high-traffic database
Hi All, We're currently running Postgres 7.4.1 on FreeBSD 5.2, a Dual Xeon 2.4, 2GB ECC, 3Ware Serial ATA RAID 5 w/ 4 disks (SLOW!!). Our database is about 20GB on disk, we have some quite large tables - 2M rows with TEXT fields in a sample table, accessed constantly. We average about 4,000 - 5,000 queries per second - all from web traffic. As you can imagine, we're quite disk limited and checkpoints can be killer. Additionally, we see queries and connections getting serialized due to queries that take a long time (5 sec or so) while waiting on disk access. No fun at all. We've tweaked everything long and hard, and at the end of the day, the disk is killing us. We're looking to upgrade our server - or rather, replace it as it has no upgrade path to SCSI. I'm considering going Opteron (though right now we don't need more CPU time), and am looking for suggestions on what an optimal RAID configuration may look like (disks, controller, cache setting). We're in the market to buy right now - any good vendor suggestions? I'd appreciate any input, thanks! Jason
Hi Rod, Actually, we're already using a substantial caching system in code for nearly all pages delivered - we've exhausted that option. Our system uses a login/session table for about 1/8 of our page views (those visitors who are logged in), and has tracking features. While I'd love to scrap them and give the database server a vacation, it's a requirement for us. You're correct about the query caching (stored in memory) being used - most of our queries are run once and then come from memory (or, based on speed of consecutive executions, that seems to be the case). Once a user hits a page for the first time in an hour or so, it seems to cache their session query. The issue that I think we're seeing is that the performance on the 3Ware RAID is quite bad, watching FreeBSD systat will show it at "100% busy" at around "3.5 MB/s". When it needs to seek across a table (for, say, an aggregate function - typically a COUNT()), it slows the entire server down while working on the disk. Additionally, VACUUM's make the server practically useless. We have indexes on everything that's used in queries, and the planner is using them. The server has 2GB of physical memory, however it's only uses between 130MB and 200MB of it. Postgres is the only application running on the server. Our pertinent settings look like this: max_connections = 512 shared_buffers = 20000 sort_mem = 2000 vacuum_mem = 20000 effective_cache_size = 300000 fsync = false wal_sync_method = fsync wal_buffers = 32 checkpoint_segments = 2 checkpoint_timeout = 30 commit_delay = 10000 Typically, we don't use anywhere near the 512 connections - however there are peak hours where we come close, and other times that we eclipse it and run out (should some connections become serialized due to a slowdown). It's not something that we can comfortably lower. The non-standard checkpoint settings have helped making it less likely that a large (in disk time) query will conflict with a checkpoint write. I'm a programmer - definitely not a DBA by any stretch - though I am forced into the role. From reading this list, it seems to me that our settings are reasonable given our usage, and that a disk upgrade is likely in order. I'd love to hear any suggestions. Thanks, Jason -----Original Message----- From: Rod Taylor [mailto:pg@rbt.ca] Sent: Tuesday, August 10, 2004 7:07 PM To: Jason Coene Cc: Postgresql Performance Subject: Re: [PERFORM] Hardware upgrade for a high-traffic database > Our database is about 20GB on disk, we have some quite large tables - 2M > rows with TEXT fields in a sample table, accessed constantly. We average > about 4,000 - 5,000 queries per second - all from web traffic. As you can 99% is reads? and probably the same data over and over again? You might want to think about a small code change to cache sections of page output in memory for the most commonly generated pages (there are usually 3 or 4 that account for 25% to 50% of web traffic -- starting pages). The fact you're getting 5k queries/second off IDE drives tells me most of the active data is in memory -- so your actual working data set is probably quite small (less than 10% of the 20GB). If the above is all true (mostly reads, smallish dataset, etc.) and the database is not growing very quickly, you might want to look into RAM and RAM bandwidth over disk. An Opteron with 8GB ram using the same old IDE drives. Get a mobo with a SCSI raid controller in it, so the disk component can be upgraded in the future (when necessary).
On Tue, 2004-08-10 at 13:17, Jason Coene wrote: > Hi All, > > We're currently running Postgres 7.4.1 on FreeBSD 5.2, a Dual Xeon 2.4, 2GB > ECC, 3Ware Serial ATA RAID 5 w/ 4 disks (SLOW!!). > > Our database is about 20GB on disk, we have some quite large tables - 2M > rows with TEXT fields in a sample table, accessed constantly. We average > about 4,000 - 5,000 queries per second - all from web traffic. As you can > imagine, we're quite disk limited and checkpoints can be killer. > Additionally, we see queries and connections getting serialized due to > queries that take a long time (5 sec or so) while waiting on disk access. > No fun at all. > > We've tweaked everything long and hard, and at the end of the day, the disk > is killing us. > > We're looking to upgrade our server - or rather, replace it as it has no > upgrade path to SCSI. I'm considering going Opteron (though right now we > don't need more CPU time), and am looking for suggestions on what an optimal > RAID configuration may look like (disks, controller, cache setting). We're > in the market to buy right now - any good vendor suggestions? I've had very good luck with LSI MegaRAID controllers with battery backed cache. The amount of cache doesn't seem as important as having it, and having it set for write back. After that, 2 gigs or more of memory is the next improvement. After that, the speed of the memory.
> Our database is about 20GB on disk, we have some quite large tables - 2M > rows with TEXT fields in a sample table, accessed constantly. We average > about 4,000 - 5,000 queries per second - all from web traffic. As you can 99% is reads? and probably the same data over and over again? You might want to think about a small code change to cache sections of page output in memory for the most commonly generated pages (there are usually 3 or 4 that account for 25% to 50% of web traffic -- starting pages). The fact you're getting 5k queries/second off IDE drives tells me most of the active data is in memory -- so your actual working data set is probably quite small (less than 10% of the 20GB). If the above is all true (mostly reads, smallish dataset, etc.) and the database is not growing very quickly, you might want to look into RAM and RAM bandwidth over disk. An Opteron with 8GB ram using the same old IDE drives. Get a mobo with a SCSI raid controller in it, so the disk component can be upgraded in the future (when necessary).
> We're currently running Postgres 7.4.1 on FreeBSD 5.2, a Dual Xeon 2.4, > 2GB > ECC, 3Ware Serial ATA RAID 5 w/ 4 disks (SLOW!!). Cheap solution while you look for another server : Try to use something other than RAID5. You have 4 disks, so you could use a striping+mirroring RAID which would boost performance. You can switch with a minimum downtime (copy files to other HDD, change RAID parameters, copy again...) maybe 1 hour ? If your hardware supports it of course. And tell us how it works !
> The issue that I think we're seeing is that the performance on the 3Ware > RAID is quite bad, watching FreeBSD systat will show it at "100% busy" at > around "3.5 MB/s". When it needs to seek across a table (for, say, an > aggregate function - typically a COUNT()), it slows the entire server down > while working on the disk. Additionally, VACUUM's make the server > practically useless. We have indexes on everything that's used in > queries, > and the planner is using them. It sounds to me like your application is CPU bound, except when vacuuming...then your server is just overloaded. A higher performance i/o system will help when vacuuming and checkpointing but will not solve the overall problem. With a (good & well supported) battery backed raid controller you can turn fsync back on which will help you with your i/o storm issues (plus the safety issue). This will be particularly important if you follow my next suggestion. One thing you might consider is materialized views. Your aggregate functions are killing you...try to avoid using them (except min/max on an index). Just watch out for mutable functions like now(). http://www.varlena.com/varlena/GeneralBits/Tidbits/matviews.html An application specific approach is to use triggers to keep the data you need in as close to query form as possible...you can reap enormous savings particularly if your queries involve 3 or more tables or have large aggregate scans. Merlin
On Wed, 2004-08-11 at 17:51, Merlin Moncure wrote: > One thing you might consider is materialized views. Your aggregate > functions are killing you...try to avoid using them (except min/max on > an index). Just watch out for mutable functions like now(). > > http://www.varlena.com/varlena/GeneralBits/Tidbits/matviews.html > > An application specific approach is to use triggers to keep the data you > need in as close to query form as possible...you can reap enormous > savings particularly if your queries involve 3 or more tables or have > large aggregate scans. I thought materialized views support in pgsql was experimental as yet. Are the pg mat-view code upto production servers? Also, do you have to delete mat-views before you dump the db or does dump automatically not dump the mat-views data? Sanjay.
> On Wed, 2004-08-11 at 17:51, Merlin Moncure wrote: > > > One thing you might consider is materialized views. Your aggregate > > functions are killing you...try to avoid using them (except min/max on > > an index). Just watch out for mutable functions like now(). > > > > http://www.varlena.com/varlena/GeneralBits/Tidbits/matviews.html > > > > An application specific approach is to use triggers to keep the data you > > need in as close to query form as possible...you can reap enormous > > savings particularly if your queries involve 3 or more tables or have > > large aggregate scans. > > I thought materialized views support in pgsql was experimental as yet. > Are the pg mat-view code upto production servers? Also, do you have to > delete mat-views before you dump the db or does dump automatically not > dump the mat-views data? I think you are thinking about 100% 'true' materialized views. In that case the answer is no, the server does not have them. The GeneralBits article describes how to emulate them through pl/sql triggers. I just bumped into the article yesterday and was very impressed by it...I have to admin though Note: I have never tried the method, but it should work. I cc'd the author who perhaps might chime in and tell you more about them. Materialized views can give performance savings so good that the tpc people had to ban them from benchmarks because they skewed results...:) In postgres, they can help a lot with aggregates...there are many gotchas tho, for example keeping a count() up to date can get kind of tricky. If you can get them to work, the filesystem cache efficiency will rocket upwards...YMMV. Getting back on topic, I missed the original post where the author stated his problems were i/o related, not cpu (contrary to my speculation). I wonder what his insert/update load is? Merlin
Thanks for all the feedback. To clear it up, we are definitely not CPU bound at the moment. Any slowdown seems to be disk dependant, or from to serialization due to a long query (due to disk). We do have a lot of INSERT/UPDATE calls, specifically on tables that track user sessions, then of course things like comments, etc (where we'll see 10-30 INSERT's per second, with TEXT field, and hundreds of reads per second). Additionally, our system does use a lot of aggregate functions. I'll look into materialized views, it sounds like it may be worth implementing. One question I do have though - you specifically mentioned NOW() as something to watch out for, in that it's mutable. We typically use COUNT() as a subselect to retrieve the number of associated rows to the current query. Additionally, we use NOW a lot, primarily to detect the status of a date, i.e.: SELECT id FROM subscriptions WHERE userid = 11111 AND timeend > NOW(); Is there a better way to do this? I was under the impression that NOW() was pretty harmless, just to return a current timestamp. Based on feedback, I'm looking at a minor upgrade of our RAID controller to a 3ware 9000 series (SATA with cache, battery backup optional), and re-configuring it for RAID 10. It's a damn cheap upgrade at around $350 and an hour of downtime, so I figure that it's worth it for us to give it a shot. Thanks, Jason -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Merlin Moncure Sent: Wednesday, August 11, 2004 1:04 PM To: skpobox@gawab.com Cc: Postgresql Performance; jgardner@jonathangardner.net Subject: Re: [PERFORM] Hardware upgrade for a high-traffic database > On Wed, 2004-08-11 at 17:51, Merlin Moncure wrote: > > > One thing you might consider is materialized views. Your aggregate > > functions are killing you...try to avoid using them (except min/max on > > an index). Just watch out for mutable functions like now(). > > > > http://www.varlena.com/varlena/GeneralBits/Tidbits/matviews.html > > > > An application specific approach is to use triggers to keep the data you > > need in as close to query form as possible...you can reap enormous > > savings particularly if your queries involve 3 or more tables or have > > large aggregate scans. > > I thought materialized views support in pgsql was experimental as yet. > Are the pg mat-view code upto production servers? Also, do you have to > delete mat-views before you dump the db or does dump automatically not > dump the mat-views data? I think you are thinking about 100% 'true' materialized views. In that case the answer is no, the server does not have them. The GeneralBits article describes how to emulate them through pl/sql triggers. I just bumped into the article yesterday and was very impressed by it...I have to admin though Note: I have never tried the method, but it should work. I cc'd the author who perhaps might chime in and tell you more about them. Materialized views can give performance savings so good that the tpc people had to ban them from benchmarks because they skewed results...:) In postgres, they can help a lot with aggregates...there are many gotchas tho, for example keeping a count() up to date can get kind of tricky. If you can get them to work, the filesystem cache efficiency will rocket upwards...YMMV. Getting back on topic, I missed the original post where the author stated his problems were i/o related, not cpu (contrary to my speculation). I wonder what his insert/update load is? Merlin ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Jason,, One suggestion i have, stay away from adaptec ZCR RAID products, we've been doing testing on them, and they don't perform well at all. --brian On Aug 11, 2004, at 1:08 PM, Jason Coene wrote: > Thanks for all the feedback. To clear it up, we are definitely not CPU > bound at the moment. Any slowdown seems to be disk dependant, or from > to > serialization due to a long query (due to disk).
> We do have a lot of INSERT/UPDATE calls, specifically on tables that track > user sessions, then of course things like comments, etc (where we'll see > 10-30 INSERT's per second, with TEXT field, and hundreds of reads per > second). Additionally, our system does use a lot of aggregate functions. > I'll look into materialized views, it sounds like it may be worth > implementing. Right. The point is: is your i/o bottle neck on the read side or the write side. With 10-30 inserts/sec and fsync off, it's definitely on the read side. What's interesting is that such a low insert load is causing i/o storm problems. How does your app run with fsync on? With read-bound i/o problems, might want to consider upgrading memory first to get better cache efficiency. You may want to consider Opteron for > 4GB allocations (yummy!). The good news is that read problems are usually solvable by being clever, whereas write problems require hardware. > One question I do have though - you specifically mentioned NOW() as > something to watch out for, in that it's mutable. We typically use This is specifically with regards to materialized views. Mutable functions cause problems because when they are pushed unto the view, they are refreshed...something to watch out for. The trick with MVs is to increase your filesystem cache efficiency. The big picture is to keep frequently read data in a single place to make better benefit of cache. Aggregates naturally read multiple rows to return a single row's worth of data so you want to target them first. This all comes at a cost of update I/O time and some application complexity. > as a subselect to retrieve the number of associated rows to the current > query. Additionally, we use NOW a lot, primarily to detect the status of > a > date, i.e.: Might want to check if your application middleware (php?) exposes PQntuples()...this is a zero cost way to get the same information. > Based on feedback, I'm looking at a minor upgrade of our RAID controller > to > a 3ware 9000 series (SATA with cache, battery backup optional), and > re-configuring it for RAID 10. It's a damn cheap upgrade at around $350 > and > an hour of downtime, so I figure that it's worth it for us to give it a > shot. p.s. you can also increase cache efficiency by reducing database size, for example use int2/int4 vs. numerics. Good luck!
> > Right. The point is: is your i/o bottle neck on the read side or the > write side. With 10-30 inserts/sec and fsync off, it's definitely on > the read side. What's interesting is that such a low insert load is > causing i/o storm problems. How does your app run with fsync on? > > With read-bound i/o problems, might want to consider upgrading memory > first to get better cache efficiency. You may want to consider Opteron > for > 4GB allocations (yummy!). > > The good news is that read problems are usually solvable by being > clever, whereas write problems require hardware. > The difference with fsync being off makes seems to be that it allows the server to write in groups instead of scattering our INSERT/UPDATE calls all over - it helps keep things going. When a checkpoint occurs, reads slow down there. Normal reads are usually quite fast, aside from some reads. A good example, a comments table where users submit TEXT data. A common query is to find the last 5 comments a user has submitted. The scan, while using an index, takes a considerable amount of time (> 0.5 sec is about as good as it gets). Again, it's using an index on the single WHERE clause (userid = int). The field that's used to ORDER BY (timestamp) is also indexed. I'm wondering why our PG server is using so little memory... The system has 2GB of memory, though only around 200MB of it are used. Is there a PG setting to force more memory usage towards the cache? Additionally, we use FreeBSD. I've heard that Linux may manage that memory better, any truth there? Sorry if I'm grabbing at straws here :) > > One question I do have though - you specifically mentioned NOW() as > > something to watch out for, in that it's mutable. We typically use > > This is specifically with regards to materialized views. Mutable > functions cause problems because when they are pushed unto the view, > they are refreshed...something to watch out for. > > The trick with MVs is to increase your filesystem cache efficiency. The > big picture is to keep frequently read data in a single place to make > better benefit of cache. Aggregates naturally read multiple rows to > return a single row's worth of data so you want to target them first. > This all comes at a cost of update I/O time and some application > complexity. > > > as a subselect to retrieve the number of associated rows to the > current > > query. Additionally, we use NOW a lot, primarily to detect the status > of > > a > > date, i.e.: > > Might want to check if your application middleware (php?) exposes > PQntuples()...this is a zero cost way to get the same information. > Thanks, I'll look into it. We use C and PHP. > > Based on feedback, I'm looking at a minor upgrade of our RAID > controller > > to > > a 3ware 9000 series (SATA with cache, battery backup optional), and > > re-configuring it for RAID 10. It's a damn cheap upgrade at around > $350 > > and > > an hour of downtime, so I figure that it's worth it for us to give it > a > > shot. > > p.s. you can also increase cache efficiency by reducing database size, > for example use int2/int4 vs. numerics. > I've gone through and optimized data types as much as possible. I'll see what else we can do w/o causing downtime once PG 8 is ready to go and we can change data types on the fly. Thanks, Jason
On Aug 11, 2004, at 3:18 PM, Jason Coene wrote: > > I'm wondering why our PG server is using so little memory... The > system has > 2GB of memory, though only around 200MB of it are used. Is there a PG > setting to force more memory usage towards the cache? Additionally, > we use > FreeBSD. I've heard that Linux may manage that memory better, any > truth > there? Sorry if I'm grabbing at straws here :) > i don't know about freebsd, but linux is very aggressive about using unused memory for disk cache. we have dedicated linux box running pg with 2gb of memory, about 250mb of memory is being used by processes (system+postgres) and shared memory (postgres only), and there is 1.75gb of disk buffers in use in the kernel. this particular database is only about 4gb, so a good portion of the db resides in memory, certainly most of the active set. the disk subsystem is a 6 disk scsi u160 raid array which performs pretty well when there is io.
> I'm wondering why our PG server is using so little memory... The system has > 2GB of memory, though only around 200MB of it are used. Is there a PG This is the second time you've said this. Surely you're not implying there is 1.8GB Free Memory -- rather than 1.8GB in Buffers or Cache. Send output of the below: sysctl vm sysctl -a | grep buffers top | grep -E "(Mem|Swap):"
On Wed, 2004-08-11 at 17:31, Brian Hirt wrote: > On Aug 11, 2004, at 3:18 PM, Jason Coene wrote: > > > > I'm wondering why our PG server is using so little memory... The > > system has > > 2GB of memory, though only around 200MB of it are used. Is there a PG > > setting to force more memory usage towards the cache? Additionally, > > we use > > FreeBSD. I've heard that Linux may manage that memory better, any > > truth > > there? Sorry if I'm grabbing at straws here :) > > > > i don't know about freebsd, but linux is very aggressive about using > unused memory for disk cache. we have dedicated linux box running pg Aggressive indeed.. I'm stuck with the version that has a tendency to swap out active processes rather than abandon disk cache -- it gets very annoying!
> -----Original Message----- > From: Rod Taylor [mailto:pg@rbt.ca] > Sent: Wednesday, August 11, 2004 5:46 PM > To: Jason Coene > Cc: 'Merlin Moncure'; Postgresql Performance > Subject: Re: [PERFORM] Hardware upgrade for a high-traffic database > > > I'm wondering why our PG server is using so little memory... The system > has > > 2GB of memory, though only around 200MB of it are used. Is there a PG > > This is the second time you've said this. Surely you're not implying > there is 1.8GB Free Memory -- rather than 1.8GB in Buffers or Cache. Hi Rod, I was looking at top and vmstat - which always show under 300MB "Active". We may hit 400MB at peak. Everything I see (though this isn't my area of expertise) points to most of the memory simply being unused. Results below, am I missing something? Jason > > Send output of the below: > > sysctl vm d01> sysctl vm vm.vmtotal: System wide totals computed every five seconds: (values in kilobytes) =============================================== Processes: (RUNQ: 1 Disk Wait: 0 Page Wait: 0 Sleep: 149) Virtual Memory: (Total: 2101614K, Active 440212K) Real Memory: (Total: 2023532K Active 327032K) Shared Virtual Memory: (Total: 14356K Active: 3788K) Shared Real Memory: (Total: 4236K Active: 2456K) Free Memory Pages: 88824K vm.loadavg: { 0.46 0.41 0.42 } vm.v_free_min: 3312 vm.v_free_target: 13997 vm.v_free_reserved: 749 vm.v_inactive_target: 20995 vm.v_cache_min: 13997 vm.v_cache_max: 27994 vm.v_pageout_free_min: 34 vm.pageout_algorithm: 0 vm.swap_enabled: 1 vm.swap_async_max: 4 vm.dmmax: 32 vm.nswapdev: 1 vm.swap_idle_threshold1: 2 vm.swap_idle_threshold2: 10 vm.v_free_severe: 2030 vm.stats.sys.v_swtch: 627853362 vm.stats.sys.v_trap: 3622664114 vm.stats.sys.v_syscall: 1638589210 vm.stats.sys.v_intr: 3250875036 vm.stats.sys.v_soft: 1930666043 vm.stats.vm.v_vm_faults: 3197534554 vm.stats.vm.v_cow_faults: 2999625102 vm.stats.vm.v_cow_optim: 10093309 vm.stats.vm.v_zfod: 3603956919 vm.stats.vm.v_ozfod: 3104475907 vm.stats.vm.v_swapin: 3353 vm.stats.vm.v_swapout: 3382 vm.stats.vm.v_swappgsin: 3792 vm.stats.vm.v_swappgsout: 7213 vm.stats.vm.v_vnodein: 14675 vm.stats.vm.v_vnodeout: 140671 vm.stats.vm.v_vnodepgsin: 24330 vm.stats.vm.v_vnodepgsout: 245840 vm.stats.vm.v_intrans: 3643 vm.stats.vm.v_reactivated: 35038 vm.stats.vm.v_pdwakeups: 26984 vm.stats.vm.v_pdpages: 335769007 vm.stats.vm.v_dfree: 8 vm.stats.vm.v_pfree: 1507856856 vm.stats.vm.v_tfree: 430723755 vm.stats.vm.v_page_size: 4096 vm.stats.vm.v_page_count: 512831 vm.stats.vm.v_free_reserved: 749 vm.stats.vm.v_free_target: 13997 vm.stats.vm.v_free_min: 3312 vm.stats.vm.v_free_count: 968 vm.stats.vm.v_wire_count: 62039 vm.stats.vm.v_active_count: 44233 vm.stats.vm.v_inactive_target: 20995 vm.stats.vm.v_inactive_count: 343621 vm.stats.vm.v_cache_count: 21237 vm.stats.vm.v_cache_min: 13997 vm.stats.vm.v_cache_max: 27994 vm.stats.vm.v_pageout_free_min: 34 vm.stats.vm.v_interrupt_free_min: 2 vm.stats.vm.v_forks: 45205536 vm.stats.vm.v_vforks: 74315 vm.stats.vm.v_rforks: 0 vm.stats.vm.v_kthreads: 2416 vm.stats.vm.v_forkpages: 1464383994 vm.stats.vm.v_vforkpages: 4259727 vm.stats.vm.v_rforkpages: 0 vm.stats.vm.v_kthreadpages: 0 vm.stats.misc.zero_page_count: 709 vm.stats.misc.cnt_prezero: -972664922 vm.max_proc_mmap: 34952 vm.msync_flush_flags: 3 vm.idlezero_enable: 1 vm.idlezero_maxrun: 16 vm.max_launder: 32 vm.pageout_stats_max: 13997 vm.pageout_full_stats_interval: 20 vm.pageout_stats_interval: 5 vm.pageout_stats_free_max: 5 vm.swap_idle_enabled: 0 vm.defer_swapspace_pageouts: 0 vm.disable_swapspace_pageouts: 0 vm.pageout_lock_miss: 0 vm.zone: ITEM SIZE LIMIT USED FREE REQUESTS FFS2 dinode: 256, 0, 30156, 4389, 20093512 FFS1 dinode: 128, 0, 0, 0, 0 FFS inode: 140, 0, 30156, 4340, 20093512 SWAPMETA: 276, 121576, 16, 264, 44599 ripcb: 180, 32780, 0, 132, 289 hostcache: 88, 15390, 6, 309, 741 syncache: 104, 15390, 0, 418, 44592418 tcptw: 56, 6603, 3, 1204, 224900 tcpcb: 368, 32769, 136, 4264, 44594153 inpcb: 180, 32780, 139, 4437, 44594153 udpcb: 180, 32780, 10, 144, 85953 unpcb: 140, 32788, 6, 246, 143982 socket: 240, 32768, 152, 4248, 44824378 KNOTE: 64, 0, 0, 434, 7561 PIPE: 172, 0, 8, 222, 352848 NFSNODE: 460, 0, 1596, 92, 2419 NFSMOUNT: 424, 0, 1, 17, 1 DIRHASH: 1024, 0, 238, 86, 287 L VFS Cache: 291, 0, 165, 160, 11956 S VFS Cache: 68, 0, 38283, 3430, 3795133 NAMEI: 1024, 0, 0, 240, 907013101 VNODEPOLL: 60, 0, 1, 131, 2 VNODE: 260, 0, 34104, 36, 34104 g_bio: 136, 0, 0, 5887, 551700514 VMSPACE: 236, 0, 152, 987, 45279840 UPCALL: 44, 0, 0, 0, 0 KSE: 64, 0, 1224, 202, 1224 KSEGRP: 120, 0, 1224, 109, 1224 THREAD: 312, 0, 1224, 84, 1224 PROC: 452, 0, 261, 963, 45282231 Files: 68, 0, 782, 5413, 719968279 4096: 4096, 0, 441, 1935, 90066743 2048: 2048, 0, 237, 423, 25077 1024: 1024, 0, 23, 157, 448114 512: 512, 0, 108, 140, 770519 256: 256, 0, 458, 1102, 70685682 128: 128, 0, 1904, 1041, 186085712 64: 64, 0, 5124, 13042, 1404464781 32: 32, 0, 1281, 1302, 839881182 16: 16, 0, 842, 1548, 1712031683 DP fakepg: 72, 0, 0, 0, 0 PV ENTRY: 28, 2166780, 157829, 769251, 56650653911 MAP ENTRY: 60, 0, 6716, 33280, 2270740046 KMAP ENTRY: 60, 65538, 24, 702, 152938 MAP: 160, 0, 9, 41, 2 VM OBJECT: 132, 0, 21596, 10654, 1136467083 128 Bucket: 524, 0, 3115, 0, 0 64 Bucket: 268, 0, 200, 10, 0 32 Bucket: 140, 0, 191, 5, 0 16 Bucket: 76, 0, 49, 3, 0 UMA Hash: 128, 0, 0, 31, 0 UMA Slabs: 34, 0, 3095, 95, 0 UMA Zones: 432, 0, 52, 2, 0 vm.kvm_size: 1069543424 vm.kvm_free: 364900352 > > sysctl -a | grep buffers d01 > sysctl -a | grep buffers vfs.numdirtybuffers: 52 vfs.lodirtybuffers: 909 vfs.hidirtybuffers: 1819 vfs.numfreebuffers: 7146 vfs.lofreebuffers: 404 vfs.hifreebuffers: 808 > > top | grep -E "(Mem|Swap):" > d01 > top | grep -E "(Mem|Swap):" Mem: 173M Active, 1346M Inact, 242M Wired, 77M Cache, 112M Buf, 5784K Free Swap: 4096M Total, 124K Used, 4096M Free
On Wed, 2004-08-11 at 18:03, Jason Coene wrote: > > -----Original Message----- > > From: Rod Taylor [mailto:pg@rbt.ca] > > Sent: Wednesday, August 11, 2004 5:46 PM > > To: Jason Coene > > Cc: 'Merlin Moncure'; Postgresql Performance > > Subject: Re: [PERFORM] Hardware upgrade for a high-traffic database > > > > > I'm wondering why our PG server is using so little memory... The system > > has > > > 2GB of memory, though only around 200MB of it are used. Is there a PG > > > > This is the second time you've said this. Surely you're not implying > > there is 1.8GB Free Memory -- rather than 1.8GB in Buffers or Cache. > > Hi Rod, > > I was looking at top and vmstat - which always show under 300MB "Active". > We may hit 400MB at peak. Everything I see (though this isn't my area of > expertise) points to most of the memory simply being unused. Results below, > am I missing something? This looks fine. The memory is not unused (only 5MB is actually empty) but is being used for disk cache. Active is memory used by programs and would need to be swapped if this space was needed. Inactive is memory that is generally dirty. Disk cache is often here. In your case, you likely write to the same pages you're reading from -- which is why this number is so big. It also explains why a checkpoint is a killer; a large chunk of this memory set needs to be pushed to disk. Cache is memory used generally for disk cache that is not dirty. It's been read from the disk and could be cleared immediately if necessary. Wired is memory that cannot be swapped. In your case, Shared Memory is probably Wired (this is good). There is another sysctl to check and set whether it is wired or swappable. Interesting (if dry) read: http://www.freebsd.org/doc/en_US.ISO8859-1/articles/vm-design/index.html
> > Hi Rod, > > > > I was looking at top and vmstat - which always show under 300MB > "Active". > > We may hit 400MB at peak. Everything I see (though this isn't my area > of > > expertise) points to most of the memory simply being unused. Results > below, > > am I missing something? > > This looks fine. The memory is not unused (only 5MB is actually empty) > but is being used for disk cache. > > Active is memory used by programs and would need to be swapped if this > space was needed. > > Inactive is memory that is generally dirty. Disk cache is often here. In > your case, you likely write to the same pages you're reading from -- > which is why this number is so big. It also explains why a checkpoint is > a killer; a large chunk of this memory set needs to be pushed to disk. > > Cache is memory used generally for disk cache that is not dirty. It's > been read from the disk and could be cleared immediately if necessary. > > Wired is memory that cannot be swapped. In your case, Shared Memory is > probably Wired (this is good). There is another sysctl to check and set > whether it is wired or swappable. > > > > Interesting (if dry) read: > http://www.freebsd.org/doc/en_US.ISO8859-1/articles/vm-design/index.html > Ah, thanks - I didn't know that Inactive was still being used. I'm glad to know that at least the OS is using up the free memory for disk cache. Shared memory is Wired, set via sysctl. Thanks for the info! It sounds like adding more memory would help cache more data - I'll look into the upgrade. Jason
"Jason Coene" <jcoene@gotfrag.com> writes: > A good example, a comments table where users submit TEXT data. A common > query is to find the last 5 comments a user has submitted. The scan, while > using an index, takes a considerable amount of time (> 0.5 sec is about as > good as it gets). Again, it's using an index on the single WHERE clause > (userid = int). The field that's used to ORDER BY (timestamp) is also > indexed. You mean you are doing SELECT ... WHERE userid = 42 ORDER BY timestamp DESC LIMIT 5; and hoping that separate indexes on userid and timestamp will get the job done? They won't. There are only two possible plans for this, neither very good: select all of user 42's posts and sort them, or scan timewise backwards through *all* posts looking for the last 5 from user 42. If you do this enough to justify a specialized index, I would suggest a two-column index on (userid, timestamp). You will also need to tweak the query, because the planner is not quite smart enough to deduce that such an index is applicable to the given sort order: SELECT ... WHERE userid = 42 ORDER BY userid DESC, timestamp DESC LIMIT 5; This should generate an index-scan-backwards plan that will execute nigh instantaneously, because it will only fetch the rows you really want. You might or might not be able to drop the separate indexes on userid and timestamp, depending on what other queries you might have that need them. But you should be paying attention to what plans you are really getting (see EXPLAIN) rather than just assuming that some indexes chosen at random will do what you need. regards, tom lane
[snip] > > One question I do have though - you specifically mentioned NOW() as > something to watch out for, in that it's mutable. We typically use COUNT() > as a subselect to retrieve the number of associated rows to the current > query. Additionally, we use NOW a lot, primarily to detect the status of a > date, i.e.: > > SELECT id FROM subscriptions WHERE userid = 11111 AND timeend > NOW(); > > Is there a better way to do this? I was under the impression that NOW() was > pretty harmless, just to return a current timestamp. > NOW() will trigger unnessecary sequence scans. As it is unknown with prepared query and function when the statement is run, the planner plans the query with now as a variable. This can push the planner to a seq scan over and index scan. I have seen this time and time again. You can create your own immutable now, but don't use it in functions or prepared queries or you will get wrong results. > Based on feedback, I'm looking at a minor upgrade of our RAID controller to > a 3ware 9000 series (SATA with cache, battery backup optional), and > re-configuring it for RAID 10. It's a damn cheap upgrade at around $350 and > an hour of downtime, so I figure that it's worth it for us to give it a > shot. > > Thanks, > > Jason Russell Smith
> You mean you are doing > SELECT ... WHERE userid = 42 ORDER BY timestamp DESC LIMIT 5; > and hoping that separate indexes on userid and timestamp will get the > job done? They won't. There are only two possible plans for this, > neither very good: select all of user 42's posts and sort them, or > scan timewise backwards through *all* posts looking for the last 5 from > user 42. Wow! I did try the method you state below (including the WHERE restricted column in the sort by, and creating a two-column index), and it did execute much faster (even on odd userid's to avoid cached results as much as possible). We have a lot of: SELECT whatever FROM ourtable WHERE field1 = X AND field2 = Y AND field3 = Z ORDER BY id DESC LIMIT 5 With indexes: ourtable(id) ourtable(field1, field2, field3) Is it standard procedure with postgres to include any fields listed in WHERE in the ORDER BY, and create a single index for only the ORDER BY fields (in order of appearance, of course)? > > If you do this enough to justify a specialized index, I would suggest a > two-column index on (userid, timestamp). You will also need to tweak > the query, because the planner is not quite smart enough to deduce that > such an index is applicable to the given sort order: > SELECT ... WHERE userid = 42 ORDER BY userid DESC, timestamp DESC > LIMIT 5; > This should generate an index-scan-backwards plan that will execute nigh > instantaneously, because it will only fetch the rows you really want. > > You might or might not be able to drop the separate indexes on userid > and timestamp, depending on what other queries you might have that need > them. But you should be paying attention to what plans you are really > getting (see EXPLAIN) rather than just assuming that some indexes chosen > at random will do what you need. > > regards, tom lane > We do many varied queries on nearly every table - our data is highly relational, and we have a lot of indexes. I thought the planner would pick up the right index via constraints and not require them in ORDER BY... EXPLAIN ANALYZE says that the indexes are being used, ala: gf=# EXPLAIN ANALYZE SELECT id FROM comments WHERE userid = 51 ORDER BY timestamp DESC LIMIT 5; QUERY PLAN ---------------------------------------------------------------------------- ------------------------------------------------------------------- Limit (cost=1608.43..1608.45 rows=5 width=8) (actual time=0.292..0.317 rows=5 loops=1) -> Sort (cost=1608.43..1609.45 rows=407 width=8) (actual time=0.287..0.295 rows=5 loops=1) Sort Key: "timestamp" -> Index Scan using comments_ix_userid on comments (cost=0.00..1590.79 rows=407 width=8) (actual time=0.031..0.190 rows=35 loops=1) Index Cond: (userid = 51) Total runtime: 0.375 ms (6 rows) Is this the wrong procedure? Your suggested syntax seems much more efficient, but I don't quite understand exactly why, as PG is using our existing indexes... gf=# EXPLAIN ANALYZE SELECT id FROM comments WHERE userid = 51 ORDER BY userid DESC, timestamp DESC LIMIT 5; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ---- Limit (cost=0.00..19.90 rows=5 width=12) (actual time=0.040..0.076 rows=5 loops=1) -> Index Scan Backward using comments_ix_userid_timestamp on comments (cost=0.00..1620.25 rows=407 width=12) (actual time=0.035..0.054 rows=5 loops=1) Index Cond: (userid = 51) Total runtime: 0.134 ms (4 rows) Note: This was done after adding an index on comments (userid, timestamp) Regards, Jason
"Jason Coene" <jcoene@gotfrag.com> writes: > We have a lot of: > SELECT whatever > FROM ourtable > WHERE field1 = X > AND field2 = Y > AND field3 = Z > ORDER BY id DESC > LIMIT 5 > With indexes: > ourtable(id) > ourtable(field1, field2, field3) > Is it standard procedure with postgres to include any fields listed in WHERE > in the ORDER BY, and create a single index for only the ORDER BY fields (in > order of appearance, of course)? It depends. If the X/Y/Z constraint is already pretty selective, then it seems sufficient to me to pick up the matching rows (using the 3-field index), sort them by id, and take the first 5. The case where the extra-index-column trick is useful is where the WHERE clause *isn't* real selective and so a lot of rows would have to be sorted. In your previous example, I imagine you have a lot of prolific posters and so "all posts by userid 42" can be a nontrivial set. The double-column index lets you skip the sort and just pull out the required rows by scanning from the end of the range of userid = 42 entries. > gf=# EXPLAIN ANALYZE SELECT id FROM comments WHERE userid = 51 ORDER BY > timestamp DESC LIMIT 5; > QUERY PLAN > ---------------------------------------------------------------------------- > ------------------------------------------------------------------- > Limit (cost=1608.43..1608.45 rows=5 width=8) (actual time=0.292..0.317 > rows=5 loops=1) > -> Sort (cost=1608.43..1609.45 rows=407 width=8) (actual > time=0.287..0.295 rows=5 loops=1) > Sort Key: "timestamp" > -> Index Scan using comments_ix_userid on comments > (cost=0.00..1590.79 rows=407 width=8) (actual time=0.031..0.190 rows=35 > loops=1) > Index Cond: (userid = 51) > Total runtime: 0.375 ms > (6 rows) This example looks fine, but since userid 51 evidently only has 35 posts, there's not much time needed to read 'em all and sort 'em. The place where the double-column index will win big is on userids with hundreds of posts. You have to keep in mind that each index costs time to maintain during inserts/updates. So adding an index just because it makes a few queries a little faster probably isn't a win. You need to make tradeoffs. regards, tom lane
On Wed, 11 Aug 2004 20:29:04 -0400, Jason Coene <jcoene@gotfrag.com> wrote: > gf=# EXPLAIN ANALYZE SELECT id FROM comments WHERE userid = 51 ORDER BY > timestamp DESC LIMIT 5; > QUERY > PLAN > ---------------------------------------------------------------------------- > ------------------------------------------------------------------- > Limit (cost=1608.43..1608.45 rows=5 width=8) (actual time=0.292..0.317 > rows=5 loops=1) > -> Sort (cost=1608.43..1609.45 rows=407 width=8) (actual > time=0.287..0.295 rows=5 loops=1) > Sort Key: "timestamp" > -> Index Scan using comments_ix_userid on comments > (cost=0.00..1590.79 rows=407 width=8) (actual time=0.031..0.190 rows=35 > loops=1) > Index Cond: (userid = 51) > Total runtime: 0.375 ms > (6 rows) Well, you have to read it from the bottom. - Index Scan using comments_ix_userid : It selects all records for your user. rows=407 : there are 407 rows. -> Sort (cost=1608.43..1609.45 rows=407 width=8) It sorts them to find the 5 more recent. So basically you grab 407 rows to return only 5, so you do 80x more disk I/O than necessary. It is likely that posts from all users are interleaved in the table, so this probably translates directly into 407 page fetches. Note : EXPLAIN ANALYZE will only give good results the first time you run it. The second time, all data is in the cache, so it looks really faster than it is. > gf=# EXPLAIN ANALYZE SELECT id FROM comments WHERE userid = 51 ORDER BY > userid DESC, timestamp DESC LIMIT 5; > QUERY PLAN > ---- > Limit (cost=0.00..19.90 rows=5 width=12) (actual time=0.040..0.076 > rows=5 > loops=1) > -> Index Scan Backward using comments_ix_userid_timestamp on comments > (cost=0.00..1620.25 rows=407 width=12) (actual time=0.035..0.054 rows=5 > loops=1) > Index Cond: (userid = 51) > Total runtime: 0.134 ms > (4 rows) > > Note: This was done after adding an index on comments (userid, timestamp) Well, this one correctly uses the index, fetches 5 rows, and returns them. So, excluding index page hits, your unoptimized query has >400 page fetches, and your optimized one has 5 page fetches. Still wonder why it's faster ? Seq scan is fast when locality of reference is good. In your case, it's very bad.
> This example looks fine, but since userid 51 evidently only has 35 > posts, there's not much time needed to read 'em all and sort 'em. The > place where the double-column index will win big is on userids with > hundreds of posts. > > You have to keep in mind that each index costs time to maintain during > inserts/updates. So adding an index just because it makes a few queries > a little faster probably isn't a win. You need to make tradeoffs. IMNSHO, in Jason's case he needs to do everything possible to get his frequently run queries going as quick as possible. ISTM he can give up a little on the update side, especially since he is running fsync=false. A .3-.5 sec query multiplied over 50-100 users running concurrently adds up quick. Ideally, you are looking up records based on a key that takes you directly to the first record you want and is pointing to the next number of records in ascending order. I can't stress enough how important this is so long as you can deal with the index/update overhead. I don't have a huge amount of experience with this in pg, but one of the tricks we do in the ISAM world is a 'reverse date' system, so that you can scan forwards on the key to pick up datetimes in descending order. This is often a win because the o/s cache may assume read/forwards giving you more cache hits. There are a few different ways to do this, but imagine: create table t ( id int, ts timestamp default now(), iv interval default ('01/01/2050'::timestamp - now()) ); create index t_idx on t(id, iv); select * from t where id = k order by id, iv limit 5; The above query should do a much better job pulling up data and should be easier on your cache. A further win might be to cluster the table on this key if the table is really big. note: interval is poor type to do this with, because it's a 12 byte type (just used it here for demonstration purposes because it's easy). With a little trickery you can stuff it into a time type or an int4 type (even better!). If you want to be really clever you can do it without adding any data to your table at all through functional indexes. Since the planner can use the same index in the extraction and ordering, you get some savings...not much, but worthwhile when applied over a lot of users. Knowing when and how to apply multiple key/functional indexes will make you feel like you have 10 times the database you are using right now. Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > The following suggestion works in two principles: one is that instead of > using timestamps for ordering, integers are quicker, The difference would be pretty marginal --- especially if you choose to use bigints instead of ints. (A timestamp is just a float8 or bigint under the hood, and is no more expensive to compare than those datatypes. Timestamps *are* expensive to convert for I/O, but comparison does not have to do that.) I wouldn't recommend kluging up your data schema just for that. regards, tom lane
> I don't have a huge amount of experience with this in pg, but one of the > tricks we do in the ISAM world is a 'reverse date' system, so that you > can scan forwards on the key to pick up datetimes in descending order. > This is often a win because the o/s cache may assume read/forwards > giving you more cache hits. There are a few different ways to do this, > but imagine: I've been thinking more about this and there is even a more optimal way of doing this if you are willing to beat on your data a bit. It involves the use of sequences. Lets revisit your id/timestamp query combination for a message board. The assumption is you are using integer keys for all tables. You probably have something like: create table messages ( user_id int4 references users, topic_id int4 references topics, message_id serial, message_time timestamp default now(), [...] ); The following suggestion works in two principles: one is that instead of using timestamps for ordering, integers are quicker, and sequences have a built in ability for reverse-ordering. Lets define: create sequence message_seq increment -1 start 2147483647 minvalue 0 maxvalue 2147483647; now we define our table: create table messages ( user_id int4 references users, topic_id int4 references topics, message_id int4 default nextval('message_seq') primary key, message_time timestamp default now(), [...] ); create index user_message_idx on messages(user_id, message_id); -- optional cluster user_message_idx messages; Since the sequence is in descending order, we don't have to do any tricks to logically reverse order the table. -- return last k posts made by user u in descending order; select * from messages where user_id = u order by user_id, message_id limit k; -- return last k posts on a topic create index topic_message_idx on messages(topic_id, user_id); select * from messages where topic_id = t order by topic_id, message_id a side benefit of clustering is that there is little penalty for increasing k because of read ahead optimization whereas in normal scenarios your read time scales with k (forcing small values for k). If we tended to pull up messages by topic more frequently than user, we would cluster on topic_message_idx instead. (if we couldn't decide, we might cluster on message_id or not at all). The crucial point is that we are making this one index run really fast at the expense of other operations. The other major point is we can use a sequence in place of a timestamp for ordering. Using int4 vs. timestamp is a minor efficiency win, if you are worried about > 4B rows, then stick with timestamp. This all boils down to a central unifying principle: organize your indices around your expected access patterns to the data. Sorry if I'm bleating on and on about this...I just think there is plenty of optimization room left in there :) Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> wrote: -- optional cluster user_message_idx messages; would one not have to repeat this operation regularly, to keep any advantage of this ? my impression was that this is a relatively heavy operation on a large table. gnari
Tom Lane wrote: > The difference would be pretty marginal --- especially if you choose to > use bigints instead of ints. (A timestamp is just a float8 or bigint > under the hood, and is no more expensive to compare than those datatypes. > Timestamps *are* expensive to convert for I/O, but comparison does not > have to do that.) I wouldn't recommend kluging up your data schema just > for that. Right (int4 use was assumed). I agree, but it's kind of a 'two birds with one stone' kind of thing, because it's easier to work with reverse ordering integers than time values. So I claim a measurable win (the real gainer of course being able to select and sort on the same key, which works on any type), based on the int4-int8 difference, which is a 33% reduction in key size. One claim I don't have the data for is that read-forward is better than read-back, but my gut tells me he'll get a better cache hit ratio that way. This will be very difficult to measure. As for kludging, using a decrementing sequence is not a bad idea if the general tendency is to read the table backwards, even if just for conceptual reasons. The main kludge is the int4 assumption, which (IMO) isn't so bad. He would just have to rebuild the existing p-key in reverse order (10$ says his keys are all already int4s), and hopefully not mess with the application code too much. At least, it's what I would try if I was in his shoes :) YMMV Merlin
> would one not have to repeat this operation regularly, to keep > any advantage of this ? my impression was that this is a relatively > heavy operation on a large table. Yeah, it requires an exclusive lock and a table rebuild. It might be useful to a message board type database since (one presumes) the reads would be concentrated over recently created data, entered after the cluster and losing any benefit. As far as table size, bigger tables are a larger operation but take longer to get all out of whack. Question is: what percentage of the data turns over between maintenance periods? Plus, there has to be a maintenance period...nobody does anything while the table is clustering. Also, a particular method of reading the table has to really dominate as far as user usage pattern. So, it's pretty rare to user cluster, but it can help in some cases. Merlin