Thread: Performance considerations for very heavy INSERT traffic
I'm in the process of developing an application which uses PostgreSQL for data storage. Our database traffic is very atypical, and as a result it has been rather challenging to figure out how to best tune PostgreSQL on what development hardware we have, as well as to figure out exactly what we should be evaluating and eventually buying for production hardware.
The vast, overwhelming majority of our database traffic is pretty much a non-stop stream of INSERTs filling up tables. It is akin to data acquisition. Several thousand clients are sending once-per-minute updates full of timestamped numerical data at our central server, which in turn performs INSERTs into several distinct tables as part of the transaction for that client. We're talking on the order of ~100 transactions per second, each containing INSERTs to multiple tables (which contain only integer and floating point columns and a timestamp column - the primary key (and only index) is on a unique integer ID for the client and the timestamp). The transaction load is spread evenly over time by having the clients send their per-minute updates at random times rather than on the exact minute mark.
There will of course be users using a web-based GUI to extract data from these tables and display them in graphs and whatnot, but the SELECT query traffic will always be considerably less frequent and intensive than the incessant INSERTs, and it's not that big a deal if the large queries take a little while to run.
This data also expires - rows with timestamps older than X days will be DELETEd periodically (once an hour or faster), such that the tables will reach a relatively stable size (pg_autovacuum is handling vacuuming for now, but considering our case, we're thinking of killing pg_autovacuum in favor of having the periodic DELETE process also do a vacuum of affected tables right after the DELETE, and then have it vacuum the other low traffic tables once a day while it's at it).
There is an aggregation layer in place which proxies the inbound data from the clients into a small(er) number of persistent postgresql backend processes. Right now we're doing one aggregator per 128 clients (so instead of 128 seperate database connections over the course of a minute for a small transaction each, there is a single database backend that is constantly committing transactions at a rate of ~ 2/second). At a test load of ~1,000 clients, we would have 8 aggregators running and 8 postgresql backends. Testing has seemed to indicate we should aggregate even harder - the planned production load is ~5,000 clients initially, but will grow to almost double that in the not-too-distant future, and that would mean ~40 backends at 128 clients each initially. Even on 8 cpus, I'm betting 40 concurrent backends doing 2 tps is much worse off than 10 backends doing 8 tps.
Test hardware right now is a dual Opteron with 4G of ram, which we've barely gotten 1,000 clients running against. Current disk hardware in testing is whatever we could scrape together (4x 3-ware PCI hardware RAID controllers, with 8 SATA drives in a RAID10 array off of each - aggregated up in a 4-way stripe with linux md driver and then formatted as ext3 with an appropriate stride parameter and data=writeback). Production will hopefully be a 4-8-way Opteron, 16 or more G of RAM, and a fiberchannel hardware raid array or two (~ 1TB available RAID10 storage) with 15krpm disks and battery-backed write cache.
I know I haven't provided a whole lot of application-level detail here, but does anyone have any general advice on tweaking postgresql to deal with a very heavy load of concurrent and almost exclusively write-only transactions? Increasing shared_buffers seems to always help, even out to half of the dev box's ram (2G). A 100ms commit_delay seemed to help, but tuning it (and _siblings) has been difficult. We're using 8.0 with the default 8k blocksize, but are strongly considering both developing against 8.1 (seems it might handle the heavy concurrency better), and re-compiling with 32k blocksize since our storage arrays will inevitably be using fairly wide stripes. Any advice on any of this (other than drop the project while you're still a little bit sane)?
--Brandon
> I know I haven't provided a whole lot of application-level detail here, You did ! What about : - using COPY instead of INSERT ? (should be easy to do from the aggregators) - using Bizgres ? (which was designed for your use case) - splitting the xlog and the data on distinct physical drives or arrays - benchmarking something else than ext3 (xfs ? reiser3 ?)
"Brandon Black" <blblack@gmail.com> wrote ...
Increasing shared_buffers seems to always help, even out to half of the dev box's ram (2G).Though officially PG does not prefer huge shared_buffers size, I did see several times that performance was boosted in case IO is the bottleneck. Also, if you want to use big bufferpool setting, make sure your version has Tom's split BufMgrLock patch(http://archives.postgresql.org/pgsql-committers/2005-03/msg00025.php), which might already in 8.0.x somewhere. And if you want to use bufferpool bigger than 2G on 64-bit machine, you may need 8.1 (http://archives.postgresql.org/pgsql-committers/2005-08/msg00221.php).Regards,Qingqing
Split your system into multiple partitions of RAID 10s.For max performance, ten drive RAID 10 for pg_xlog (This will max out a PCI-X bus) on Bus A, multiple 4/6Drive RAID 10s for tablespaces on Bus B. For max performance I would recommend using one RAID 10 for raw data tables, one for aggregate tables and one for indexes. More RAM will only help you with queries against your data, if you are pre-aggregating, then you may not need all that much RAM.
You can easily get 100 tansacts per second with even less hardware with a little data partitioning.
Choose your controller carefully as many don't co-operate with linux well.
Alex Turner
NetEconomist
You can easily get 100 tansacts per second with even less hardware with a little data partitioning.
Choose your controller carefully as many don't co-operate with linux well.
Alex Turner
NetEconomist
On 9/12/05, Brandon Black <blblack@gmail.com> wrote:
I'm in the process of developing an application which uses PostgreSQL for data storage. Our database traffic is very atypical, and as a result it has been rather challenging to figure out how to best tune PostgreSQL on what development hardware we have, as well as to figure out exactly what we should be evaluating and eventually buying for production hardware.
The vast, overwhelming majority of our database traffic is pretty much a non-stop stream of INSERTs filling up tables. It is akin to data acquisition. Several thousand clients are sending once-per-minute updates full of timestamped numerical data at our central server, which in turn performs INSERTs into several distinct tables as part of the transaction for that client. We're talking on the order of ~100 transactions per second, each containing INSERTs to multiple tables (which contain only integer and floating point columns and a timestamp column - the primary key (and only index) is on a unique integer ID for the client and the timestamp). The transaction load is spread evenly over time by having the clients send their per-minute updates at random times rather than on the exact minute mark.
There will of course be users using a web-based GUI to extract data from these tables and display them in graphs and whatnot, but the SELECT query traffic will always be considerably less frequent and intensive than the incessant INSERTs, and it's not that big a deal if the large queries take a little while to run.
This data also expires - rows with timestamps older than X days will be DELETEd periodically (once an hour or faster), such that the tables will reach a relatively stable size (pg_autovacuum is handling vacuuming for now, but considering our case, we're thinking of killing pg_autovacuum in favor of having the periodic DELETE process also do a vacuum of affected tables right after the DELETE, and then have it vacuum the other low traffic tables once a day while it's at it).
There is an aggregation layer in place which proxies the inbound data from the clients into a small(er) number of persistent postgresql backend processes. Right now we're doing one aggregator per 128 clients (so instead of 128 seperate database connections over the course of a minute for a small transaction each, there is a single database backend that is constantly committing transactions at a rate of ~ 2/second). At a test load of ~1,000 clients, we would have 8 aggregators running and 8 postgresql backends. Testing has seemed to indicate we should aggregate even harder - the planned production load is ~5,000 clients initially, but will grow to almost double that in the not-too-distant future, and that would mean ~40 backends at 128 clients each initially. Even on 8 cpus, I'm betting 40 concurrent backends doing 2 tps is much worse off than 10 backends doing 8 tps.
Test hardware right now is a dual Opteron with 4G of ram, which we've barely gotten 1,000 clients running against. Current disk hardware in testing is whatever we could scrape together (4x 3-ware PCI hardware RAID controllers, with 8 SATA drives in a RAID10 array off of each - aggregated up in a 4-way stripe with linux md driver and then formatted as ext3 with an appropriate stride parameter and data=writeback). Production will hopefully be a 4-8-way Opteron, 16 or more G of RAM, and a fiberchannel hardware raid array or two (~ 1TB available RAID10 storage) with 15krpm disks and battery-backed write cache.
I know I haven't provided a whole lot of application-level detail here, but does anyone have any general advice on tweaking postgresql to deal with a very heavy load of concurrent and almost exclusively write-only transactions? Increasing shared_buffers seems to always help, even out to half of the dev box's ram (2G). A 100ms commit_delay seemed to help, but tuning it (and _siblings) has been difficult. We're using 8.0 with the default 8k blocksize, but are strongly considering both developing against 8.1 (seems it might handle the heavy concurrency better), and re-compiling with 32k blocksize since our storage arrays will inevitably be using fairly wide stripes. Any advice on any of this (other than drop the project while you're still a little bit sane)?
--Brandon
On 9/12/05, PFC <lists@boutiquenumerique.com> wrote:
Possibly, although it would kill the current design of returning the database transaction status for a single client packet back to the client on transaction success/failure. The aggregator could put several clients' data into a series of delayed multi-row copy statements.
I only briefly scanned their "About" page, but they didn't sound particularly suited to my case at the time (it sounded kinds buzzwordy actually, which I suppose is great for business apps people :) ). We're more of a sciency shop. I'll go look at the project in more detail tonight in light of your recommendation.
That would almost definitely help, I haven't tried it yet. Speaking of the xlog, anyone know anything specific about the WAL tuning parameters for heavy concurrent write traffic? What little I could dig up on WAL tuning was contradictory, and testing some random changes to the parameters hasn't been very conclusive yet. I would imagine the WAL buffers stuff could potentially have a large effect for us.
> I know I haven't provided a whole lot of application-level detail here,
You did !
What about :
- using COPY instead of INSERT ?
(should be easy to do from the aggregators)
Possibly, although it would kill the current design of returning the database transaction status for a single client packet back to the client on transaction success/failure. The aggregator could put several clients' data into a series of delayed multi-row copy statements.
- using Bizgres ?
(which was designed for your use case)
I only briefly scanned their "About" page, but they didn't sound particularly suited to my case at the time (it sounded kinds buzzwordy actually, which I suppose is great for business apps people :) ). We're more of a sciency shop. I'll go look at the project in more detail tonight in light of your recommendation.
- splitting the xlog and the data on distinct physical drives or arrays
That would almost definitely help, I haven't tried it yet. Speaking of the xlog, anyone know anything specific about the WAL tuning parameters for heavy concurrent write traffic? What little I could dig up on WAL tuning was contradictory, and testing some random changes to the parameters hasn't been very conclusive yet. I would imagine the WAL buffers stuff could potentially have a large effect for us.
- benchmarking something else than ext3
(xfs ? reiser3 ?)
We've had bad experiences under extreme and/or strange workloads with XFS here in general, although this is the first major postgresql project - the rest were with other applications writing to XFS. Bad experiences like XFS filesystems "detecting internal inconsistencies" at runtime and unmounting themselves from within the kernel module (much to the dismay of applications with open files on the filesystem), on machines with validated good hardware. It has made me leary of using anything other than ext3 for fear of stability problems. Reiser3 might be worth taking a look at though.
Thanks for the ideas,
-- Brandon
Brandon Black wrote: > > > On 9/12/05, *PFC* <lists@boutiquenumerique.com > <mailto:lists@boutiquenumerique.com>> wrote: > > > > - benchmarking something else than ext3 > (xfs ? reiser3 ?) > > > We've had bad experiences under extreme and/or strange workloads with > XFS here in general, although this is the first major postgresql > project - the rest were with other applications writing to XFS. Bad > experiences like XFS filesystems "detecting internal inconsistencies" > at runtime and unmounting themselves from within the kernel module > (much to the dismay of applications with open files on the > filesystem), on machines with validated good hardware. It has made me > leary of using anything other than ext3 for fear of stability > problems. Reiser3 might be worth taking a look at though. Just one tidbit. We tried XFS on a very active system similar to what you describe. Dual opterons, 8GB memory, fiber channel drives, 2.6 kernel, etc. And the reliability was awful. We spent a lot of time making changes one at a time to try and isolate the cause; when we switched out from XFS to ReiserFS our stability problems went away. It may be the case that the XFS problems have all been corrected in newer kernels, but I'm not going to put too much effort into trying that again. I recently built a postgres with 32KB block sizes and have been doing some testing. For our particular workloads it has been a win. -- Alan
Brandon Black <blblack@gmail.com> writes: > The vast, overwhelming majority of our database traffic is pretty much a > non-stop stream of INSERTs filling up tables. That part Postgres should handle pretty well. It should be pretty much limited by your I/O bandwidth so big raid 1+0 arrays are ideal. Putting the WAL on a dedicated array would also be critical. The WAL parameters like commit_delay and commit_siblings are a bit of a mystery. Nobody has done any extensive testing of them. It would be quite helpful if you find anything conclusive and post it. It would also be surprising if they had a very large effect. They almost got chopped recently because they weren't believed to be useful. You might also ponder whether you need to by issuing a commit for every datum. If you only need to commit periodically you can get much better throughput. I suppose that's the same as commit_siblings. It would be interesting to know if you can get those parameters to perform as well as batching up records yourself. > There will of course be users using a web-based GUI to extract data from > these tables and display them in graphs and whatnot, but the SELECT query > traffic will always be considerably less frequent and intensive than the > incessant INSERTs, and it's not that big a deal if the large queries take a > little while to run. I do fear these queries. Even if they aren't mostly terribly intensive if you're pushing the edges of your write I/O bandwidth then a single seek to satisfy one of these selects could really hurt your throughput. That said, as long as your WAL is on a dedicated drive Postgres's architecture should in theory be ideal and allow you do run these things with impunity. The WAL is purely write-only and it's the only thing your inserts will be blocking on. > This data also expires - rows with timestamps older than X days will be > DELETEd periodically (once an hour or faster), such that the tables will > reach a relatively stable size (pg_autovacuum is handling vacuuming for now, > but considering our case, we're thinking of killing pg_autovacuum in favor > of having the periodic DELETE process also do a vacuum of affected tables > right after the DELETE, and then have it vacuum the other low traffic tables > once a day while it's at it). Ay, there's the rub. Taking this approach means you have vacuums running which have to scan your entire table and your inserts are being sprayed all over the disk. An alternative you may consider is using partitioned tables. Then when you want to expire old records you simply drop the oldest partition. Or in your case you might rotate through the partitions, so you just truncate the oldest one and start inserting into it. Unfortunately there's no built-in support for partitioned tables in Postgres. You get to roll your own using UNION ALL or using inherited tables. Various people have described success with each option though they both have downsides too. Using partitioned tables you would never need to delete any records except for when you delete all of them. So you would never need to run vacuum except on newly empty partitions. That avoids having to scan through all those pages that you know will never have holes. If you use TRUNCATE (or VACUUM ALL or CLUSTER) that would mean your inserts are always sequential too (though it also means lots of block allocations along the way, possibly not an advantage). This may be a lot more work to set up and maintain but I think it would be a big performance win. It would directly speed up the WAL writes by avoiding those big full page dumps. And it would also cut out all the I/O traffic vacuum generates. > Increasing shared_buffers seems to always help, even out to half of the dev > box's ram (2G). Half should be a pessimal setting. It means virtually everything is buffered twice. Once in the kernel and once in Postgres. Effectively halving your memory. If that's really helping try raising it even further, to something like 90% of your memory. But the conventional dogma is that shared buffers should never be more than about 10k no matter how much memory you have. Let the kernel do the bulk of the buffering. That said it's even more mysterious in your situation. Why would a large shared buffers help insert-only performance much at all? My guess is that it's letting the vacuums complete quicker. Perhaps try raising work_mem? -- greg
On 9/12/05, Brandon Black <blblack@gmail.com> wrote: > > I'm in the process of developing an application which uses PostgreSQL for > data storage. Our database traffic is very atypical, and as a result it has > been rather challenging to figure out how to best tune PostgreSQL on what > development hardware we have, as well as to figure out exactly what we > should be evaluating and eventually buying for production hardware. A few suggestions... 1) Switch to COPY if you can, it's anywhere from 10-100x faster than INSERT, but it does not necessarily fit your idea of updating multiple tables. In that case, try and enlarge the transaction's scope and do multiple INSERTs in the same transaction. Perhaps batching once per second, or 5 seconds, and returning the aggregate result ot the clients. 2) Tune ext3. The default configuration wrecks high-write situations. Look into data=writeback for mounting, turning off atime (I hope you've done this already) updates, and also modifying the scheduler to the elevator model. This is poorly documented in Linux (like just about everything), but it's crtical. 3) Use 8.1 and strongly look at Bizgres. The data partitioning is critical. 4) Make sure you are not touching more data than you need, and don't have any extraneous indexes. Use the planner to make sure every index is used, as it substantially increases the write load. I've worked on a few similar applications, and this is a hard thing in any database, even Oracle. Chris -- | Christopher Petrilli | petrilli@gmail.com
On 12 Sep 2005 23:07:49 -0400, Greg Stark <gsstark@mit.edu> wrote:
Ideally I'd like to commit the data seperately, as the data could contain errors which abort the transaction, but it may come down to batching it and coding things such that I can catch and discard the offending row and retry the transaction if it fails (which should be fairly rare I would hope). I was hoping that the commit_delay/commit_siblings stuff would allow me to maintain simplistic transaction failure isolation while giving some of the benefits of batching things up, as you've said. I have seen performance gains with it set at 100ms and a 3-6 siblings with 8 backends running, but I haven't been able to extensively tune these values, they were mostly random guesses that seemed to work. My cycles of performance testing take a while, at least a day or two per change being tested, and the differences can even then be hard to see due to variability in the testing load (as it's not really a static test load, but a window on reality). On top of that, with the time it takes, I've succumbed more than once to the temptation of tweaking more than one thing per performance run, which really muddies the results.
I find it odd as well. After reading the standard advice on shared_buffers, I had only intended on raising it slightly. But seeing ever-increasing performance gains, I just kept tuning it upwards all the way to the 2G limit, and saw noticeable gains every time. During at least some of the test cycles, there was no deleting or vacuuming going on, just insert traffic. I guessed that shared_buffers' caching strategy must have been much more effective than the OS cache at something or other, but I don't know what exactly. The only important read traffic that comes to mind is the index which is both being constantly updated and constantly checked for primary key uniqueness violations.
All of these replies here on the list (and a private one or two) have given me a long list of things to try, and I feel confident that at least some of them will gain me enough performance to comfortably deploy this application in the end on somewhat reasonable hardware. Thanks to everyone here on the list for all the suggestions, it has been very helpful in giving me directions to go with this that I hadn't thought of before.
When I finally get all of this sorted out and working reasonably optimally, I'll be sure to come back and report what techniques/settings did and didn't work for this workload.
-- Brandon
The WAL parameters like commit_delay and commit_siblings are a bit of a
mystery. Nobody has done any extensive testing of them. It would be quite
helpful if you find anything conclusive and post it. It would also be
surprising if they had a very large effect. They almost got chopped recently
because they weren't believed to be useful.
You might also ponder whether you need to by issuing a commit for every datum.
If you only need to commit periodically you can get much better throughput. I
suppose that's the same as commit_siblings. It would be interesting to know if
you can get those parameters to perform as well as batching up records
yourself.
Ideally I'd like to commit the data seperately, as the data could contain errors which abort the transaction, but it may come down to batching it and coding things such that I can catch and discard the offending row and retry the transaction if it fails (which should be fairly rare I would hope). I was hoping that the commit_delay/commit_siblings stuff would allow me to maintain simplistic transaction failure isolation while giving some of the benefits of batching things up, as you've said. I have seen performance gains with it set at 100ms and a 3-6 siblings with 8 backends running, but I haven't been able to extensively tune these values, they were mostly random guesses that seemed to work. My cycles of performance testing take a while, at least a day or two per change being tested, and the differences can even then be hard to see due to variability in the testing load (as it's not really a static test load, but a window on reality). On top of that, with the time it takes, I've succumbed more than once to the temptation of tweaking more than one thing per performance run, which really muddies the results.
> Increasing shared_buffers seems to always help, even out to half of the dev
> box's ram (2G).
Half should be a pessimal setting. It means virtually everything is buffered
twice. Once in the kernel and once in Postgres. Effectively halving your
memory. If that's really helping try raising it even further, to something
like 90% of your memory. But the conventional dogma is that shared buffers
should never be more than about 10k no matter how much memory you have. Let
the kernel do the bulk of the buffering.
That said it's even more mysterious in your situation. Why would a large
shared buffers help insert-only performance much at all? My guess is that it's
letting the vacuums complete quicker. Perhaps try raising work_mem?
I find it odd as well. After reading the standard advice on shared_buffers, I had only intended on raising it slightly. But seeing ever-increasing performance gains, I just kept tuning it upwards all the way to the 2G limit, and saw noticeable gains every time. During at least some of the test cycles, there was no deleting or vacuuming going on, just insert traffic. I guessed that shared_buffers' caching strategy must have been much more effective than the OS cache at something or other, but I don't know what exactly. The only important read traffic that comes to mind is the index which is both being constantly updated and constantly checked for primary key uniqueness violations.
All of these replies here on the list (and a private one or two) have given me a long list of things to try, and I feel confident that at least some of them will gain me enough performance to comfortably deploy this application in the end on somewhat reasonable hardware. Thanks to everyone here on the list for all the suggestions, it has been very helpful in giving me directions to go with this that I hadn't thought of before.
When I finally get all of this sorted out and working reasonably optimally, I'll be sure to come back and report what techniques/settings did and didn't work for this workload.
-- Brandon
On 9/12/05, Christopher Petrilli <petrilli@gmail.com> wrote:
I'm using noatime and data=writeback already. I changed my scheduler from the default anticipatory to deadline and saw an improvement, but I haven't yet delved into playing with specific elevator tunable values per-device.
I've just started down the path of getting 8.1 running with a larger block size, and tommorow I'm going to look at Bizgres's partitioning as opposed to some manual schemes. Will the current Bizgres have a lot of the performance enhancements of 8.1 already (or will 8.1 or 8.2 eventually get Bizgres's partitioning?)?
-- Brandon
2) Tune ext3. The default configuration wrecks high-write situations.
Look into data=writeback for mounting, turning off atime (I hope
you've done this already) updates, and also modifying the scheduler to
the elevator model. This is poorly documented in Linux (like just
about everything), but it's crtical.
I'm using noatime and data=writeback already. I changed my scheduler from the default anticipatory to deadline and saw an improvement, but I haven't yet delved into playing with specific elevator tunable values per-device.
3) Use 8.1 and strongly look at Bizgres. The data partitioning is critical.
I've just started down the path of getting 8.1 running with a larger block size, and tommorow I'm going to look at Bizgres's partitioning as opposed to some manual schemes. Will the current Bizgres have a lot of the performance enhancements of 8.1 already (or will 8.1 or 8.2 eventually get Bizgres's partitioning?)?
-- Brandon
* Brandon Black (blblack@gmail.com) wrote: > Ideally I'd like to commit the data seperately, as the data could contain > errors which abort the transaction, but it may come down to batching it and > coding things such that I can catch and discard the offending row and retry > the transaction if it fails (which should be fairly rare I would hope). I Don't really know if it'd be better, or worse, or what, but another thought to throw out there is to perhaps use savepoints instead of full transactions? Not sure if that's more expensive or cheaper than doing full commits but it might be something to consider. > When I finally get all of this sorted out and working reasonably optimally, > I'll be sure to come back and report what techniques/settings did and didn't > work for this workload. That'd be great, many thanks, Stephen
Attachment
On 9/12/05, Christopher Petrilli <petrilli@gmail.com> wrote:
I started looking closer at my options for partitioning (inheritance, union all), and at Bizgres today. Bizgres partitioning appears to be basically the same kind of inheritance partitioning one can do in mainline PostgreSQL. Am I correct in thinking that the main difference is that they've coded support for "enable_constraint_exclusion=true" so that the query planner can be more effective at taking advantage of the partitioning when you've specified CHECK constraints on the child tables? I may go for 8.1 instead in that case, as the main win I'm looking for is that with inheritance I'll be doing inserts into smaller tables instead of ones that grow to unmanageable sizes (and that I can drop old child tables instead of delete/vacuum).
3) Use 8.1 and strongly look at Bizgres. The data partitioning is critical.
I started looking closer at my options for partitioning (inheritance, union all), and at Bizgres today. Bizgres partitioning appears to be basically the same kind of inheritance partitioning one can do in mainline PostgreSQL. Am I correct in thinking that the main difference is that they've coded support for "enable_constraint_exclusion=true" so that the query planner can be more effective at taking advantage of the partitioning when you've specified CHECK constraints on the child tables? I may go for 8.1 instead in that case, as the main win I'm looking for is that with inheritance I'll be doing inserts into smaller tables instead of ones that grow to unmanageable sizes (and that I can drop old child tables instead of delete/vacuum).
On Mon, 12 Sep 2005 16:04:06 -0500 Brandon Black <blblack@gmail.com> wrote: I've seen serveral tests PostgreSQL on JFS file system, it runs faster than using ext3. Our production server works using JFS and RAID10, we have 250K+ transactions per day and everything is OK. Try switching to separate RAID10 arrays and use one for xlog, and others for data. If you are using indexes, try to put them on separate RAID10 array. > Test hardware right now is a dual Opteron with 4G of ram, which we've barely > gotten 1,000 clients running against. Current disk hardware in testing is > whatever we could scrape together (4x 3-ware PCI hardware RAID controllers, > with 8 SATA drives in a RAID10 array off of each - aggregated up in a 4-way > stripe with linux md driver and then formatted as ext3 with an appropriate > stride parameter and data=writeback). Production will hopefully be a 4-8-way > Opteron, 16 or more G of RAM, and a fiberchannel hardware raid array or two > (~ 1TB available RAID10 storage) with 15krpm disks and battery-backed write > cache. -- Evgeny Gridasov Software Developer I-Free, Russia
On Tue, 2005-09-13 at 11:30, Brandon Black wrote: > I started looking closer at my options for partitioning (inheritance, > union all), and at Bizgres today. Bizgres partitioning appears to be > basically the same kind of inheritance partitioning one can do in > mainline PostgreSQL. Am I correct in thinking that the main > difference is that they've coded support for > "enable_constraint_exclusion=true" so that the query planner can be > more effective at taking advantage of the partitioning when you've > specified CHECK constraints on the child tables? I may go for 8.1 > instead in that case, as the main win I'm looking for is that with > inheritance I'll be doing inserts into smaller tables instead of ones > that grow to unmanageable sizes (and that I can drop old child tables > instead of delete/vacuum). Perhaps I missed something in this thread, but don't forget you still need vacuum to reclaim XIDs. --Ian
On Tue, Sep 13, 2005 at 12:16:55PM -0400, Ian Westmacott wrote: > On Tue, 2005-09-13 at 11:30, Brandon Black wrote: > > I started looking closer at my options for partitioning (inheritance, > > union all), and at Bizgres today. Bizgres partitioning appears to be > > basically the same kind of inheritance partitioning one can do in > > mainline PostgreSQL. Am I correct in thinking that the main > > difference is that they've coded support for > > "enable_constraint_exclusion=true" so that the query planner can be > > more effective at taking advantage of the partitioning when you've > > specified CHECK constraints on the child tables? I may go for 8.1 > > instead in that case, as the main win I'm looking for is that with > > inheritance I'll be doing inserts into smaller tables instead of ones > > that grow to unmanageable sizes (and that I can drop old child tables > > instead of delete/vacuum). > > Perhaps I missed something in this thread, but don't forget > you still need vacuum to reclaim XIDs. Yes, but if you are going to drop the partition before 1 billion transactions, you can skip vacuuming it completely. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "Es filósofo el que disfruta con los enigmas" (G. Coli)
On Sep 12, 2005, at 6:02 PM, Brandon Black wrote:
- using COPY instead of INSERT ?
(should be easy to do from the aggregators)
Possibly, although it would kill the current design of returning the database transaction status for a single client packet back to the client on transaction success/failure. The aggregator could put several clients' data into a series of delayed multi-row copy statements.
buffer through the file system on your aggregator. once you "commit" to local disk file, return back to your client that you got the data. then insert into the actual postgres DB in large batches of inserts inside a single Postgres transaction.
we have our web server log certain tracking requests to a local file. with file locks and append mode, it is extremely quick and has little contention delays. then every so often, we lock the file, rename it, release the lock, then process it at our leisure to do the inserts to Pg in one big transaction.
Vivek Khera, Ph.D.
+1-301-869-4449 x806
On Sep 12, 2005, at 6:02 PM, Brandon Black wrote:
- splitting the xlog and the data on distinct physical drives or arrays
That would almost definitely help, I haven't tried it yet. Speaking of the xlog, anyone know anything specific about the WAL tuning parameters for heavy concurrent write traffic? What little I could dig up on WAL tuning was contradictory, and testing some random changes to the parameters hasn't been very conclusive yet. I would imagine the WAL buffers stuff could potentially have a large effect for us.
you will want to make your pg_xlog RAID volume BIG, and then tell postgres to use that space: bump up checkpoint_segments (and suitably the checkpoint timeouts). I run with 256 segments and a timeout of 5 minutes. The timeout refletcs your expected crash recovery time, so adjust it wisely....
Also, you should consider how you split your drives across your RAID data channels on your test machine: I put each pair of the RAID10 mirrors on opposite channels, so both channels of my RAID controller are pretty evenly loaded during write.
Vivek Khera, Ph.D.
+1-301-869-4449 x806