Thread: sustained update load of 1-2k/sec
Hi all, I bet you get tired of the same ole questions over and over. I'm currently working on an application that will poll thousands of cable modems per minute and I would like to use PostgreSQL to maintain state between polls of each device. This requires a very heavy amount of updates in place on a reasonably large table(100k-500k rows, ~7 columns mostly integers/bigint). Each row will be refreshed every 15 minutes, or at least that's how fast I can poll via SNMP. I hope I can tune the DB to keep up. The app is threaded and will likely have well over 100 concurrent db connections. Temp tables for storage aren't a preferred option since this is designed to be a shared nothing approach and I will likely have several polling processes. Here are some of my assumptions so far . . . HUGE WAL Vacuum hourly if not more often I'm getting 1700tx/sec from MySQL and I would REALLY prefer to use PG. I don't need to match the number, just get close. Is there a global temp table option? In memory tables would be very beneficial in this case. I could just flush it to disk occasionally with an insert into blah select from memory table. Any help or creative alternatives would be greatly appreciated. :) 'njoy, Mark -- Writing software requires an intelligent person, creating functional art requires an artist. -- Unknown
On Aug 18, 2005, at 10:24 PM, Mark Cotner wrote: > I'm currently working on an application that will poll > thousands of cable modems per minute and I would like > to use PostgreSQL to maintain state between polls of > each device. This requires a very heavy amount of > updates in place on a reasonably large table(100k-500k > rows, ~7 columns mostly integers/bigint). Each row > will be refreshed every 15 minutes, or at least that's > how fast I can poll via SNMP. I hope I can tune the > DB to keep up. > > The app is threaded and will likely have well over 100 > concurrent db connections. Temp tables for storage > aren't a preferred option since this is designed to be > a shared nothing approach and I will likely have > several polling processes. Somewhat OT, but.. The easiest way to speed that up is to use less threads. You're adding a whole TON of overhead with that many threads that you just don't want or need. You should probably be using something event- driven to solve this problem, with just a few database threads to store all that state. Less is definitely more in this case. See <http://www.kegel.com/c10k.html> (and there's plenty of other literature out there saying that event driven is an extremely good way to do this sort of thing). Here are some frameworks to look at for this kind of network code: (Python) Twisted - <http://twistedmatrix.com/> (Perl) POE - <http://poe.perl.org/> (Java) java.nio (not familiar enough with the Java thing to know whether or not there's a high-level wrapper) (C++) ACE - <http://www.cs.wustl.edu/~schmidt/ACE.html> (Ruby) IO::Reactor - <http://www.deveiate.org/code/IO-Reactor.html> (C) libevent - <http://monkey.org/~provos/libevent/> .. and of course, you have select/poll/kqueue/WaitNextEvent/whatever that you could use directly, if you wanted to roll your own solution, but don't do that. If you don't want to optimize the whole application, I'd at least just push the DB operations down to a very small number of connections (*one* might even be optimal!), waiting on some kind of thread-safe queue for updates from the rest of the system. This way you can easily batch those updates into transactions and you won't be putting so much unnecessary synchronization overhead into your application and the database. Generally, once you have more worker threads (or processes) than CPUs, you're going to get diminishing returns in a bad way, assuming those threads are making good use of their time. -bob
Excellent feedback. Thank you. Please do keep in mind I'm storing the results of SNMP queries. The majority of the time each thread is in a wait state, listening on a UDP port for return packet. The number of threads is high because in order to sustain poll speed I need to minimize the impact of timeouts and all this waiting for return packets. I had intended to have a fallback plan which would build a thread safe queue for db stuffs, but the application isn't currently architected that way. It's not completely built yet so now is the time for change. I hadn't thought of building up a batch of queries and creating a transaction from them. I've been looking into memcached as a persistent object store as well and hadn't seen the reactor pattern yet. Still trying to get my puny brain around that one. Again, thanks for the help. 'njoy, Mark On 8/19/05 5:09 AM, "Bob Ippolito" <bob@redivi.com> wrote: > > On Aug 18, 2005, at 10:24 PM, Mark Cotner wrote: > >> I'm currently working on an application that will poll >> thousands of cable modems per minute and I would like >> to use PostgreSQL to maintain state between polls of >> each device. This requires a very heavy amount of >> updates in place on a reasonably large table(100k-500k >> rows, ~7 columns mostly integers/bigint). Each row >> will be refreshed every 15 minutes, or at least that's >> how fast I can poll via SNMP. I hope I can tune the >> DB to keep up. >> >> The app is threaded and will likely have well over 100 >> concurrent db connections. Temp tables for storage >> aren't a preferred option since this is designed to be >> a shared nothing approach and I will likely have >> several polling processes. > > Somewhat OT, but.. > > The easiest way to speed that up is to use less threads. You're > adding a whole TON of overhead with that many threads that you just > don't want or need. You should probably be using something event- > driven to solve this problem, with just a few database threads to > store all that state. Less is definitely more in this case. See > <http://www.kegel.com/c10k.html> (and there's plenty of other > literature out there saying that event driven is an extremely good > way to do this sort of thing). > > Here are some frameworks to look at for this kind of network code: > (Python) Twisted - <http://twistedmatrix.com/> > (Perl) POE - <http://poe.perl.org/> > (Java) java.nio (not familiar enough with the Java thing to know > whether or not there's a high-level wrapper) > (C++) ACE - <http://www.cs.wustl.edu/~schmidt/ACE.html> > (Ruby) IO::Reactor - <http://www.deveiate.org/code/IO-Reactor.html> > (C) libevent - <http://monkey.org/~provos/libevent/> > > .. and of course, you have select/poll/kqueue/WaitNextEvent/whatever > that you could use directly, if you wanted to roll your own solution, > but don't do that. > > If you don't want to optimize the whole application, I'd at least > just push the DB operations down to a very small number of > connections (*one* might even be optimal!), waiting on some kind of > thread-safe queue for updates from the rest of the system. This way > you can easily batch those updates into transactions and you won't be > putting so much unnecessary synchronization overhead into your > application and the database. > > Generally, once you have more worker threads (or processes) than > CPUs, you're going to get diminishing returns in a bad way, assuming > those threads are making good use of their time. > > -bob >
On Aug 19, 2005, at 12:14 AM, Mark Cotner wrote: > Excellent feedback. Thank you. Please do keep in mind I'm storing > the > results of SNMP queries. The majority of the time each thread is > in a wait > state, listening on a UDP port for return packet. The number of > threads is > high because in order to sustain poll speed I need to minimize the > impact of > timeouts and all this waiting for return packets. Asynchronous IO via select/poll/etc. basically says: "given these 100 sockets, wake me up when any of them has something to tell me, or wake me up anyway in N milliseconds". From one thread, you can usually deal with thousands of connections without breaking a sweat, where with thread-per-connection you have so much overhead just for the threads that you probably run out of RAM before your network is throttled. The reactor pattern basically just abstracts this a bit so that you worry about what do to when the sockets have something to say, and also allow you to schedule timed events, rather than having to worry about how to implement that correctly *and* write your application. With 100 threads you are basically invoking a special-case of the same mechanism that only looks at one socket, but this makes for 100 different data structures that end up in both userspace and kernel space, plus the thread stacks (which can easily be a few megs each) and context switching when any of them wakes up.. You're throwing a lot of RAM and CPU cycles out the window by using this design. Also, preemptive threads are hard. > I had intended to have a fallback plan which would build a thread > safe queue > for db stuffs, but the application isn't currently architected that > way. > It's not completely built yet so now is the time for change. I hadn't > thought of building up a batch of queries and creating a > transaction from > them. It should be *really* easy to just swap out the implementation of your "change this record" function with one that simply puts its arguments on a queue, with another thread that gets them from the queue and actually does the work. > I've been looking into memcached as a persistent object store as > well and > hadn't seen the reactor pattern yet. Still trying to get my puny > brain > around that one. memcached is RAM based, it's not persistent at all... unless you are sure all of your nodes will be up at all times and will never go down. IIRC, it also just starts throwing away data once you hit its size limit. If course, this isn't really any different than MySQL's MyISAM tables if you hit the row limit, but I think that memcached might not even give you an error when this happens. Also, memcached is just key/value pairs over a network, not much of a database going on there. If you can fit all this data in RAM and you don't care so much about the integrity, you might not benefit much from a RDBMS at all. However, I don't really know what you're doing with the data once you have it so I might be very wrong here... -bob > > Again, thanks for the help. > > 'njoy, > Mark > > > On 8/19/05 5:09 AM, "Bob Ippolito" <bob@redivi.com> wrote: > > >> >> On Aug 18, 2005, at 10:24 PM, Mark Cotner wrote: >> >> >>> I'm currently working on an application that will poll >>> thousands of cable modems per minute and I would like >>> to use PostgreSQL to maintain state between polls of >>> each device. This requires a very heavy amount of >>> updates in place on a reasonably large table(100k-500k >>> rows, ~7 columns mostly integers/bigint). Each row >>> will be refreshed every 15 minutes, or at least that's >>> how fast I can poll via SNMP. I hope I can tune the >>> DB to keep up. >>> >>> The app is threaded and will likely have well over 100 >>> concurrent db connections. Temp tables for storage >>> aren't a preferred option since this is designed to be >>> a shared nothing approach and I will likely have >>> several polling processes. >>> >> >> Somewhat OT, but.. >> >> The easiest way to speed that up is to use less threads. You're >> adding a whole TON of overhead with that many threads that you just >> don't want or need. You should probably be using something event- >> driven to solve this problem, with just a few database threads to >> store all that state. Less is definitely more in this case. See >> <http://www.kegel.com/c10k.html> (and there's plenty of other >> literature out there saying that event driven is an extremely good >> way to do this sort of thing). >> >> Here are some frameworks to look at for this kind of network code: >> (Python) Twisted - <http://twistedmatrix.com/> >> (Perl) POE - <http://poe.perl.org/> >> (Java) java.nio (not familiar enough with the Java thing to know >> whether or not there's a high-level wrapper) >> (C++) ACE - <http://www.cs.wustl.edu/~schmidt/ACE.html> >> (Ruby) IO::Reactor - <http://www.deveiate.org/code/IO-Reactor.html> >> (C) libevent - <http://monkey.org/~provos/libevent/> >> >> .. and of course, you have select/poll/kqueue/WaitNextEvent/whatever >> that you could use directly, if you wanted to roll your own solution, >> but don't do that. >> >> If you don't want to optimize the whole application, I'd at least >> just push the DB operations down to a very small number of >> connections (*one* might even be optimal!), waiting on some kind of >> thread-safe queue for updates from the rest of the system. This way >> you can easily batch those updates into transactions and you won't be >> putting so much unnecessary synchronization overhead into your >> application and the database. >> >> Generally, once you have more worker threads (or processes) than >> CPUs, you're going to get diminishing returns in a bad way, assuming >> those threads are making good use of their time. >> >> -bob >> >> > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
I have managed tx speeds that high from postgresql going even as high as 2500/sec for small tables, but it does require a good RAID controler card (yes I'm even running with fsync on). I'm using 3ware 9500S-8MI with Raptor drives in multiple RAID 10s. The box wasn't too $$$ at just around $7k. I have two independant controlers on two independant PCI buses to give max throughput. on with a 6 drive RAID 10 and the other with two 4 drive RAID 10s. Alex Turner NetEconomist On 8/19/05, Mark Cotner <mcotner@yahoo.com> wrote: > Hi all, > I bet you get tired of the same ole questions over and > over. > > I'm currently working on an application that will poll > thousands of cable modems per minute and I would like > to use PostgreSQL to maintain state between polls of > each device. This requires a very heavy amount of > updates in place on a reasonably large table(100k-500k > rows, ~7 columns mostly integers/bigint). Each row > will be refreshed every 15 minutes, or at least that's > how fast I can poll via SNMP. I hope I can tune the > DB to keep up. > > The app is threaded and will likely have well over 100 > concurrent db connections. Temp tables for storage > aren't a preferred option since this is designed to be > a shared nothing approach and I will likely have > several polling processes. > > Here are some of my assumptions so far . . . > > HUGE WAL > Vacuum hourly if not more often > > I'm getting 1700tx/sec from MySQL and I would REALLY > prefer to use PG. I don't need to match the number, > just get close. > > Is there a global temp table option? In memory tables > would be very beneficial in this case. I could just > flush it to disk occasionally with an insert into blah > select from memory table. > > Any help or creative alternatives would be greatly > appreciated. :) > > 'njoy, > Mark > > > -- > Writing software requires an intelligent person, > creating functional art requires an artist. > -- Unknown > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
Bob Ippolito <bob@redivi.com> writes: > If you don't want to optimize the whole application, I'd at least > just push the DB operations down to a very small number of > connections (*one* might even be optimal!), waiting on some kind of > thread-safe queue for updates from the rest of the system. While I agree that hundreds of threads seems like overkill, I think the above advice might be going too far in the other direction. The problem with single-threaded operation is that any delay affects the whole system --- eg, if you're blocked waiting for disk I/O, the CPU doesn't get anything done either. You want enough DB connections doing things in parallel to make sure that there's always something else useful to do for each major component. This is particularly important for Postgres, which doesn't do any internal query parallelization (not that it would help much anyway for the sorts of trivial queries you are worried about). If you have, say, a 4-way CPU you want at least 4 active connections to make good use of the CPUs. I'd suggest trying to build the system so that it uses a dozen or two active database connections. If that doesn't match up to the number of polling activities you want to have in flight at any instant, then you can do something like what Bob suggested on the client side to bridge the gap. As far as the question "can PG do 1-2k xact/sec", the answer is "yes if you throw enough hardware at it". Spending enough money on the disk subsystem is the key ... regards, tom lane
Tom Lane wrote: >Bob Ippolito <bob@redivi.com> writes: > > >>If you don't want to optimize the whole application, I'd at least >>just push the DB operations down to a very small number of >>connections (*one* might even be optimal!), waiting on some kind of >>thread-safe queue for updates from the rest of the system. >> >> > >While I agree that hundreds of threads seems like overkill, I think the >above advice might be going too far in the other direction. The problem >with single-threaded operation is that any delay affects the whole >system --- eg, if you're blocked waiting for disk I/O, the CPU doesn't >get anything done either. You want enough DB connections doing things >in parallel to make sure that there's always something else useful to do >for each major component. This is particularly important for Postgres, >which doesn't do any internal query parallelization (not that it would >help much anyway for the sorts of trivial queries you are worried about). >If you have, say, a 4-way CPU you want at least 4 active connections to >make good use of the CPUs. > >I'd suggest trying to build the system so that it uses a dozen or two >active database connections. If that doesn't match up to the number of >polling activities you want to have in flight at any instant, then you >can do something like what Bob suggested on the client side to bridge >the gap. > >As far as the question "can PG do 1-2k xact/sec", the answer is "yes >if you throw enough hardware at it". Spending enough money on the >disk subsystem is the key ... > > The 1-2k xact/sec for MySQL seems suspicious, sounds very much like write-back cached, not write-through, esp. considering that heavy concurrent write access isn't said to be MySQLs strength... I wonder if preserving the database after a fatal crash is really necessary, since the data stored sounds quite volatile; in this case, fsync=false might be sufficient. Regards, Andreas
Andreas Pflug <pgadmin@pse-consulting.de> writes: > Tom Lane wrote: >> As far as the question "can PG do 1-2k xact/sec", the answer is "yes >> if you throw enough hardware at it". Spending enough money on the >> disk subsystem is the key ... >> > The 1-2k xact/sec for MySQL seems suspicious, sounds very much like > write-back cached, not write-through, esp. considering that heavy > concurrent write access isn't said to be MySQLs strength... > I wonder if preserving the database after a fatal crash is really > necessary, since the data stored sounds quite volatile; in this case, > fsync=false might be sufficient. Yeah, that's something to think about. If you do need full transaction safety, then you *must* have a decent battery-backed-write-cache setup, else your transaction commit rate will be limited by disk rotation speed --- for instance, a single connection can commit at most 250 xacts per second if the WAL log is on a 15000RPM drive. (You can improve this to the extent that you can spread activity across multiple connections, but I'm not sure you can expect to reliably have 8 or more connections ready to commit each time the disk goes 'round.) regards, tom lane
Alex mentions a nice setup, but I'm pretty sure I know how to beat that IO subsystems HW's performance by at least 1.5x or 2x. Possibly more. (No, I do NOT work for any vendor I'm about to discuss.) Start by replacing the WD Raptors with Maxtor Atlas 15K II's. At 5.5ms average access, 97.4MB/s outer track throughput, 85.9MB/s average, and 74.4 MB/s inner track throughput, they have the best performance characteristics of any tested shipping HDs I know of. (Supposedly the new SAS versions will _sustain_ ~98MB/s, but I'll believe that only if I see it under independent testing). In comparison, the numbers on the WD740GD are 8.1ms average access, 71.8, 62.9, and 53.9 MB/s outer, average and inner track throughputs respectively. Be prepared to use as many of them as possible (read: as many you can afford) if you want to maximize transaction rates, particularly for small transactions like this application seems to be mentioning. Next, use a better RAID card. The TOL enterprise stuff (Xyratex, Engino, Dot-hill) is probably too expensive, but in the commodity market benchmarks indicate that that Areca's 1GB buffer RAID cards currently outperform all the other commodity RAID stuff. 9 Atlas II's per card in a RAID 5 set, or 16 per card in a RAID 10 set, should max the RAID card's throughput and come very close to, if not attaining, the real world peak bandwidth of the 64b 133MHz PCI-X bus they are plugged into. Say somewhere in the 700-800MB/s range. Repeat the above for as many independent PCI-X buses as you have for a very fast commodity RAID IO subsystem. Two such configured cards used in the dame manner as mentioned by Alex should easily attain 1.5x - 2x the transaction numbers mentioned by Alex unless there's a bottleneck somewhere else in the system design. Hope this helps, Ron Peacetree At 08:40 AM 8/19/2005, Alex Turner wrote: >I have managed tx speeds that high from postgresql going even as high >as 2500/sec for small tables, but it does require a good RAID >controler card (yes I'm even running with fsync on). I'm using 3ware >9500S-8MI with Raptor drives in multiple RAID 10s. The box wasn't too >$$$ at just around $7k. I have two independant controlers on two >independant PCI buses to give max throughput. on with a 6 drive RAID >10 and the other with two 4 drive RAID 10s. > >Alex Turner >NetEconomist > >On 8/19/05, Mark Cotner <mcotner@yahoo.com> wrote: > > Hi all, > > I bet you get tired of the same ole questions over and > > over. > > > > I'm currently working on an application that will poll > > thousands of cable modems per minute and I would like > > to use PostgreSQL to maintain state between polls of > > each device. This requires a very heavy amount of > > updates in place on a reasonably large table(100k-500k > > rows, ~7 columns mostly integers/bigint). Each row > > will be refreshed every 15 minutes, or at least that's > > how fast I can poll via SNMP. I hope I can tune the > > DB to keep up. > > > > The app is threaded and will likely have well over 100 > > concurrent db connections. Temp tables for storage > > aren't a preferred option since this is designed to be > > a shared nothing approach and I will likely have > > several polling processes. > > > > Here are some of my assumptions so far . . . > > > > HUGE WAL > > Vacuum hourly if not more often > > > > I'm getting 1700tx/sec from MySQL and I would REALLY > > prefer to use PG. I don't need to match the number, > > just get close. > > > > Is there a global temp table option? In memory tables > > would be very beneficial in this case. I could just > > flush it to disk occasionally with an insert into blah > > select from memory table. > > > > Any help or creative alternatives would be greatly > > appreciated. :) > > > > 'njoy, > > Mark > > > > > > -- > > Writing software requires an intelligent person, > > creating functional art requires an artist. > > -- Unknown > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 3: Have you checked our extensive FAQ? > > > > http://www.postgresql.org/docs/faq > > > >---------------------------(end of broadcast)--------------------------- >TIP 2: Don't 'kill -9' the postmaster
At 09:58 AM 8/19/2005, Andreas Pflug wrote: >The 1-2k xact/sec for MySQL seems suspicious, sounds very much like >write-back cached, not write-through, esp. considering that heavy >concurrent write access isn't said to be MySQLs strength... Don't be suspicious. I haven't seen the code under discussion, but I have seen mySQL easily achieve these kinds of numbers using the myISAM storage engine in write-through cache mode. myISAM can be =FAST=. Particularly when decent HW is thrown at it. Ron
On Fri, 2005-08-19 at 10:54 -0400, Ron wrote: > Maxtor Atlas 15K II's. > Areca's 1GB buffer RAID cards The former are SCSI disks and the latter is an SATA controller. The combination would have a transaction rate of approximately 0. I can vouch for the Areca controllers, however. You can certainly achieve pgbench transaction rates in the hundreds per second even with only 5 7200RPM disks and 128MB cache. Don't forget to buy the battery. -jwb
On 8/19/05 1:24 AM, "Mark Cotner" <mcotner@yahoo.com> wrote: > I'm currently working on an application that will poll > thousands of cable modems per minute and I would like > to use PostgreSQL to maintain state between polls of > each device. This requires a very heavy amount of > updates in place on a reasonably large table(100k-500k > rows, ~7 columns mostly integers/bigint). Each row > will be refreshed every 15 minutes, or at least that's > how fast I can poll via SNMP. I hope I can tune the > DB to keep up. > > The app is threaded and will likely have well over 100 > concurrent db connections. Temp tables for storage > aren't a preferred option since this is designed to be > a shared nothing approach and I will likely have > several polling processes. Mark, We have PostgreSQL databases on modest hardware doing exactly what you are attempting to (massive scalable SNMP monitoring system). The monitoring volume for a single database server appears to exceed what you are trying to do by a few orders of magnitude with no scaling or performance issues, so I can state without reservation that PostgreSQL can easily handle your application in theory. However, that is predicated on having a well-architected system that minimizes resource contention and unnecessary blocking, and based on your description you may be going about it a bit wrong. The biggest obvious bottleneck is the use of threads and massive process-level parallelization. As others have pointed out, async queues are your friends, as is partitioning the workload horizontally rather than vertically through the app stack. A very scalable high-throughput engine for SNMP polling only requires two or three threads handling different parts of the workload to saturate the network, and by choosing what each thread does carefully you can all but eliminate blocking when there is work to be done. We only use a single database connection to insert all the data into PostgreSQL, and that process/thread receives its data from a work queue. Depending on how you design your system, you can batch many records in your queue as a single transaction. In our case, we also use very few updates, mostly just inserts, which is probably advantageous in terms of throughput if you have the disk for it. The insert I/O load is easily handled, and our disk array is a modest 10k SCSI rig. The only thing that really hammers the server is when multiple reporting processes are running, which frequently touch several million rows each (the database is much larger than the system memory), and even this is manageable with clever database design. In short, what you are trying to do is easily doable on PostgreSQL in theory. However, restrictions on design choices may pose significant hurdles. We did not start out with an ideal system either; it took a fair amount of re-engineering to solve all the bottlenecks and problems that pop up. Good luck, J. Andrew Rogers jrogers@neopolitan.com
At 12:34 PM 8/19/2005, Jeffrey W. Baker wrote: >On Fri, 2005-08-19 at 10:54 -0400, Ron wrote: > > Maxtor Atlas 15K II's. > > > Areca's 1GB buffer RAID cards > >The former are SCSI disks and the latter is an SATA controller. The >combination would have a transaction rate of approximately 0. You are evidently thinking of the Areca ARC-11xx controllers (and you are certainly right for that HW combination ;-) ). Those are not the only product Areca makes that can be upgraded to a 1GB cache. Until SAS infrastructure is good enough, U320 SCSI and FC HD's remain the top performing HD's realistically available. At the most fundamental, your DBMS is only as good as your HD IO subsystem, and your HD IO subsystem is only as good as your HDs. As others have said here, skimping on your HDs is _not_ a good design choice where DBMSs are concerned. As an aside, the Atlas 15K II's are now available in SAS: http://www.maxtor.com/portal/site/Maxtor/menuitem.ba88f6d7cf664718376049b291346068/?channelpath=/en_us/Products/SCSI%20Hard%20Drives/Atlas%2015K%20Family/Atlas%2015K%20II%20SAS I haven't seen independent benches on them, so I explicitly referenced the U320 Atlas 15K II's known performance numbers instead. As I said, Maxtor is claiming even better for the SAS version of the Atlas 15K II. None of the SAS <-> PCI-X or PCI-E RAID cards I know of are ready for mass market yet, although a few are in beta.. >I can vouch for the Areca controllers, however. You can certainly >achieve pgbench transaction rates in the hundreds per second even with >only 5 7200RPM disks and 128MB cache. > >Don't forget to buy the battery. Agreed. Hope this is helpful, Ron Peacetree
> While I agree that hundreds of threads seems like overkill, I think the > above advice might be going too far in the other direction. The problem > with single-threaded operation is that any delay affects the whole > system --- eg, if you're blocked waiting for disk I/O, the CPU doesn't You use UDP which is a connectionless protocol... then why use threads ? I'd advise this : Use asynchronous network code (one thread) to do your network stuff. This will lower the CPU used by this code immensely. Every minute, dump a file contianing everything to insert into the table. Use another thread to COPY it into the DB, in a temporary table if you wish, and then INSERT INTO ... SELECT. This should be well adapted to your requirements.
Don't forget that Ultra 320 is the speed of the bus, not each drive. No matter how many honking 15k disks you put on a 320MB bus, you can only get 320MB/sec! and have so many outstanding IO/s on the bus. Not so with SATA! Each drive is on it's own bus, and you are only limited by the speed of your PCI-X Bus, which can be as high as 800MB/sec at 133Mhz/64bit. It's cheap and it's fast - all you have to do is pay for the enclosure, which can be a bit pricey, but there are some nice 24bay and even 40bay enclosures out there for SATA. Yes a 15k RPM drive will give you better seek time and better peak through put, but put them all on a single U320 bus and you won't see much return past a stripe size of 3 or 4. If it's raw transactions per second data warehouse style, it's all about the xlog baby which is sequential writes, and all about large block reads, which is sequential reads. Alex Turner NetEconomist P.S. Sorry if i'm a bit punchy, I've been up since yestarday with server upgrade nightmares that continue ;) On 8/19/05, Ron <rjpeace@earthlink.net> wrote: > Alex mentions a nice setup, but I'm pretty sure I know how to beat > that IO subsystems HW's performance by at least 1.5x or 2x. Possibly > more. (No, I do NOT work for any vendor I'm about to discuss.) > > Start by replacing the WD Raptors with Maxtor Atlas 15K II's. > At 5.5ms average access, 97.4MB/s outer track throughput, 85.9MB/s > average, and 74.4 MB/s inner track throughput, they have the best > performance characteristics of any tested shipping HDs I know > of. (Supposedly the new SAS versions will _sustain_ ~98MB/s, but > I'll believe that only if I see it under independent testing). > In comparison, the numbers on the WD740GD are 8.1ms average access, > 71.8, 62.9, and 53.9 MB/s outer, average and inner track throughputs > respectively. > > Be prepared to use as many of them as possible (read: as many you can > afford) if you want to maximize transaction rates, particularly for > small transactions like this application seems to be mentioning. > > Next, use a better RAID card. The TOL enterprise stuff (Xyratex, > Engino, Dot-hill) is probably too expensive, but in the commodity > market benchmarks indicate that that Areca's 1GB buffer RAID cards > currently outperform all the other commodity RAID stuff. > > 9 Atlas II's per card in a RAID 5 set, or 16 per card in a RAID 10 > set, should max the RAID card's throughput and come very close to, if > not attaining, the real world peak bandwidth of the 64b 133MHz PCI-X > bus they are plugged into. Say somewhere in the 700-800MB/s range. > > Repeat the above for as many independent PCI-X buses as you have for > a very fast commodity RAID IO subsystem. > > Two such configured cards used in the dame manner as mentioned by > Alex should easily attain 1.5x - 2x the transaction numbers mentioned > by Alex unless there's a bottleneck somewhere else in the system design. > > Hope this helps, > Ron Peacetree > > At 08:40 AM 8/19/2005, Alex Turner wrote: > >I have managed tx speeds that high from postgresql going even as high > >as 2500/sec for small tables, but it does require a good RAID > >controler card (yes I'm even running with fsync on). I'm using 3ware > >9500S-8MI with Raptor drives in multiple RAID 10s. The box wasn't too > >$$$ at just around $7k. I have two independant controlers on two > >independant PCI buses to give max throughput. on with a 6 drive RAID > >10 and the other with two 4 drive RAID 10s. > > > >Alex Turner > >NetEconomist > > > >On 8/19/05, Mark Cotner <mcotner@yahoo.com> wrote: > > > Hi all, > > > I bet you get tired of the same ole questions over and > > > over. > > > > > > I'm currently working on an application that will poll > > > thousands of cable modems per minute and I would like > > > to use PostgreSQL to maintain state between polls of > > > each device. This requires a very heavy amount of > > > updates in place on a reasonably large table(100k-500k > > > rows, ~7 columns mostly integers/bigint). Each row > > > will be refreshed every 15 minutes, or at least that's > > > how fast I can poll via SNMP. I hope I can tune the > > > DB to keep up. > > > > > > The app is threaded and will likely have well over 100 > > > concurrent db connections. Temp tables for storage > > > aren't a preferred option since this is designed to be > > > a shared nothing approach and I will likely have > > > several polling processes. > > > > > > Here are some of my assumptions so far . . . > > > > > > HUGE WAL > > > Vacuum hourly if not more often > > > > > > I'm getting 1700tx/sec from MySQL and I would REALLY > > > prefer to use PG. I don't need to match the number, > > > just get close. > > > > > > Is there a global temp table option? In memory tables > > > would be very beneficial in this case. I could just > > > flush it to disk occasionally with an insert into blah > > > select from memory table. > > > > > > Any help or creative alternatives would be greatly > > > appreciated. :) > > > > > > 'njoy, > > > Mark > > > > > > > > > -- > > > Writing software requires an intelligent person, > > > creating functional art requires an artist. > > > -- Unknown > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > > TIP 3: Have you checked our extensive FAQ? > > > > > > http://www.postgresql.org/docs/faq > > > > > > >---------------------------(end of broadcast)--------------------------- > >TIP 2: Don't 'kill -9' the postmaster > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >
At 03:31 PM 8/19/2005, Alex Turner wrote: >Don't forget that Ultra 320 is the speed of the bus, not each drive. >No matter how many honking 15k disks you put on a 320MB bus, you can >only get 320MB/sec! and have so many outstanding IO/s on the bus. Of course. This is exactly why multi-channel SCSI and multichannel Fibre Channel cards exist; and why external RAID enclosures usually have multiple such cards in them... Even moderately acceptable U320 SCSI cards are dual channel at this point (think Adaptec dual channel AHAxxxx's), and Quad channel ones are just as common. The Quads will, of course, saturate a 64b 133MHz PCI-X bus. _IF_ the chipset on them can keep up. The current kings of RAID card performance are all Fibre Channel based, and all the ones I know of are theoretically capable of saturating a 64b 133MHz PCI-X bus. Again, _IF_ the chipset on them can keep up. Most commodity RAID card have neither adequate CPU nor enough buffer. Regardless of the peripheral IO technology they use. >Not so with SATA! Each drive is on it's own bus, and you are only >limited by the speed of your PCI-X Bus, which can be as high as >800MB/sec at 133Mhz/64bit. That's the Theory anyway, and latency should be lower as well. OTOH, as my wife likes to say "In theory, Theory and Practice are the same. In practice, they almost never are." You are only getting the performance you mention as long as your card can keep up with multiplexing N IO streams, crunching RAID 5 XORs (assuming you are using RAID 5), etc, etc. As I'm sure you know, "The chain is only as strong as its weakest link.". Most commodity SATA RAID cards brag about being able to pump 300MB/s (they were all over LW SF bragging about this!?), which in this context is woefully unimpressive. Sigh. I'm impressed with the Areca cards because they usually have CPUs that actually can come close to pushing the theoretical IO limit of the bus they are plugged into; and they can be upgraded to (barely) acceptable buffer amounts <rant>(come on, manufacturers! 4GB of DDR PC3200 is only -2- DIMMs, and shortly that will be enough to hold 8GB of DDR PC3200. Give us more buffer!)</rant>. >It's cheap and it's fast - all you have to do is pay for the >enclosure, which can be a bit pricey, but there are some nice 24bay >and even 40bay enclosures out there for SATA. I've even seen 48 bay ones. However, good enclosures, particularly for larger numbers of HDs, are examples of non-trivial engineering and priced accordingly. Too many times I see people buy "bargain" enclosures and set themselves and their organizations up for some _very_ unpleasant times that could easily have been avoided by being careful to buy quality products. "Pay when you buy or pay much more later." >Yes a 15k RPM drive will give you better seek time and better peak >through put, but put them all on a single U320 bus and you won't see >much return past a stripe size of 3 or 4 Agreed. Same holds for 2Gbps FC. Haven't tested 4Gbps FC personally yet, but I'm told the limit is higher in the manner you'd expect. >If it's raw transactions per second data warehouse style, it's all >about the xlog baby which is sequential writes, and all about large >block reads, which is sequential reads. > >Alex Turner >NetEconomist >P.S. Sorry if i'm a bit punchy, I've been up since yestarday with >server upgrade nightmares that continue ;) My condolences and sympathies. I've definitely been there and done that. Ron Peacetree
:) Most of the ppl on this list are systems programmers, however I am not. The tool of choice for this app is Ruby and the libraries don't support async SNMP at the moment. I've done a good deal of async snmp and the libraries that actually pull it off generally aren't that good(Net-SNMP and Perl's Net::SNMP). Granted, UDP is connectionless to an extent, but you still have to send the PDU, and bind to the return socket and wait. If you batch the outgoing PDUs then you can get away with sending them out synchronously and listening on the returning socket synchronously, but it would require that your libraries support this. I understand the concepts well enough, maybe I'll put together a patch. It would be much lower overhead than managing all those threads. Looks like it's gonna be a fun weekend. Thanks again for all the great feedback. 'njoy, Mark On 8/19/05 2:11 PM, "PFC" <lists@boutiquenumerique.com> wrote: > >> While I agree that hundreds of threads seems like overkill, I think the >> above advice might be going too far in the other direction. The problem >> with single-threaded operation is that any delay affects the whole >> system --- eg, if you're blocked waiting for disk I/O, the CPU doesn't > > You use UDP which is a connectionless protocol... then why use threads ? > > I'd advise this : > > Use asynchronous network code (one thread) to do your network stuff. This > will lower the CPU used by this code immensely. > Every minute, dump a file contianing everything to insert into the table. > Use another thread to COPY it into the DB, in a temporary table if you > wish, and then INSERT INTO ... SELECT. > This should be well adapted to your requirements. > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster
Thanks again everyone for the excellent suggestions. I looked into IO::Reactor, but after a few hours of fiddling decided I was getting the kind of performance I wanted from using a slightly more than modest number of threads and decided(due to dev timelines) to come back to patching the SNMP libraries for Ruby to do async using Reactor later. I am unfortunately stuck with updates, but I think(with you're suggestions) I've made it work for me. MySQL = 1500 updates/sec PostgreSQL w/10k tx per commit using single thread = 1400 updates/sec Given the update heavy nature of this table I felt it was necessary to test during a vacuum. Turns out the hit wasn't that bad . . . PostgreSQL w/10k tx per commit using a single thread during a vacuum = 1300 updates/sec 100-200 updates/sec is a small price to pay for mature stored procedures, more stored procedure language options, acid compliance, mvcc, very few if any corrupt tables(get about 2 a week from MySQL on the 40 DBs I manage), more crash resistant db(crash about once a month on one of my 40 MySQL dbs), and replication that actually works for more than a day before quitting for no apparent reason ;) [/flame off] For those of you with Cox Communications cable modems look forward to better customer service and cable plant management. :) And if anyone's curious here's the app I'm rebuilding/updating http://www.mysql.com/customers/customer.php?id=16 We won runner up behind Saabre airline reservation system for MySQL app of the year. Needless to say they weren't too happy when they heard we might be switching DBs. 'njoy, Mark On 8/19/05 1:12 PM, "J. Andrew Rogers" <jrogers@neopolitan.com> wrote: > On 8/19/05 1:24 AM, "Mark Cotner" <mcotner@yahoo.com> wrote: >> I'm currently working on an application that will poll >> thousands of cable modems per minute and I would like >> to use PostgreSQL to maintain state between polls of >> each device. This requires a very heavy amount of >> updates in place on a reasonably large table(100k-500k >> rows, ~7 columns mostly integers/bigint). Each row >> will be refreshed every 15 minutes, or at least that's >> how fast I can poll via SNMP. I hope I can tune the >> DB to keep up. >> >> The app is threaded and will likely have well over 100 >> concurrent db connections. Temp tables for storage >> aren't a preferred option since this is designed to be >> a shared nothing approach and I will likely have >> several polling processes. > > > Mark, > > We have PostgreSQL databases on modest hardware doing exactly what you are > attempting to (massive scalable SNMP monitoring system). The monitoring > volume for a single database server appears to exceed what you are trying to > do by a few orders of magnitude with no scaling or performance issues, so I > can state without reservation that PostgreSQL can easily handle your > application in theory. > > However, that is predicated on having a well-architected system that > minimizes resource contention and unnecessary blocking, and based on your > description you may be going about it a bit wrong. > > The biggest obvious bottleneck is the use of threads and massive > process-level parallelization. As others have pointed out, async queues are > your friends, as is partitioning the workload horizontally rather than > vertically through the app stack. A very scalable high-throughput engine > for SNMP polling only requires two or three threads handling different parts > of the workload to saturate the network, and by choosing what each thread > does carefully you can all but eliminate blocking when there is work to be > done. > > We only use a single database connection to insert all the data into > PostgreSQL, and that process/thread receives its data from a work queue. > Depending on how you design your system, you can batch many records in your > queue as a single transaction. In our case, we also use very few updates, > mostly just inserts, which is probably advantageous in terms of throughput > if you have the disk for it. The insert I/O load is easily handled, and our > disk array is a modest 10k SCSI rig. The only thing that really hammers the > server is when multiple reporting processes are running, which frequently > touch several million rows each (the database is much larger than the system > memory), and even this is manageable with clever database design. > > > In short, what you are trying to do is easily doable on PostgreSQL in > theory. However, restrictions on design choices may pose significant > hurdles. We did not start out with an ideal system either; it took a fair > amount of re-engineering to solve all the bottlenecks and problems that pop > up. > > Good luck, > > J. Andrew Rogers > jrogers@neopolitan.com > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq