Thread: Why would I want to use connection pooling middleware?
I have a PostgreSQL 8.3.5 server with max_connections = 400. At this moment, I have 223 open connections, including 64 from a bunch of webserver processes and about 100 from desktop machines running a particular application. The rest are from various scheduled processes and other assorted things. Now, I know there are projects like pgpool- II that can serve to pool connections to the server. Why would I want to do that, though? I understand why pooling within a process itself is a good thing. However, say I have two users running the same program on different desktop machines. At present, those applications connect with the same username/password that's tied to the program and not the actual user. It seems like if Abby and Barb end up sharing the same connection from the pool, and Abby runs some giant report query, then Barb would get held back while she waits for it to finish. Is that true? Even if not, what would be the advantage in the two of them sharing a connection? I'm just trying to wrap my head around this. Thanks! -- Kirk Strauser
Kirk Strauser <kirk@strauser.com> writes: > I have a PostgreSQL 8.3.5 server with max_connections = 400. At this > moment, I have 223 open connections, including 64 from a bunch of > webserver processes and about 100 from desktop machines running a > particular application. The rest are from various scheduled processes > and other assorted things. Now, I know there are projects like pgpool- > II that can serve to pool connections to the server. Why would I want > to do that, though? Idle backends eat resources that would be better spent elsewhere. (I'm assuming that the majority of those 223 backends aren't actually doing anything at any one instant.) As an example, any system catalog update has to be broadcast to all live backends, and they all have to dutifully search their catalog caches to flush stale entries. That costs the same whether the backend is being put to use or has been sitting idle for minutes. There's no percentage in trying to pool connections from applications that are constantly doing something; but webserver sessions tend to have lots of user "think time" as well as similar DB environments, so often they can be pooled profitably. regards, tom lane
Tom Lane wrote: > Kirk Strauser <kirk@strauser.com> writes: > > I have a PostgreSQL 8.3.5 server with max_connections = 400. At this > > moment, I have 223 open connections, including 64 from a bunch of > > webserver processes and about 100 from desktop machines running a > > particular application. The rest are from various scheduled processes > > and other assorted things. Now, I know there are projects like pgpool- > > II that can serve to pool connections to the server. Why would I want > > to do that, though? > > Idle backends eat resources that would be better spent elsewhere. > (I'm assuming that the majority of those 223 backends aren't actually > doing anything at any one instant.) As an example, any system catalog > update has to be broadcast to all live backends, and they all have to > dutifully search their catalog caches to flush stale entries. That costs > the same whether the backend is being put to use or has been sitting > idle for minutes. Also, memory wasted in per-process memory is memory not used for caches. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Thu, Jan 15, 2009 at 10:54 AM, Kirk Strauser <kirk@strauser.com> wrote: > I have a PostgreSQL 8.3.5 server with max_connections = 400. At this > moment, I have 223 open connections, including 64 from a bunch of webserver > processes and about 100 from desktop machines running a particular > application. The rest are from various scheduled processes and other > assorted things. Now, I know there are projects like pgpool-II that can > serve to pool connections to the server. Why would I want to do that, > though? > > I understand why pooling within a process itself is a good thing. However, > say I have two users running the same program on different desktop machines. > At present, those applications connect with the same username/password > that's tied to the program and not the actual user. It seems like if Abby > and Barb end up sharing the same connection from the pool, and Abby runs > some giant report query, then Barb would get held back while she waits for > it to finish. Is that true? Even if not, what would be the advantage in > the two of them sharing a connection? There tend to be three effects that provide benefits: 1. Fewer connections tend to consume less resources on the DBMS server. Each connection consumes some resources, memory, generates lock entries, and such, and having fewer connections means that the aggregate size of the postmaster processes is likely to be smaller. 2. Pooling connections should mean that you can use and re-use connections, which should reduce the amount of work done building up and tearing down connections. Each PostgreSQL connection is handled by a separate OS process; if the connection pool is passing the same connection from user to user, your system is doing less work spawning backend processes, doing authentication, and otherwise getting from fork() to "ready to handle queries." 3. There is only so much *genuine* concurrency that you can actually get out of your DB server, and there is only limited value to having more backend processes than this "emergent quantity." For instance, if you only have a single CPU and a single disk drive, then your computer is only ever *truly* doing one thing at a time. Trying to make such a server service 200 connections, each trying to do work, means that this server will be doing a great deal of work switching from process to process, doing the context switches. That's an extreme point, of course, but it should be reasonably intuitive to consider that... - If you have 10 CPUs and a RAID array of 10 disk drives, then that host can likely cope comfortably with doing ~10 things at once; - Change those numbers to 20/20 and the intuition continues. If some of your 200 connections are only intermittently used, then if you had a connection pool with 20 "real" connections, then the 200 users would seldom notice delays due to sharing. And the connection pool usage would mean that the DB server would have way fewer processes kicking around consuming memory. You might well be better using the process for the extra 180 backends for shared cache :-). -- http://linuxfinances.info/info/linuxdistributions.html Joe E. Lewis - "There's only one thing money won't buy, and that is poverty."
Kirk Strauser <kirk@strauser.com> wrote: [snip] > I understand why pooling within a process itself is a good thing. > However, say I have two users running the same program on different > desktop machines. At present, those applications connect with the > same username/password that's tied to the program and not the actual > user. It seems like if Abby and Barb end up sharing the same > connection from the pool, and Abby runs some giant report query, then > Barb would get held back while she waits for it to finish. Is that > true? I don't believe that's true. My understanding of pgpool is that it will reuse an existing connection if it's free, or open a new one if required. -- Bill Moran http://www.potentialtech.com
On Jan 15, 2009, at 10:08 AM, Tom Lane wrote: > As an example, any system catalog update has to be broadcast to all > live backends, and they all have to dutifully search their catalog > caches to flush stale entries. That costs the same whether the > backend is being put to use or has been sitting idle for minutes. I didn't realize that. I wasn't sure what types of overheads were involved and didn't think about those sorts of things. > There's no percentage in trying to pool connections from applications > that are constantly doing something; but webserver sessions tend to > have > lots of user "think time" as well as similar DB environments, so often > they can be pooled profitably. That makes sense. Along those lines, how do you actually enable connection pooling in pgpool-II? I've RTFM a few times but it doesn't seem to have a flag for "enable_pooling". Is "num_init_children" effectively the same as a hypothetical "max_children?" If I set it to "1" and leave "max_pool" at "4", then clients queue up while one at a time gets to connect. Sorry, I know this isn't the pgpool-II mailing list. :-) -- Kirk Strauser
Kirk Strauser wrote: > I understand why pooling within a process itself is a good thing. > However, say I have two users running the same program on different > desktop machines. At present, those applications connect with the same > username/password that's tied to the program and not the actual user. > It seems like if Abby and Barb end up sharing the same connection from > the pool, and Abby runs some giant report query, then Barb would get > held back while she waits for it to finish. Is that true? Even if not, > what would be the advantage in the two of them sharing a connection? Connections are pooled on the client end, not on the server end. So, you'd be able to pool connections on your web server, and should, for reasons documented by others. However, since Abby and Barb are using different computers, you won't achieve anything by introducing pooling into your desktop application. -- Guy Rouillier
On Thu, Jan 15, 2009 at 11:57:13AM -0500, Guy Rouillier wrote: > Connections are pooled on the client end, not on the server end. So, > you'd be able to pool connections on your web server, and should, for > reasons documented by others. However, since Abby and Barb are using > different computers, you won't achieve anything by introducing pooling > into your desktop application. Connection pooling can function anywhere; you could use it client side (like the connection pooling built into, say, PHP) where what you say is true. You can also use something like pgpool on another box (maybe the database server itself) and point all clients at it. -- Sam http://samason.me.uk/
On Jan 15, 2009, at 10:20 AM, Bill Moran wrote: > I don't believe that's true. My understanding of pgpool is that it > will > reuse an existing connection if it's free, or open a new one if > required. Gah! It just made it worse! $ ps auxwww | grep pgpool | grep dbuser | wc -l 30 $ ps auxwww | grep postgres: | grep dbuser | wc -l 38 So not only is it not sharing connections among clients, but it's keeping old ones open too. This isn't really what I had in mind. Also, many of the applications are launched in the morning and open a connection, then get used all day, then closed at the end of the afternoon. I'm starting to wonder if perhaps pgpool-II isn't what I was looking for. -- Kirk Strauser
Kirk Strauser wrote: > ... > > I understand why pooling within a process itself is a good thing. > However, say I have two users running the same program on different > desktop machines. At present, those applications connect with the > same username/password that's tied to the program and not the actual > user. It seems like if Abby and Barb end up sharing the same > connection from the pool, and Abby runs some giant report query, then > Barb would get held back while she waits for it to finish. Is that > true? Even if not, what would be the advantage in the two of them > sharing a connection? Short answer: performance. First you need the correct mental model. The "pool" in pooling is a pre-established pool of connections to the database. When a client connects to the pooler, the pooler decides, based on its configuration, which database connection will receive the statement sent to the pool by the client. The pooler can also monitor the connection pool and decide when to increase the pool size or release database connections. Pgbouncer (part of Skype tools) has worked very well for me. It can be configured to handle connections in a variety of ways. At one extreme, a connection to the database is only used for the duration of a single statement then that database connection is available and waiting for the next statement. At the other extreme, a connection is assigned and tied up for the entire duration that the client is connected to the pooler. In any case, Barb and Abby will not be sharing a database connection *simultaneously* - if they are using their desktop app which is connected to the db through the pooler and they are both running queries at the same time then each of them will have their own connection to the database. Whether or not you will benefit from pooling depends on the nature of your application. If, for example, the application runs queries by connecting, running query, disconnecting and those queries are long-running and/or infrequent then pooling will probably be of little if any benefit. At the other extreme, say you have a very high volume of short single-statement transactions (think web). You start seeing a significant impact of the connection setup/teardown time. Even if the app connects for each web-hit, connecting to a pooler like pgbouncer is much faster than connecting directly to the database (I've done some tests showing nearly a factor of 10 improvement in page delivery rates). But if the pooler is set to statement-level pooling, you can have hundreds of web processes each maintaining a persistent connection to the pooler while the pooler only needs to maintain sufficient connections for simultaneous statements - generally a tiny fraction of the number of web processes. The pooler typically has a number of other configuration parameters that govern things like the maximum amount of time a server connection will be used before it is dropped and reconnected as well as how long idle database connections hang around before they are dropped. Beware. You can not just drop a pooler in place without understanding the applications that will connect through it. Setting GUC variables,for example, is a red flag. You could easily create a situation where you increase work-memory for a specific statement but that statement ends up assigned to a different database backend or have a client connection be unaware of GUC changes made by a previously connected client that was assigned to that backend. But if your application is designed to work well with pooling, it can provide dramatic performance benefits. Cheers, Steve
On Jan 15, 2009, at 12:30 PM, Steve Crawford wrote: > But if your application is designed to work well with pooling, it > can provide dramatic performance benefits. I think that's the problem. As I mentioned at one point, a lot of our applications have connections open for hours at a time and fire off queries when the user does something. I'm coming to think that pooling wouldn't give much benefit to long-living processes like that. On a related note, is max_connections=400 reasonably sized for a server with 8GB of RAM? Again, most of these are dormant at any given time. The database itself is currently hosted on a dual Xeon server with 3GB of RAM and other applications so I'm sure the new 8-core/8GB hardware is bound to do better at any rate. -- Kirk Strauser
In response to Kirk Strauser <kirk@strauser.com>: > On Jan 15, 2009, at 12:30 PM, Steve Crawford wrote: > > > But if your application is designed to work well with pooling, it > > can provide dramatic performance benefits. > > I think that's the problem. As I mentioned at one point, a lot of our > applications have connections open for hours at a time and fire off > queries when the user does something. I'm coming to think that > pooling wouldn't give much benefit to long-living processes like that. Actually, that's exactly the kind of app that should benefit from pooling. However, it pgpool can't pool connections if each connection has its own username. Not sure what exactly is causing it not to work for you, but that was the first thing that came to mind. > On a related note, is max_connections=400 reasonably sized for a > server with 8GB of RAM? Again, most of these are dormant at any given > time. The database itself is currently hosted on a dual Xeon server > with 3GB of RAM and other applications so I'm sure the new 8-core/8GB > hardware is bound to do better at any rate. It all depends. We had servers with 300 - 400 connections that did just fine. We've looked at pgpool, but deemed it not worth the added complexity. Quite honestly, why attempt to solve problems that don't even exist? Are you having a problem? If so, what is the problem? -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/
Kirk Strauser wrote: > On Jan 15, 2009, at 12:30 PM, Steve Crawford wrote: > >> But if your application is designed to work well with pooling, it can >> provide dramatic performance benefits. > > I think that's the problem. As I mentioned at one point, a lot of our > applications have connections open for hours at a time and fire off > queries when the user does something. I'm coming to think that > pooling wouldn't give much benefit to long-living processes like that. > If you know that the application does not change GUC variables then you will probably benefit greatly by using pgbouncer. If all the queries are single-statements then set pool_mode=statement. If you have multiple-statement transactions then configure pgbouncer to use pool_mode=transaction. Either way, your app won't tie up a back-end connection when it is sitting idle. You will probably find that you can handle your hundreds of clients with a pretty small pool of backend connections. Pgbouncer will give you some nice statistics to help you adjust the pool sizing and such. > On a related note, is max_connections=400 reasonably sized for a > server with 8GB of RAM? Again, most of these are dormant at any given > time. The database itself is currently hosted on a dual Xeon server > with 3GB of RAM and other applications so I'm sure the new 8-core/8GB > hardware is bound to do better at any rate. Too little info (and others here can answer better anyway). But I think you should test pooling and find out how many you really need before jumping into tuning. I haven't tried Pgpool* but have found pgbouncer to be easy-to-use, reliable and effective. Cheers, Steve
On Jan 15, 2009, at 2:39 PM, Bill Moran wrote: > However, it pgpool can't pool connections if each connection has its > own username. Not sure what exactly is causing it not to work for > you, > but that was the first thing that came to mind. The usernames are per-app. Zope connections with username "zope", for example. However, any given application might have 30 instances running at any time. > Are you having a problem? If so, what is the problem? Honestly? That so many people are singing the praises of connection pooling and I thought I'd better at least see what the excitingment is about. -- Kirk Strauser
On Thu, Jan 15, 2009 at 1:26 PM, Kirk Strauser <kirk@strauser.com> wrote: > On Jan 15, 2009, at 12:30 PM, Steve Crawford wrote: > >> But if your application is designed to work well with pooling, it can >> provide dramatic performance benefits. > > I think that's the problem. As I mentioned at one point, a lot of our > applications have connections open for hours at a time and fire off queries > when the user does something. I'm coming to think that pooling wouldn't > give much benefit to long-living processes like that. As mentioned, this is exactly what pooling is good for. Imagine 200 users each accessing the database once every hour. If on average you have two or three users actually hitting the database, then you could handle that many users with only 10 or 20 database connections. The apps each have a persistent connection up to the pool service, which then routes the active sessions through a connection pool so the db only thinks it has 20 or fewer users. > On a related note, is max_connections=400 reasonably sized for a server with > 8GB of RAM? Again, most of these are dormant at any given time. The > database itself is currently hosted on a dual Xeon server with 3GB of RAM > and other applications so I'm sure the new 8-core/8GB hardware is bound to > do better at any rate. That depends very much on how many of those are ever active, or could become active. When things get ugly is when everyone at the company needs to log in at once. Suddenly, the 390 connections out of the 400 that were idle go active, and the server grinds to a crawl under the load. A connection pool would simple hold the connection in a wait state until one of the 20 or so live connections to the db became available, preventing server meltdown.
On Jan 15, 2009, at 2:54 PM, Steve Crawford wrote: > If you know that the application does not change GUC variables then > you will probably benefit greatly by using pgbouncer. Thanks, Steve! That's just the kind of pointer I can use. I've been using PostgreSQL for years but I've never really gone far into low- level optimizations like this. -- Kirk Strauser
In response to Kirk Strauser <kirk@strauser.com>: > On Jan 15, 2009, at 2:39 PM, Bill Moran wrote: > > > However, it pgpool can't pool connections if each connection has its > > own username. Not sure what exactly is causing it not to work for > > you, > > but that was the first thing that came to mind. > > The usernames are per-app. Zope connections with username "zope", for > example. However, any given application might have 30 instances > running at any time. You might be hitting up against pgpool being pre-emptive on startup. i.e., it's establishing a bunch of connections right off the bat so they're available right away. If your application actually uses less connections than pgpool maintains, then it's not going to be a benefit. > > Are you having a problem? If so, what is the problem? > > Honestly? That so many people are singing the praises of connection > pooling and I thought I'd better at least see what the excitingment is > about. Well, it's a good instinct to look into stuff like that. Especially now that you've discovered that it's not cut and dry. Try exercising your application under load to see if pgpool helps. If it keeps extra connections open during idle time, that won't really hurt much, but if it reduces server load under stress, that's worthwhile. -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/
On Thursday 15 January 2009 09:54:50 Kirk Strauser wrote: > I have a PostgreSQL 8.3.5 server with max_connections = 400. At this > moment, I have 223 open connections, including 64 from a bunch of > webserver processes and about 100 from desktop machines running a > particular application. The rest are from various scheduled processes > and other assorted things. Now, I know there are projects like pgpool- > II that can serve to pool connections to the server. Why would I want > to do that, though? After installing and configuring PgBouncer and then pointing all of our clients at it, our average number of database connections has dropped from 250+ to 17. Query times are also much better, and more RAM is going to caching than to holding processes. Count me as a new fan. -- Kirk Strauser