Thread: changing MyDatabaseId
I've spent a few hours pouring over the source code with coarse-toothed comb, trying to figure out just exactly what might break if we changed MyDatabaseId after backend startup time, or in other words, allowed a backend to unbind from the database to which it was originally bound and rebind to a new one. This is related to the periodic conversations we've had about a built-in connection pooler, and/or maintaining a pool of worker threads that could be used to service parallel query, replication sets, etc. What follows is not meant to be a concrete design proposal; it's basic research that may lead to a proposal at some time in the future. Still, comments are welcome. For the use cases that seem interesting, it is sufficient to think about changing MyDatabaseId when there is no transaction in progress, or perhaps, when there is a transaction in progress which does that and nothing else. This means that code that simply uses MyDatabaseId in the course of going about its duties is not going to blow up. However, anything that stores the current database ID someplace is a hazard, as is any code that stores information which was obtained using the current database ID. Looking through the code, it appears to me that we'd need to do the following (not necessarily in this order): 1. Blow away all of our catalog caches as if we've received a sinval reset. Since most of our caches are hooked into the catalog cache via invalidation hooks, this should trigger cascading invalidations of other caches that need it, at least in most cases. (Further looking is probably needed to identify any cases where a sinval-reset-equivalent is insufficient.) 2. Blow away absolutely everything from the relcache. It appears to me that we'd need to repeat the phase 2 and 3 initialization steps. We can't even keep the entries for shared catalogs, because the descriptors that formrdesc() coughs up may not exactly match what we read from the pg_class entry. It's tempting to propose adding enough additional shared catalogs to make the relcache entries for shared catalogs independent of any particular database's contents, but right now they are not. 3. Reinitialize the relmapper. 4. Release all locks on objects in the old database; and our shared lock on the database itself. Take a new shared lock on the new database. 5. Flush out any backend-local statistics that have been gathered but not yet sent to the statistics collector. 6. Update our entry in the PgBackendStatus array so pg_stat_activity sees the new database assignment. 7. Change MyProc->databaseId while holding ProcArrayLock in LW_EXCLUSIVE mode. 8. Update MyDatabaseTableSpace. I haven't benchmarked this (so perhaps I should STFU) but #2 and maybe #1 and #3 sounds like the most painful part of this activity. Still, there's not much help for it that I can see. The best you can hope for is to keep around entries to the shared catalogs, and that's not going to get you terribly far, and at least ATM even that is unsafe. I think maybe the thing to do is try to quantify how much time is being spent in each part of the backend startup process - like connect to the database lots and lots of times with individual connections, run a trivial SELECT against a table, and then disconnect and repeat. The problem is that (unless someone knows how to do magic tricks with oprofile or dtrace) you're only going to find out the amount of time actually spent executing each portion of the code. It won't tell you how much overhead you have from fork() itself, let alone how much faster the initialization steps would have run if they had been running on a warmed-up process address space rather than one that might well need to fault in a lot of page-table entries in its first few moments of life. But it might still provide some useful data. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Looking through the code, it appears to me that we'd need to do the > following (not necessarily in this order): Don't forget 9. Unload loadable modules that do not exist according to the new database's catalogs; eg we don't want postgis trying to run when its supporting tables don't exist in the selected database. 10. Somehow persuade remaining loadable modules to discard whatever state they have that might be database-local. We don't have workable APIs for either of those operations ATM. I believe also that there are probably race conditions in several of the steps you listed; in particular there is certainly a risk involved in changing the database-we-advertise-being-connected-to versus a concurrent DROP DATABASE. Maybe that's easily soluble, but I'm not sure about it. regards, tom lane
Excerpts from Tom Lane's message of lun nov 15 02:41:40 -0300 2010: > I believe also that there are probably race conditions in several of > the steps you listed; in particular there is certainly a risk involved > in changing the database-we-advertise-being-connected-to versus a > concurrent DROP DATABASE. Maybe that's easily soluble, but I'm not sure > about it. Perhaps it would help to first disconnect from the original database and not reconnect to any one in particular, i.e. stay in a state like the autovac launcher is in. Then cleanup as per Robert's list, then switch to a valid database ID. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Mon, Nov 15, 2010 at 12:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Looking through the code, it appears to me that we'd need to do the >> following (not necessarily in this order): > > Don't forget > > 9. Unload loadable modules that do not exist according to the new > database's catalogs; eg we don't want postgis trying to run when > its supporting tables don't exist in the selected database. I might be missing something here, but I wasn't aware that the catalogs had any say in what loadable modules exist. Obviously their contents will determine whether such loadable modules ever get invoked as a result of actions at the SQL level, but if they don't it may not matter much that they're loaded but idle. > 10. Somehow persuade remaining loadable modules to discard whatever > state they have that might be database-local. Ouch. > We don't have workable APIs for either of those operations ATM. Somewhat independently of this problem, it would be nice to be able to unload a loadable module, and I think that we currently don't support that at all. One thing (the only thing?) that makes that a show-stopper is that the way we manage hook function chaining makes it impossible for a module to reliably unload itself. If you load both auto_explain and pg_stat_statements, for example, you end up with a tangle of pointers that can't be easily unwound. I don't have a real clear idea how to fix it, but the status quo certainly doesn't seem optimal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, On 11/15/2010 05:39 AM, Robert Haas wrote: > I've spent a few hours pouring over the source code with > coarse-toothed comb, trying to figure out just exactly what might > break if we changed MyDatabaseId after backend startup time, or in > other words, allowed a backend to unbind from the database to which it > was originally bound and rebind to a new one. This is related to the > periodic conversations we've had about a built-in connection pooler, > and/or maintaining a pool of worker threads that could be used to > service parallel query, replication sets, etc. What follows is not > meant to be a concrete design proposal; it's basic research that may > lead to a proposal at some time in the future. Still, comments are > welcome. Thanks a lot for doing that, saved me a couple hours (presumably more than it cost you :-) > Thoughts? The question obviously is whether or not this is faster than just terminating one backend and starting a new one. Which basically costs an additional termination and re-creation of a process (i.e. fork()) AFAICS. Or what other savings do you envision? If that's it, it certainly seems like a huge amount of work for very little benefit. Or does this feature enable something that's impossible to do otherwise? Regards Markus Wanner
On Wednesday 17 November 2010 11:04:04 Markus Wanner wrote: > Robert, > > On 11/15/2010 05:39 AM, Robert Haas wrote: > > I've spent a few hours pouring over the source code with > > coarse-toothed comb, trying to figure out just exactly what might > > break if we changed MyDatabaseId after backend startup time, or in > > other words, allowed a backend to unbind from the database to which it > > was originally bound and rebind to a new one. This is related to the > > periodic conversations we've had about a built-in connection pooler, > > and/or maintaining a pool of worker threads that could be used to > > service parallel query, replication sets, etc. What follows is not > > meant to be a concrete design proposal; it's basic research that may > > lead to a proposal at some time in the future. Still, comments are > > welcome. > > Thanks a lot for doing that, saved me a couple hours (presumably more > than it cost you :-) > > > Thoughts? > > The question obviously is whether or not this is faster than just > terminating one backend and starting a new one. Which basically costs an > additional termination and re-creation of a process (i.e. fork()) > AFAICS. Or what other savings do you envision? Well, one could optimize most of the resetting away if the the old MyDatabaseId and the new one are the same - an optimization which is hardly possible with forking new backends. Also I think it could lower the impact due locking the procarray an related areas. Andres
Andreas, On 11/17/2010 11:38 AM, Andres Freund wrote: > Well, one could optimize most of the resetting away if the the old > MyDatabaseId and the new one are the same - an optimization which is hardly > possible with forking new backends. Uh? Why not simply re-use the same backend, then? Or do you think of re-connecting to the same database as a way to reset your connection? > Also I think it could lower the impact due locking the procarray an related > areas. That may be, yes. But as pointed out by Tom and Alvaro, you'd have to adhere to a compatible sequence of changes to shared memory to avoid race conditions. That possibly requires using a very similar locking sequence as the combination of a detaching and a newly starting backend would use. Regards Markus Wanner
On Wednesday 17 November 2010 11:58:33 Markus Wanner wrote: > Andreas, > > On 11/17/2010 11:38 AM, Andres Freund wrote: > > Well, one could optimize most of the resetting away if the the old > > MyDatabaseId and the new one are the same - an optimization which is > > hardly possible with forking new backends. > Uh? Why not simply re-use the same backend, then? Or do you think of > re-connecting to the same database as a way to reset your connection? I am thinking of a connection-pooler like setup. Quite often your main-load goes towards a single database - in that situation you don't have to reset the database id most of the time. Andres
On 11/17/2010 12:09 PM, Andres Freund wrote: > I am thinking of a connection-pooler like setup. Quite often your main-load > goes towards a single database - in that situation you don't have to reset the > database id most of the time. Okay, so that's what I'd call a connection-reset or something. Or probably a re-authentication feature which allows switching users (and thus permissions on that same database). Could make sense for a connection pooler, yes. Not having to flush caches in that case could even turn out to be a good win. And I think it's a lot simpler to just switch the user than to switch the database. Such a feature looks more promising to me. (Do we have other connection related and unchangeable state?) As side notes: for the bgworkers in Postgres-R, I'm currently re-using existing backends. As they only do change set application, the amount of connection-reset required is minimal (well, there isn't a client connection for these kinds of backends, in the first place, but that's another story). Plus they are always acting as superuser, no authentication or user switching required in that case. For parallel querying as well as async transactions, it might make sense to be able to switch users but remain connected to the same database. Regards Markus Wanner
Excerpts from Markus Wanner's message of mié nov 17 07:04:04 -0300 2010: > > Thoughts? > > The question obviously is whether or not this is faster than just > terminating one backend and starting a new one. Which basically costs an > additional termination and re-creation of a process (i.e. fork()) > AFAICS. Or what other savings do you envision? I don't think it's a speed thing only. It would be a great thing to have in autovacuum, for example, where we have constant problem reports because the system failed to fork a new backend. If we could simply reuse an already existing one, it would be a lot more robust. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 11/17/2010 01:27 PM, Alvaro Herrera wrote: > I don't think it's a speed thing only. It would be a great thing to > have in autovacuum, for example, where we have constant problem reports > because the system failed to fork a new backend. If we could simply > reuse an already existing one, it would be a lot more robust. Hm, that's an interesting point. To actually increase robustness, it would have to be a failure scenario that (temporarily) prevents forking, but allows an existing backend to continue to do work (i.e. the ability to allocate memory or open files come to mind). Any idea about what's usually causing these fork() failures? I'm asking because I'm afraid that for example, in case of an out of memory condition, we'd just hit an OOM error later on, without being able to perform the VACUUM job, either. Regards Markus Wanner
Excerpts from Markus Wanner's message of mié nov 17 09:57:18 -0300 2010: > On 11/17/2010 01:27 PM, Alvaro Herrera wrote: > > I don't think it's a speed thing only. It would be a great thing to > > have in autovacuum, for example, where we have constant problem reports > > because the system failed to fork a new backend. If we could simply > > reuse an already existing one, it would be a lot more robust. > > Hm, that's an interesting point. > > To actually increase robustness, it would have to be a failure scenario > that (temporarily) prevents forking, but allows an existing backend to > continue to do work (i.e. the ability to allocate memory or open files > come to mind). Well, the autovacuum mechanism involves a lot of back-and-forth between launcher and postmaster, which includes some signals, a fork() and backend initialization. The failure possibilities are endless. Fork failure communication is similarly brittle. > Any idea about what's usually causing these fork() failures? I'm asking > because I'm afraid that for example, in case of an out of memory > condition, we'd just hit an OOM error later on, without being able to > perform the VACUUM job, either. To be honest I have no idea. Sometimes the server is just too loaded. Right now we have this "delay", if the process is not up and running in 60 seconds then we have to assume that "something" happened, and we no longer wait for it. If we knew the process was already there, we could leave it alone; we'd know it would get to its duty eventually. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 11/17/2010 02:19 PM, Alvaro Herrera wrote: > Well, the autovacuum mechanism involves a lot of back-and-forth between > launcher and postmaster, which includes some signals, a fork() and > backend initialization. The failure possibilities are endless. > > Fork failure communication is similarly brittle. I certainly agree to that. However, a re-connecting mechanism wouldn't allow us to get rid of the existing avworker startup infrastructure entirely. And for increased robustness, we'd require a less brittle re-connecting mechanism. Given Robert's list, that doesn't seem trivial, either. (But still doable, yes). > Right now we have this "delay", if the process is not up and running in > 60 seconds then we have to assume that "something" happened, and we no > longer wait for it. If we knew the process was already there, we could > leave it alone; we'd know it would get to its duty eventually. You are assuming presence of pool here. Which is fine, it's just not something that a re-connecting feature would solve per se. (Says he who coded the bgworkers pool thingie). Regards Markus Wanner
Markus Wanner <markus@bluegap.ch> writes: > On 11/17/2010 02:19 PM, Alvaro Herrera wrote: >> Well, the autovacuum mechanism involves a lot of back-and-forth between >> launcher and postmaster, which includes some signals, a fork() and >> backend initialization. The failure possibilities are endless. >> >> Fork failure communication is similarly brittle. > I certainly agree to that. However, a re-connecting mechanism wouldn't > allow us to get rid of the existing avworker startup infrastructure > entirely. I'm afraid that any such change would trade a visible, safe failure mechanism (no avworker) for invisible, impossible-to-debug data corruption scenarios (due to failure to reset some bit of cached state). It certainly won't give me any warm fuzzy feeling that I can trust autovacuum. regards, tom lane
On 11/17/2010 04:25 PM, Tom Lane wrote: > I'm afraid that any such change would trade a visible, safe failure > mechanism (no avworker) for invisible, impossible-to-debug data > corruption scenarios (due to failure to reset some bit of cached state). > It certainly won't give me any warm fuzzy feeling that I can trust > autovacuum. Well, Alvaro doesn't quite seem have a warm fuzzy feeling with the status quo, either. And I can certainly understand his concerns. But yes, the os-level process separation and cache state reset guarantee that an exit() / fork() pair provides is hard to match up against in user space. So, Alvaro's argument for robustness only stands under the assumption that we can achieve a perfect cache state reset mechanism. Now, how feasible is that? Are there any kind of tools that could help us check? Regards Markus Wanner
On Wed, Nov 17, 2010 at 5:04 AM, Markus Wanner <markus@bluegap.ch> wrote: > The question obviously is whether or not this is faster than just > terminating one backend and starting a new one. I agree. > Which basically costs an > additional termination and re-creation of a process (i.e. fork()) > AFAICS. Or what other savings do you envision? I think the thing I'd like to do (or see somebody else do) is write a test harness that connects to the database, runs a single SELECT * FROM dual (one row, one column table), and then disconnects; and then oprofile the daylights out of the backend processes. In other words, I'd like to measure as exactly as we can the overhead of each part of the startup process. I think that would give us a clearer picture of where the overhead is, and then we could look more directly at which line items might be avoidable for rebinding to a new database. However, that test doesn't capture everything. For example, imagine a connection pooler sitting in front of PG. Rebinding to a new database means disconnecting a TCP connection and establishing a new one. Switching databases might save some latency there even if we don't actually save much in terms of CPU instructions. Maybe that's not important, though. I don't know. I don't want to let my theorizing get too far ahead of the data. It also occurs to me to wonder whether there's some way that we can speed up backend startup, period. One of the frequent complaints levied against PostgreSQL is that our connections are too expensive. AFAICT, this is partly because backend startup costs are high, and partly because of internal contention, especially around ProcArrayLock. Squeezing on the startup overhead will be valuable no matter what we decide to do about database switching. A variant on this theme would be to try to adjust the startup sequence in some way so that less of it needs to be redone if we switch databases, which might be possible even if a more general solution isn't. Unfortunately, I'm not sure how feasible this is, but maybe there's a way... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 17, 2010 at 4:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: > However, that test doesn't capture everything. For example, imagine a > connection pooler sitting in front of PG. Rebinding to a new database > means disconnecting a TCP connection and establishing a new one. > Switching databases might save some latency there even if we don't > actually save much in terms of CPU instructions. Maybe that's not > important, though. I don't know. I don't want to let my theorizing > get too far ahead of the data. Everything you said is true but there's more. A freshly created backend needs to build relcache entries and for every relation in your query. A reused connection eventually warms up the relcache and syscaches and can plan new queries using them without doing any syscalls. And of course if it's a query that's already been planned might be able to reuse the entire plan structure without replanning it. -- greg
On Wed, Nov 17, 2010 at 12:42 PM, Greg Stark <gsstark@mit.edu> wrote: > On Wed, Nov 17, 2010 at 4:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> However, that test doesn't capture everything. For example, imagine a >> connection pooler sitting in front of PG. Rebinding to a new database >> means disconnecting a TCP connection and establishing a new one. >> Switching databases might save some latency there even if we don't >> actually save much in terms of CPU instructions. Maybe that's not >> important, though. I don't know. I don't want to let my theorizing >> get too far ahead of the data. > > Everything you said is true but there's more. A freshly created > backend needs to build relcache entries and for every relation in your > query. A reused connection eventually warms up the relcache and > syscaches and can plan new queries using them without doing any > syscalls. And of course if it's a query that's already been planned > might be able to reuse the entire plan structure without replanning > it. I think you're missing the point. If we switch databases, all cached relations and plans have to be flushed anyway. We're talking about what might NOT need to be flushed on switching databases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 17, 2010 at 6:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I think you're missing the point. If we switch databases, all cached > relations and plans have to be flushed anyway. We're talking about > what might NOT need to be flushed on switching databases. Oh sorry, yes, I missed that point. I will mention that your point about TCP connection establishment latency is real. TCP connection establishment adds several milliseconds of latency (for low latency connections -- obviously it's much much worse for long-haul connections) and then the congestion control slow start adds more if there's a significant amount of data to transfer. -- greg
Alvaro Herrera wrote: > Excerpts from Tom Lane's message of lun nov 15 02:41:40 -0300 2010: > > > I believe also that there are probably race conditions in several of > > the steps you listed; in particular there is certainly a risk involved > > in changing the database-we-advertise-being-connected-to versus a > > concurrent DROP DATABASE. Maybe that's easily soluble, but I'm not sure > > about it. > > Perhaps it would help to first disconnect from the original database and > not reconnect to any one in particular, i.e. stay in a state like the > autovac launcher is in. Then cleanup as per Robert's list, then switch > to a valid database ID. One question is whether we keep our caches in the hope we will continue with the same database, or clear our caches so we don't have to incur the overhead of clearing them when we connect a new session. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +