Thread: Re: O(1) DSM handle operations
On Mon, Mar 27, 2017 at 5:13 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > This is just a thought for discussion, no patch attached... > > DSM operations dsm_create(), dsm_attach(), dsm_unpin_segment() perform > linear searches of the dsm_control->item array for either a free slot > or a slot matching a given handle. Maybe no one thinks this is a > problem, because in practice the number of DSM slots you need to scan > should be something like number of backends * some small factor at > peak. One thing I thought about when designing the format of the DSM control segment was that we need to (attempt to) reread the old segment after recovering from a crash, even if it's borked. With the current design, I think that nothing too bad can happen even if some or all of the old control segment has been overwritten with gibberish. I mean, if we get particularly unlucky, we might manage to remove a DSM segment that some other cluster is using, but we'd have to be very unlucky for things to even get that bad, and we shouldn't crash outright. If we replace the array with some more complicated data structure, we'd have to be sure that reading it is robust against it having been scrambled by a previous crash. Otherwise, it won't be possible to restart the cluster without manual intervention. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 28, 2017 at 3:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 27, 2017 at 5:13 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> This is just a thought for discussion, no patch attached... >> >> DSM operations dsm_create(), dsm_attach(), dsm_unpin_segment() perform >> linear searches of the dsm_control->item array for either a free slot >> or a slot matching a given handle. Maybe no one thinks this is a >> problem, because in practice the number of DSM slots you need to scan >> should be something like number of backends * some small factor at >> peak. > > One thing I thought about when designing the format of the DSM control > segment was that we need to (attempt to) reread the old segment after > recovering from a crash, even if it's borked. With the current > design, I think that nothing too bad can happen even if some or all of > the old control segment has been overwritten with gibberish. I mean, > if we get particularly unlucky, we might manage to remove a DSM > segment that some other cluster is using, but we'd have to be very > unlucky for things to even get that bad, and we shouldn't crash > outright. > > If we replace the array with some more complicated data structure, > we'd have to be sure that reading it is robust against it having been > scrambled by a previous crash. Otherwise, it won't be possible to > restart the cluster without manual intervention. Couldn't cleanup code continue to work just the same way though? The only extra structure is an intrusive freelist, but that could be completely ignored by code that wants to scan the whole array after crash. It would only be used to find a free slot after successful restart, once the freelist is rebuilt and known to be sane, and could be sanity checked when accessed by dsm_create. So idea 2 doesn't seem to make that code any less robust, does it? Deterministic key_t values for SysV IPC do seem problematic thought, for multiple PostgreSQL clusters. Maybe that is a serious problem for idea 1. -- Thomas Munro http://www.enterprisedb.com
On Mon, Mar 27, 2017 at 11:47 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Couldn't cleanup code continue to work just the same way though? The > only extra structure is an intrusive freelist, but that could be > completely ignored by code that wants to scan the whole array after > crash. It would only be used to find a free slot after successful > restart, once the freelist is rebuilt and known to be sane, and could > be sanity checked when accessed by dsm_create. So idea 2 doesn't seem > to make that code any less robust, does it? Agreed. > Deterministic key_t values for SysV IPC do seem problematic thought, > for multiple PostgreSQL clusters. Maybe that is a serious problem for > idea 1. It might be. Mostly, even if we had no PostgreSQL-imposed limits here at all, I don't share your confidence that we create tons of DSM segments and everything will just work. I think we'll run into operating system limits pretty quickly -- either hard limits, like limits on the number of segments supported, or soft limits, like difficulties handling large numbers of memory mappings with good performance. I personally think it would be more worthwhile to work on http://postgr.es/m/CA+TgmoZVqXATOGEKFdnG_sMugx_iT8_B4L0OKJZeowHburMkiQ@mail.gmail.com instead, but I will be a little too busy to write that patch this week. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company