Thread: Avoiding Tablespace path collision for primary and standby
Currently, if primary and standby are setup on same machine (which is always the case for development), CREATE TABLESPACE xyz LOCATION '/abc', primary and mirror both write to "/abc/TABLESPACE_VERSION_DIRECTORY" directory. Collision is certainly not an issue in any production deployment but seems still solving the same for development is extremely helpful.
Proposing to create directory with timestamp at time of creating tablespace and create symbolic link to it instead. So, would be something like "/abc/PG_<timestamp>/TABLESPACE_VERSION_DIRECTORY". This helps avoid collision of primary and standby as timestamps would differ between primary creating the tablespace and mirror replaying the record for the same.
Ideally other advantage of this scheme is creating that additional TABLESPACE_VERSION_DIRECTORY inside can also be eliminated as even during pg_upgrade the paths will not collide. So, it helps to avoid constructing this additional string part at multiple places in code for tablespace access.
Since this is on-disk change yes may have impact to existing tools.
Attaching the patch to showcase the proposed. Tested by creating tablespace with primary and standby on same machine, also tablespace test passes.
Attachment
Ashwin Agrawal <aagrawal@pivotal.io> writes: > Proposing to create directory with timestamp at time of creating tablespace > and create symbolic link to it instead. I'm skeptical that this solves your problem. What happens when the CREATE TABLESPACE command is replicated to the standby with sub-second delay? Clock skew is another reason to doubt that timestamp == unique identifier, which is essentially what you're assuming here. Even if we fixed that, the general idea of including a quasi-random component in the directory name seems like it would have a lot of unpleasant side effects in terms of reproduceability, testability, etc. regards, tom lane
On Fri, May 25, 2018 at 7:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hmm.. aren't to some degree we currently as well create directories/files with quasi-random numbers like tablespace-oids, database-oids and relfilenodes, etc..
Ashwin Agrawal <aagrawal@pivotal.io> writes:
> Proposing to create directory with timestamp at time of creating tablespace
> and create symbolic link to it instead.
I'm skeptical that this solves your problem. What happens when the CREATE
TABLESPACE command is replicated to the standby with sub-second delay?
I thought timestamps have micro-second precision. Are we expecting tabelspace to be created, wal logged, streamed, and replayed on mirror in micro-second ?
Clock skew is another reason to doubt that timestamp == unique identifier,
which is essentially what you're assuming here.
On same machine is what we care about generating uniqueness. Different machines the problem doesn't exist anyways, so doesn't matter clock is skewed or not.
Even if we fixed that, the general idea of including a quasi-random
component in the directory name seems like it would have a lot of
unpleasant side effects in terms of reproduceability, testability, etc.
To generate uniqueness for the path between primary and standby need to use something which is not represented within database. So will be random to some degree. Like one can use PORT number of postmaster. As only need to generate unique path while creating link during CREATE TABLESPACE.
On Sat, May 26, 2018 at 9:17 AM, Ashwin Agrawal <aagrawal@pivotal.io> wrote: > To generate uniqueness for the path between primary and standby need to use > something which is not represented within database. So will be random to > some degree. Like one can use PORT number of postmaster. As only need to > generate unique path while creating link during CREATE TABLESPACE. I also wondered about this when trying to figure out how to write a TAP test for recovery testing with tablespaces, for my undo proposal. I was starting to wonder about either allowing relative paths or supporting some kind of variable in the tablespace path that could then be set differently in each cluster's .conf. -- Thomas Munro http://www.enterprisedb.com
On Sat, May 26, 2018 at 02:10:52PM +1200, Thomas Munro wrote: > I also wondered about this when trying to figure out how to write a > TAP test for recovery testing with tablespaces, for my undo proposal. > I was starting to wonder about either allowing relative paths or > supporting some kind of variable in the tablespace path that could > then be set differently in each cluster's .conf. As for now for tablespace creation with multiple nodes on the same host, you really come to just using the tablespace map within pg_basebackup.. I think that this is a difficult problem as one may want to not use the same partition space for both primary and standby, hence you would need to associate a tablespace path with one node using for example a node name set in postgresql.conf, while extending CREATE TABLESPACE to support this grammar and register the paths for each nodes in WAL records. Using a path that variates depending on the time is not a good idea in my opinion. -- Michael
Attachment
Thomas Munro <thomas.munro@enterprisedb.com> writes: > I also wondered about this when trying to figure out how to write a > TAP test for recovery testing with tablespaces, for my undo proposal. > I was starting to wonder about either allowing relative paths or > supporting some kind of variable in the tablespace path that could > then be set differently in each cluster's .conf. Yeah, the configuration-variable solution had occurred to me too. I'm not sure how convenient it'd be in practice, but perhaps it would be workable. Not sure about the relative-path idea. Seems like that would create a huge temptation to put tablespaces inside the data directory, which would force us to deal with that can of worms. Also, to the extent that people use tablespaces for what they're actually meant to be used for (ie, putting some stuff into a different filesystem), I can't see a relative path being helpful. Admins don't go mounting disks at random places in the filesystem tree. regards, tom lane
On Sat, May 26, 2018 at 7:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> I also wondered about this when trying to figure out how to write a
> TAP test for recovery testing with tablespaces, for my undo proposal.
> I was starting to wonder about either allowing relative paths or
> supporting some kind of variable in the tablespace path that could
> then be set differently in each cluster's .conf.
Yeah, the configuration-variable solution had occurred to me too.
I'm not sure how convenient it'd be in practice, but perhaps it
would be workable.
Configuration variable becomes tricky to play with for this purpose, specially given configuration files get copied by pg_basebackup.
Will the configuration-variable be set by some option to pg_basebackup, as even during pg_basebackup will need to use the same configuration-variable. (I know basebackup provides way to specify different path for existing tablespaces but seems will need to still use same static string for ALL the tablespaces path, given how the linking and directory creation happens today)
Also, not sure how configuration-variable will be used to solve the problem, like changing its value shouldn't block me from accessing the previously created tablespaces and all.
Seems as the conflict happens naturally by design, if it can be resolved someway automatically would be better than a config option based solution.
On Fri, May 25, 2018 at 02:17:23PM -0700, Ashwin Agrawal wrote: > > On Fri, May 25, 2018 at 7:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Ashwin Agrawal <aagrawal@pivotal.io> writes: > > Proposing to create directory with timestamp at time of creating > tablespace > > and create symbolic link to it instead. > > I'm skeptical that this solves your problem. What happens when the CREATE > TABLESPACE command is replicated to the standby with sub-second delay? > > > I thought timestamps have micro-second precision. Are we expecting tabelspace > to be created, wal logged, streamed, and replayed on mirror in micro-second ? I didn't see anyone answer your question above. We don't expect micro-second replay, but clock skew, which Tom Lane mention, could make it appear to be a micro-second replay. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Hi, Hi, On 2018-05-26 10:08:57 -0400, Tom Lane wrote: > Not sure about the relative-path idea. Seems like that would create > a huge temptation to put tablespaces inside the data directory, which > would force us to deal with that can of worms. It doesn't seem impossible to normalize the path, and then check for that. > Also, to the extent that people use tablespaces for what they're > actually meant to be used for (ie, putting some stuff into a different > filesystem), I can't see a relative path being helpful. Admins don't > go mounting disks at random places in the filesystem tree. I'm not convinced by that argument. It can certainly make sense to mount several filesystems relative to a subdirectory. And then there's the case we're talking about, where you have primary/standby on a single system. It's not like we'd *force* relative tablespaces... Greetings, Andres Freund
On Wed, Jun 20, 2018 at 9:39 AM Bruce Momjian <bruce@momjian.us> wrote:
On Fri, May 25, 2018 at 02:17:23PM -0700, Ashwin Agrawal wrote:
>
> On Fri, May 25, 2018 at 7:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Ashwin Agrawal <aagrawal@pivotal.io> writes:
> > Proposing to create directory with timestamp at time of creating
> tablespace
> > and create symbolic link to it instead.
>
> I'm skeptical that this solves your problem. What happens when the CREATE
> TABLESPACE command is replicated to the standby with sub-second delay?
>
>
> I thought timestamps have micro-second precision. Are we expecting tabelspace
> to be created, wal logged, streamed, and replayed on mirror in micro-second ?
I didn't see anyone answer your question above. We don't expect
micro-second replay, but clock skew, which Tom Lane mention, could make
it appear to be a micro-second replay.
Thanks Bruce for answering. Though I still don't see why clock skew is a problem here. As I think clock skew only happens across machines. On same machine why would it be an issue. Problem is only with same machine, different machines anyways paths don't collide so even if clock skew happens is not a problem. (I understand there may be reservations for putting timestamp in directory path, but clock skew argument is not clear.)
On June 20, 2018 10:31:05 AM PDT, Ashwin Agrawal <aagrawal@pivotal.io> wrote: >On Wed, Jun 20, 2018 at 9:39 AM Bruce Momjian <bruce@momjian.us> wrote: > >> On Fri, May 25, 2018 at 02:17:23PM -0700, Ashwin Agrawal wrote: >> > >> > On Fri, May 25, 2018 at 7:33 AM, Tom Lane <tgl@sss.pgh.pa.us> >wrote: >> > >> > Ashwin Agrawal <aagrawal@pivotal.io> writes: >> > > Proposing to create directory with timestamp at time of >creating >> > tablespace >> > > and create symbolic link to it instead. >> > >> > I'm skeptical that this solves your problem. What happens when >the >> CREATE >> > TABLESPACE command is replicated to the standby with sub-second >> delay? >> > >> > >> > I thought timestamps have micro-second precision. Are we expecting >> tabelspace >> > to be created, wal logged, streamed, and replayed on mirror in >> micro-second ? >> >> I didn't see anyone answer your question above. We don't expect >> micro-second replay, but clock skew, which Tom Lane mention, could >make >> it appear to be a micro-second replay. >> > >Thanks Bruce for answering. Though I still don't see why clock skew is >a >problem here. As I think clock skew only happens across machines. On >same >machine why would it be an issue. Problem is only with same machine, >different machines anyways paths don't collide so even if clock skew >happens is not a problem. (I understand there may be reservations for >putting timestamp in directory path, but clock skew argument is not >clear.) Clock skew happens within machines too. Both because of multi socket systems and virtualization systems. Also clock adjustments. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Jun 20, 2018 at 10:50 AM Andres Freund <andres@anarazel.de> wrote:
On June 20, 2018 10:31:05 AM PDT, Ashwin Agrawal <aagrawal@pivotal.io> wrote:
>On Wed, Jun 20, 2018 at 9:39 AM Bruce Momjian <bruce@momjian.us> wrote:
>
>> On Fri, May 25, 2018 at 02:17:23PM -0700, Ashwin Agrawal wrote:
>> >
>> > On Fri, May 25, 2018 at 7:33 AM, Tom Lane <tgl@sss.pgh.pa.us>
>wrote:
>> >
>> > Ashwin Agrawal <aagrawal@pivotal.io> writes:
>> > > Proposing to create directory with timestamp at time of
>creating
>> > tablespace
>> > > and create symbolic link to it instead.
>> >
>> > I'm skeptical that this solves your problem. What happens when
>the
>> CREATE
>> > TABLESPACE command is replicated to the standby with sub-second
>> delay?
>> >
>> >
>> > I thought timestamps have micro-second precision. Are we expecting
>> tabelspace
>> > to be created, wal logged, streamed, and replayed on mirror in
>> micro-second ?
>>
>> I didn't see anyone answer your question above. We don't expect
>> micro-second replay, but clock skew, which Tom Lane mention, could
>make
>> it appear to be a micro-second replay.
>>
>
>Thanks Bruce for answering. Though I still don't see why clock skew is
>a
>problem here. As I think clock skew only happens across machines. On
>same
>machine why would it be an issue. Problem is only with same machine,
>different machines anyways paths don't collide so even if clock skew
>happens is not a problem. (I understand there may be reservations for
>putting timestamp in directory path, but clock skew argument is not
>clear.)
Clock skew happens within machines too. Both because of multi socket systems and virtualization systems. Also clock adjustments.
Okay just bouncing another approach, how about generating UUID for a postgres instance during initdb and pg_basebackup ? (unlike `system_identifier` used in pg_controldata store it in separate independent file which is excluded in pg_basebackup, instead created by pg_basebackup) Read only once during startup and used in tablespace path ? (Understand generating uuid maybe little heavy-lifting for just same node tablespace path collision, but having unique identifier for each postgres instance primary or standby maybe useful for long term for other purposes as well)
Ashwin Agrawal <aagrawal@pivotal.io> writes: > Okay just bouncing another approach, how about generating UUID for a > postgres instance during initdb and pg_basebackup ? There's no uuid generation code in core postgres, for excellent reasons (lack of portability and lack of failure modes are the main objections). This is not different in any meaningful way from the proposal to use timestamps, except for being more complicated. regards, tom lane
On Thu., 21 Jun. 2018, 04:30 Tom Lane, <tgl@sss.pgh.pa.us> wrote:
Ashwin Agrawal <aagrawal@pivotal.io> writes:
> Okay just bouncing another approach, how about generating UUID for a
> postgres instance during initdb and pg_basebackup ?
There's no uuid generation code in core postgres, for excellent reasons
(lack of portability and lack of failure modes are the main objections).
This is not different in any meaningful way from the proposal to use
timestamps, except for being more complicated.
A v4 UUID is just 128 random bits and some simple formatting. So I really don't understand your concerns about UUID generation.
That said, it can already be handled with tablespace maps in pg_basebackup. And any new scheme would need to happen in pg_basebackup too, because it must happen before the tablespace are copied and thr replica is first started.
I don't see a big concern with some pg_basebackup --gen-unique-tablespaces option or the like.
UUID would be better than timestamp due to the skew issues discussed upthread. But personally I'd just take a label argument. pg_basebackup --tablespace-prefix or the like.
For non pg_basebackup uses you have to solve it yourself anyway. Pg doesn't know if it's just been started as a copy, after all, and it's too late to move tablespace then even if we'd do such a thing.