Re: Changeset Extraction v7.1 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Changeset Extraction v7.1
Date
Msg-id 20140125222526.GZ7182@awork2.anarazel.de
Whole thread Raw
In response to Re: Changeset Extraction v7.1  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Changeset Extraction v7.1
Re: Changeset Extraction v7.1
List pgsql-hackers
Hi Robert, all,

On 2014-01-24 20:38:11 -0500, Robert Haas wrote:
> On Fri, Jan 24, 2014 at 12:49 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> But this code is riddled with places where you track a catalog xmin
> >> and a data xmin separately.  The only point of doing it that way is to
> >> support a division that hasn't been made yet.
> >
> > If you think it will make stuff more manageable I can try separating all
> > lines dealing with catalog_xmin into another patch as long as data_xmin
> > doesn't have to be renamed.
> > That said, I don't really think it's a big problem that the division
> > hasn't been made, essentially the meaning is different, even if we don't
> > take advantage of it yet. data_xmin is there for streaming replication
> > where we need to prevent all removal, catalog_xmin is there for
> > changeset extraction.
> 
> I spent some more time studying the 0001 and 0002 patches this
> afternoon, with a side dish of 0004.  I'm leaning toward thinking we
> should go ahead and make that division.

Ok.

> I'm also wondering about
> whether we've got the right naming here.  AFAICT, it's not the case
> that we're going to use the "catalog xmin" for catalogs and the "data
> xmin" for non-catalogs.  Rather, the "catalog xmin" is going to always
> be included in globalxmin calculations, so IOW it should just be
> called "xmin".

Well, not really. That's true for GetSnapshotData(), but not for
GetOldestXmin() where we calculate an xmin *not* including the catalog
xmin. And the data_xmin is always used, regardless of
catalog/non_catalog, we peg the xmin further for catalog tables, based
on that value.
The reason for doing things this way is that it makes all current usages
of RecentGlobalXmin safe, since that is the more conservative
value. Only in inspected location we can use RecentGlobalDataXmin which
*does* include data_xmin but *not* catalog_xmin.

> It's interesting (though not immediately relevant) to speculate about
> how we might extend this to fine-grained xmin tracking more generally.
> [musings for another time]

Yea, I have wondered about that as well. I think the easiest thing would
be to to compute RecentGlobalDataXmin in a database specific manner
since by definition it will *not* include shared tables. We do that
already for GetOldestXmin() but that's not used for heap pruning. I'd
quickly tested that some months back and it gave significant speedups
for two pgbenches in two databases.

> >> I have zero confidence that it's OK to treat fsync() as an operation
> >> that won't fail.  Linux documents EIO as a plausible error return, for
> >> example.  (And really, how could it not?)
> >
> > But quite fundamentally having a the most basic persistency building
> > block fail is something you can't really handle safely. Note that
> > issue_xlog_fsync() has always (and I wager, will always) treated that as
> > a PANIC.
> > I don't recall many complaints about that for WAL. All of the ones I
> > found in a quick search were like "oh, the fs invalidated my fd because
> > of corruption". And few.
> 
> Well, you have a point.  And certainly this version looks much better
> than the previous version in terms of the likelihood of PANIC
> occurring in practice.  But I wonder if we couldn't cut it down even
> further without too much effort.  Suppose we define a slot to exist
> if, and only if, the state file exists.  A directory without a state
> file is subject to arbitrary removal.  Then we can proceed as follows:
> 
> - mkdir() the directory.
> - open() state.tmp
> - write() state.tmp
> - close() state.tmp
> - fsync() parent directory, directory and state.tmp
> - rename() state.tmp to state
> - fsync() state
> 
> If any step except the last one fails, no problem.  The next startup
> can nuke the leftovers; any future attempt to create a slot with the
> name can ignore an EEXIST failure from mkdir() and procedure just as
> above.  Only a failure of the very last fsync is a PANIC.   In some
> ways I think this'd be simpler than what you've got now, because we'd
> eliminate the dance with renaming the directory as well as the state
> file; only the state file matters.

Hm. I think this is pretty exactly what happens in the current patch,
right? There's an additional fsync() of the parent directory at the end,
but that's it.

> To drop a slot, just unlink the state file and fsync() the directory.
> If the unlink fails, it's just an error.  If the fsync() fails, it's a
> PANIC.  Once the state file is gone, removing everything else is only
> an ERROR, and you don't even need to fsync() it again.

Well, the patch as is renames the directory first and fsyncs that. Only
a failure in fsyncing is punishable by PANIC, if rmtree() on the temp
directory file fails it generates WARNINGs, that's it.

> To update a slot, open, write, close, and fsync state.tmp, then rename
> it to state and fsync() again.  None of these steps need PANIC; hold
> off on updating the values in memory until they're all done.  If any
> step fails, the attempt to update the slot fails, but either memory
> and disk are still consistent, or the disk has an xmin newer than
> memory, but still legal.  On restart, when restoring slots, fsync()
> each state file, dying horribly if we can't, and remove any
> directories that don't contain one.

That's again pretty similar to what happens, only that we panic if the
fsync()ing fails. And I think that's correct.

I still think worrying over this to this degree is a waste of
effort. There's much hotter places that could be inspected to that
detail than this.

> Looking over patch 0002, I see that there's code to allow a walsender
> to create or drop a physical replication slot.  Also, if we've
> acquired a replication slot, there's code to update it, and code to
> make sure we disconnect from it, but there's no code to acquire it. I
> think maybe the hunk in StartReplication() is supposed to be calling
> ReplicationSlotAcquire() instead of ReplicationSlotRelease(),

Uh. You had me worried here for a minute or two, a hunk or two earlier
than the ReplicationSlotRelease() you mention. What probably confused
you is that StartReplication only returns once all streaming is
finished. Not my idea...

static void
StartReplication(StartReplicationCmd *cmd)
{
...   if (cmd->slotname)       ReplicationSlotAcquire(cmd->slotname);
...
...   /* this is where we'll actually loop busily */   WalSndLoop(XLogSendPhysical);
...   if (cmd->slotname)       ReplicationSlotRelease();
...
}

> which
> (ahem) makes one wonder how thoroughly this code has been tested.

It's actually tested as of a week ago or so. Both with pg_receivexlog
and a hacked up walreceiver. That's how I noticed
a472ae1e4e2bf5fb71ac655d38d1e35df4c1c966 ;). Because it did *not* work
properly in the beginning... But it didn't end up being my code. Hah!

> There's also no provision for walsender (or pg_receivexlog?) to send
> the new SLOT option to walreceiver, which seems somewhat necessary.

There's a hacked up pg_receivexlog in the last commit in the series. I
haven't included the hack for walreceiver as it was too embarassing for
the public eye.

I really, really don't want to focus on polishing up the receiver side
for this before the basics of changeset extraction are done. I've very,
very reluctantly agreed to generalize the slot concept for streaming rep
now, but I said all along that I won't do the client work till the
changeset extraction stuff is done. There is a good deal of UI design
work to be done, and I don't think I have the capacity to tackle that
right now.

> I'm tempted to suggest also adding something to src/bin/scripts to
> create and drop slots, though I suppose we could just recommend psql
> -c 'CREATE_REPLICATION_SLOT SLOT zippy PHYSICAL' 'replication=true'.

There's an SQL function for doing so. In, err, the wrong patch:
postgres=# \df create_physical_replication_slot                                                         List of
functions Schema   |               Name               | Result data type |                   Argument data types
           |  Type   

------------+----------------------------------+------------------+----------------------------------------------------------+--------pg_catalog
|create_physical_replication_slot | record           | slotname name, OUT slotname text, OUT xlog_position text |
normal
(1 row)

will move it. Not sure if we need the slotname as an OUT value as well,
it's helpful as part of the additional return types for creating a
decoding slot, but here...

Not sure if there's still a reason for a separate commandline utility?

> (BTW, isn't that kind of a strange syntax, with the word SLOT
> appearing twice?  I think we could drop the second one.)

Well, that's what I'd suggested on the mailinglist, so I didn't change
it. It will definitely be a separate SLOT slot_name for
START_REPLICATION, that's pretty much the only reason for keeping it
separate for CREATE/DROP.  Don't care which way we go in the end.

> It also occurred to me that we need documentation for all of this; I
> see that's in patch 0010, but I haven't reviewed it in detail yet.

The streaming rep part is scantily documented since that's pending the
clientside work, but the changeset extraction part should be documented
to some degree... Craig worked on my initial docs and seemed to be able
to make enough sense of it, so I hope it's not in a totally bad state.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core
Next
From: Andres Freund
Date:
Subject: Re: A minor correction in comment in heaptuple.c