Re: logical changeset generation v5 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: logical changeset generation v5
Date
Msg-id 20130903231046.GG7018@awork2.anarazel.de
Whole thread Raw
In response to Re: logical changeset generation v5  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: logical changeset generation v5
List pgsql-hackers
On 2013-09-03 15:56:15 -0400, Robert Haas wrote:
> On Tue, Sep 3, 2013 at 12:57 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> And why is every 15 seconds good enough?
> >
> > Waiting 15s to become consistent instead of checkpoint_timeout seems to
> > be ok to me and to be a good tradeoff between overhead and waiting. We
> > can certainly discuss other values or making it configurable. The latter
> > seemed to be unnecessary to me, but I have don't have a problem
> > implementing it. I just don't want to document it :P
> 
> I don't think it particularly needs to be configurable, but I wonder
> if we can't be a bit smarter about when we do it.  For example,
> suppose we logged it every 15 s but only until we log a non-overflowed
> snapshot.

There's actually more benefits than just overflowed snapshots (pruning
of the known xids machinery, exclusive lock cleanup).

> I realize that the overhead of a WAL record every 15
> seconds is fairly small, but the load on some systems is all but
> nonexistent.  It would be nice not to wake up the HD unnecessarily.

The patch as-is only writes if there has been WAL written since the last
time it logged a running_xacts. I think it's not worth building more
smarts than that?

> >> The WAL writer is supposed to call XLogBackgroundFlush() every time
> >> WalWriterDelay expires.  Yeah, it can hibernate, but if it's
> >> hibernating, then we should respect that decision for this WAL record
> >> type also.
> >
> > Why should we respect it?
> 
> Because I don't see any reason to believe that this WAL record is any
> more important or urgent than any other WAL record that might get
> logged.

I can't follow the logic behind that statement. Just about all WAL
records are either pretty immediately flushed afterwards or are done in
the context of a transaction where we flush (or do a
XLogSetAsyncXactLSN) at transaction commit.

XLogBackgroundFlush() won't necessarily flush the running_xacts
record. Unless you've set the async xact lsn it will only flush complete
blocks. So what can happen (I've seen that more than once in testing,
took me a while to debug) that a checkpoint is started in a busy period
but nothing happens after it finished. Since the checkpoint triggered
running_xact is triggered *before* we do the smgr flush it still is
overflowed. Then, after activity has died down, the bgwriter issues the
running xact record, but it's filling a block and thus it never get's
flushed.

To me the alternatives are to do a XLogSetAsyncXactLSN() or an
XLogFlush(). The latter is more aggressive and can block for a
measurable amount of time, which is why I don't want to do it in the
bgwriter.

> >> >> I understand why logical replication needs to connect to a database,
> >> >> but I don't understand why any other walsender would need to connect
> >> >> to a database.
> >> >
> >> > Well, logical replication actually streams out data using the walsender,
> >> > so that's the reason why I want to add it there. But there have been
> >> > cases in the past where we wanted to do stuff in the walsender that need
> >> > database access, but we couldn't do so because you cannot connect to
> >> > one.
> >
> >> Could you be more specific?
> >
> > I only remember 3959.1349384333@sss.pgh.pa.us but I think it has come up
> > before.
> 
> It seems we need some more design there.  Perhaps entering replication
> mode could be triggered by writing either dbname=replication or
> replication=yes.  But then, do the replication commands simply become
> SQL commands?  I've certainly seen hackers use them that way.  And I
> can imagine that being a sensible approach, but this patch seems like
> it's only covering a fairly small fraction of what really ought to be
> a single commit.

Yes. I think you're right that we need more input/design here. I've
previously started threads about it, but nobody replied :(.

The problem with using dbname=replication as a trigger for anything is
that we actually allow databases to be created with that name. Perhaps
that was a design mistake.

I wondered about turning replication from a boolean into something like
off|0, on|1, database. dbname= gets only used in the latter
variant. That would be compatible with previous versions and would even
support using old tools (since all of them seem to do replication=1).

> But then, do the replication commands simply become
> SQL commands?  I've certainly seen hackers use them that way.

I don't think that it's a good way at this point to make them to plain
SQL. There is more infrastructure (signal handlers, permissions,
different timeouts) & memory required for walsenders, so using plain SQL
there seems beyond the scope of this.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: WAL CPU overhead/optimization (was Master-slave visibility order)
Next
From: Merlin Moncure
Date:
Subject: operator precedence issues