Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?) - Mailing list pgsql-general

From Ed L.
Subject Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)
Date
Msg-id 200304101605.51688.pgsql@bluepolka.net
Whole thread Raw
In response to Re: 32/64-bit transaction IDs?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
On Saturday March 22 2003 12:00, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > On Saturday March 22 2003 11:29, Tom Lane wrote:
> >> I think this last part is wrong.  It shouldn't be using the xid as
> >> part of the ordering, only the sequence value.
> >
> > Why not?  How would you replay them on the slave in the same
> > transaction groupings and order that occurred on the master?
>
> Why not is easy:
>
>     begin xact 1;
>     update tuple X;
>     ...
>
>                     begin xact 2;
>                     update tuple Y;
>                     commit;
>
>     ...
>     update tuple Y;
>     commit;
>
> (Note that this is only possible in read-committed mode, else xact 1
> wouldn't be allowed to update tuple Y like this.)  Here, you must
> replay xact 1's update of Y after xact 2's update of Y, else you'll
> get the wrong final state on the slave.  On the other hand, it really
> does not matter whether you replay the tuple X update before or after
> you replay xact 2, because xact 2 didn't touch tuple X.
>
> If the existing DBmirror code is sorting as you say, then it will fail
> in this scenario --- unless it always manages to execute a propagation
> step in between the commits of xacts 2 and 1, which doesn't seem very
> trustworthy.
>
> What I'm envisioning is that you should just send updates in the order
> of their insertion sequence numbers and *not* try to force them into
> transactional grouping.  In the above scenario, depending on when
> propagation runs it might send X update, Y update, second Y update
> all in one batch; or it might send Y update in one batch and then X
> update and second Y update in a later batch.  Both of these are legal
> and consistent orderings.  The only ordering constraint is that the
> second Y update must be applied second --- but that is certain as
> long as we sort the contents of a batch by insertion order.  (The
> pending-update report for the second Y update cannot have been made
> until after xact 2 commits, because the MVCC semantics will force
> xact 1 to wait for 2 to commit before it can update Y.)
>
> Note that all of a transaction's updates will become visible in the
> pending-update table simultaneously when it commits, so (as long as
> you grab batches in single SELECTs, or with a serializable transaction)
> there's no problems with partial transactions being applied by a batch.
> A batch will contain all the updates of some specific set of
> transactions, namely all those that committed since the prior batch
> for that slave.  But you can't order the updates done by a batch in xid
> order, they have to be applied in insertion order.

If you grab everything in the queue with a single SELECT, this works.
Depending on the queue length, that's not always practical, and as noted
above, committed batches could result in partial transactions on the slave.
So the riddle is how to get a consistent but batchable replication order.

Suppose you have a replication queue, with dbmirror-like trigger-based
insertions, that looks something like this:

    create table replication_queue (
        xid    integer,
        seqid    serial primary key,
        data...
    );

Here, 'xid' is the transaction ID, 'seqid' is the queue insertion order, and
'data' has all the info needed to replicate the update.  Every row
update/delete/inserte results in a row in this queue.

Suppose further you wish to limit the number of updates replicated in a
particular cycle.  For example, suppose you have a million changes queued
for replication on the master, but for obvious reasons you don't want to
select a million rows to replicate all at once.  Suppose you also don't
want to grab them one tuple or one transaction at a time, preferring to
avoid hammering the master.  Rather, you want to grab them in batches of no
more than N transactions, replicate them all to a slave and commit on the
slave, take a breather, repeating until the slave is caught up.  And during
each breather, you want to have committed only complete transactions so
that any slave clients see consistent data.

My question:  Is there an ordering algorithm that would make a consistent
but limited batchsize replication possible?

The algorithm I'm considering right now is the following:

    select xid, max(seqid) as "max_seqid"
    into temp replication_order
    from replication_queue
    group by xid
    order by max(seqid)
    limit N;

Then, to get the actual queue replication order,

    select q.xid, q.seqid, q.data
    from replication_queue q, replication_order o
    where q.xid = o.xid
    order by o.max_seqid, q.seqid;

[This is a batched variation of dbmirror's original algorithm.]

Ed


pgsql-general by date:

Previous
From: Dennis Gearon
Date:
Subject: Re: Pg and Stunnel
Next
From: Tom Lane
Date:
Subject: Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit transaction IDs?)