Thread: Tracking row updates

Tracking row updates

From
Alex Adriaanse
Date:
[I've tried to send this message to pgsql-general several times now,
but even though I'm subscribed to it I never saw the message show up
in the mailing list, so I'm trying to send it from a different account
now.  If you get several copies of this message, I apologize.]

I'm working on an application where we have a central database server
and a bunch of clients that are disconnected most of the time, need to
maintain a local copy of the central database.  The client databases are
based on One$DB since it has to be lightweight.  The client does not
access the server database directly, but through a server application.
Periodically (about once a week) the clients will connect to the central
database and retrieve updates made to the central database so that their
local database will be up-to-date, as well as send local updates back to
the server.  A lot of these clients will be connecting through a dial-up
connection, and some of the tables can get quite large, so just
retrieving all rows in a table is not an option when synchronizing.

This is how I currently have things designed on the server side:

Each table has a revision column, which is set to a sequence value every
time a row is inserted or updated (deletes are translated into updates
that "hide" the row by setting an "enabled" column to false), as you can
see in the following statements:

CREATE TABLE codes (
    id SERIAL PRIMARY KEY,
    created TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
    updated TIMESTAMP,
    revision SERIAL,
    enabled BOOLEAN DEFAULT TRUE NOT NULL,
    name VARCHAR(8) NOT NULL,
    description VARCHAR(50)
) WITHOUT OIDS;

CREATE INDEX codes_revision_idx ON codes (revision);

CREATE FUNCTION row_update_codes() RETURNS trigger AS '
    BEGIN
        NEW.revision = nextval(''codes_revision_seq'');
        NEW.updated = current_timestamp;
        RETURN NEW;
    END;
' LANGUAGE plpgsql;

CREATE TRIGGER update_row BEFORE INSERT OR UPDATE ON codes
    FOR EACH ROW EXECUTE PROCEDURE row_update_codes();

CREATE RULE delete_row AS
    ON DELETE TO codes DO INSTEAD
    UPDATE codes SET enabled = FALSE WHERE id = OLD.id;

To fetch all updates since the last synchronization, the client would
calculated a value for $lastrevision by running this query on its local
database:
SELECT max(revision) AS lastrevision FROM codes;

It would then fetch all updated rows by running this query against the
server:
SELECT * FROM codes WHERE revision > $lastrevision;

This seems to work, except there exists a race condition.  Consider the
following series of events (in chronological order):

   1. Initially, in the codes table there's a row with id=1, revision=1,
      and a row with id=2, revision=2
   2. Client A begins a transaction, and updates the row with id=1,
      resulting in revision=3 for that row
   3. Client B begins a transaction, and updates the row with id=2,
      resulting in revision=4 for that row
   4. Client B commits the transaction
   5. Client C (which has $lastrevision=2 in its local database)
      synchronizes with the database by doing SELECT * FROM codes WHERE
      revision > 2; and retrieves client B's update to the row with
      id=2, revision=4 (it doesn't yet see the update from client A)
   6. Client A commits the transaction
   7. Some time later, Client C synchronizes with the database again.
      $lastrevision for its database is now 4, so doing SELECT * FROM
      codes WHERE revision > 4; does not retrieve any rows.  So client C
      never sees client A's update to the row with id=1

Essentially, the race condition occurs when the order of clients
committing transactions (i.e. the updates becoming visible to other
clients) differs from the order of clients generating sequence values.
Do you guys have any suggestions on how to avoid this race condition, or
maybe a more elegant way to synchronize the clients with the server?

I was thinking about doing the following to solve this problem.  I'd
create a new table, e.g. codes_active_txns, with a sequence_value
column.  Before the server application starts any transaction involving
an insert/update/delete to codes, it would retrieve
nextval('codes_revision_seq') as $seq_val, insert it into
codes_active_txns, commit, begin another transaction, and do its
inserts/updates, setting revision = $seq_val for each row that it
touches.  Once it's done with the table updates, it would commit the
transaction, and delete the row containing $seq_val from
codes_active_txns, and commit that.

The server would calculate $lastrevision with:
    SELECT max(revision) FROM codes WHERE revision < (SELECT
        COALESCE(min(sequence_value), 2147483647) FROM
        codes_active_txns);
and send this to the client.  $lastrevision would contain the highest
revision for which there are no active (non-committed) transactions with
a revision lower than $lastrevision.  The client would save this value,
and pass it back to the server during the next synchronization session
so that the server can figure out what rows it needs to send to the
client with SELECT * FROM codes WHERE revision > $lastrevision;.
Occasionally the same row might be sent to the client twice during two
synchronization sessions, but that's no big deal because that number
should be low and the client can handle these duplicate rows.

Now, this design isn't very elegant for several reasons.  One, it
requires the server application to manually keep track of active
transactions, which is really something that should belong in the
database.  Also, if the server application crashes midway, and never
removes its row from codes_active_txns, the list of duplicate rows being
sent will grow larger and larger, so we'll need some way to periodically
(e.g. every 5 minutes) clean stale values from codes_active_txns.

So, my questions are:

   1. Can you guys think of a simpler and/or more elegant way to send
      only recently updated rows to the client?
   2. If not:
         1. Is there any way to move this logic that maintains
            codes_active_txns from the server application to the server
            database?  E.g. something like a trigger on BEGIN and
            COMMIT, which as far as I know does not exist.
         2. How would I go about identifying sessions that have died off
            but did not remove their row from codes_active_txns?  I can
            add extra columns to store connection/backend information.
            I think pg_locks comes close to providing this information
            (we'd just store the backend pid in codes_active_txns and
            have the cleanup script check pg_locks to make sure that pid
            is still connected), except I don't think this would work if
            I were to use something like pgpool to connect to the
            database.  The other option is adding a timestamp column and
            removing old rows, but it is possible that occasionally a
            transaction runs for a long time, so we don't want to remove
            rows for such long-running active transactions.

I've also thought of the following alternatives, but I don't think they
are feasible:

   1. Lock the entire codes table when making an update.  Our
      application will have a high level of concurrent inserts and
      updates during peak times, and this would just about kill its
      throughput.
   2. Having the client send a list of all ids and revisions for each
      row in its local database to the server.  This would scale
      poorly.  However, it is OK to send a small (bound) list of
      ids/revisions.

I really appreciate any insights you guys might be able to give me on
this problem.

Thanks,

Alex

Re: Tracking row updates

From
"Qingqing Zhou"
Date:
"Alex Adriaanse" <alex.adriaanse@gmail.com> writes
> This seems to work, except there exists a race condition.  Consider the
> following series of events (in chronological order):
>
>    1. Initially, in the codes table there's a row with id=1, revision=1,
>       and a row with id=2, revision=2
>    2. Client A begins a transaction, and updates the row with id=1,
>       resulting in revision=3 for that row
>    3. Client B begins a transaction, and updates the row with id=2,
>       resulting in revision=4 for that row
>    4. Client B commits the transaction
>    5. Client C (which has $lastrevision=2 in its local database)
>       synchronizes with the database by doing SELECT * FROM codes WHERE
>       revision > 2; and retrieves client B's update to the row with
>       id=2, revision=4 (it doesn't yet see the update from client A)
>    6. Client A commits the transaction
>    7. Some time later, Client C synchronizes with the database again.
>       $lastrevision for its database is now 4, so doing SELECT * FROM
>       codes WHERE revision > 4; does not retrieve any rows.  So client C
>       never sees client A's update to the row with id=1
>
> Essentially, the race condition occurs when the order of clients
> committing transactions (i.e. the updates becoming visible to other
> clients) differs from the order of clients generating sequence values.
> Do you guys have any suggestions on how to avoid this race condition, or
> maybe a more elegant way to synchronize the clients with the server?

In my understanding, you are doing something like a CVS does. Say if you
don't "check out" a file and you make a revision on the version you now
see(say version 1), then when you want to commit, you will probabaly receive
a "merge required" notice. Since in this interval, the file may have already
updated by another user (to version 2) - he is free to do so since nobody
knows that you might commit an update. To avoid this,  you have to "check
out" the file, i.e., lock the file to prevent other changes, then you are
free of any merge requirement. The cost is that you locked the file and
nobody could change it.  So the only options are "merge" or "lock".

Regards,
Qingqing







Re: Tracking row updates

From
Alex Adriaanse
Date:
Qingqing Zhou wrote:

>"Alex Adriaanse" <alex.adriaanse@gmail.com> writes
>
>
>>This seems to work, except there exists a race condition.  Consider the
>>following series of events (in chronological order):
>>
>>   1. Initially, in the codes table there's a row with id=1, revision=1,
>>      and a row with id=2, revision=2
>>   2. Client A begins a transaction, and updates the row with id=1,
>>      resulting in revision=3 for that row
>>   3. Client B begins a transaction, and updates the row with id=2,
>>      resulting in revision=4 for that row
>>   4. Client B commits the transaction
>>   5. Client C (which has $lastrevision=2 in its local database)
>>      synchronizes with the database by doing SELECT * FROM codes WHERE
>>      revision > 2; and retrieves client B's update to the row with
>>      id=2, revision=4 (it doesn't yet see the update from client A)
>>   6. Client A commits the transaction
>>   7. Some time later, Client C synchronizes with the database again.
>>      $lastrevision for its database is now 4, so doing SELECT * FROM
>>      codes WHERE revision > 4; does not retrieve any rows.  So client C
>>      never sees client A's update to the row with id=1
>>
>>Essentially, the race condition occurs when the order of clients
>>committing transactions (i.e. the updates becoming visible to other
>>clients) differs from the order of clients generating sequence values.
>>Do you guys have any suggestions on how to avoid this race condition, or
>>maybe a more elegant way to synchronize the clients with the server?
>>
>>
>
>In my understanding, you are doing something like a CVS does. Say if you
>don't "check out" a file and you make a revision on the version you now
>see(say version 1), then when you want to commit, you will probabaly receive
>a "merge required" notice. Since in this interval, the file may have already
>updated by another user (to version 2) - he is free to do so since nobody
>knows that you might commit an update. To avoid this,  you have to "check
>out" the file, i.e., lock the file to prevent other changes, then you are
>free of any merge requirement. The cost is that you locked the file and
>nobody could change it.  So the only options are "merge" or "lock".
>
>Regards,
>Qingqing
>
Applying this analogy to our database, wouldn't that require a
table-level lock during a CVS-like commit (which would mean locking the
table, getting the revision number, updating the row(s), and committing
the transaction)?

Alex


Re: Tracking row updates

From
"Qingqing Zhou"
Date:
"Alex Adriaanse" <alex@alexandcarmen.com> writes
> Applying this analogy to our database, wouldn't that require a
> table-level lock during a CVS-like commit (which would mean locking the
> table, getting the revision number, updating the row(s), and committing
> the transaction)?
>

You may have a look at how CVS works and decide it. Note the difference is
that CVS manages the files and versions, and seems you just manage the
versions?

Another way is to change the logic - try to submit the update together.
Think the bank-transfer example we always used in describing transaction's
atomic property, which shares something common in your case. We fold the
logic of reduce some money and add some money together, so no matter how
many concurrent transfer is on the target account, it is guarantteed no race
condition, since the "serializable" property assures that.

Regards,
Qingqing