Thread: table as log (multiple writers and readers)

table as log (multiple writers and readers)

From

"Vance Maverick"

Date:

16 April 2008, 13:01:57

I want to create a table with the semantics of a log. There may be
multiple concurrent writers, because log entries will be generated by
triggers. And there will be multiple concurrent readers -- mostly
remote processes watching the log over time. I'd like to guarantee that
each of those readers will see the same, complete sequence of changes.

For a first attempt, I created a table with a serial column (actually I
created the sequence ID separately, but that's a detail). The readers
connect remotely, scan through the table in sequence order, and remember
the ID of the last row they read. When they read again, they start from
after that ID -- the query is roughly

SELECT * FROM logtable WHERE id > ? ORDER BY id

But there's a problem with this -- the rows may not be inserted into the
log in ID order. For example, if two concurrent sessions get the
sequence IDs 3 and 4, the one with ID 4 might commit first. If a reader
queries the table at that moment, it will see the sequence (1, 2, 4).
Later, if the other session commits, a log entry with ID 3 will be
added. The IDs will now be (1, 2, 4, 3) -- but if the reader comes back
to poll the log again, it will miss entry 3. In order not to miss it,
the reader would have to remember much more information than just a
high-water mark. (I think it would have to remember the complete set of
IDs it had processed already, and then scan the whole table to find
entries not in the set.)

Is there an approach that will give the semantics I need? In
particular, I'd like a reader that has already read up to a certain
point in the log to be able to restart at or near that same point, so it
wouldn't have to reread everything.

(I recognize that these log semantics are closer to those provided by
files, not database tables. A file is an option I'll consider, but
obviously it lacks transactional semantics.)

Thanks,

Vance

Re: table as log (multiple writers and readers)

From

brian

Date:

16 April 2008, 13:22:26

Vance Maverick wrote:
> I want to create a table with the semantics of a log.  There may be
> multiple concurrent writers, because log entries will be generated by
> triggers.  And there will be multiple concurrent readers -- mostly
> remote processes watching the log over time.  I'd like to guarantee that
> each of those readers will see the same, complete sequence of changes.
>
> For a first attempt, I created a table with a serial column (actually I
> created the sequence ID separately, but that's a detail).  The readers
> connect remotely, scan through the table in sequence order, and remember
> the ID of the last row they read.  When they read again, they start from
> after that ID -- the query is roughly
>
>    SELECT * FROM logtable WHERE id > ? ORDER BY id
>
> But there's a problem with this -- the rows may not be inserted into the
> log in ID order.  For example, if two concurrent sessions get the
> sequence IDs 3 and 4, the one with ID 4 might commit first.  If a reader
> queries the table at that moment, it will see the sequence (1, 2, 4).
> Later, if the other session commits, a log entry with ID 3 will be
> added.  The IDs will now be (1, 2, 4, 3) -- but if the reader comes back
> to poll the log again, it will miss entry 3.  In order not to miss it,
> the reader would have to remember much more information than just a
> high-water mark.  (I think it would have to remember the complete set of
> IDs it had processed already, and then scan the whole table to find
> entries not in the set.)
>
> Is there an approach that will give the semantics I need?  In
> particular, I'd like a reader that has already read up to a certain
> point in the log to be able to restart at or near that same point, so it
> wouldn't have to reread everything.
>
> (I recognize that these log semantics are closer to those provided by
> files, not database tables.  A file is an option I'll consider, but
> obviously it lacks transactional semantics.)
>

Use a timestamp column also.

Re: table as log (multiple writers and readers)

From

Craig Ringer

Date:

16 April 2008, 13:35:54

brian wrote:

> Use a timestamp column also.

That's subject to the same issues, because a transaction's
current_timestamp() is determined at transaction start. So, in a
situation like this:


WRITER 1    WRITER 2    READER 1
--------------------------------------------
BEGIN
             BEGIN
INSERT
             INSERT
             COMMIT
                         BEGIN
                         SELECT
COMMIT

then READER 1 will see the most recent timestamp as that inserted by
WRITER 2, but it won't see the row inserted by WRITER 1 with an earlier
timestamp.

I don't think it's even OK in the case of a single-statement INSERT
(where the transaction is implicit) and/or with the use of
clock_timestamp() ... though I'm less sure about that.

--
Craig Ringer

Re: table as log (multiple writers and readers)

From

"Vance Maverick"

Date:

16 April 2008, 14:58:01

Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> wrote:
> brian wrote:
>
> > Use a timestamp column also.
>
> That's subject to the same issues.
[...]
> I don't think it's even OK in the case of a single-statement INSERT
(where the
> transaction is implicit) and/or with the use of clock_timestamp() ...
though
> I'm less sure about that.

No, you're right. The problem is that the timestamp is chosen some time
before the commit succeeds. So if there are concurrent writers
committing at the same time, the order of commit is determined by a
race, one that takes place after the timestamp values are set.

Another approach would be to queue the log entries in a "staging" table,
so that a single process could move them into the log. This is fairly
heavyweight, but it would guarantee the consistent sequencing of the log
as seen by a reader (even if the order of entries in the log didn't
always reflect the true commit sequence in the staging table). I'm
hoping someone knows a cleverer trick.

Vance

Re: table as log (multiple writers and readers)

From

Craig Ringer

Date:

16 April 2008, 15:49:35

Vance Maverick wrote:

> Another approach would be to queue the log entries in a "staging" table,
> so that a single process could move them into the log.  This is fairly
> heavyweight, but it would guarantee the consistent sequencing of the log
> as seen by a reader (even if the order of entries in the log didn't
> always reflect the true commit sequence in the staging table).

The way I see it, one way or another you are going to have to serialize
writers, otherwise you'll always be faced with commit-order races.

I do have one other idea, but it's not pretty. Write a C/Python/whatever
procedural function, say send_log_record(....), that uses an appropriate
inter-process communication mechanism to send the log message to another
process. Your writers use this function to generate log records. The
outside process receiving log records has a single connection to the DB
open and it is the only writer to the log table, thus avoiding the
problems with commit races with multiple writers. Your C function is
bypassing transactional isolation by communicating with another process
that modifies the DB, and in the process eliminating the need to hold
the whole transaction up to ensure predictable log write ordering.
However, it *does* mean that you'll get a log entry even if the
transaction then aborts. You might be able to get around that by doing
your logging with a deferred trigger, but there's always a risk that a
later deferred trigger will fail and abort the transaction.

Doing it via a staging table is a *lot* nicer, and a lot lighter weight,
than having your logging code force serialization of all operations in
transactions that could otherwise, other than the log ordering
requirement, run concurrently. Say, using an id generation table that
each transaction locks to ensure ordered ID generation and commits. It's
also properly transactional, so you won't have any log records for
aborted transactions.

It's a pity PostgreSQL's RETURNING extension appear to doesn't support
     INSERT INTO ... DELETE FROM ... RETURNING
because that'd make your log record mover a rather efficient one-liner.

The only other alternative I can think of is to have multiple writers
inserting records into the same table the readers are reading from, but
have the writers insert records with a timestamp field (say `visible')
set to null.

A single helper (there must only ever be one) can then repeatedly run a
command sequence like:

BEGIN;
UPDATE logtable
    SET visible = current_timestamp
  WHERE visible IS NULL
COMMIT;

to ensure that records became visible in timestamp order. No race here;
like in the staging table approach you're using a single write
transaction on the table being used by the readers.

Readers would filter for records that have `visible >
last_seen_visible', where last_seen_visible would be a literal, being
the greatest value of `visible' seen in the last query for log records.
They could trust that no records could ever be missed.

Unfortunately, with this approach you're incurring the cost of a dead
row for every UPDATE. You can avoid that with 8.3 (using HOT) only if
you have no index on `visible' - but having no index means a sequential
scan of the log table for every UPDATE making rows visible and for every
SELECT looking for the latest rows. Ouch.

That's basically just another way to write your log staging approach,
anyway. It stays within a single table but it's not otherwise much
different. I haven't any idea whether it'd perform better or worse than
using a separate log staging table.

If you really want to make somebody cry, I guess you could do it with
dblink - connect back to your own database from dblink and use a short
transaction to commit a log record, using table-based (rather than
sequence) ID generation to ensure that records were inserted in ID
order. That'd restrict the "critical section" in which your various
transactions were unable to run concurrently to a much shorter period,
but would result in a log message being saved even if the transaction
later aborted. It'd also be eye-bleedingly horrible, to the point where
even the "send a message from a C function" approach would be nicer.

--
Craig Ringer

Re: table as log (multiple writers and readers)

From

brian

Date:

16 April 2008, 16:14:28

Craig Ringer wrote:
> brian wrote:
>
>> Use a timestamp column also.
>
> That's subject to the same issues, because a transaction's
> current_timestamp() is determined at transaction start. So, in a
> situation like this:
>
>
> WRITER 1    WRITER 2    READER 1
> --------------------------------------------
> BEGIN
>             BEGIN
> INSERT
>             INSERT
>             COMMIT
>                         BEGIN
>                         SELECT
> COMMIT
>
> then READER 1 will see the most recent timestamp as that inserted by
> WRITER 2, but it won't see the row inserted by WRITER 1 with an earlier
> timestamp.
>
> I don't think it's even OK in the case of a single-statement INSERT
> (where the transaction is implicit) and/or with the use of
> clock_timestamp() ... though I'm less sure about that.
>

I don't mean to rely on *only* the timestamp, but for the reader to
remember both the last ID and the timestamp for that particular
transaction. When the next read occurs it should check to see if there's
an earlier timestamp with a higher ID than that remembered. The database
"knows" that WRITER 1 was there first. If it's important to the
application then the reader will need to take some action to re-order
things based on what it has already read, which it could do if it if it
compared timestamps and ID order. It needn't keep the complete set of
IDs in memory.

Wait--would WRITER 1 have the higher ID?

Re: table as log (multiple writers and readers)

From

"David Wilson"

Date:

16 April 2008, 16:27:34

(I originally missed replying to all here; sorry about the duplicate,
Vance, but figured others might be interested.

On Wed, Apr 16, 2008 at 1:55 PM, Vance Maverick <vmaverick@pgp.com> wrote:
>
>  Another approach would be to queue the log entries in a "staging" table,
>  so that a single process could move them into the log.  This is fairly
>  heavyweight, but it would guarantee the consistent sequencing of the log
>  as seen by a reader (even if the order of entries in the log didn't
>  always reflect the true commit sequence in the staging table).  I'm
>  hoping someone knows a cleverer trick.

Consider a loop like the following

advisory lock staging table
if (entries in table)
   copy entries to main log table as a single transaction
release advisory lock on staging table
read out and handle most recent log entries from main table

The advisory lock is automatically released on client disconnect, and
doing the whole thing within one transaction should prevent any
partial-copies on failures.

It doesn't matter that there are concurrent inserts to the staging
table because the staging table is always wiped all at once and
transferred in a synchronous fashion to the main table. You also can't
lose data, because it's always in one of the two tables.

--
- David T. Wilson
david.t.wilson@gmail.com

Re: table as log (multiple writers and readers)

From

Craig Ringer

Date:

16 April 2008, 22:43:32

brian wrote:

> I don't mean to rely on *only* the timestamp, but for the reader to
> remember both the last ID and the timestamp for that particular
> transaction. When the next read occurs it should check to see if there's
> an earlier timestamp with a higher ID than that remembered.

[snip]

> Wait--would WRITER 1 have the higher ID?

No, it'll have a lower id in this case because it calls
nextval('sequence_name') first. Writer 1 would have a lower id, and a
lower timestamp (because its transaction began first) even if it
committed later. Using clock_timestamp() in the insert will not help,
because the first transaction to insert (as in this case) is not
necessarily the first to commit.

If a reader sees a given id and timestamp, that doesn't meant there
aren't transactions with lower ids, and lower timestamps, still
uncomitted or in the process of committing.

What you want is a timestamp that's generated at commit time with a
guarantee that no later commits will have equal or lower timestamps . As
far as I know (I'm *VERY* far from the authority here) there's no way to
achieve that, so you have to serialize your commits to the table somehow.

--
Craig Ringer

Re: table as log (multiple writers and readers)

From

Andrew Sullivan

Date:

17 April 2008, 00:18:24

On Thu, Apr 17, 2008 at 12:35:33AM +0800, Craig Ringer wrote:
> That's subject to the same issues, because a transaction's
> current_timestamp() is determined at transaction start.

But clock_timestamp() (and its ancestors in Postgres) don't have that
restriction.  I dunno that it's enough for you, though, since you have
visibility issues as well.  You seem to want both the benefits of files and
relational database transactions, and I don't think you can really have both
at once without paying in reader complication.

One way I can think of doing it is to write a seen_log that notes what the
client has already seen with a timestamp of (say) 1 minute.  Then you can
say "go forward from this time excluding ids (ids here)".

A

Re: table as log (multiple writers and readers)

From

Craig Ringer

Date:

17 April 2008, 00:45:46

Andrew Sullivan wrote:
> On Thu, Apr 17, 2008 at 12:35:33AM +0800, Craig Ringer wrote:
>> That's subject to the same issues, because a transaction's
>> current_timestamp() is determined at transaction start.
>
> But clock_timestamp() (and its ancestors in Postgres) don't have that
> restriction.

True, as I noted later, but it doesn't help. AFAIK you can't guarantee
that multiple concurrent INSERTs will be committed in the same order
that their clock_timestamp() calls were evaluated.

Consequently, you can still have a situation where a record with a lower
timestamp becomes visible to readers after a record with a higher
timestamp has, and after the reader has already recorded the higher
timestamp as their cutoff.

> I dunno that it's enough for you, though, since you have
> visibility issues as well.  You seem to want both the benefits of files and
> relational database transactions, and I don't think you can really have both
> at once without paying in reader complication.

Or writer complication. In the end, the idea that using a file based log
wouldn't have this problem is based on the implicit assumption that the
file based logging mechanism would provide some sort of serialization of
writes.

As POSIX requires the write() call to be thread safe, write() would be
doing its own internal locking (or doing clever lock-free queueing etc)
to ensure writes are serialized.

However, at least in Linux fairly recently, writes aren't serialized, so
you have to do it yourself. See:

http://lwn.net/Articles/180387/

In any case, logging to a file with some sort of writer serialization
isn't significantly different to logging to a database table outside
your transaction using some sort of writer serialization. Both
mechanisms must serialize writers to work. Both mechanisms must operate
outside the transactional rules of the transaction invoking the logging
operation in order to avoid serializing all operations in transactions
that write to the log on the log.

> One way I can think of doing it is to write a seen_log that notes what the
> client has already seen with a timestamp of (say) 1 minute.  Then you can
> say "go forward from this time excluding ids (ids here)".

It won't work with multiple concurrent writers. There is no guarantee
that an INSERT with a timestamp older than the one you just saw isn't
waiting to commit.

--
Craig Ringer

Re: table as log (multiple writers and readers)

From

Andrew Sullivan

Date:

17 April 2008, 01:51:19

On Thu, Apr 17, 2008 at 12:44:51PM +0800, Craig Ringer wrote:
> It won't work with multiple concurrent writers. There is no guarantee
> that an INSERT with a timestamp older than the one you just saw isn't
> waiting to commit.

This is pretty unlikely -- I won't say impossible, because I'm sure there's
some kernel-level race condition -- if you use the clock time approach and
SERIALIZABLE mode.  You could add a trigger that checks for other timestamps
< yours, I suppose.  Of course, that's pretty heavyweight, too.  How much is
the absolute serialization worth to you in performance?

The only other thing I can suggest is what someone else did: commit them
with wallclock timestamps, and then have a different thread wake up every
_n_ seconds and put the records into the proper table in timestamp order.

A

Re: table as log (multiple writers and readers)

From

Andrew Sullivan

Date:

17 April 2008, 01:53:34

Oh, one other thing

On Thu, Apr 17, 2008 at 12:44:51PM +0800, Craig Ringer wrote:

> > One way I can think of doing it is to write a seen_log that notes what the
> > client has already seen with a timestamp of (say) 1 minute.  Then you can
> > say "go forward from this time excluding ids (ids here)".
>
> It won't work with multiple concurrent writers. There is no guarantee
> that an INSERT with a timestamp older than the one you just saw isn't
> waiting to commit.

Yeah, I spoke imprecisely.  The idea is, "Start at timestamp _t_, but don't
re-process these ones, which I've seen."  The trick is to set your start _t_
far enough back in time that it's incompatible with your business logic that
anything could still be pending from then.  This is nasty and prone to bugs,
but it can be coded up.

A

Re: table as log (multiple writers and readers)

From

Chris Browne

Date:

17 April 2008, 14:56:10

ajs@crankycanuck.ca (Andrew Sullivan) writes:
> Oh, one other thing
>
> On Thu, Apr 17, 2008 at 12:44:51PM +0800, Craig Ringer wrote:
>
>> > One way I can think of doing it is to write a seen_log that notes what the
>> > client has already seen with a timestamp of (say) 1 minute.  Then you can
>> > say "go forward from this time excluding ids (ids here)".
>>
>> It won't work with multiple concurrent writers. There is no guarantee
>> that an INSERT with a timestamp older than the one you just saw isn't
>> waiting to commit.
>
> Yeah, I spoke imprecisely.  The idea is, "Start at timestamp _t_, but don't
> re-process these ones, which I've seen."  The trick is to set your start _t_
> far enough back in time that it's incompatible with your business logic that
> anything could still be pending from then.  This is nasty and prone to bugs,
> but it can be coded up.

The alternative pattern is to put a trigger on the table that collects
the primary key ID in a "queue" table.

If some updates were "still in flight" when the queue processor runs,
then it can catch them next time.

Curiously enough, I have had *two* meetings today where this was germaine ;-).
--
select 'cbbrowne' || '@' || 'linuxfinances.info';
http://cbbrowne.com/info/wp.html
Do not worry  about the bullet that  has got your name on  it. It will
hit you and it will kill  you, no questions asked. The rounds to worry
about are the ones marked: TO WHOM IT MAY CONCERN.

Re: table as log (multiple writers and readers)

From

Joris Dobbelsteen

Date:

21 April 2008, 19:08:43

Andrew Sullivan wrote:
> On Thu, Apr 17, 2008 at 12:44:51PM +0800, Craig Ringer wrote:
>> It won't work with multiple concurrent writers. There is no guarantee
>> that an INSERT with a timestamp older than the one you just saw isn't
>> waiting to commit.
>
> This is pretty unlikely -- I won't say impossible, because I'm sure there's
> some kernel-level race condition -- if you use the clock time approach and
> SERIALIZABLE mode.

Don't, understand what SERIALIZABLE mode means (in this context):
* It makes restrictions on what you CAN & MAY read (i.e. controls
visiblity).

If you want to serialize your accesses (which is something completely
different) you are probably better of with using an exclusive table lock
on the log table. Otherwise guarantees cannot be made on postgresql.
Even worse, you MUST NOT use SERIALIZABLE mode if you are going to check
you log table.

Similarly, handing referential integrity is a complex to handle and has
special mechanisms in postgresql to handle that.

Going into detail about concurrently running code:
* Without synchronization mechanisms you CANNOT control in which order
code is executed.

Also, you are not dealing with a race condition here. Its the semantics
or requirements on the ordering. Race conditions are failures caused by
a particular ordering (in time) of events, in such a way that corrupts
the state. There can be prevented by proper locking, but have been
overlooked (which is easy) and timing constraints are usually very
stringent so that they are very hard to detect. Therefore, they are
also, usually, very hard to encounter in common situations.

> You could add a trigger that checks for other timestamps
> < yours, I suppose.  Of course, that's pretty heavyweight, too.  How much is
> the absolute serialization worth to you in performance?

You cannot do this in a reliable way. The Postgresql MVCC semantics are
such that you can not do that in a reliable manner. The only way is if
you take at least a write lock on the log table and ensure you are NOT
in serializable mode.

Remember that you CANNOT view tuples that are not yet committed. In
fact, with serializable mode you will only see changes from transactions
that are completed BEFORE you started yours.

> The only other thing I can suggest is what someone else did: commit them
> with wallclock timestamps, and then have a different thread wake up every
> _n_ seconds and put the records into the proper table in timestamp order.

You can guarantee:
* the proper ordering of log events for a single transaction (by
property of transactions).
* modifications/delete to a single tuple (due to the synchronization
point caused by the write lock).
This seems enough in the requested case.

You can even do so with a simple sequence counter if you are only
interested the of ordering in time of events.

- Joris

Re: table as log (multiple writers and readers)

From

Joris Dobbelsteen

Date:

21 April 2008, 19:09:07

Craig Ringer wrote:
[snip]
> If you really want to make somebody cry, I guess you could do it with
> dblink - connect back to your own database from dblink and use a short
> transaction to commit a log record, using table-based (rather than
> sequence) ID generation to ensure that records were inserted in ID
> order. That'd restrict the "critical section" in which your various
> transactions were unable to run concurrently to a much shorter period,
> but would result in a log message being saved even if the transaction
> later aborted. It'd also be eye-bleedingly horrible, to the point where
> even the "send a message from a C function" approach would be nicer.

This will not work for the problem the TS has. Let a single transaction
hang for a long enough time before commit, while others succeed. It will
keep ordering of changes, but commits might come unordered.

The issue is, you don't really have the critical section as you
describe, there is no SINGLE lock you are 'fighting' for.

It will work with an added table write lock (or up), that will be the
lock for your critical section.

In my opinion I would just forget about this one rather quickly as you
more or less proposed...

- Joris

Re: table as log (multiple writers and readers)

From

Vance Maverick

Date:

21 April 2008, 20:09:24

Thanks to all for your help.  I've adopted the scheme involving a
"staging" table -- the writer processes insert into that, then a single
"publisher" process pulls from that and writes to the log, giving a
clean serial order for any reader of the log.

    Vance

On Mon, 2008-04-21 at 23:59 +0200, Joris Dobbelsteen wrote:
> Craig Ringer wrote:
> [snip]
> > If you really want to make somebody cry, I guess you could do it with
> > dblink - connect back to your own database from dblink and use a short
> > transaction to commit a log record, using table-based (rather than
> > sequence) ID generation to ensure that records were inserted in ID
> > order. That'd restrict the "critical section" in which your various
> > transactions were unable to run concurrently to a much shorter period,
> > but would result in a log message being saved even if the transaction
> > later aborted. It'd also be eye-bleedingly horrible, to the point where
> > even the "send a message from a C function" approach would be nicer.
>
> This will not work for the problem the TS has. Let a single transaction
> hang for a long enough time before commit, while others succeed. It will
> keep ordering of changes, but commits might come unordered.
>
> The issue is, you don't really have the critical section as you
> describe, there is no SINGLE lock you are 'fighting' for.
>
> It will work with an added table write lock (or up), that will be the
> lock for your critical section.
>
> In my opinion I would just forget about this one rather quickly as you
> more or less proposed...
>
> - Joris

Re: table as log (multiple writers and readers)

From

Joris Dobbelsteen

Date:

21 April 2008, 20:57:06

David Wilson wrote:
> (I originally missed replying to all here; sorry about the duplicate,
> Vance, but figured others might be interested.
>
> On Wed, Apr 16, 2008 at 1:55 PM, Vance Maverick <vmaverick@pgp.com> wrote:
>>  Another approach would be to queue the log entries in a "staging" table,
>>  so that a single process could move them into the log.  This is fairly
>>  heavyweight, but it would guarantee the consistent sequencing of the log
>>  as seen by a reader (even if the order of entries in the log didn't
>>  always reflect the true commit sequence in the staging table).  I'm
>>  hoping someone knows a cleverer trick.
>
>
> Consider a loop like the following
>
> advisory lock staging table
> if (entries in table)
>    copy entries to main log table as a single transaction
> release advisory lock on staging table
> read out and handle most recent log entries from main table
>
> The advisory lock is automatically released on client disconnect, and
> doing the whole thing within one transaction should prevent any
> partial-copies on failures.
>
> It doesn't matter that there are concurrent inserts to the staging
> table because the staging table is always wiped all at once and
> transferred in a synchronous fashion to the main table. You also can't
> lose data, because it's always in one of the two tables.

If you want to clean up the the staging table I have some concerns about
the advisory lock. I think you mean exclusive table lock.

There are other two options as well:

* Track which data is copies and remove those from the staging table
that are in the new table.

* Use a serializable mode for the staging-to-log-copying transactions.
In this way you can just copy the table and trow away everything
(without checking). This seems rather cheap and allows for concurrent
processing.

- Joris

Re: table as log (multiple writers and readers)

From

"David Wilson"

Date:

21 April 2008, 21:06:04

On Mon, Apr 21, 2008 at 7:55 PM, Joris Dobbelsteen
<joris@familiedobbelsteen.nl> wrote:
>
>  If you want to clean up the the staging table I have some concerns about
> the advisory lock. I think you mean exclusive table lock.

Either works, really. An advisory lock is really just a lock over
which you have control of the meaning, as long as you're using it in
the appropriate places. Also, an advisory lock on just the processes
doing staging-to-log moves would allow writes into the staging table
to continue concurrently with the staging-to-log transaction (whereas
an exclusive lock would unnecessarily prevent them).

Also, while Vance appears to have chosen to have a dedicated
staging-to-log process, even that isn't necessary- each reader can
simply do the lock/clear staging/unlock before any attempt to read-
unless you're polling that log table at truly crazy rates, the
overhead should be negligible and will ensure that the staging table
is simply cleared out "whenever necessary" while removing the
complexity of a separate process.

--
- David T. Wilson
david.t.wilson@gmail.com

Re: table as log (multiple writers and readers)

From

Joris Dobbelsteen

Date:

22 April 2008, 10:53:30

David Wilson wrote:
> On Mon, Apr 21, 2008 at 7:55 PM, Joris Dobbelsteen
> <joris@familiedobbelsteen.nl> wrote:
>>  If you want to clean up the the staging table I have some concerns about
>> the advisory lock. I think you mean exclusive table lock.
>
> Either works, really. An advisory lock is really just a lock over
> which you have control of the meaning, as long as you're using it in
> the appropriate places. Also, an advisory lock on just the processes
> doing staging-to-log moves would allow writes into the staging table
> to continue concurrently with the staging-to-log transaction (whereas
> an exclusive lock would unnecessarily prevent them).

Describe the mechanism, because I don't really believe it yet. I think
you need to do a advisory lock around every commit of every transaction
that writes to the log table.

If you are only using the advisory lock in the staging-to-log
transaction, how would this prevent newly committed tuples to not show
up during this process? (You can't both delete and insert in a single
statement, I believe, in which case you won't have a problem anyways).

> Also, while Vance appears to have chosen to have a dedicated
> staging-to-log process, even that isn't necessary- each reader can
> simply do the lock/clear staging/unlock before any attempt to read-
> unless you're polling that log table at truly crazy rates, the
> overhead should be negligible and will ensure that the staging table
> is simply cleared out "whenever necessary" while removing the
> complexity of a separate process.

Using serialization mode for the staging-to-log process seems to be the
most efficient methods, as it won't even block writers.

- Joris

Re: table as log (multiple writers and readers)

From

"David Wilson"

Date:

22 April 2008, 11:26:44

On Tue, Apr 22, 2008 at 9:52 AM, Joris Dobbelsteen
<joris@familiedobbelsteen.nl> wrote:
>
>  Describe the mechanism, because I don't really believe it yet. I think you
> need to do a advisory lock around every commit of every transaction that
> writes to the log table.

Consider some number of reader processes and some number of writer processes.

Writer processes touch only the staging table, and solely do inserts
into it. As a result, writer processes cannot interfere with each
other in any way and do not require any synchronization beyond that
provided by MVCC.

Reader processes are interested in polling the logging table at
intervals. In the process, they also act as staging-to-log movers.
This act (because it is destructive and because we require serialized
inserts for id generation in the log table) must take a lock that
prevents other readers from attempting the same work at the same time.

Each reader process therefore has a loop that appears as follows:
1) Obtain advisory lock.
2) Begin transaction.
3) For each row in staging table, insert copy into log table.
4) Delete all visible rows from staging table.
5) Commit transaction.
6) Release advisory lock.
7) Handle not-yet-seen rows in the logging table (This is the primary
work of the readers)
8) Sleep for the desired interval and return to 1).

We require two types of synchronization and the above takes care of both:
1) The advisory lock prevents multiple readers from doing simultaneous
staging-to-log moves.
2) The transaction block ensures that the reader will see a consistent
state on the staging table while writers may write at the same time;
writes that occur during the reader's transaction block will simply be
ignored during this round of reading.

You need both types of synchronization to avoid problems- taking an
exclusive lock would simply be the sledgehammer method of doing the
synchronization, since it would take the place of both the advisory
lock and the transaction at the same time but would also block
writers.

--
- David T. Wilson
david.t.wilson@gmail.com

Re: table as log (multiple writers and readers)

From

Joris Dobbelsteen

Date:

22 April 2008, 15:49:58

David Wilson wrote:
> On Tue, Apr 22, 2008 at 9:52 AM, Joris Dobbelsteen
> <joris@familiedobbelsteen.nl> wrote:
>>  Describe the mechanism, because I don't really believe it yet. I think you
>> need to do a advisory lock around every commit of every transaction that
>> writes to the log table.
>
> Consider some number of reader processes and some number of writer processes.
>
> Writer processes touch only the staging table, and solely do inserts
> into it. As a result, writer processes cannot interfere with each
> other in any way and do not require any synchronization beyond that
> provided by MVCC.
>
> Reader processes are interested in polling the logging table at
> intervals. In the process, they also act as staging-to-log movers.
> This act (because it is destructive and because we require serialized
> inserts for id generation in the log table) must take a lock that
> prevents other readers from attempting the same work at the same time.
>
> Each reader process therefore has a loop that appears as follows:
> 1) Obtain advisory lock.
> 2) Begin transaction.
> 3) For each row in staging table, insert copy into log table.
> 4) Delete all visible rows from staging table.

Ah, yes, all visible rows...
My point is that, unless you use a transaction with serializable
isolation, this all visible rows for the second statement might be
different from those that you copied into the log table.

With the normal Read committed isolation level you suffer from a
possible nonrepeatable read that might change tuple visibility between
different statements.

> 5) Commit transaction.
> 6) Release advisory lock.
> 7) Handle not-yet-seen rows in the logging table (This is the primary
> work of the readers)
> 8) Sleep for the desired interval and return to 1).
>
> We require two types of synchronization and the above takes care of both:
> 1) The advisory lock prevents multiple readers from doing simultaneous
> staging-to-log moves.
> 2) The transaction block ensures that the reader will see a consistent
> state on the staging table while writers may write at the same time;
> writes that occur during the reader's transaction block will simply be
> ignored during this round of reading.

See above, you demand its impossible for nonrepeatable reads to occur.

> You need both types of synchronization to avoid problems- taking an
> exclusive lock would simply be the sledgehammer method of doing the
> synchronization, since it would take the place of both the advisory
> lock and the transaction at the same time but would also block
> writers.

I agree with you on this, but it does guarentee the impossibility of a
nonrepeatable read at the cost of concurrency. There seems to be a
better solution indeed.

- Joris

Re: table as log (multiple writers and readers)

From

"David Wilson"

Date:

22 April 2008, 15:59:15

On Tue, Apr 22, 2008 at 2:48 PM, Joris Dobbelsteen
<joris@familiedobbelsteen.nl> wrote:
>
>  Ah, yes, all visible rows...
>  My point is that, unless you use a transaction with serializable isolation,
> this all visible rows for the second statement might be different from those
> that you copied into the log table.
>
>  With the normal Read committed isolation level you suffer from a possible
> nonrepeatable read that might change tuple visibility between different
> statements.

That depends on implementation. A select into ... to do the initial
copy followed by a delete where... with the where clause referencing
the log table itself to ensure that we delete only things that now
exist in the log table, or a row by row  insert/delete pair. Either
would provide the appropriate level of protection from accidental
deletion of more things than you intended without harming concurrency.
The delete referencing the log table might require that the log table
be indexed for performance, but it's likely that such indexing would
be done anyway for general log use.
--
- David T. Wilson
david.t.wilson@gmail.com

Re: table as log (multiple writers and readers)

From

"Gurjeet Singh"

Date:

22 April 2008, 16:17:19

On Wed, Apr 23, 2008 at 12:29 AM, David Wilson <david.t.wilson@gmail.com> wrote:

On Tue, Apr 22, 2008 at 2:48 PM, Joris Dobbelsteen
<joris@familiedobbelsteen.nl> wrote:
>
> Ah, yes, all visible rows...
> My point is that, unless you use a transaction with serializable isolation,
> this all visible rows for the second statement might be different from those
> that you copied into the log table.
>
> With the normal Read committed isolation level you suffer from a possible
> nonrepeatable read that might change tuple visibility between different
> statements.

That depends on implementation. A select into ... to do the initial
copy followed by a delete where... with the where clause referencing
the log table itself to ensure that we delete only things that now
exist in the log table, or a row by row insert/delete pair. Either
would provide the appropriate level of protection from accidental
deletion of more things than you intended without harming concurrency.
The delete referencing the log table might require that the log table
be indexed for performance, but it's likely that such indexing would
be done anyway for general log use.

I think this plpgsql function would solve the problem of atomic read-and-delete operation...

create or replace function log_rotate() returns void as $$
declare
rec record;
begin

    for rec in delete from t1 returning * loop
        insert into t2 values( rec.a, rec.b );
    end loop;

end;
$$ language 'plpgsql';

select log_rotate();

Best regards,
--
gurjeet[.singh]@EnterpriseDB.com

singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

Mail sent from my BlackLaptop device

Re: table as log (multiple writers and readers)

From

Joris Dobbelsteen

Date:

22 April 2008, 16:43:24

Gurjeet Singh wrote:
> On Wed, Apr 23, 2008 at 12:29 AM, David Wilson <david.t.wilson@gmail.com
> <mailto:david.t.wilson@gmail.com>> wrote:
>
>     On Tue, Apr 22, 2008 at 2:48 PM, Joris Dobbelsteen
>     <joris@familiedobbelsteen.nl <mailto:joris@familiedobbelsteen.nl>>
>     wrote:
>      >
>      >  Ah, yes, all visible rows...
>      >  My point is that, unless you use a transaction with serializable
>     isolation,
>      > this all visible rows for the second statement might be different
>     from those
>      > that you copied into the log table.
>      >
>      >  With the normal Read committed isolation level you suffer from a
>     possible
>      > nonrepeatable read that might change tuple visibility between
>     different
>      > statements.
>
>     That depends on implementation. A select into ... to do the initial
>     copy followed by a delete where... with the where clause referencing
>     the log table itself to ensure that we delete only things that now
>     exist in the log table, or a row by row  insert/delete pair. Either
>     would provide the appropriate level of protection from accidental
>     deletion of more things than you intended without harming concurrency.
>     The delete referencing the log table might require that the log table
>     be indexed for performance, but it's likely that such indexing would
>     be done anyway for general log use.

Of course, point is, that is another way to define "visibility" in this
context: if present in log table. Point is, a suitable definition is needed.

> I think this plpgsql function would solve the problem of atomic
> read-and-delete operation...
>
> create or replace function log_rotate() returns void as $$
> declare
>   rec record;
> begin
>
>     for rec in delete from t1 returning * loop
>         insert into t2 values( rec.a, rec.b );
>     end loop;
>
> end;
> $$ language 'plpgsql';
>
> select log_rotate();

Don't forget ordering, this was important before...

START TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT ... INTO log FROM staging ORDER BY ...;
DELETE FROM staging;
COMMIT;

Don't know if that ORDER BY works. It should in this case.

- Joris