Thread: table as log (multiple writers and readers)
I want to create a table with the semantics of a log. There may be
multiple concurrent writers, because log entries will be generated by
triggers. And there will be multiple concurrent readers -- mostly
remote processes watching the log over time. I'd like to guarantee that
each of those readers will see the same, complete sequence of changes.
For a first attempt, I created a table with a serial column (actually I
created the sequence ID separately, but that's a detail). The readers
connect remotely, scan through the table in sequence order, and remember
the ID of the last row they read. When they read again, they start from
after that ID -- the query is roughly
SELECT * FROM logtable WHERE id > ? ORDER BY id
But there's a problem with this -- the rows may not be inserted into the
log in ID order. For example, if two concurrent sessions get the
sequence IDs 3 and 4, the one with ID 4 might commit first. If a reader
queries the table at that moment, it will see the sequence (1, 2, 4).
Later, if the other session commits, a log entry with ID 3 will be
added. The IDs will now be (1, 2, 4, 3) -- but if the reader comes back
to poll the log again, it will miss entry 3. In order not to miss it,
the reader would have to remember much more information than just a
high-water mark. (I think it would have to remember the complete set of
IDs it had processed already, and then scan the whole table to find
entries not in the set.)
Is there an approach that will give the semantics I need? In
particular, I'd like a reader that has already read up to a certain
point in the log to be able to restart at or near that same point, so it
wouldn't have to reread everything.
(I recognize that these log semantics are closer to those provided by
files, not database tables. A file is an option I'll consider, but
obviously it lacks transactional semantics.)
Thanks,
Vance
Vance Maverick wrote: > I want to create a table with the semantics of a log. There may be > multiple concurrent writers, because log entries will be generated by > triggers. And there will be multiple concurrent readers -- mostly > remote processes watching the log over time. I'd like to guarantee that > each of those readers will see the same, complete sequence of changes. > > For a first attempt, I created a table with a serial column (actually I > created the sequence ID separately, but that's a detail). The readers > connect remotely, scan through the table in sequence order, and remember > the ID of the last row they read. When they read again, they start from > after that ID -- the query is roughly > > SELECT * FROM logtable WHERE id > ? ORDER BY id > > But there's a problem with this -- the rows may not be inserted into the > log in ID order. For example, if two concurrent sessions get the > sequence IDs 3 and 4, the one with ID 4 might commit first. If a reader > queries the table at that moment, it will see the sequence (1, 2, 4). > Later, if the other session commits, a log entry with ID 3 will be > added. The IDs will now be (1, 2, 4, 3) -- but if the reader comes back > to poll the log again, it will miss entry 3. In order not to miss it, > the reader would have to remember much more information than just a > high-water mark. (I think it would have to remember the complete set of > IDs it had processed already, and then scan the whole table to find > entries not in the set.) > > Is there an approach that will give the semantics I need? In > particular, I'd like a reader that has already read up to a certain > point in the log to be able to restart at or near that same point, so it > wouldn't have to reread everything. > > (I recognize that these log semantics are closer to those provided by > files, not database tables. A file is an option I'll consider, but > obviously it lacks transactional semantics.) > Use a timestamp column also.
brian wrote: > Use a timestamp column also. That's subject to the same issues, because a transaction's current_timestamp() is determined at transaction start. So, in a situation like this: WRITER 1 WRITER 2 READER 1 -------------------------------------------- BEGIN BEGIN INSERT INSERT COMMIT BEGIN SELECT COMMIT then READER 1 will see the most recent timestamp as that inserted by WRITER 2, but it won't see the row inserted by WRITER 1 with an earlier timestamp. I don't think it's even OK in the case of a single-statement INSERT (where the transaction is implicit) and/or with the use of clock_timestamp() ... though I'm less sure about that. -- Craig Ringer
Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> wrote:
> brian wrote:
>
> > Use a timestamp column also.
>
> That's subject to the same issues.
[...]
> I don't think it's even OK in the case of a single-statement INSERT
(where the
> transaction is implicit) and/or with the use of clock_timestamp() ...
though
> I'm less sure about that.
No, you're right. The problem is that the timestamp is chosen some time
before the commit succeeds. So if there are concurrent writers
committing at the same time, the order of commit is determined by a
race, one that takes place after the timestamp values are set.
Another approach would be to queue the log entries in a "staging" table,
so that a single process could move them into the log. This is fairly
heavyweight, but it would guarantee the consistent sequencing of the log
as seen by a reader (even if the order of entries in the log didn't
always reflect the true commit sequence in the staging table). I'm
hoping someone knows a cleverer trick.
Vance
Vance Maverick wrote: > Another approach would be to queue the log entries in a "staging" table, > so that a single process could move them into the log. This is fairly > heavyweight, but it would guarantee the consistent sequencing of the log > as seen by a reader (even if the order of entries in the log didn't > always reflect the true commit sequence in the staging table). The way I see it, one way or another you are going to have to serialize writers, otherwise you'll always be faced with commit-order races. I do have one other idea, but it's not pretty. Write a C/Python/whatever procedural function, say send_log_record(....), that uses an appropriate inter-process communication mechanism to send the log message to another process. Your writers use this function to generate log records. The outside process receiving log records has a single connection to the DB open and it is the only writer to the log table, thus avoiding the problems with commit races with multiple writers. Your C function is bypassing transactional isolation by communicating with another process that modifies the DB, and in the process eliminating the need to hold the whole transaction up to ensure predictable log write ordering. However, it *does* mean that you'll get a log entry even if the transaction then aborts. You might be able to get around that by doing your logging with a deferred trigger, but there's always a risk that a later deferred trigger will fail and abort the transaction. Doing it via a staging table is a *lot* nicer, and a lot lighter weight, than having your logging code force serialization of all operations in transactions that could otherwise, other than the log ordering requirement, run concurrently. Say, using an id generation table that each transaction locks to ensure ordered ID generation and commits. It's also properly transactional, so you won't have any log records for aborted transactions. It's a pity PostgreSQL's RETURNING extension appear to doesn't support INSERT INTO ... DELETE FROM ... RETURNING because that'd make your log record mover a rather efficient one-liner. The only other alternative I can think of is to have multiple writers inserting records into the same table the readers are reading from, but have the writers insert records with a timestamp field (say `visible') set to null. A single helper (there must only ever be one) can then repeatedly run a command sequence like: BEGIN; UPDATE logtable SET visible = current_timestamp WHERE visible IS NULL COMMIT; to ensure that records became visible in timestamp order. No race here; like in the staging table approach you're using a single write transaction on the table being used by the readers. Readers would filter for records that have `visible > last_seen_visible', where last_seen_visible would be a literal, being the greatest value of `visible' seen in the last query for log records. They could trust that no records could ever be missed. Unfortunately, with this approach you're incurring the cost of a dead row for every UPDATE. You can avoid that with 8.3 (using HOT) only if you have no index on `visible' - but having no index means a sequential scan of the log table for every UPDATE making rows visible and for every SELECT looking for the latest rows. Ouch. That's basically just another way to write your log staging approach, anyway. It stays within a single table but it's not otherwise much different. I haven't any idea whether it'd perform better or worse than using a separate log staging table. If you really want to make somebody cry, I guess you could do it with dblink - connect back to your own database from dblink and use a short transaction to commit a log record, using table-based (rather than sequence) ID generation to ensure that records were inserted in ID order. That'd restrict the "critical section" in which your various transactions were unable to run concurrently to a much shorter period, but would result in a log message being saved even if the transaction later aborted. It'd also be eye-bleedingly horrible, to the point where even the "send a message from a C function" approach would be nicer. -- Craig Ringer
Craig Ringer wrote: > brian wrote: > >> Use a timestamp column also. > > That's subject to the same issues, because a transaction's > current_timestamp() is determined at transaction start. So, in a > situation like this: > > > WRITER 1 WRITER 2 READER 1 > -------------------------------------------- > BEGIN > BEGIN > INSERT > INSERT > COMMIT > BEGIN > SELECT > COMMIT > > then READER 1 will see the most recent timestamp as that inserted by > WRITER 2, but it won't see the row inserted by WRITER 1 with an earlier > timestamp. > > I don't think it's even OK in the case of a single-statement INSERT > (where the transaction is implicit) and/or with the use of > clock_timestamp() ... though I'm less sure about that. > I don't mean to rely on *only* the timestamp, but for the reader to remember both the last ID and the timestamp for that particular transaction. When the next read occurs it should check to see if there's an earlier timestamp with a higher ID than that remembered. The database "knows" that WRITER 1 was there first. If it's important to the application then the reader will need to take some action to re-order things based on what it has already read, which it could do if it if it compared timestamps and ID order. It needn't keep the complete set of IDs in memory. Wait--would WRITER 1 have the higher ID?
(I originally missed replying to all here; sorry about the duplicate, Vance, but figured others might be interested. On Wed, Apr 16, 2008 at 1:55 PM, Vance Maverick <vmaverick@pgp.com> wrote: > > Another approach would be to queue the log entries in a "staging" table, > so that a single process could move them into the log. This is fairly > heavyweight, but it would guarantee the consistent sequencing of the log > as seen by a reader (even if the order of entries in the log didn't > always reflect the true commit sequence in the staging table). I'm > hoping someone knows a cleverer trick. Consider a loop like the following advisory lock staging table if (entries in table) copy entries to main log table as a single transaction release advisory lock on staging table read out and handle most recent log entries from main table The advisory lock is automatically released on client disconnect, and doing the whole thing within one transaction should prevent any partial-copies on failures. It doesn't matter that there are concurrent inserts to the staging table because the staging table is always wiped all at once and transferred in a synchronous fashion to the main table. You also can't lose data, because it's always in one of the two tables. -- - David T. Wilson david.t.wilson@gmail.com
brian wrote: > I don't mean to rely on *only* the timestamp, but for the reader to > remember both the last ID and the timestamp for that particular > transaction. When the next read occurs it should check to see if there's > an earlier timestamp with a higher ID than that remembered. [snip] > Wait--would WRITER 1 have the higher ID? No, it'll have a lower id in this case because it calls nextval('sequence_name') first. Writer 1 would have a lower id, and a lower timestamp (because its transaction began first) even if it committed later. Using clock_timestamp() in the insert will not help, because the first transaction to insert (as in this case) is not necessarily the first to commit. If a reader sees a given id and timestamp, that doesn't meant there aren't transactions with lower ids, and lower timestamps, still uncomitted or in the process of committing. What you want is a timestamp that's generated at commit time with a guarantee that no later commits will have equal or lower timestamps . As far as I know (I'm *VERY* far from the authority here) there's no way to achieve that, so you have to serialize your commits to the table somehow. -- Craig Ringer
On Thu, Apr 17, 2008 at 12:35:33AM +0800, Craig Ringer wrote: > That's subject to the same issues, because a transaction's > current_timestamp() is determined at transaction start. But clock_timestamp() (and its ancestors in Postgres) don't have that restriction. I dunno that it's enough for you, though, since you have visibility issues as well. You seem to want both the benefits of files and relational database transactions, and I don't think you can really have both at once without paying in reader complication. One way I can think of doing it is to write a seen_log that notes what the client has already seen with a timestamp of (say) 1 minute. Then you can say "go forward from this time excluding ids (ids here)". A
Andrew Sullivan wrote: > On Thu, Apr 17, 2008 at 12:35:33AM +0800, Craig Ringer wrote: >> That's subject to the same issues, because a transaction's >> current_timestamp() is determined at transaction start. > > But clock_timestamp() (and its ancestors in Postgres) don't have that > restriction. True, as I noted later, but it doesn't help. AFAIK you can't guarantee that multiple concurrent INSERTs will be committed in the same order that their clock_timestamp() calls were evaluated. Consequently, you can still have a situation where a record with a lower timestamp becomes visible to readers after a record with a higher timestamp has, and after the reader has already recorded the higher timestamp as their cutoff. > I dunno that it's enough for you, though, since you have > visibility issues as well. You seem to want both the benefits of files and > relational database transactions, and I don't think you can really have both > at once without paying in reader complication. Or writer complication. In the end, the idea that using a file based log wouldn't have this problem is based on the implicit assumption that the file based logging mechanism would provide some sort of serialization of writes. As POSIX requires the write() call to be thread safe, write() would be doing its own internal locking (or doing clever lock-free queueing etc) to ensure writes are serialized. However, at least in Linux fairly recently, writes aren't serialized, so you have to do it yourself. See: http://lwn.net/Articles/180387/ In any case, logging to a file with some sort of writer serialization isn't significantly different to logging to a database table outside your transaction using some sort of writer serialization. Both mechanisms must serialize writers to work. Both mechanisms must operate outside the transactional rules of the transaction invoking the logging operation in order to avoid serializing all operations in transactions that write to the log on the log. > One way I can think of doing it is to write a seen_log that notes what the > client has already seen with a timestamp of (say) 1 minute. Then you can > say "go forward from this time excluding ids (ids here)". It won't work with multiple concurrent writers. There is no guarantee that an INSERT with a timestamp older than the one you just saw isn't waiting to commit. -- Craig Ringer
On Thu, Apr 17, 2008 at 12:44:51PM +0800, Craig Ringer wrote: > It won't work with multiple concurrent writers. There is no guarantee > that an INSERT with a timestamp older than the one you just saw isn't > waiting to commit. This is pretty unlikely -- I won't say impossible, because I'm sure there's some kernel-level race condition -- if you use the clock time approach and SERIALIZABLE mode. You could add a trigger that checks for other timestamps < yours, I suppose. Of course, that's pretty heavyweight, too. How much is the absolute serialization worth to you in performance? The only other thing I can suggest is what someone else did: commit them with wallclock timestamps, and then have a different thread wake up every _n_ seconds and put the records into the proper table in timestamp order. A
Oh, one other thing On Thu, Apr 17, 2008 at 12:44:51PM +0800, Craig Ringer wrote: > > One way I can think of doing it is to write a seen_log that notes what the > > client has already seen with a timestamp of (say) 1 minute. Then you can > > say "go forward from this time excluding ids (ids here)". > > It won't work with multiple concurrent writers. There is no guarantee > that an INSERT with a timestamp older than the one you just saw isn't > waiting to commit. Yeah, I spoke imprecisely. The idea is, "Start at timestamp _t_, but don't re-process these ones, which I've seen." The trick is to set your start _t_ far enough back in time that it's incompatible with your business logic that anything could still be pending from then. This is nasty and prone to bugs, but it can be coded up. A
ajs@crankycanuck.ca (Andrew Sullivan) writes: > Oh, one other thing > > On Thu, Apr 17, 2008 at 12:44:51PM +0800, Craig Ringer wrote: > >> > One way I can think of doing it is to write a seen_log that notes what the >> > client has already seen with a timestamp of (say) 1 minute. Then you can >> > say "go forward from this time excluding ids (ids here)". >> >> It won't work with multiple concurrent writers. There is no guarantee >> that an INSERT with a timestamp older than the one you just saw isn't >> waiting to commit. > > Yeah, I spoke imprecisely. The idea is, "Start at timestamp _t_, but don't > re-process these ones, which I've seen." The trick is to set your start _t_ > far enough back in time that it's incompatible with your business logic that > anything could still be pending from then. This is nasty and prone to bugs, > but it can be coded up. The alternative pattern is to put a trigger on the table that collects the primary key ID in a "queue" table. If some updates were "still in flight" when the queue processor runs, then it can catch them next time. Curiously enough, I have had *two* meetings today where this was germaine ;-). -- select 'cbbrowne' || '@' || 'linuxfinances.info'; http://cbbrowne.com/info/wp.html Do not worry about the bullet that has got your name on it. It will hit you and it will kill you, no questions asked. The rounds to worry about are the ones marked: TO WHOM IT MAY CONCERN.
Andrew Sullivan wrote: > On Thu, Apr 17, 2008 at 12:44:51PM +0800, Craig Ringer wrote: >> It won't work with multiple concurrent writers. There is no guarantee >> that an INSERT with a timestamp older than the one you just saw isn't >> waiting to commit. > > This is pretty unlikely -- I won't say impossible, because I'm sure there's > some kernel-level race condition -- if you use the clock time approach and > SERIALIZABLE mode. Don't, understand what SERIALIZABLE mode means (in this context): * It makes restrictions on what you CAN & MAY read (i.e. controls visiblity). If you want to serialize your accesses (which is something completely different) you are probably better of with using an exclusive table lock on the log table. Otherwise guarantees cannot be made on postgresql. Even worse, you MUST NOT use SERIALIZABLE mode if you are going to check you log table. Similarly, handing referential integrity is a complex to handle and has special mechanisms in postgresql to handle that. Going into detail about concurrently running code: * Without synchronization mechanisms you CANNOT control in which order code is executed. Also, you are not dealing with a race condition here. Its the semantics or requirements on the ordering. Race conditions are failures caused by a particular ordering (in time) of events, in such a way that corrupts the state. There can be prevented by proper locking, but have been overlooked (which is easy) and timing constraints are usually very stringent so that they are very hard to detect. Therefore, they are also, usually, very hard to encounter in common situations. > You could add a trigger that checks for other timestamps > < yours, I suppose. Of course, that's pretty heavyweight, too. How much is > the absolute serialization worth to you in performance? You cannot do this in a reliable way. The Postgresql MVCC semantics are such that you can not do that in a reliable manner. The only way is if you take at least a write lock on the log table and ensure you are NOT in serializable mode. Remember that you CANNOT view tuples that are not yet committed. In fact, with serializable mode you will only see changes from transactions that are completed BEFORE you started yours. > The only other thing I can suggest is what someone else did: commit them > with wallclock timestamps, and then have a different thread wake up every > _n_ seconds and put the records into the proper table in timestamp order. You can guarantee: * the proper ordering of log events for a single transaction (by property of transactions). * modifications/delete to a single tuple (due to the synchronization point caused by the write lock). This seems enough in the requested case. You can even do so with a simple sequence counter if you are only interested the of ordering in time of events. - Joris
Craig Ringer wrote: [snip] > If you really want to make somebody cry, I guess you could do it with > dblink - connect back to your own database from dblink and use a short > transaction to commit a log record, using table-based (rather than > sequence) ID generation to ensure that records were inserted in ID > order. That'd restrict the "critical section" in which your various > transactions were unable to run concurrently to a much shorter period, > but would result in a log message being saved even if the transaction > later aborted. It'd also be eye-bleedingly horrible, to the point where > even the "send a message from a C function" approach would be nicer. This will not work for the problem the TS has. Let a single transaction hang for a long enough time before commit, while others succeed. It will keep ordering of changes, but commits might come unordered. The issue is, you don't really have the critical section as you describe, there is no SINGLE lock you are 'fighting' for. It will work with an added table write lock (or up), that will be the lock for your critical section. In my opinion I would just forget about this one rather quickly as you more or less proposed... - Joris
Thanks to all for your help. I've adopted the scheme involving a "staging" table -- the writer processes insert into that, then a single "publisher" process pulls from that and writes to the log, giving a clean serial order for any reader of the log. Vance On Mon, 2008-04-21 at 23:59 +0200, Joris Dobbelsteen wrote: > Craig Ringer wrote: > [snip] > > If you really want to make somebody cry, I guess you could do it with > > dblink - connect back to your own database from dblink and use a short > > transaction to commit a log record, using table-based (rather than > > sequence) ID generation to ensure that records were inserted in ID > > order. That'd restrict the "critical section" in which your various > > transactions were unable to run concurrently to a much shorter period, > > but would result in a log message being saved even if the transaction > > later aborted. It'd also be eye-bleedingly horrible, to the point where > > even the "send a message from a C function" approach would be nicer. > > This will not work for the problem the TS has. Let a single transaction > hang for a long enough time before commit, while others succeed. It will > keep ordering of changes, but commits might come unordered. > > The issue is, you don't really have the critical section as you > describe, there is no SINGLE lock you are 'fighting' for. > > It will work with an added table write lock (or up), that will be the > lock for your critical section. > > In my opinion I would just forget about this one rather quickly as you > more or less proposed... > > - Joris
David Wilson wrote: > (I originally missed replying to all here; sorry about the duplicate, > Vance, but figured others might be interested. > > On Wed, Apr 16, 2008 at 1:55 PM, Vance Maverick <vmaverick@pgp.com> wrote: >> Another approach would be to queue the log entries in a "staging" table, >> so that a single process could move them into the log. This is fairly >> heavyweight, but it would guarantee the consistent sequencing of the log >> as seen by a reader (even if the order of entries in the log didn't >> always reflect the true commit sequence in the staging table). I'm >> hoping someone knows a cleverer trick. > > > Consider a loop like the following > > advisory lock staging table > if (entries in table) > copy entries to main log table as a single transaction > release advisory lock on staging table > read out and handle most recent log entries from main table > > The advisory lock is automatically released on client disconnect, and > doing the whole thing within one transaction should prevent any > partial-copies on failures. > > It doesn't matter that there are concurrent inserts to the staging > table because the staging table is always wiped all at once and > transferred in a synchronous fashion to the main table. You also can't > lose data, because it's always in one of the two tables. If you want to clean up the the staging table I have some concerns about the advisory lock. I think you mean exclusive table lock. There are other two options as well: * Track which data is copies and remove those from the staging table that are in the new table. * Use a serializable mode for the staging-to-log-copying transactions. In this way you can just copy the table and trow away everything (without checking). This seems rather cheap and allows for concurrent processing. - Joris
On Mon, Apr 21, 2008 at 7:55 PM, Joris Dobbelsteen <joris@familiedobbelsteen.nl> wrote: > > If you want to clean up the the staging table I have some concerns about > the advisory lock. I think you mean exclusive table lock. Either works, really. An advisory lock is really just a lock over which you have control of the meaning, as long as you're using it in the appropriate places. Also, an advisory lock on just the processes doing staging-to-log moves would allow writes into the staging table to continue concurrently with the staging-to-log transaction (whereas an exclusive lock would unnecessarily prevent them). Also, while Vance appears to have chosen to have a dedicated staging-to-log process, even that isn't necessary- each reader can simply do the lock/clear staging/unlock before any attempt to read- unless you're polling that log table at truly crazy rates, the overhead should be negligible and will ensure that the staging table is simply cleared out "whenever necessary" while removing the complexity of a separate process. -- - David T. Wilson david.t.wilson@gmail.com
David Wilson wrote: > On Mon, Apr 21, 2008 at 7:55 PM, Joris Dobbelsteen > <joris@familiedobbelsteen.nl> wrote: >> If you want to clean up the the staging table I have some concerns about >> the advisory lock. I think you mean exclusive table lock. > > Either works, really. An advisory lock is really just a lock over > which you have control of the meaning, as long as you're using it in > the appropriate places. Also, an advisory lock on just the processes > doing staging-to-log moves would allow writes into the staging table > to continue concurrently with the staging-to-log transaction (whereas > an exclusive lock would unnecessarily prevent them). Describe the mechanism, because I don't really believe it yet. I think you need to do a advisory lock around every commit of every transaction that writes to the log table. If you are only using the advisory lock in the staging-to-log transaction, how would this prevent newly committed tuples to not show up during this process? (You can't both delete and insert in a single statement, I believe, in which case you won't have a problem anyways). > Also, while Vance appears to have chosen to have a dedicated > staging-to-log process, even that isn't necessary- each reader can > simply do the lock/clear staging/unlock before any attempt to read- > unless you're polling that log table at truly crazy rates, the > overhead should be negligible and will ensure that the staging table > is simply cleared out "whenever necessary" while removing the > complexity of a separate process. Using serialization mode for the staging-to-log process seems to be the most efficient methods, as it won't even block writers. - Joris
On Tue, Apr 22, 2008 at 9:52 AM, Joris Dobbelsteen <joris@familiedobbelsteen.nl> wrote: > > Describe the mechanism, because I don't really believe it yet. I think you > need to do a advisory lock around every commit of every transaction that > writes to the log table. Consider some number of reader processes and some number of writer processes. Writer processes touch only the staging table, and solely do inserts into it. As a result, writer processes cannot interfere with each other in any way and do not require any synchronization beyond that provided by MVCC. Reader processes are interested in polling the logging table at intervals. In the process, they also act as staging-to-log movers. This act (because it is destructive and because we require serialized inserts for id generation in the log table) must take a lock that prevents other readers from attempting the same work at the same time. Each reader process therefore has a loop that appears as follows: 1) Obtain advisory lock. 2) Begin transaction. 3) For each row in staging table, insert copy into log table. 4) Delete all visible rows from staging table. 5) Commit transaction. 6) Release advisory lock. 7) Handle not-yet-seen rows in the logging table (This is the primary work of the readers) 8) Sleep for the desired interval and return to 1). We require two types of synchronization and the above takes care of both: 1) The advisory lock prevents multiple readers from doing simultaneous staging-to-log moves. 2) The transaction block ensures that the reader will see a consistent state on the staging table while writers may write at the same time; writes that occur during the reader's transaction block will simply be ignored during this round of reading. You need both types of synchronization to avoid problems- taking an exclusive lock would simply be the sledgehammer method of doing the synchronization, since it would take the place of both the advisory lock and the transaction at the same time but would also block writers. -- - David T. Wilson david.t.wilson@gmail.com
David Wilson wrote: > On Tue, Apr 22, 2008 at 9:52 AM, Joris Dobbelsteen > <joris@familiedobbelsteen.nl> wrote: >> Describe the mechanism, because I don't really believe it yet. I think you >> need to do a advisory lock around every commit of every transaction that >> writes to the log table. > > Consider some number of reader processes and some number of writer processes. > > Writer processes touch only the staging table, and solely do inserts > into it. As a result, writer processes cannot interfere with each > other in any way and do not require any synchronization beyond that > provided by MVCC. > > Reader processes are interested in polling the logging table at > intervals. In the process, they also act as staging-to-log movers. > This act (because it is destructive and because we require serialized > inserts for id generation in the log table) must take a lock that > prevents other readers from attempting the same work at the same time. > > Each reader process therefore has a loop that appears as follows: > 1) Obtain advisory lock. > 2) Begin transaction. > 3) For each row in staging table, insert copy into log table. > 4) Delete all visible rows from staging table. Ah, yes, all visible rows... My point is that, unless you use a transaction with serializable isolation, this all visible rows for the second statement might be different from those that you copied into the log table. With the normal Read committed isolation level you suffer from a possible nonrepeatable read that might change tuple visibility between different statements. > 5) Commit transaction. > 6) Release advisory lock. > 7) Handle not-yet-seen rows in the logging table (This is the primary > work of the readers) > 8) Sleep for the desired interval and return to 1). > > We require two types of synchronization and the above takes care of both: > 1) The advisory lock prevents multiple readers from doing simultaneous > staging-to-log moves. > 2) The transaction block ensures that the reader will see a consistent > state on the staging table while writers may write at the same time; > writes that occur during the reader's transaction block will simply be > ignored during this round of reading. See above, you demand its impossible for nonrepeatable reads to occur. > You need both types of synchronization to avoid problems- taking an > exclusive lock would simply be the sledgehammer method of doing the > synchronization, since it would take the place of both the advisory > lock and the transaction at the same time but would also block > writers. I agree with you on this, but it does guarentee the impossibility of a nonrepeatable read at the cost of concurrency. There seems to be a better solution indeed. - Joris
On Tue, Apr 22, 2008 at 2:48 PM, Joris Dobbelsteen <joris@familiedobbelsteen.nl> wrote: > > Ah, yes, all visible rows... > My point is that, unless you use a transaction with serializable isolation, > this all visible rows for the second statement might be different from those > that you copied into the log table. > > With the normal Read committed isolation level you suffer from a possible > nonrepeatable read that might change tuple visibility between different > statements. That depends on implementation. A select into ... to do the initial copy followed by a delete where... with the where clause referencing the log table itself to ensure that we delete only things that now exist in the log table, or a row by row insert/delete pair. Either would provide the appropriate level of protection from accidental deletion of more things than you intended without harming concurrency. The delete referencing the log table might require that the log table be indexed for performance, but it's likely that such indexing would be done anyway for general log use. -- - David T. Wilson david.t.wilson@gmail.com
On Tue, Apr 22, 2008 at 2:48 PM, Joris DobbelsteenThat depends on implementation. A select into ... to do the initial
<joris@familiedobbelsteen.nl> wrote:
>
> Ah, yes, all visible rows...
> My point is that, unless you use a transaction with serializable isolation,
> this all visible rows for the second statement might be different from those
> that you copied into the log table.
>
> With the normal Read committed isolation level you suffer from a possible
> nonrepeatable read that might change tuple visibility between different
> statements.
copy followed by a delete where... with the where clause referencing
the log table itself to ensure that we delete only things that now
exist in the log table, or a row by row insert/delete pair. Either
would provide the appropriate level of protection from accidental
deletion of more things than you intended without harming concurrency.
The delete referencing the log table might require that the log table
be indexed for performance, but it's likely that such indexing would
be done anyway for general log use.
I think this plpgsql function would solve the problem of atomic read-and-delete operation...
create or replace function log_rotate() returns void as $$
declare
rec record;
begin
for rec in delete from t1 returning * loop
insert into t2 values( rec.a, rec.b );
end loop;
end;
$$ language 'plpgsql';
select log_rotate();
Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
EnterpriseDB http://www.enterprisedb.com
Mail sent from my BlackLaptop device
Gurjeet Singh wrote: > On Wed, Apr 23, 2008 at 12:29 AM, David Wilson <david.t.wilson@gmail.com > <mailto:david.t.wilson@gmail.com>> wrote: > > On Tue, Apr 22, 2008 at 2:48 PM, Joris Dobbelsteen > <joris@familiedobbelsteen.nl <mailto:joris@familiedobbelsteen.nl>> > wrote: > > > > Ah, yes, all visible rows... > > My point is that, unless you use a transaction with serializable > isolation, > > this all visible rows for the second statement might be different > from those > > that you copied into the log table. > > > > With the normal Read committed isolation level you suffer from a > possible > > nonrepeatable read that might change tuple visibility between > different > > statements. > > That depends on implementation. A select into ... to do the initial > copy followed by a delete where... with the where clause referencing > the log table itself to ensure that we delete only things that now > exist in the log table, or a row by row insert/delete pair. Either > would provide the appropriate level of protection from accidental > deletion of more things than you intended without harming concurrency. > The delete referencing the log table might require that the log table > be indexed for performance, but it's likely that such indexing would > be done anyway for general log use. Of course, point is, that is another way to define "visibility" in this context: if present in log table. Point is, a suitable definition is needed. > I think this plpgsql function would solve the problem of atomic > read-and-delete operation... > > create or replace function log_rotate() returns void as $$ > declare > rec record; > begin > > for rec in delete from t1 returning * loop > insert into t2 values( rec.a, rec.b ); > end loop; > > end; > $$ language 'plpgsql'; > > select log_rotate(); Don't forget ordering, this was important before... START TRANSACTION ISOLATION LEVEL REPEATABLE READ; SELECT ... INTO log FROM staging ORDER BY ...; DELETE FROM staging; COMMIT; Don't know if that ORDER BY works. It should in this case. - Joris