Thread: Clustering features for upcoming developer meeting -- please claim yours!
CLH, The pgCon developer meeting is coming up next week. We have a tentative agenda item to discuss clustering features, but we need to know specifically *what* we are going to discuss. As a reminder, the list of features is here: http://wiki.postgresql.org/wiki/ClusterFeatures Of these, the following seem well-defined enough to be worth talkign about. The question to answer is, (a) which features actually have someone on THIS list who plans to work on them? (b) will that person be at pgCon? Please claim features which you are ready to talk about, ASAP. Thanks! * Export snapshots to other sessions - 11 * Global deadlock detection - 9 * API into the Parser / Parser as an independent module - 9 * Start/stop archiving at runtime - 8 * XID feed - 4 (included because XC seems to have written this) -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Re: Clustering features for upcoming developer meeting -- please claim yours!
Hi, I'm attending at the meeting and working with snapshot exporting and XID feed. I can talk about these features. Regards; ---------- Koichi Suzuki 2010/5/7 Josh Berkus <josh@agliodbs.com>: > CLH, > > The pgCon developer meeting is coming up next week. We have a tentative > agenda item to discuss clustering features, but we need to know > specifically *what* we are going to discuss. > > As a reminder, the list of features is here: > > http://wiki.postgresql.org/wiki/ClusterFeatures > > Of these, the following seem well-defined enough to be worth talkign > about. The question to answer is, > (a) which features actually have someone on THIS list who plans to work > on them? > (b) will that person be at pgCon? > > Please claim features which you are ready to talk about, ASAP. Thanks! > > * Export snapshots to other sessions - 11 > * Global deadlock detection - 9 > * API into the Parser / Parser as an independent module - 9 > * Start/stop archiving at runtime - 8 > * XID feed - 4 (included because XC seems to have written this) > > -- > -- Josh Berkus > PostgreSQL Experts Inc. > http://www.pgexperts.com > > -- > Sent via pgsql-cluster-hackers mailing list (pgsql-cluster-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-cluster-hackers >
Re: Clustering features for upcoming developer meeting -- please claim yours!
On 05/06/2010 10:36 PM, Koichi Suzuki wrote: > Hi, > > I'm attending at the meeting and working with snapshot exporting and > XID feed. I can talk about these features. Can you expand the description of XID feed on the wiki page? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Re: Clustering features for upcoming developer meeting -- please claim yours!
Okay I will. Are there any issues you'd like to have? ---------- Koichi Suzuki 2010/5/8 Josh Berkus <josh@agliodbs.com>: > On 05/06/2010 10:36 PM, Koichi Suzuki wrote: >> >> Hi, >> >> I'm attending at the meeting and working with snapshot exporting and >> XID feed. I can talk about these features. > > Can you expand the description of XID feed on the wiki page? > > > -- > -- Josh Berkus > PostgreSQL Experts Inc. > http://www.pgexperts.com > > -- > Sent via pgsql-cluster-hackers mailing list > (pgsql-cluster-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-cluster-hackers >
Re: Clustering features for upcoming developer meeting -- please claim yours!
Josh Berkus <josh@agliodbs.com> wrote: > The pgCon developer meeting is coming up next week. We have a tentative > agenda item to discuss clustering features, but we need to know > specifically *what* we are going to discuss. I'd like to discuss about "Function scan push-down" and "Modification trigger into core" in the list. I wrote additional description for them in Wiki, and will use them at the meeting. Comments and adjustment for the topics are welcome. * SQL/MED for WHERE-clause push-down http://wiki.postgresql.org/wiki/SQL/MED#FDW_routines * General Modification Trigger and Generalized Data Queue Design http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Design Regards, --- Takahiro Itagaki NTT Open Source Software Center
On 5/6/2010 7:42 PM, Josh Berkus wrote: > CLH, > > The pgCon developer meeting is coming up next week. We have a tentative > agenda item to discuss clustering features, but we need to know > specifically *what* we are going to discuss. > > As a reminder, the list of features is here: > > http://wiki.postgresql.org/wiki/ClusterFeatures > > Of these, the following seem well-defined enough to be worth talkign > about. The question to answer is, > (a) which features actually have someone on THIS list who plans to work > on them? > (b) will that person be at pgCon? > > Please claim features which you are ready to talk about, ASAP. Thanks! > > * Export snapshots to other sessions - 11 > * Global deadlock detection - 9 > * API into the Parser / Parser as an independent module - 9 > * Start/stop archiving at runtime - 8 > * XID feed - 4 (included because XC seems to have written this) > Aside from that list, I'd like to get into a little more detail on DDL triggers. This seems to be something I could actually work on in the future. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Re: Clustering features for upcoming developer meeting -- please claim yours!
Jan, > Aside from that list, I'd like to get into a little more detail on DDL > triggers. This seems to be something I could actually work on in the > future. Is this the same thing as the general modification trigger? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 5/10/2010 1:39 PM, Josh Berkus wrote: > Jan, > >> Aside from that list, I'd like to get into a little more detail on DDL >> triggers. This seems to be something I could actually work on in the >> future. > > Is this the same thing as the general modification trigger? To my understanding, the general modification triggers are meant to unify the "data" queue mechanisms, both Londiste and Slony are based on, under one new, built in mechanism with the intention to cut down the overhead associated with them. There is certainly a big need to coordinate this project with any attempts made in the direction of DDL triggers. I think it is obvious that I would later on like to make use of them within Slony to replicate schema changes. This of course requires that such schema changes get applied on the replica's at the correct place inside the data stream. For example, if you "ALTER TABLE ADD COLUMN", you want to replicate all DML changes, that happened before that ALTER TABLE grabbed its exclusive lock, before that ALTER TABLE itself. And it would be quite disastrous to attempt to apply any INSERT that happened on the master with that new column before the ALTER TABLE happened on the replica. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Re: Clustering features for upcoming developer meeting -- please claim yours!
On 5/10/10, Jan Wieck <JanWieck@yahoo.com> wrote: > On 5/10/2010 1:39 PM, Josh Berkus wrote: > > > Aside from that list, I'd like to get into a little more detail on DDL > > > triggers. This seems to be something I could actually work on in the > > > future. > > > > > > > Is this the same thing as the general modification trigger? > > > > To my understanding, the general modification triggers are meant to unify > the "data" queue mechanisms, both Londiste and Slony are based on, under one > new, built in mechanism with the intention to cut down the overhead > associated with them. > > There is certainly a big need to coordinate this project with any attempts > made in the direction of DDL triggers. I think it is obvious that I would > later on like to make use of them within Slony to replicate schema changes. > This of course requires that such schema changes get applied on the > replica's at the correct place inside the data stream. For example, if you > "ALTER TABLE ADD COLUMN", you want to replicate all DML changes, that > happened before that ALTER TABLE grabbed its exclusive lock, before that > ALTER TABLE itself. And it would be quite disastrous to attempt to apply any > INSERT that happened on the master with that new column before the ALTER > TABLE happened on the replica. AFAICS the "agreeable order" should take care of positioning: http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Suggestions_for_Implementation This combined with DML triggers that react to invalidate events (like PgQ ones) should already work fine? Are there situations where such setup fails? -- marko
On 5/10/2010 4:25 PM, Marko Kreen wrote: > AFAICS the "agreeable order" should take care of positioning: > > http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Suggestions_for_Implementation > > This combined with DML triggers that react to invalidate events (like > PgQ ones) should already work fine? > > Are there situations where such setup fails? > That explanation of an agreeable order only solves the problems of placing the DDL into the replication stream between transactions, possibly done by multiple clients. It does in no way address the problem of one single client executing a couple of updates, modifies the object, then continues with updates. In this case, there isn't even a transaction boundary at which the DDL happened on the master. And this one transaction could indeed alter the object several times. This means that a generalized data queue needs to have hooks, so that DDL triggers can inject their payload into it. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Re: Clustering features for upcoming developer meeting -- please claim yours!
On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote: > On 5/10/2010 4:25 PM, Marko Kreen wrote: > > > AFAICS the "agreeable order" should take care of positioning: > > > > > http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Suggestions_for_Implementation > > > > This combined with DML triggers that react to invalidate events (like > > PgQ ones) should already work fine? > > > > Are there situations where such setup fails? > > > > > > That explanation of an agreeable order only solves the problems of placing > the DDL into the replication stream between transactions, possibly done by > multiple clients. > > It does in no way address the problem of one single client executing a > couple of updates, modifies the object, then continues with updates. In this > case, there isn't even a transaction boundary at which the DDL happened on > the master. And this one transaction could indeed alter the object several > times. But the event order would be strictly defined by the sequence id? And the local, invalidation-aware triggers would see always up-to-date state, no? And it would be applied as single TX on subscriber too. Where's the problem? > This means that a generalized data queue needs to have hooks, so that DDL > triggers can inject their payload into it. If you mean "hooks" as pgq.insert_event() function, then yes.. I hope you are designing a generally usable queue with the GDQ. Btw, speaking of DDL triggers, as long as we don't have it I'm assuming all replicated DDL would be applied as: 1) Execute DDL statement 2) Insert statement into queue in same tx. So I'm assuming the DDL trigger would simply make the step 2) automatically. Perhaps are you thinking of some other sort of DDL triggers? -- marko
Re: Clustering features for upcoming developer meeting -- please claim yours!
On Mon, 2010-05-10 at 17:04 -0400, Jan Wieck wrote: > On 5/10/2010 4:25 PM, Marko Kreen wrote: > > AFAICS the "agreeable order" should take care of positioning: > > > > http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Suggestions_for_Implementation > > > > This combined with DML triggers that react to invalidate events (like > > PgQ ones) should already work fine? > > > > Are there situations where such setup fails? > > > > That explanation of an agreeable order only solves the problems of > placing the DDL into the replication stream between transactions, > possibly done by multiple clients. Why only "between transactions" (whatever that means) ? If all transactions get their event ids from the same non-cached sequence, then the event id _is_ a reliable ordering within a set of concurrent transactions. Event id's get serialized (wher it matters) by the very locks taken by DDL/DML statments on the objects they manipulate. Once more, for this to work over more than one backend, the sequence providing the event id's needs to be non-cached. > It does in no way address the problem of one single client executing a > couple of updates, modifies the object, then continues with updates. In > this case, there isn't even a transaction boundary at which the DDL > happened on the master. And this one transaction could indeed alter the > object several times. How is DDL here different from DML herev? You need to replay DML in the right order too, no ? > This means that a generalized data queue needs to have hooks, so that > DDL triggers can inject their payload into it. Anything that needs to be replicated, needs "to have hooks" in the generalized data queue, so that a) they get replicated in the right order for each affected object a.1) this can be relaxed for related objects in case FK-s are disabled of deferred until transaction end b) they get committed on the subscriber side at transaction (set) boundaries of provider. if you implement the data queue as something non-transactional (non pgQ-like), then you need to replicate (i,e. copy over and replay c) events from both committed and rollbacked transaction d) commits/rollbacks themselves e) and apply and/or rollback each individual transaction separately IOW you mostly re-implement WAL, except at logical level. Which may or may not be a good thing, depending on other requirements of the system. If you do it using pgQ you save on not copying rollbacked data, but you do slightly more work on provider side. You also end up with not having dead tuples from aborted transactions on subscriber. -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Re: Clustering features for upcoming developer meeting -- please claim yours!
On Tue, 2010-05-11 at 01:08 +0300, Marko Kreen wrote: > On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote: > > This means that a generalized data queue needs to have hooks, so that DDL > > triggers can inject their payload into it. > > If you mean "hooks" as pgq.insert_event() function, then yes.. > I hope you are designing a generally usable queue with the GDQ. The only way to have a generally usable queue different from pgQ is having something that copies all events off the server (to either final subscriber or some forwarding/processing station) and leaves the commit/abort resolution to "the other server". event id should still provide an usable order for applying these events. -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
On 5/10/2010 6:40 PM, Hannu Krosing wrote: > On Mon, 2010-05-10 at 17:04 -0400, Jan Wieck wrote: >> On 5/10/2010 4:25 PM, Marko Kreen wrote: >> > AFAICS the "agreeable order" should take care of positioning: >> > >> > http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Suggestions_for_Implementation >> > >> > This combined with DML triggers that react to invalidate events (like >> > PgQ ones) should already work fine? >> > >> > Are there situations where such setup fails? >> > >> >> That explanation of an agreeable order only solves the problems of >> placing the DDL into the replication stream between transactions, >> possibly done by multiple clients. > > Why only "between transactions" (whatever that means) ? > > If all transactions get their event ids from the same non-cached > sequence, then the event id _is_ a reliable ordering within a set of > concurrent transactions. > > Event id's get serialized (wher it matters) by the very locks taken by > DDL/DML statments on the objects they manipulate. > > Once more, for this to work over more than one backend, the sequence > providing the event id's needs to be non-cached. > >> It does in no way address the problem of one single client executing a >> couple of updates, modifies the object, then continues with updates. In >> this case, there isn't even a transaction boundary at which the DDL >> happened on the master. And this one transaction could indeed alter the >> object several times. > > How is DDL here different from DML herev? > > You need to replay DML in the right order too, no ? > >> This means that a generalized data queue needs to have hooks, so that >> DDL triggers can inject their payload into it. > > Anything that needs to be replicated, needs "to have hooks" in the > generalized data queue, so that > > a) they get replicated in the right order for each affected object > a.1) this can be relaxed for related objects in case FK-s are > disabled of deferred until transaction end > b) they get committed on the subscriber side at transaction (set) > boundaries of provider. > > if you implement the data queue as something non-transactional (non > pgQ-like), then you need to replicate (i,e. copy over and replay > > c) events from both committed and rollbacked transaction > d) commits/rollbacks themselves > e) and apply and/or rollback each individual transaction separately > > IOW you mostly re-implement WAL, except at logical level. Which may or > may not be a good thing, depending on other requirements of the system. > > If you do it using pgQ you save on not copying rollbacked data, but you > do slightly more work on provider side. You also end up with not having > dead tuples from aborted transactions on subscriber. > So the idea is to have one queue that captures row level DML events as well as statement level DDL. That is certainly possible and in that case the event id will indeed provide a usable order for applying these actions, if it is taken from a non-cached sequence after all locks have been taken, as Marko explained. That event id resembles Slony's action_seq. The thing this event id alone does not provide is any point where inside that sequence of event id's the replica can issue commits. On a busy server, there may never be any such moment unless the replica applies things the Slony way instead of in monotonically increasing event id's. If your idea is to simply record things WAL style and shove them off to the replicas, you just move some of the current overhead from the master by duplicating it onto every replica. There are more things to consider about such a generalized queue, especially if we think about adding it to core. One for example is version independence. Slony and I think Londiste too can replicate across PostgreSQL server versions. And experience shows us that no communications protocol, on disk format or the like is ever set in stone. So we need to think how this queue can become backwards compatible without introducing more overhead than we try to save right now. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Re: Clustering features for upcoming developer meeting -- please claim yours!
On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote: > On 5/10/2010 6:40 PM, Hannu Krosing wrote: > > On Mon, 2010-05-10 at 17:04 -0400, Jan Wieck wrote: > > > On 5/10/2010 4:25 PM, Marko Kreen wrote: > > > > AFAICS the "agreeable order" should take care of positioning: > > > > > > http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Suggestions_for_Implementation > > > > > This combined with DML triggers that react to invalidate events > (like > > > > PgQ ones) should already work fine? > > > > > Are there situations where such setup fails? > > > > > So the idea is to have one queue that captures row level DML events as well > as statement level DDL. That is certainly possible and in that case the > event id will indeed provide a usable order for applying these actions, if > it is taken from a non-cached sequence after all locks have been taken, as > Marko explained. > > That event id resembles Slony's action_seq. > > The thing this event id alone does not provide is any point where inside > that sequence of event id's the replica can issue commits. On a busy server, > there may never be any such moment unless the replica applies things the > Slony way instead of in monotonically increasing event id's. If your idea is > to simply record things WAL style and shove them off to the replicas, you > just move some of the current overhead from the master by duplicating it > onto every replica. I'm not sure about what overhead are you talking about. Are you trying to get rid of current snapshot-based grouping of events? Why? > There are more things to consider about such a generalized queue, > especially if we think about adding it to core. > > One for example is version independence. Slony and I think Londiste too can > replicate across PostgreSQL server versions. And experience shows us that no > communications protocol, on disk format or the like is ever set in stone. So > we need to think how this queue can become backwards compatible without > introducing more overhead than we try to save right now. I'm guessing you are trying to do 2 more things: 1) Add queue operations to SQL syntax 2) Non-table custom storage. I'm indifferent to 1) and dubious how big the win the 2) can bring, but glad to be proven wrong. But there's another issue - our experience with PgQ has shown that generic queue means also generic code operating with it, which means bugs. And transactional queue readers are not allowed to drop events on problems. Which means on problems, admins need to examine queue and delete/modify the events. Ofcourse, the bug causing the problem needs also be fixed, but bugfixing does not repair the queue, that must be done manually. If the 1) and/or 2) means such possibility is removed, it will be quite big hit on the generic-ness of the GDQ. In that aspect I would prefer to fix any remaining problems (what are they?) with plain queue queue tables, even if the "NoSQL" queueing could perform significantly better. -- marko
GDQ iimplementation (was: Re: Clustering features for upcoming developer meeting -- please claim yours!)
I changed the subject line because we are diving deep into implementation details. On 5/11/2010 5:24 AM, Marko Kreen wrote: > On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote: >> The thing this event id alone does not provide is any point where inside >> that sequence of event id's the replica can issue commits. On a busy server, >> there may never be any such moment unless the replica applies things the >> Slony way instead of in monotonically increasing event id's. If your idea is >> to simply record things WAL style and shove them off to the replicas, you >> just move some of the current overhead from the master by duplicating it >> onto every replica. > > I'm not sure about what overhead are you talking about. > > Are you trying to get rid of current snapshot-based grouping > of events? Why? The problem statement on the Wiki page and Itagaki's comments about non-table storage of the queue made it look to me as if some WAL style flat file approach was looked after. I am glad that we agree that we cannot get rid of the snapshot based grouping. That and the IMHO required table storage is the overhead I was talking about. We should be clear that we cannot get rid of that grouping and that however many log segments are used (Slony currently 2, Londiste default 3), the oldest running transaction on the master determines which log segments can get truncated. The more log segments there are in use, the more UNION keywords may appear in the query, selecting from the log. > >> There are more things to consider about such a generalized queue, >> especially if we think about adding it to core. >> >> One for example is version independence. Slony and I think Londiste too can >> replicate across PostgreSQL server versions. And experience shows us that no >> communications protocol, on disk format or the like is ever set in stone. So >> we need to think how this queue can become backwards compatible without >> introducing more overhead than we try to save right now. > > I'm guessing you are trying to do 2 more things: > > 1) Add queue operations to SQL syntax > 2) Non-table custom storage. No. I don't know how you read 1) into the above and 2) was my misunderstanding reading the Wiki. I don't want either. > But there's another issue - our experience with PgQ has shown > that generic queue means also generic code operating with it, > which means bugs. And transactional queue readers are not > allowed to drop events on problems. Which means on problems, > admins need to examine queue and delete/modify the events. > > Ofcourse, the bug causing the problem needs also be fixed, > but bugfixing does not repair the queue, that must be done > manually. > > If the 1) and/or 2) means such possibility is removed, > it will be quite big hit on the generic-ness of the GDQ. > > In that aspect I would prefer to fix any remaining problems > (what are they?) with plain queue queue tables, even if > the "NoSQL" queueing could perform significantly better. A generic queue implementation needs to come with some advantage over what we have now. Otherwise there is no incentive for any of the existing systems to even consider switching to it. What are the advantages of anything proposed over the current implementations used by Londiste and Slony? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Re: GDQ iimplementation (was: Re: Clustering features for upcoming developer meeting -- please claim yours!)
On Tue, 2010-05-11 at 08:33 -0400, Jan Wieck wrote: > What are the advantages of anything proposed over the current > implementations used by Londiste and Slony? It would be good to have a core technology that provided a generic transport to other remote databases. We already have WALSender and WALReceiver, which uses the COPY protocol as a transport mechanism. It would be easy to extend that so we could send other forms of data. We can do that in two ways: * Alter triggers so that Slony/Londiste write directly to WAL rather than log tables, using a new WAL record for custom data blobs. * Alter WALSender so it can read Slony/Londiste log tables for consumption by an architecture similar to WALreceiver/Startup. Probably easier. We can also alter the WAL format itself to include the information in WAL that is required to do what Slony/Londiste already do, so we don't need to specifically write anything at all, just read WAL at other end. Even more efficient. The advantages of these options would be * integration of core technologies * greater efficiency for trigger based logging via WAL In other RDBMS "replication" has long meant "data transport, either for HA or application use". We should be looking beyond the pure HA aspects, as pgq does. I would certainly like to see a system that wrote data on master and then constructed the SQL on receiver-side (i.e. on slave), so the integration was less tight. That would allow data to be sent and for it to be consumed to a variety of purposes, not just HA replay. -- Simon Riggs www.2ndQuadrant.com
Re: GDQ iimplementation (was: Re: Clustering features for upcoming developer meeting -- please claim yours!)
On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote: > I changed the subject line because we are diving deep into implementation > details. > On 5/11/2010 5:24 AM, Marko Kreen wrote: > > On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote: > > > The thing this event id alone does not provide is any point where > inside > > > that sequence of event id's the replica can issue commits. On a busy > server, > > > there may never be any such moment unless the replica applies things the > > > Slony way instead of in monotonically increasing event id's. If your > idea is > > > to simply record things WAL style and shove them off to the replicas, > you > > > just move some of the current overhead from the master by duplicating it > > > onto every replica. > > > > > > > I'm not sure about what overhead are you talking about. > > > > Are you trying to get rid of current snapshot-based grouping > > of events? Why? > > > > The problem statement on the Wiki page and Itagaki's comments about > non-table storage of the queue made it look to me as if some WAL style flat > file approach was looked after. > > I am glad that we agree that we cannot get rid of the snapshot based > grouping. That and the IMHO required table storage is the overhead I was > talking about. We should be clear that we cannot get rid of that grouping > and that however many log segments are used (Slony currently 2, Londiste > default 3), the oldest running transaction on the master determines which > log segments can get truncated. The more log segments there are in use, the > more UNION keywords may appear in the query, selecting from the log. Seems we are in agreement. And although PgQ can operate with any N >= 2 segments, it queries on 2 at a time, same as Slony. Rest are just there to give admins some safety room for "OH F*CK" moments. With short rotation times, it starts to seem useful.. There does not seem any advantage for querying more than 2 segments. > > > There are more things to consider about such a generalized queue, > > > especially if we think about adding it to core. > > > > > > One for example is version independence. Slony and I think Londiste too > can > > > replicate across PostgreSQL server versions. And experience shows us > that no > > > communications protocol, on disk format or the like is ever set in > stone. So > > > we need to think how this queue can become backwards compatible without > > > introducing more overhead than we try to save right now. > > > > > > > I'm guessing you are trying to do 2 more things: > > > > 1) Add queue operations to SQL syntax > > 2) Non-table custom storage. > > > > No. I don't know how you read 1) into the above and 2) was my > misunderstanding reading the Wiki. I don't want either. Oh sorry, I got that impression from wiki, not from you. As there are some ideas from you on the wiki, I assumed you are involved, so used 'you' very liberally. -- marko
Re: GDQ iimplementation (was: Re: Clustering features for upcoming developer meeting -- please claim yours!)
On 5/11/10, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2010-05-11 at 08:33 -0400, Jan Wieck wrote: > > What are the advantages of anything proposed over the current > > implementations used by Londiste and Slony? > > It would be good to have a core technology that provided a generic > transport to other remote databases. I suspect there still should be some sort of middle-ware code that reads the data from Postgres, and writes to other db. So the task of the GDQ should be to make data available to that reader, not be "transport to remote databases", no? And if we are talking about "Generalized Data Queue", one important aspect is that it should be easy to write both queue readers and writers in whatever language users wants. Which means it should be possible to do both reading and writing with ordinary SQL queries. Even requiring COPY is out, as it not available in many adapters. Ofcourse it's OK to have such extensions available optionally. Eg. Londiste does event bulk insert with COPY. But it's not required for ordinary clients. And you can always turn SELECT output into COPY format. > We already have WALSender and WALReceiver, which uses the COPY protocol > as a transport mechanism. It would be easy to extend that so we could > send other forms of data. > > We can do that in two ways: > > * Alter triggers so that Slony/Londiste write directly to WAL rather > than log tables, using a new WAL record for custom data blobs. > > * Alter WALSender so it can read Slony/Londiste log tables for > consumption by an architecture similar to WALreceiver/Startup. Probably > easier. > > We can also alter the WAL format itself to include the information in > WAL that is required to do what Slony/Londiste already do, so we don't > need to specifically write anything at all, just read WAL at other end. > Even more efficient. Hm. You'd need to tie WAL rotating with reader positions. And to read a largish batch from WAL, you need to process also unrelated data? Reading from WAL is OK for full replication, but for smallish queue, that gets only small percentage of overall traffic? > The advantages of these options would be > > * integration of core technologies > * greater efficiency for trigger based logging via WAL > > In other RDBMS "replication" has long meant "data transport, either for > HA or application use". We should be looking beyond the pure HA aspects, > as pgq does. > > I would certainly like to see a system that wrote data on master and > then constructed the SQL on receiver-side (i.e. on slave), so the > integration was less tight. That would allow data to be sent and for it > to be consumed to a variety of purposes, not just HA replay. pgq.logutriga()? It writes all columns, also NULL-ed into queue in urlencoded format. Londiste actually even knows how to generate SQL from those. We use it for most non-replication queues, where we want process the event more intelligently that simply executing it on some other connection. -- marko
On 5/11/2010 9:36 AM, Marko Kreen wrote: > Seems we are in agreement. That's always a good point to start from. > And although PgQ can operate with any N >= 2 segments, it queries > on 2 at a time, same as Slony. Rest are just there to give admins > some safety room for "OH F*CK" moments. With short rotation times, > it starts to seem useful.. Agreed. The rotation time should actually reflect the longest running transactions experienced on a frequent base from the application. And there needs to be a safeguard against rotating over even longer running transactions. The problem with a long running transaction is that it could have written into log segment 1 before we switched to segment 2. We can only TRUNCATE segment 1 after that transaction committed AND the log has been consumed by everyone interested in it. I am not familiar with how PgQ/Londiste do this. Slony specifically remembers the highest XID in progress at the time of switching, waits until the lowest XID in progress is higher than that (so all log that ever went into that segment is now visible or aborted), then waits for all log in that segment to be confirmed and finally truncates the log. All this time, it needs to do the UNION query over both log segments. > There does not seem any advantage for querying more than 2 segments. I didn't experiment with such implementation yet. I'll theorize about that in a separate thread later. >> No. I don't know how you read 1) into the above and 2) was my >> misunderstanding reading the Wiki. I don't want either. > > Oh sorry, I got that impression from wiki, not from you. > > As there are some ideas from you on the wiki, I assumed > you are involved, so used 'you' very liberally. No problem. I misinterpreted stuff there as "the currently favored idea" too. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Re: GDQ iimplementation (was: Re: Clustering features for upcoming developer meeting -- please claim yours!)
On Tue, 2010-05-11 at 17:03 +0300, Marko Kreen wrote: > On 5/11/10, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2010-05-11 at 08:33 -0400, Jan Wieck wrote: > > > What are the advantages of anything proposed over the current > > > implementations used by Londiste and Slony? > > > > It would be good to have a core technology that provided a generic > > transport to other remote databases. > > I suspect there still should be some sort of middle-ware code > that reads the data from Postgres, and writes to other db. > > So the task of the GDQ should be to make data available to that > reader, not be "transport to remote databases", no? Yes for maximum flexibility, user code at both ends would be good. -- Simon Riggs www.2ndQuadrant.com
On 5/11/2010 9:19 AM, Simon Riggs wrote: > On Tue, 2010-05-11 at 08:33 -0400, Jan Wieck wrote: > >> What are the advantages of anything proposed over the current >> implementations used by Londiste and Slony? > > It would be good to have a core technology that provided a generic > transport to other remote databases. > > We already have WALSender and WALReceiver, which uses the COPY protocol > as a transport mechanism. It would be easy to extend that so we could > send other forms of data. > > We can do that in two ways: > > * Alter triggers so that Slony/Londiste write directly to WAL rather > than log tables, using a new WAL record for custom data blobs. Londiste and Slony "consume" the log data in a different order than it appears in the WAL. Using WAL would mean moving a lot of complexity, that is currently done by using an MVCC style grouping from the log origin to the log consumers. > * Alter WALSender so it can read Slony/Londiste log tables for > consumption by an architecture similar to WALreceiver/Startup. Probably > easier. Only if that altercation means also to be able to 1) hand WALSender the from and to snapshots 2) WALSender is able to send the UNION of multiple log tables ordered by the event/action ID Because that is how both, Londiste and Slony, are consuming the log. > We can also alter the WAL format itself to include the information in > WAL that is required to do what Slony/Londiste already do, so we don't > need to specifically write anything at all, just read WAL at other end. > Even more efficient. > > The advantages of these options would be > > * integration of core technologies > * greater efficiency for trigger based logging via WAL I'm still unclear how we can ensure cross version functionality when using such core technology. Are you implying that a 9.3 WALReceiver will always be able to consume the data format sent by a 9.1 WALSender? > In other RDBMS "replication" has long meant "data transport, either for > HA or application use". We should be looking beyond the pure HA aspects, > as pgq does. Slony replication has meant both too from the beginning. > I would certainly like to see a system that wrote data on master and > then constructed the SQL on receiver-side (i.e. on slave), so the > integration was less tight. That would allow data to be sent and for it > to be consumed to a variety of purposes, not just HA replay. Slony does exactly that constructing of SQL on the receiver side, and it is a big drawback because every single row update needs to go through a separate SQL query that is parsed, planned and optimized. I can envision a generic function that takes the data format, recorded by the capture trigger on the master, and turns that into a simple plan. All these single row updates/deletes are PK based, no need to even think about parsing and planning that over and over. Just replace the targetlist to reflect whatever this log row updates and execute it. These will always be a literal value from the log or the OLD value on fields untouched. Simple enough. The big advantage from such generic support would be that systems like Londiste/Slony could use the existing COPY-SELECT mechanism to transport the log in a streaming protocol, while a BEFORE INSERT trigger on the receivers log segments is turning it into highly efficient single row operations. This generic single row change capture and single row update support would allow Londiste/Slony type replication systems to eliminate most round trip based latency, a lot of CPU usage on the replicas plus all the libpq and SQL query assembly in the replication engine itself. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Tue, 2010-05-11 at 10:38 -0400, Jan Wieck wrote: > Slony replication has meant both too from the beginning. You've done a brilliant job and I have huge respect for that. MHO: The world changes and new solutions emerge. Assimilation of technology into lower layers of the stack has been happening for years. The core parts of Slony should be assimilated, just as TCP/IP now exists as part of the OS, to the benefit of all. Various parts of Slony have already moved to core. Slony continues to have huge potential, though as part of an evolution, not in all cases fulfilling the same role it did at the beginning. Log shipping cannot easily exist outside of core, though SQL shipping can: but should it? How much more could we do? -- Simon Riggs www.2ndQuadrant.com
On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote: > On 5/11/2010 9:36 AM, Marko Kreen wrote: > > And although PgQ can operate with any N >= 2 segments, it queries > > on 2 at a time, same as Slony. Rest are just there to give admins > > some safety room for "OH F*CK" moments. With short rotation times, > > it starts to seem useful.. > > > > Agreed. The rotation time should actually reflect the longest running > transactions experienced on a frequent base from the application. And there > needs to be a safeguard against rotating over even longer running > transactions. Nightly pg_dump.. ;) > The problem with a long running transaction is that it could have written > into log segment 1 before we switched to segment 2. We can only TRUNCATE > segment 1 after that transaction committed AND the log has been consumed by > everyone interested in it. > > I am not familiar with how PgQ/Londiste do this. Slony specifically > remembers the highest XID in progress at the time of switching, waits until > the lowest XID in progress is higher than that (so all log that ever went > into that segment is now visible or aborted), then waits for all log in that > segment to be confirmed and finally truncates the log. All this time, it > needs to do the UNION query over both log segments. The "highest XID" means actually "own transaction" here? And it's not committed yet? That's seems to leave transactions that happen before it's own commit into dubious state? Although you may be fine, if you don't try to minimize reading both tables. PgQ does this: Rotate: 1) If some consumer reads older table, don't rotate. 2) Set table_nr++, switch_step1 = txid_current(), switch_step2 = NULL 3) Commit 4) Set switch_step2 = txid_current() where switch_step2 IS NULL 5) Commit Reader: 1) xmin1 = xmin of lower snapshot of batch 2) xmax2 = xmax of higher snapshot of batch 3) if xmax2 < switch_step1, read older table 4) if xmin1 > switch_step2, read newer table 5) otherwise read both -- marko
On 5/11/2010 11:20 AM, Marko Kreen wrote: > On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote: >> On 5/11/2010 9:36 AM, Marko Kreen wrote: >> > And although PgQ can operate with any N >= 2 segments, it queries >> > on 2 at a time, same as Slony. Rest are just there to give admins >> > some safety room for "OH F*CK" moments. With short rotation times, >> > it starts to seem useful.. >> > >> >> Agreed. The rotation time should actually reflect the longest running >> transactions experienced on a frequent base from the application. And there >> needs to be a safeguard against rotating over even longer running >> transactions. > > Nightly pg_dump.. ;) > >> The problem with a long running transaction is that it could have written >> into log segment 1 before we switched to segment 2. We can only TRUNCATE >> segment 1 after that transaction committed AND the log has been consumed by >> everyone interested in it. >> >> I am not familiar with how PgQ/Londiste do this. Slony specifically >> remembers the highest XID in progress at the time of switching, waits until >> the lowest XID in progress is higher than that (so all log that ever went >> into that segment is now visible or aborted), then waits for all log in that >> segment to be confirmed and finally truncates the log. All this time, it >> needs to do the UNION query over both log segments. > > The "highest XID" means actually "own transaction" here? > And it's not committed yet? That's seems to leave transactions > that happen before it's own commit into dubious state? One needs to tell transactions to switch log, commit, then look at the highest running XID after that. Any XID lower/equal to that one could possibly have written into the old segment. > > Although you may be fine, if you don't try to minimize > reading both tables. > > PgQ does this: > > Rotate: > 1) If some consumer reads older table, don't rotate. > 2) Set table_nr++, switch_step1 = txid_current(), switch_step2 = NULL > 3) Commit > 4) Set switch_step2 = txid_current() where switch_step2 IS NULL > 5) Commit Right, exactly like that :) > Reader: > 1) xmin1 = xmin of lower snapshot of batch > 2) xmax2 = xmax of higher snapshot of batch > 3) if xmax2 < switch_step1, read older table > 4) if xmin1 > switch_step2, read newer table > 5) otherwise read both Sounds familiar. I still don't know exactly what role the 3rd log segment plays in that, but it sure cannot hurt. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 5/11/2010 11:11 AM, Simon Riggs wrote: > On Tue, 2010-05-11 at 10:38 -0400, Jan Wieck wrote: > >> Slony replication has meant both too from the beginning. > > You've done a brilliant job and I have huge respect for that. > > MHO: The world changes and new solutions emerge. Assimilation of > technology into lower layers of the stack has been happening for years. > The core parts of Slony should be assimilated, just as TCP/IP now exists > as part of the OS, to the benefit of all. Various parts of Slony have > already moved to core. Slony continues to have huge potential, though as > part of an evolution, not in all cases fulfilling the same role it did > at the beginning. Log shipping cannot easily exist outside of core, > though SQL shipping can: but should it? How much more could we do? I don't have any problem with assimilation of technology or moving things into core if appropriate. What I have a problem with is stuffing things into core for minor advantages, then later discovering that we lost flexibility essential for important features. Right now one can use Slony 2.0 to do PostgreSQL major version upgrades via switchover. Using pgbouncer, these can even be done transparent to the application without the need to reconnect to the new master. I think Londiste has or is at least working on similar features. This is because Slony 2.0 is a separate product only relying on very stable core functionality, like txid's and snapshots. Are you ready to "guarantee" that the queue and transport mechanism, you want to put into core, is THAT stable and major version independent? I would not, but that may be just me. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote: > On 5/11/2010 11:20 AM, Marko Kreen wrote: > > On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote: > > > On 5/11/2010 9:36 AM, Marko Kreen wrote: > > > > And although PgQ can operate with any N >= 2 segments, it queries > > > > on 2 at a time, same as Slony. Rest are just there to give admins > > > > some safety room for "OH F*CK" moments. With short rotation times, > > > > it starts to seem useful.. > > > > > > > > > > Agreed. The rotation time should actually reflect the longest running > > > transactions experienced on a frequent base from the application. And > there > > > needs to be a safeguard against rotating over even longer running > > > transactions. > > > > > > > Nightly pg_dump.. ;) > > > > > > > The problem with a long running transaction is that it could have > written > > > into log segment 1 before we switched to segment 2. We can only TRUNCATE > > > segment 1 after that transaction committed AND the log has been consumed > by > > > everyone interested in it. > > > > > > I am not familiar with how PgQ/Londiste do this. Slony specifically > > > remembers the highest XID in progress at the time of switching, waits > until > > > the lowest XID in progress is higher than that (so all log that ever > went > > > into that segment is now visible or aborted), then waits for all log in > that > > > segment to be confirmed and finally truncates the log. All this time, it > > > needs to do the UNION query over both log segments. > > > > > > > The "highest XID" means actually "own transaction" here? > > And it's not committed yet? That's seems to leave transactions > > that happen before it's own commit into dubious state? > > > > One needs to tell transactions to switch log, commit, then look at the > highest running XID after that. Any XID lower/equal to that one could > possibly have written into the old segment. Yeah, sounds fine. Except you cannot ignore the newer table with that. But that makes difference only for consumers that are lagging. > > Although you may be fine, if you don't try to minimize > > reading both tables. > > > > PgQ does this: > > > > Rotate: > > 1) If some consumer reads older table, don't rotate. > > 2) Set table_nr++, switch_step1 = txid_current(), switch_step2 = NULL > > 3) Commit > > 4) Set switch_step2 = txid_current() where switch_step2 IS NULL > > 5) Commit > > > > Right, exactly like that :) > > > > Reader: > > 1) xmin1 = xmin of lower snapshot of batch > > 2) xmax2 = xmax of higher snapshot of batch > > 3) if xmax2 < switch_step1, read older table > > 4) if xmin1 > switch_step2, read newer table > > 5) otherwise read both > > > > Sounds familiar. I still don't know exactly what role the 3rd log segment > plays in that, but it sure cannot hurt. It makes sure you have one rotation_period of events always available. In case you want to do some recovery on them. But that's it. -- marko
Re: GDQ iimplementation (was: Re: Clustering features for upcoming developer meeting -- please claim yours!)
On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote: > I changed the subject line because we are diving deep into implementation > details. Here is my take on various issued related to queueing: http://wiki.postgresql.org/wiki/GDQIssues Feel free to add / re-prioritize the list. -- marko
Jan, Marko, Simon, I'm concerned that doing anything about the write overhead issue was discarded almost immediately in this discussion. This is not a trivial issue for performance; it means that each row which is being tracked by the GDQ needs to be written to disk a minimum of 4 times (once to WAL, once to table, once to WAL for queue, once to queue). That's at least one time too many, and effectively doubles the load on the master server. This is particularly unacceptable overhead for systems where users are not that interested in retaining the queue after an unexpected shutdown. Surely there's some way around this? Some kind of special fsync-on-write table, for example? The access pattern to a queue is quite specialized. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
----- Original message -----
> Jan, Marko, Simon,
>
> I'm concerned that doing anything about the write overhead issue was
> discarded almost immediately in this discussion. This is not a trivial
> issue for performance; it means that each row which is being tracked by
> the GDQ needs to be written to disk a minimum of 4 times (once to WAL,
> once to table, once to WAL for queue, once to queue). That's at least
> one time too many, and effectively doubles the load on the master server.
>
> This is particularly unacceptable overhead for systems where users are
> not that interested in retaining the queue after an unexpected shutdown.
>
> Surely there's some way around this? Some kind of special
> fsync-on-write table, for example? The access pattern to a queue is
> quite specialized.
Uh, this seems like purely theoretical speculation, which
also ignores the "generic queue" aspect.
In practice, with databases where there is more reads than
writes, the additional queue write seems insignificant.
So I guess it's up to you to bring hard proofs that the
additional writes are problem.
If we already are speculating, I'd guess that writing to
WAL and INSERT-only queue table involves lot less
seeking than writing to actual table.
But feel free to edit the "Goals" section, unless you are
talking about non-transactional queueing, which seems
off-topic here.
--
marko
On Mon, 2010-05-17 at 14:46 -0700, Josh Berkus wrote: > Jan, Marko, Simon, > > I'm concerned that doing anything about the write overhead issue was > discarded almost immediately in this discussion. Only thing we can do to write overhead _on_master_ is to trade it for transaction boundary reconstruction on slave (or special intermediate node), effectively implementing a "logical WAL" in addition to (or as an extension of) the current WAL. > This is not a trivial > issue for performance; it means that each row which is being tracked by > the GDQ needs to be written to disk a minimum of 4 times (once to WAL, > once to table, once to WAL for queue, once to queue). In reality the WAL record for main table is forced to disk mosttimes in the same WAL write as the WAL record for queue. And the actual queue page does not reach disk at all if queue rotation is fast. > That's at least > one time too many, and effectively doubles the load on the master server. It doubles the "throughput/sequential load" to fs cache but does much less for "number of fsyncs" as all those writesare done within the same transaction and only WAL writes need to get to disk. In my unscientific tests with pgbench adding FK's between the pgbench tables + adding PK to log table had bigger performance impact than setting up replication using londiste. > This is particularly unacceptable overhead for systems where users are > not that interested in retaining the queue after an unexpected shutdown. Users not needing data after unexpected shutdown should use temp tables. If several users need the same data, then global temp tables should be implemented / used. > Surely there's some way around this? Some kind of special > fsync-on-write table, for example? This is sure to have a large negative performance impact. WAL was added to postgreSQL for just this - to get rid of fsync-on-commit (fsync-on-write is as bad or worse than fsync-on-commit) > The access pattern to a queue is > quite specialized. A generic solution for such users would be implementing Global Temporary Tables (which need no WAL), and then using these for non-persistent GDQ -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Josh, could you give more details on the item: * Something which will work with other databases and caching systems? What exactly are you thinking here? Compatibility with some API-s? Frameworks? Having some low-level guarantees? -- marko
On 5/17/2010 5:46 PM, Josh Berkus wrote: > Jan, Marko, Simon, > > I'm concerned that doing anything about the write overhead issue was > discarded almost immediately in this discussion. This is not a trivial > issue for performance; it means that each row which is being tracked by > the GDQ needs to be written to disk a minimum of 4 times (once to WAL, > once to table, once to WAL for queue, once to queue). That's at least > one time too many, and effectively doubles the load on the master server. > > This is particularly unacceptable overhead for systems where users are > not that interested in retaining the queue after an unexpected shutdown. > > Surely there's some way around this? Some kind of special > fsync-on-write table, for example? The access pattern to a queue is > quite specialized. > I recall this slightly different. The idea of a PostgreSQL managed queue, that does NOT guarantee consistency with the final commit status of the message generating transactions, was discarded. That is not the same as ignoring the write overhead. In all our existing use cases (Londiste/Slony/Bucardo) the information in the queue cannot be entirely found in the WAL of the original underlying row operation. There are old row key values and sequence numbers or other meta information that isn't even known at the time, the original rows WAL entry is written. It may seem possible to implement the data capturing part of the queue within the heap access methods, add the extra information to the WAL record and thus get rid of one of the images. But that isn't as simple as it sounds, since queue tables have toast tables too, they don't consist of simply one "log entry", they actually consist of a bunch of tuples. One in the queue table, 0-n in the queues toast table and then the index tuples. In the case of compression, the binary data in the toasted queue attribute will be entirely different than what you may find in the WAL pieces that were written for the original data rows toast segments. It is going to be a heck of a forensics job to reconstruct all that. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Tue, 2010-05-18 at 01:53 +0200, Hannu Krosing wrote: > On Mon, 2010-05-17 at 14:46 -0700, Josh Berkus wrote: > > Jan, Marko, Simon, > > > > I'm concerned that doing anything about the write overhead issue was > > discarded almost immediately in this discussion. > > Only thing we can do to write overhead _on_master_ is to trade it for > transaction boundary reconstruction on slave (or special intermediate > node), effectively implementing a "logical WAL" in addition to (or as an > extension of) the current WAL. That does sound pretty good to me. Fairly easy to make the existing triggers write XLOG_NOOP WAL records directly rather than writing to a queue table, which also gets logged to WAL. We could just skip the queue table altogether. Even better would be extending WAL format to include all the information you need, so it gets written to WAL just once. > > This is not a trivial > > issue for performance; it means that each row which is being tracked by > > the GDQ needs to be written to disk a minimum of 4 times (once to WAL, > > once to table, once to WAL for queue, once to queue). > > In reality the WAL record for main table is forced to disk mosttimes in > the same WAL write as the WAL record for queue. And the actual queue > page does not reach disk at all if queue rotation is fast. Josh, you really should do some measurements to show the overheads. Not sure you'll get people just to accept that assertion otherwise. -- Simon Riggs www.2ndQuadrant.com
On Thu, 2010-05-20 at 20:51 +0100, Simon Riggs wrote: > On Tue, 2010-05-18 at 01:53 +0200, Hannu Krosing wrote: > > On Mon, 2010-05-17 at 14:46 -0700, Josh Berkus wrote: > > > Jan, Marko, Simon, > > > > > > I'm concerned that doing anything about the write overhead issue was > > > discarded almost immediately in this discussion. > > > > Only thing we can do to write overhead _on_master_ is to trade it for > > transaction boundary reconstruction on slave (or special intermediate > > node), effectively implementing a "logical WAL" in addition to (or as an > > extension of) the current WAL. > > That does sound pretty good to me. > > Fairly easy to make the existing triggers write XLOG_NOOP WAL records > directly rather than writing to a queue table, which also gets logged to > WAL. We could just skip the queue table altogether. > > Even better would be extending WAL format to include all the information > you need, so it gets written to WAL just once. Maybe it is also possible (less intrusive/easier to implement) to add some things to WAL which have met resistance as general trigger-based features, like "logical representation" of DDL. We already have equivalent of minimal ON COMMIT/ON ROLLBACK triggers in form of commit/rollback records in WAL. Also, if we use extended WAL as GDQ, then there should be a possibility to write WAL in form that supports only "logical" (+ of course Durability) features but not full backup and WAL based replication . And a possibility to have "user-defined" WAL records for specific tasks would also be a nice and postgreSQL-ly extensibility feature. > > > This is not a trivial > > > issue for performance; it means that each row which is being tracked by > > > the GDQ needs to be written to disk a minimum of 4 times (once to WAL, > > > once to table, once to WAL for queue, once to queue). > > > > In reality the WAL record for main table is forced to disk mosttimes in > > the same WAL write as the WAL record for queue. And the actual queue > > page does not reach disk at all if queue rotation is fast. > > Josh, you really should do some measurements to show the overheads. Not > sure you'll get people just to accept that assertion otherwise. > -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training