Thread: Clustering features for upcoming developer meeting -- please claim yours!

Clustering features for upcoming developer meeting -- please claim yours!

From

Josh Berkus

Date:

06 May 2010, 20:43:02

CLH,

The pgCon developer meeting is coming up next week.  We have a tentative
agenda item to discuss clustering features, but we need to know
specifically *what* we are going to discuss.

As a reminder, the list of features is here:

http://wiki.postgresql.org/wiki/ClusterFeatures

Of these, the following seem well-defined enough to be worth talkign
about.  The question to answer is,
(a) which features actually have someone on THIS list who plans to work
on them?
(b) will that person be at pgCon?

Please claim features which you are ready to talk about, ASAP.  Thanks!

    * Export snapshots to other sessions - 11
    * Global deadlock detection - 9
    * API into the Parser / Parser as an independent module - 9
    * Start/stop archiving at runtime - 8
    * XID feed - 4  (included because XC seems to have written this)

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Koichi Suzuki

Date:

07 May 2010, 02:36:56

Hi,

I'm attending at the meeting and working with snapshot exporting and
XID feed.  I can talk about these features.

Regards;
----------
Koichi Suzuki



2010/5/7 Josh Berkus <josh@agliodbs.com>:
> CLH,
>
> The pgCon developer meeting is coming up next week.  We have a tentative
> agenda item to discuss clustering features, but we need to know
> specifically *what* we are going to discuss.
>
> As a reminder, the list of features is here:
>
> http://wiki.postgresql.org/wiki/ClusterFeatures
>
> Of these, the following seem well-defined enough to be worth talkign
> about.  The question to answer is,
> (a) which features actually have someone on THIS list who plans to work
> on them?
> (b) will that person be at pgCon?
>
> Please claim features which you are ready to talk about, ASAP.  Thanks!
>
>    * Export snapshots to other sessions - 11
>    * Global deadlock detection - 9
>    * API into the Parser / Parser as an independent module - 9
>    * Start/stop archiving at runtime - 8
>    * XID feed - 4  (included because XC seems to have written this)
>
> --
>                                  -- Josh Berkus
>                                     PostgreSQL Experts Inc.
>                                     http://www.pgexperts.com
>
> --
> Sent via pgsql-cluster-hackers mailing list (pgsql-cluster-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-cluster-hackers
>

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Josh Berkus

Date:

07 May 2010, 14:31:44

On 05/06/2010 10:36 PM, Koichi Suzuki wrote:
> Hi,
>
> I'm attending at the meeting and working with snapshot exporting and
> XID feed.  I can talk about these features.

Can you expand the description of XID feed on the wiki page?


--
                                   -- Josh Berkus
                                      PostgreSQL Experts Inc.
                                      http://www.pgexperts.com

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Koichi Suzuki

Date:

07 May 2010, 20:55:16

Okay I will.   Are there any issues you'd like to have?
----------
Koichi Suzuki



2010/5/8 Josh Berkus <josh@agliodbs.com>:
> On 05/06/2010 10:36 PM, Koichi Suzuki wrote:
>>
>> Hi,
>>
>> I'm attending at the meeting and working with snapshot exporting and
>> XID feed.  I can talk about these features.
>
> Can you expand the description of XID feed on the wiki page?
>
>
> --
>                                  -- Josh Berkus
>                                     PostgreSQL Experts Inc.
>                                     http://www.pgexperts.com
>
> --
> Sent via pgsql-cluster-hackers mailing list
> (pgsql-cluster-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-cluster-hackers
>

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Takahiro Itagaki

Date:

10 May 2010, 02:31:41

Josh Berkus <josh@agliodbs.com> wrote:

> The pgCon developer meeting is coming up next week.  We have a tentative
> agenda item to discuss clustering features, but we need to know
> specifically *what* we are going to discuss.

I'd like to discuss about "Function scan push-down" and "Modification
trigger into core" in the list. I wrote additional description for them
in Wiki, and will use them at the meeting. Comments and adjustment for
the topics are welcome.

* SQL/MED for WHERE-clause push-down
    http://wiki.postgresql.org/wiki/SQL/MED#FDW_routines

* General Modification Trigger and Generalized Data Queue Design
    http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Design

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Jan Wieck

Date:

10 May 2010, 13:02:27

On 5/6/2010 7:42 PM, Josh Berkus wrote:
> CLH,
>
> The pgCon developer meeting is coming up next week.  We have a tentative
> agenda item to discuss clustering features, but we need to know
> specifically *what* we are going to discuss.
>
> As a reminder, the list of features is here:
>
> http://wiki.postgresql.org/wiki/ClusterFeatures
>
> Of these, the following seem well-defined enough to be worth talkign
> about.  The question to answer is,
> (a) which features actually have someone on THIS list who plans to work
> on them?
> (b) will that person be at pgCon?
>
> Please claim features which you are ready to talk about, ASAP.  Thanks!
>
>     * Export snapshots to other sessions - 11
>     * Global deadlock detection - 9
>     * API into the Parser / Parser as an independent module - 9
>     * Start/stop archiving at runtime - 8
>     * XID feed - 4  (included because XC seems to have written this)
>

Aside from that list, I'd like to get into a little more detail on DDL
triggers. This seems to be something I could actually work on in the future.


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Josh Berkus

Date:

10 May 2010, 14:39:46

Jan,

> Aside from that list, I'd like to get into a little more detail on DDL
> triggers. This seems to be something I could actually work on in the
> future.

Is this the same thing as the general modification trigger?


--
                                   -- Josh Berkus
                                      PostgreSQL Experts Inc.
                                      http://www.pgexperts.com

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Jan Wieck

Date:

10 May 2010, 16:44:52

On 5/10/2010 1:39 PM, Josh Berkus wrote:
> Jan,
>
>> Aside from that list, I'd like to get into a little more detail on DDL
>> triggers. This seems to be something I could actually work on in the
>> future.
>
> Is this the same thing as the general modification trigger?

To my understanding, the general modification triggers are meant to
unify the "data" queue mechanisms, both Londiste and Slony are based on,
under one new, built in mechanism with the intention to cut down the
overhead associated with them.

There is certainly a big need to coordinate this project with any
attempts made in the direction of DDL triggers. I think it is obvious
that I would later on like to make use of them within Slony to replicate
schema changes. This of course requires that such schema changes get
applied on the replica's at the correct place inside the data stream.
For example, if you "ALTER TABLE ADD COLUMN", you want to replicate all
DML changes, that happened before that ALTER TABLE grabbed its exclusive
lock, before that ALTER TABLE itself. And it would be quite disastrous
to attempt to apply any INSERT that happened on the master with that new
column before the ALTER TABLE happened on the replica.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Marko Kreen

Date:

10 May 2010, 17:26:02

On 5/10/10, Jan Wieck <JanWieck@yahoo.com> wrote:
> On 5/10/2010 1:39 PM, Josh Berkus wrote:
> > > Aside from that list, I'd like to get into a little more detail on DDL
> > > triggers. This seems to be something I could actually work on in the
> > > future.
> > >
> >
> > Is this the same thing as the general modification trigger?
> >
>
>  To my understanding, the general modification triggers are meant to unify
> the "data" queue mechanisms, both Londiste and Slony are based on, under one
> new, built in mechanism with the intention to cut down the overhead
> associated with them.
>
>  There is certainly a big need to coordinate this project with any attempts
> made in the direction of DDL triggers. I think it is obvious that I would
> later on like to make use of them within Slony to replicate schema changes.
> This of course requires that such schema changes get applied on the
> replica's at the correct place inside the data stream. For example, if you
> "ALTER TABLE ADD COLUMN", you want to replicate all DML changes, that
> happened before that ALTER TABLE grabbed its exclusive lock, before that
> ALTER TABLE itself. And it would be quite disastrous to attempt to apply any
> INSERT that happened on the master with that new column before the ALTER
> TABLE happened on the replica.

AFAICS the "agreeable order" should take care of positioning:

  http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Suggestions_for_Implementation

This combined with DML triggers that react to invalidate events (like
PgQ ones) should already work fine?

Are there situations where such setup fails?

--
marko

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Jan Wieck

Date:

10 May 2010, 18:03:04

On 5/10/2010 4:25 PM, Marko Kreen wrote:
> AFAICS the "agreeable order" should take care of positioning:
>
>   http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Suggestions_for_Implementation
>
> This combined with DML triggers that react to invalidate events (like
> PgQ ones) should already work fine?
>
> Are there situations where such setup fails?
>

That explanation of an agreeable order only solves the problems of
placing the DDL into the replication stream between transactions,
possibly done by multiple clients.

It does in no way address the problem of one single client executing a
couple of updates, modifies the object, then continues with updates. In
this case, there isn't even a transaction boundary at which the DDL
happened on the master. And this one transaction could indeed alter the
object several times.

This means that a generalized data queue needs to have hooks, so that
DDL triggers can inject their payload into it.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Marko Kreen

Date:

10 May 2010, 19:08:46

On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote:
> On 5/10/2010 4:25 PM, Marko Kreen wrote:
>
> > AFAICS the "agreeable order" should take care of positioning:
> >
> >
> http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Suggestions_for_Implementation
> >
> > This combined with DML triggers that react to invalidate events (like
> > PgQ ones) should already work fine?
> >
> > Are there situations where such setup fails?
> >
> >
>
>  That explanation of an agreeable order only solves the problems of placing
> the DDL into the replication stream between transactions, possibly done by
> multiple clients.
>
>  It does in no way address the problem of one single client executing a
> couple of updates, modifies the object, then continues with updates. In this
> case, there isn't even a transaction boundary at which the DDL happened on
> the master. And this one transaction could indeed alter the object several
> times.

But the event order would be strictly defined by the sequence id?
And the local, invalidation-aware triggers would see always
up-to-date state, no?

And it would be applied as single TX on subscriber too.

Where's the problem?

>  This means that a generalized data queue needs to have hooks, so that DDL
> triggers can inject their payload into it.

If you mean "hooks" as pgq.insert_event() function, then yes..
I hope you are designing a generally usable queue with the GDQ.

Btw, speaking of DDL triggers, as long as we don't have it
I'm assuming all replicated DDL would be applied as:

  1) Execute DDL statement
  2) Insert statement into queue

in same tx.  So I'm assuming the DDL trigger would simply make
the step 2) automatically.  Perhaps are you thinking of some
other sort of DDL triggers?

--
marko

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Hannu Krosing

Date:

10 May 2010, 19:40:34

On Mon, 2010-05-10 at 17:04 -0400, Jan Wieck wrote:
> On 5/10/2010 4:25 PM, Marko Kreen wrote:
> > AFAICS the "agreeable order" should take care of positioning:
> >
> >   http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Suggestions_for_Implementation
> >
> > This combined with DML triggers that react to invalidate events (like
> > PgQ ones) should already work fine?
> >
> > Are there situations where such setup fails?
> >
>
> That explanation of an agreeable order only solves the problems of
> placing the DDL into the replication stream between transactions,
> possibly done by multiple clients.

Why only "between transactions" (whatever that means) ?

If all transactions get their event ids from the same non-cached
sequence, then the event id _is_ a reliable ordering within a set of
concurrent transactions.

Event id's get serialized (wher it matters) by the very locks taken by
DDL/DML statments on the objects they manipulate.

Once more, for this to work over more than one backend, the sequence
providing the event id's needs to be non-cached.

> It does in no way address the problem of one single client executing a
> couple of updates, modifies the object, then continues with updates. In
> this case, there isn't even a transaction boundary at which the DDL
> happened on the master. And this one transaction could indeed alter the
> object several times.

How is DDL here different from DML herev?

You need to replay DML in the right order too, no ?

> This means that a generalized data queue needs to have hooks, so that
> DDL triggers can inject their payload into it.

Anything that needs to be replicated, needs "to have hooks" in the
generalized data queue, so that

 a) they get replicated in the right order for each affected object
   a.1) this can be relaxed for related objects in case FK-s are
      disabled of deferred until transaction end
 b) they get committed on the subscriber side at transaction (set)
boundaries of provider.

if you implement the data queue as something non-transactional (non
pgQ-like), then you need to replicate (i,e. copy over and replay

 c) events from both committed and rollbacked transaction
 d) commits/rollbacks themselves
 e) and apply and/or rollback each individual transaction separately

IOW you mostly re-implement WAL, except at logical level. Which may or
may not be a good thing, depending on other requirements of the system.

If you  do it using pgQ you save on not copying rollbacked data, but you
do slightly more work on provider side. You also end up with not having
dead tuples from aborted transactions on subscriber.

--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Hannu Krosing

Date:

10 May 2010, 19:48:26

On Tue, 2010-05-11 at 01:08 +0300, Marko Kreen wrote:
> On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote:
> >  This means that a generalized data queue needs to have hooks, so that DDL
> > triggers can inject their payload into it.
>
> If you mean "hooks" as pgq.insert_event() function, then yes..
> I hope you are designing a generally usable queue with the GDQ.

The only way to have a generally usable queue different from pgQ is
having something that copies all events off the server (to either final
subscriber or some forwarding/processing station) and leaves the
commit/abort resolution to "the other server".

event id should still provide an usable order for applying these events.

--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Jan Wieck

Date:

10 May 2010, 23:46:35

On 5/10/2010 6:40 PM, Hannu Krosing wrote:
> On Mon, 2010-05-10 at 17:04 -0400, Jan Wieck wrote:
>> On 5/10/2010 4:25 PM, Marko Kreen wrote:
>> > AFAICS the "agreeable order" should take care of positioning:
>> >
>> >   http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Suggestions_for_Implementation
>> >
>> > This combined with DML triggers that react to invalidate events (like
>> > PgQ ones) should already work fine?
>> >
>> > Are there situations where such setup fails?
>> >
>>
>> That explanation of an agreeable order only solves the problems of
>> placing the DDL into the replication stream between transactions,
>> possibly done by multiple clients.
>
> Why only "between transactions" (whatever that means) ?
>
> If all transactions get their event ids from the same non-cached
> sequence, then the event id _is_ a reliable ordering within a set of
> concurrent transactions.
>
> Event id's get serialized (wher it matters) by the very locks taken by
> DDL/DML statments on the objects they manipulate.
>
> Once more, for this to work over more than one backend, the sequence
> providing the event id's needs to be non-cached.
>
>> It does in no way address the problem of one single client executing a
>> couple of updates, modifies the object, then continues with updates. In
>> this case, there isn't even a transaction boundary at which the DDL
>> happened on the master. And this one transaction could indeed alter the
>> object several times.
>
> How is DDL here different from DML herev?
>
> You need to replay DML in the right order too, no ?
>
>> This means that a generalized data queue needs to have hooks, so that
>> DDL triggers can inject their payload into it.
>
> Anything that needs to be replicated, needs "to have hooks" in the
> generalized data queue, so that
>
>  a) they get replicated in the right order for each affected object
>    a.1) this can be relaxed for related objects in case FK-s are
>       disabled of deferred until transaction end
>  b) they get committed on the subscriber side at transaction (set)
> boundaries of provider.
>
> if you implement the data queue as something non-transactional (non
> pgQ-like), then you need to replicate (i,e. copy over and replay
>
>  c) events from both committed and rollbacked transaction
>  d) commits/rollbacks themselves
>  e) and apply and/or rollback each individual transaction separately
>
> IOW you mostly re-implement WAL, except at logical level. Which may or
> may not be a good thing, depending on other requirements of the system.
>
> If you  do it using pgQ you save on not copying rollbacked data, but you
> do slightly more work on provider side. You also end up with not having
> dead tuples from aborted transactions on subscriber.
>

So the idea is to have one queue that captures row level DML events as
well as statement level DDL. That is certainly possible and in that case
the event id will indeed provide a usable order for applying these
actions, if it is taken from a non-cached sequence after all locks have
been taken, as Marko explained.

That event id resembles Slony's action_seq.

The thing this event id alone does not provide is any point where inside
that sequence of event id's the replica can issue commits. On a busy
server, there may never be any such moment unless the replica applies
things the Slony way instead of in monotonically increasing event id's.
If your idea is to simply record things WAL style and shove them off to
the replicas, you just move some of the current overhead from the master
by duplicating it onto every replica.

There are more things to consider about such a generalized queue,
especially if we think about adding it to core.

One for example is version independence. Slony and I think Londiste too
can replicate across PostgreSQL server versions. And experience shows us
that no communications protocol, on disk format or the like is ever set
in stone. So we need to think how this queue can become backwards
compatible without introducing more overhead than we try to save right now.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Re: Clustering features for upcoming developer meeting -- please claim yours!

From

Marko Kreen

Date:

11 May 2010, 06:25:11

On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote:
> On 5/10/2010 6:40 PM, Hannu Krosing wrote:
> > On Mon, 2010-05-10 at 17:04 -0400, Jan Wieck wrote:
> > > On 5/10/2010 4:25 PM, Marko Kreen wrote:
> > > > AFAICS the "agreeable order" should take care of positioning:
> > > > >
> http://wiki.postgresql.org/wiki/ModificationTriggerGDQ#Suggestions_for_Implementation
> > > > > This combined with DML triggers that react to invalidate events
> (like
> > > > PgQ ones) should already work fine?
> > > > > Are there situations where such setup fails?
> > > >

>  So the idea is to have one queue that captures row level DML events as well
> as statement level DDL. That is certainly possible and in that case the
> event id will indeed provide a usable order for applying these actions, if
> it is taken from a non-cached sequence after all locks have been taken, as
> Marko explained.
>
>  That event id resembles Slony's action_seq.
>
>  The thing this event id alone does not provide is any point where inside
> that sequence of event id's the replica can issue commits. On a busy server,
> there may never be any such moment unless the replica applies things the
> Slony way instead of in monotonically increasing event id's. If your idea is
> to simply record things WAL style and shove them off to the replicas, you
> just move some of the current overhead from the master by duplicating it
> onto every replica.

I'm not sure about what overhead are you talking about.

Are you trying to get rid of current snapshot-based grouping
of events?  Why?

>  There are more things to consider about such a generalized queue,
> especially if we think about adding it to core.
>
>  One for example is version independence. Slony and I think Londiste too can
> replicate across PostgreSQL server versions. And experience shows us that no
> communications protocol, on disk format or the like is ever set in stone. So
> we need to think how this queue can become backwards compatible without
> introducing more overhead than we try to save right now.

I'm guessing you are trying to do 2 more things:

1) Add queue operations to SQL syntax
2) Non-table custom storage.

I'm indifferent to 1) and dubious how big the win the 2) can bring,
but glad to be proven wrong.

But there's another issue - our experience with PgQ has shown
that generic queue means also generic code operating with it,
which means bugs.  And transactional queue readers are not
allowed to drop events on problems.  Which means on problems,
admins need to examine queue and delete/modify the events.

Ofcourse, the bug causing the problem needs also be fixed,
but bugfixing does not repair the queue, that must be done
manually.

If the 1) and/or 2) means such possibility is removed,
it will be quite big hit on the generic-ness of the GDQ.

In that aspect I would prefer to fix any remaining problems
(what are they?) with plain queue queue tables, even if
the "NoSQL" queueing could perform significantly better.

--
marko

GDQ iimplementation (was: Re: Clustering features for upcoming developer meeting -- please claim yours!)

From

Jan Wieck

Date:

11 May 2010, 09:33:29

I changed the subject line because we are diving deep into
implementation details.

On 5/11/2010 5:24 AM, Marko Kreen wrote:
> On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote:
>>  The thing this event id alone does not provide is any point where inside
>> that sequence of event id's the replica can issue commits. On a busy server,
>> there may never be any such moment unless the replica applies things the
>> Slony way instead of in monotonically increasing event id's. If your idea is
>> to simply record things WAL style and shove them off to the replicas, you
>> just move some of the current overhead from the master by duplicating it
>> onto every replica.
>
> I'm not sure about what overhead are you talking about.
>
> Are you trying to get rid of current snapshot-based grouping
> of events?  Why?

The problem statement on the Wiki page and Itagaki's comments about
non-table storage of the queue made it look to me as if some WAL style
flat file approach was looked after.

I am glad that we agree that we cannot get rid of the snapshot based
grouping. That and the IMHO required table storage is the overhead I was
talking about. We should be clear that we cannot get rid of that
grouping and that however many log segments are used (Slony currently 2,
Londiste default 3), the oldest running transaction on the master
determines which log segments can get truncated. The more log segments
there are in use, the more UNION keywords may appear in the query,
selecting from the log.

>
>>  There are more things to consider about such a generalized queue,
>> especially if we think about adding it to core.
>>
>>  One for example is version independence. Slony and I think Londiste too can
>> replicate across PostgreSQL server versions. And experience shows us that no
>> communications protocol, on disk format or the like is ever set in stone. So
>> we need to think how this queue can become backwards compatible without
>> introducing more overhead than we try to save right now.
>
> I'm guessing you are trying to do 2 more things:
>
> 1) Add queue operations to SQL syntax
> 2) Non-table custom storage.

No. I don't know how you read 1) into the above and 2) was my
misunderstanding reading the Wiki. I don't want either.

> But there's another issue - our experience with PgQ has shown
> that generic queue means also generic code operating with it,
> which means bugs.  And transactional queue readers are not
> allowed to drop events on problems.  Which means on problems,
> admins need to examine queue and delete/modify the events.
>
> Ofcourse, the bug causing the problem needs also be fixed,
> but bugfixing does not repair the queue, that must be done
> manually.
>
> If the 1) and/or 2) means such possibility is removed,
> it will be quite big hit on the generic-ness of the GDQ.
>
> In that aspect I would prefer to fix any remaining problems
> (what are they?) with plain queue queue tables, even if
> the "NoSQL" queueing could perform significantly better.

A generic queue implementation needs to come with some advantage over
what we have now. Otherwise there is no incentive for any of the
existing systems to even consider switching to it.

What are the advantages of anything proposed over the current
implementations used by Londiste and Slony?

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Re: GDQ iimplementation (was: Re: Clustering features for upcoming developer meeting -- please claim yours!)

From

Simon Riggs

Date:

11 May 2010, 10:19:44

On Tue, 2010-05-11 at 08:33 -0400, Jan Wieck wrote:

> What are the advantages of anything proposed over the current
> implementations used by Londiste and Slony?

It would be good to have a core technology that provided a generic
transport to other remote databases.

We already have WALSender and WALReceiver, which uses the COPY protocol
as a transport mechanism. It would be easy to extend that so we could
send other forms of data.

We can do that in two ways:

* Alter triggers so that Slony/Londiste write directly to WAL rather
than log tables, using a new WAL record for custom data blobs.

* Alter WALSender so it can read Slony/Londiste log tables for
consumption by an architecture similar to WALreceiver/Startup. Probably
easier.

We can also alter the WAL format itself to include the information in
WAL that is required to do what Slony/Londiste already do, so we don't
need to specifically write anything at all, just read WAL at other end.
Even more efficient.

The advantages of these options would be

* integration of core technologies
* greater efficiency for trigger based logging via WAL

In other RDBMS "replication" has long meant "data transport, either for
HA or application use". We should be looking beyond the pure HA aspects,
as pgq does.

I would certainly like to see a system that wrote data on master and
then constructed the SQL on receiver-side (i.e. on slave), so the
integration was less tight. That would allow data to be sent and for it
to be consumed to a variety of purposes, not just HA replay.

--
 Simon Riggs           www.2ndQuadrant.com

Re: GDQ iimplementation (was: Re: Clustering features for upcoming developer meeting -- please claim yours!)

From

Marko Kreen

Date:

11 May 2010, 10:36:36

On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote:
> I changed the subject line because we are diving deep into implementation
> details.
>  On 5/11/2010 5:24 AM, Marko Kreen wrote:
> > On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote:
> > >  The thing this event id alone does not provide is any point where
> inside
> > > that sequence of event id's the replica can issue commits. On a busy
> server,
> > > there may never be any such moment unless the replica applies things the
> > > Slony way instead of in monotonically increasing event id's. If your
> idea is
> > > to simply record things WAL style and shove them off to the replicas,
> you
> > > just move some of the current overhead from the master by duplicating it
> > > onto every replica.
> > >
> >
> > I'm not sure about what overhead are you talking about.
> >
> > Are you trying to get rid of current snapshot-based grouping
> > of events?  Why?
> >
>
>  The problem statement on the Wiki page and Itagaki's comments about
> non-table storage of the queue made it look to me as if some WAL style flat
> file approach was looked after.
>
>  I am glad that we agree that we cannot get rid of the snapshot based
> grouping. That and the IMHO required table storage is the overhead I was
> talking about. We should be clear that we cannot get rid of that grouping
> and that however many log segments are used (Slony currently 2, Londiste
> default 3), the oldest running transaction on the master determines which
> log segments can get truncated. The more log segments there are in use, the
> more UNION keywords may appear in the query, selecting from the log.

Seems we are in agreement.

And although PgQ can operate with any N >= 2 segments, it queries
on 2 at a time, same as Slony.  Rest are just there to give admins
some safety room for "OH F*CK" moments.  With short rotation times,
it starts to seem useful..

There does not seem any advantage for querying more than 2 segments.

> > >  There are more things to consider about such a generalized queue,
> > > especially if we think about adding it to core.
> > >
> > >  One for example is version independence. Slony and I think Londiste too
> can
> > > replicate across PostgreSQL server versions. And experience shows us
> that no
> > > communications protocol, on disk format or the like is ever set in
> stone. So
> > > we need to think how this queue can become backwards compatible without
> > > introducing more overhead than we try to save right now.
> > >
> >
> > I'm guessing you are trying to do 2 more things:
> >
> > 1) Add queue operations to SQL syntax
> > 2) Non-table custom storage.
> >
>
>  No. I don't know how you read 1) into the above and 2) was my
> misunderstanding reading the Wiki. I don't want either.

Oh sorry, I got that impression from wiki, not from you.

As there are some ideas from you on the wiki, I assumed
you are involved, so used 'you' very liberally.

--
marko

Re: GDQ iimplementation (was: Re: Clustering features for upcoming developer meeting -- please claim yours!)

From

Marko Kreen

Date:

11 May 2010, 11:03:51

On 5/11/10, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, 2010-05-11 at 08:33 -0400, Jan Wieck wrote:
>  > What are the advantages of anything proposed over the current
>  > implementations used by Londiste and Slony?
>
> It would be good to have a core technology that provided a generic
>  transport to other remote databases.

I suspect there still should be some sort of middle-ware code
that reads the data from Postgres, and writes to other db.

So the task of the GDQ should be to make data available to that
reader, not be "transport to remote databases", no?

And if we are talking about "Generalized Data Queue", one important
aspect is that it should be easy to write both queue readers and
writers in whatever language users wants.  Which means it should be
possible to do both reading and writing with ordinary SQL queries.
Even requiring COPY is out, as it not available in many adapters.

Ofcourse it's OK to have such extensions available optionally.
Eg. Londiste does event bulk insert with COPY.  But it's not required
for ordinary clients.  And you can always turn SELECT output into
COPY format.

>  We already have WALSender and WALReceiver, which uses the COPY protocol
>  as a transport mechanism. It would be easy to extend that so we could
>  send other forms of data.
>
>  We can do that in two ways:
>
>  * Alter triggers so that Slony/Londiste write directly to WAL rather
>  than log tables, using a new WAL record for custom data blobs.
>
>  * Alter WALSender so it can read Slony/Londiste log tables for
>  consumption by an architecture similar to WALreceiver/Startup. Probably
>  easier.
>
>  We can also alter the WAL format itself to include the information in
>  WAL that is required to do what Slony/Londiste already do, so we don't
>  need to specifically write anything at all, just read WAL at other end.
>  Even more efficient.

Hm.  You'd need to tie WAL rotating with reader positions.

And to read a largish batch from WAL, you need to process also
unrelated data?

Reading from WAL is OK for full replication, but for smallish
queue, that gets only small percentage of overall traffic?

>  The advantages of these options would be
>
>  * integration of core technologies
>  * greater efficiency for trigger based logging via WAL
>
>  In other RDBMS "replication" has long meant "data transport, either for
>  HA or application use". We should be looking beyond the pure HA aspects,
>  as pgq does.
>
>  I would certainly like to see a system that wrote data on master and
>  then constructed the SQL on receiver-side (i.e. on slave), so the
>  integration was less tight. That would allow data to be sent and for it
>  to be consumed to a variety of purposes, not just HA replay.

pgq.logutriga()?  It writes all columns, also NULL-ed into queue
in urlencoded format.  Londiste actually even knows how to generate
SQL from those.

We use it for most non-replication queues, where we want process
the event more intelligently that simply executing it on
some other connection.

--
marko

Re: GDQ iimplementation

From

Jan Wieck

Date:

11 May 2010, 11:05:38

On 5/11/2010 9:36 AM, Marko Kreen wrote:
> Seems we are in agreement.

That's always a good point to start from.

> And although PgQ can operate with any N >= 2 segments, it queries
> on 2 at a time, same as Slony.  Rest are just there to give admins
> some safety room for "OH F*CK" moments.  With short rotation times,
> it starts to seem useful..

Agreed. The rotation time should actually reflect the longest running
transactions experienced on a frequent base from the application. And
there needs to be a safeguard against rotating over even longer running
transactions.

The problem with a long running transaction is that it could have
written into log segment 1 before we switched to segment 2. We can only
TRUNCATE segment 1 after that transaction committed AND the log has been
consumed by everyone interested in it.

I am not familiar with how PgQ/Londiste do this. Slony specifically
remembers the highest XID in progress at the time of switching, waits
until the lowest XID in progress is higher than that (so all log that
ever went into that segment is now visible or aborted), then waits for
all log in that segment to be confirmed and finally truncates the log.
All this time, it needs to do the UNION query over both log segments.

 > There does not seem any advantage for querying more than 2 segments.

I didn't experiment with such implementation yet. I'll theorize about
that in a separate thread later.

>>  No. I don't know how you read 1) into the above and 2) was my
>> misunderstanding reading the Wiki. I don't want either.
>
> Oh sorry, I got that impression from wiki, not from you.
>
> As there are some ideas from you on the wiki, I assumed
> you are involved, so used 'you' very liberally.

No problem. I misinterpreted stuff there as "the currently favored idea"
too.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Re: GDQ iimplementation (was: Re: Clustering features for upcoming developer meeting -- please claim yours!)

From

Simon Riggs

Date:

11 May 2010, 11:27:14

On Tue, 2010-05-11 at 17:03 +0300, Marko Kreen wrote:
> On 5/11/10, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On Tue, 2010-05-11 at 08:33 -0400, Jan Wieck wrote:
> >  > What are the advantages of anything proposed over the current
> >  > implementations used by Londiste and Slony?
> >
> > It would be good to have a core technology that provided a generic
> >  transport to other remote databases.
>
> I suspect there still should be some sort of middle-ware code
> that reads the data from Postgres, and writes to other db.
>
> So the task of the GDQ should be to make data available to that
> reader, not be "transport to remote databases", no?

Yes for maximum flexibility, user code at both ends would be good.

--
 Simon Riggs           www.2ndQuadrant.com

Re: GDQ iimplementation

From

Jan Wieck

Date:

11 May 2010, 11:37:03

On 5/11/2010 9:19 AM, Simon Riggs wrote:
> On Tue, 2010-05-11 at 08:33 -0400, Jan Wieck wrote:
>
>> What are the advantages of anything proposed over the current
>> implementations used by Londiste and Slony?
>
> It would be good to have a core technology that provided a generic
> transport to other remote databases.
>
> We already have WALSender and WALReceiver, which uses the COPY protocol
> as a transport mechanism. It would be easy to extend that so we could
> send other forms of data.
>
> We can do that in two ways:
>
> * Alter triggers so that Slony/Londiste write directly to WAL rather
> than log tables, using a new WAL record for custom data blobs.

Londiste and Slony "consume" the log data in a different order than it
appears in the WAL. Using WAL would mean moving a lot of complexity,
that is currently done by using an MVCC style grouping from the log
origin to the log consumers.

> * Alter WALSender so it can read Slony/Londiste log tables for
> consumption by an architecture similar to WALreceiver/Startup. Probably
> easier.

Only if that altercation means also to be able to

1) hand WALSender the from and to snapshots
2) WALSender is able to send the UNION of multiple log tables ordered
    by the event/action ID

Because that is how both, Londiste and Slony, are consuming the log.

> We can also alter the WAL format itself to include the information in
> WAL that is required to do what Slony/Londiste already do, so we don't
> need to specifically write anything at all, just read WAL at other end.
> Even more efficient.
>
> The advantages of these options would be
>
> * integration of core technologies
> * greater efficiency for trigger based logging via WAL

I'm still unclear how we can ensure cross version functionality when
using such core technology. Are you implying that a 9.3 WALReceiver will
always be able to consume the data format sent by a 9.1 WALSender?

> In other RDBMS "replication" has long meant "data transport, either for
> HA or application use". We should be looking beyond the pure HA aspects,
> as pgq does.

Slony replication has meant both too from the beginning.

> I would certainly like to see a system that wrote data on master and
> then constructed the SQL on receiver-side (i.e. on slave), so the
> integration was less tight. That would allow data to be sent and for it
> to be consumed to a variety of purposes, not just HA replay.

Slony does exactly that constructing of SQL on the receiver side, and it
is a big drawback because every single row update needs to go through a
separate SQL query that is parsed, planned and optimized. I can envision
a generic function that takes the data format, recorded by the capture
trigger on the master, and turns that into a simple plan. All these
single row updates/deletes are PK based, no need to even think about
parsing and planning that over and over. Just replace the targetlist to
reflect whatever this log row updates and execute it. These will always
be a literal value from the log or the OLD value on fields untouched.
Simple enough.

The big advantage from such generic support would be that systems like
Londiste/Slony could use the existing COPY-SELECT mechanism to transport
the log in a streaming protocol, while a BEFORE INSERT trigger on the
receivers log segments is turning it into highly efficient single row
operations.

This generic single row change capture and single row update support
would allow Londiste/Slony type replication systems to eliminate most
round trip based latency, a lot of CPU usage on the replicas plus all
the libpq and SQL query assembly in the replication engine itself.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Re: GDQ iimplementation

From

Simon Riggs

Date:

11 May 2010, 12:11:30

On Tue, 2010-05-11 at 10:38 -0400, Jan Wieck wrote:

> Slony replication has meant both too from the beginning.

You've done a brilliant job and I have huge respect for that.

MHO: The world changes and new solutions emerge. Assimilation of
technology into lower layers of the stack has been happening for years.
The core parts of Slony should be assimilated, just as TCP/IP now exists
as part of the OS, to the benefit of all. Various parts of Slony have
already moved to core. Slony continues to have huge potential, though as
part of an evolution, not in all cases fulfilling the same role it did
at the beginning. Log shipping cannot easily exist outside of core,
though SQL shipping can: but should it? How much more could we do?

--
 Simon Riggs           www.2ndQuadrant.com

Re: GDQ iimplementation

From

Marko Kreen

Date:

11 May 2010, 12:20:51

On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote:
> On 5/11/2010 9:36 AM, Marko Kreen wrote:
> > And although PgQ can operate with any N >= 2 segments, it queries
> > on 2 at a time, same as Slony.  Rest are just there to give admins
> > some safety room for "OH F*CK" moments.  With short rotation times,
> > it starts to seem useful..
> >
>
>  Agreed. The rotation time should actually reflect the longest running
> transactions experienced on a frequent base from the application. And there
> needs to be a safeguard against rotating over even longer running
> transactions.

Nightly pg_dump.. ;)

>  The problem with a long running transaction is that it could have written
> into log segment 1 before we switched to segment 2. We can only TRUNCATE
> segment 1 after that transaction committed AND the log has been consumed by
> everyone interested in it.
>
>  I am not familiar with how PgQ/Londiste do this. Slony specifically
> remembers the highest XID in progress at the time of switching, waits until
> the lowest XID in progress is higher than that (so all log that ever went
> into that segment is now visible or aborted), then waits for all log in that
> segment to be confirmed and finally truncates the log. All this time, it
> needs to do the UNION query over both log segments.

The "highest XID" means actually "own transaction" here?
And it's not committed yet?  That's seems to leave transactions
that happen before it's own commit into dubious state?

Although you may be fine, if you don't try to minimize
reading both tables.

PgQ does this:

Rotate:
1) If some consumer reads older table, don't rotate.
2) Set table_nr++, switch_step1 = txid_current(), switch_step2 = NULL
3) Commit
4) Set switch_step2 = txid_current() where switch_step2 IS NULL
5) Commit

Reader:
1) xmin1 = xmin of lower snapshot of batch
2) xmax2 = xmax of higher snapshot of batch
3) if xmax2 < switch_step1, read older table
4) if xmin1 > switch_step2, read newer table
5) otherwise read both

--
marko

Re: GDQ iimplementation

From

Jan Wieck

Date:

11 May 2010, 12:52:50

On 5/11/2010 11:20 AM, Marko Kreen wrote:
> On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote:
>> On 5/11/2010 9:36 AM, Marko Kreen wrote:
>> > And although PgQ can operate with any N >= 2 segments, it queries
>> > on 2 at a time, same as Slony.  Rest are just there to give admins
>> > some safety room for "OH F*CK" moments.  With short rotation times,
>> > it starts to seem useful..
>> >
>>
>>  Agreed. The rotation time should actually reflect the longest running
>> transactions experienced on a frequent base from the application. And there
>> needs to be a safeguard against rotating over even longer running
>> transactions.
>
> Nightly pg_dump.. ;)
>
>>  The problem with a long running transaction is that it could have written
>> into log segment 1 before we switched to segment 2. We can only TRUNCATE
>> segment 1 after that transaction committed AND the log has been consumed by
>> everyone interested in it.
>>
>>  I am not familiar with how PgQ/Londiste do this. Slony specifically
>> remembers the highest XID in progress at the time of switching, waits until
>> the lowest XID in progress is higher than that (so all log that ever went
>> into that segment is now visible or aborted), then waits for all log in that
>> segment to be confirmed and finally truncates the log. All this time, it
>> needs to do the UNION query over both log segments.
>
> The "highest XID" means actually "own transaction" here?
> And it's not committed yet?  That's seems to leave transactions
> that happen before it's own commit into dubious state?

One needs to tell transactions to switch log, commit, then look at the
highest running XID after that. Any XID lower/equal to that one could
possibly have written into the old segment.

>
> Although you may be fine, if you don't try to minimize
> reading both tables.
>
> PgQ does this:
>
> Rotate:
> 1) If some consumer reads older table, don't rotate.
> 2) Set table_nr++, switch_step1 = txid_current(), switch_step2 = NULL
> 3) Commit
> 4) Set switch_step2 = txid_current() where switch_step2 IS NULL
> 5) Commit

Right, exactly like that :)

> Reader:
> 1) xmin1 = xmin of lower snapshot of batch
> 2) xmax2 = xmax of higher snapshot of batch
> 3) if xmax2 < switch_step1, read older table
> 4) if xmin1 > switch_step2, read newer table
> 5) otherwise read both

Sounds familiar. I still don't know exactly what role the 3rd log
segment plays in that, but it sure cannot hurt.


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Re: GDQ iimplementation

From

Jan Wieck

Date:

11 May 2010, 13:01:36

On 5/11/2010 11:11 AM, Simon Riggs wrote:
> On Tue, 2010-05-11 at 10:38 -0400, Jan Wieck wrote:
>
>> Slony replication has meant both too from the beginning.
>
> You've done a brilliant job and I have huge respect for that.
>
> MHO: The world changes and new solutions emerge. Assimilation of
> technology into lower layers of the stack has been happening for years.
> The core parts of Slony should be assimilated, just as TCP/IP now exists
> as part of the OS, to the benefit of all. Various parts of Slony have
> already moved to core. Slony continues to have huge potential, though as
> part of an evolution, not in all cases fulfilling the same role it did
> at the beginning. Log shipping cannot easily exist outside of core,
> though SQL shipping can: but should it? How much more could we do?

I don't have any problem with assimilation of technology or moving
things into core if appropriate.

What I have a problem with is stuffing things into core for minor
advantages, then later discovering that we lost flexibility essential
for important features.

Right now one can use Slony 2.0 to do PostgreSQL major version upgrades
via switchover. Using pgbouncer, these can even be done transparent to
the application without the need to reconnect to the new master. I think
Londiste has or is at least working on similar features.

This is because Slony 2.0 is a separate product only relying on very
stable core functionality, like txid's and snapshots.

Are you ready to "guarantee" that the queue and transport mechanism, you
want to put into core, is THAT stable and major version independent? I
would not, but that may be just me.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Re: GDQ iimplementation

From

Marko Kreen

Date:

11 May 2010, 13:19:31

On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote:
> On 5/11/2010 11:20 AM, Marko Kreen wrote:
> > On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote:
> > > On 5/11/2010 9:36 AM, Marko Kreen wrote:
> > > > And although PgQ can operate with any N >= 2 segments, it queries
> > > > on 2 at a time, same as Slony.  Rest are just there to give admins
> > > > some safety room for "OH F*CK" moments.  With short rotation times,
> > > > it starts to seem useful..
> > > >
> > >
> > >  Agreed. The rotation time should actually reflect the longest running
> > > transactions experienced on a frequent base from the application. And
> there
> > > needs to be a safeguard against rotating over even longer running
> > > transactions.
> > >
> >
> > Nightly pg_dump.. ;)
> >
> >
> > >  The problem with a long running transaction is that it could have
> written
> > > into log segment 1 before we switched to segment 2. We can only TRUNCATE
> > > segment 1 after that transaction committed AND the log has been consumed
> by
> > > everyone interested in it.
> > >
> > >  I am not familiar with how PgQ/Londiste do this. Slony specifically
> > > remembers the highest XID in progress at the time of switching, waits
> until
> > > the lowest XID in progress is higher than that (so all log that ever
> went
> > > into that segment is now visible or aborted), then waits for all log in
> that
> > > segment to be confirmed and finally truncates the log. All this time, it
> > > needs to do the UNION query over both log segments.
> > >
> >
> > The "highest XID" means actually "own transaction" here?
> > And it's not committed yet?  That's seems to leave transactions
> > that happen before it's own commit into dubious state?
> >
>
>  One needs to tell transactions to switch log, commit, then look at the
> highest running XID after that. Any XID lower/equal to that one could
> possibly have written into the old segment.

Yeah, sounds fine.  Except you cannot ignore the newer table
with that.  But that makes difference only for consumers that
are lagging.

> > Although you may be fine, if you don't try to minimize
> > reading both tables.
> >
> > PgQ does this:
> >
> > Rotate:
> > 1) If some consumer reads older table, don't rotate.
> > 2) Set table_nr++, switch_step1 = txid_current(), switch_step2 = NULL
> > 3) Commit
> > 4) Set switch_step2 = txid_current() where switch_step2 IS NULL
> > 5) Commit
> >
>
>  Right, exactly like that :)
>
>
> > Reader:
> > 1) xmin1 = xmin of lower snapshot of batch
> > 2) xmax2 = xmax of higher snapshot of batch
> > 3) if xmax2 < switch_step1, read older table
> > 4) if xmin1 > switch_step2, read newer table
> > 5) otherwise read both
> >
>
>  Sounds familiar. I still don't know exactly what role the 3rd log segment
> plays in that, but it sure cannot hurt.

It makes sure you have one rotation_period of events always
available.  In case you want to do some recovery on them.
But that's it.

--
marko

Re: GDQ iimplementation (was: Re: Clustering features for upcoming developer meeting -- please claim yours!)

From

Marko Kreen

Date:

17 May 2010, 12:23:22

On 5/11/10, Jan Wieck <JanWieck@yahoo.com> wrote:
> I changed the subject line because we are diving deep into implementation
> details.

Here is my take on various issued related to queueing:

  http://wiki.postgresql.org/wiki/GDQIssues

Feel free to add / re-prioritize the list.

--
marko

Re: GDQ iimplementation

From

Josh Berkus

Date:

17 May 2010, 18:46:39

Jan, Marko, Simon,

I'm concerned that doing anything about the write overhead issue was
discarded almost immediately in this discussion.  This is not a trivial
issue for performance; it means that each row which is being tracked by
the GDQ needs to be written to disk a minimum of 4 times (once to WAL,
once to table, once to WAL for queue, once to queue).  That's at least
one time too many, and effectively doubles the load on the master server.

This is particularly unacceptable overhead for systems where users are
not that interested in retaining the queue after an unexpected shutdown.

Surely there's some way around this?  Some kind of special
fsync-on-write table, for example?  The access pattern to a queue is
quite specialized.

--
                                   -- Josh Berkus
                                      PostgreSQL Experts Inc.
                                      http://www.pgexperts.com

Re: GDQ iimplementation

From

Marko Kreen

Date:

17 May 2010, 19:49:12

----- Original message -----
> Jan, Marko, Simon,
>
> I'm concerned that doing anything about the write overhead issue was
> discarded almost immediately in this discussion. This is not a trivial
> issue for performance; it means that each row which is being tracked by
> the GDQ needs to be written to disk a minimum of 4 times (once to WAL,
> once to table, once to WAL for queue, once to queue). That's at least
> one time too many, and effectively doubles the load on the master server.
>
> This is particularly unacceptable overhead for systems where users are
> not that interested in retaining the queue after an unexpected shutdown.
>
> Surely there's some way around this? Some kind of special
> fsync-on-write table, for example? The access pattern to a queue is
> quite specialized.

Uh, this seems like purely theoretical speculation, which
also ignores the "generic queue" aspect.

In practice, with databases where there is more reads than
writes, the additional queue write seems insignificant.

So I guess it's up to you to bring hard proofs that the
additional writes are problem.

If we already are speculating, I'd guess that writing to
WAL and INSERT-only queue table involves lot less
seeking than writing to actual table.

But feel free to edit the "Goals" section, unless you are
talking about non-transactional queueing, which seems
off-topic here.

--
marko

Re: GDQ iimplementation

From

Hannu Krosing

Date:

17 May 2010, 20:53:46

On Mon, 2010-05-17 at 14:46 -0700, Josh Berkus wrote:
> Jan, Marko, Simon,
>
> I'm concerned that doing anything about the write overhead issue was
> discarded almost immediately in this discussion.

Only thing we can do to write overhead _on_master_ is to trade it for
transaction boundary reconstruction on slave (or special intermediate
node), effectively implementing a "logical WAL" in addition to (or as an
extension of) the current WAL.

> This is not a trivial
> issue for performance; it means that each row which is being tracked by
> the GDQ needs to be written to disk a minimum of 4 times (once to WAL,
> once to table, once to WAL for queue, once to queue).

In reality the WAL record for main table is forced to disk mosttimes in
the same WAL write as the WAL record for queue. And the actual queue
page does not reach disk at all if queue rotation is fast.

> That's at least
> one time too many, and effectively doubles the load on the master server.

It doubles the "throughput/sequential load" to fs cache but does much
less for "number of fsyncs" as all those writesare done within the same
transaction and only WAL writes need to get to disk.

In my unscientific tests with pgbench adding FK's between the pgbench
tables + adding PK to log table had bigger performance impact than
setting up replication using londiste.

> This is particularly unacceptable overhead for systems where users are
> not that interested in retaining the queue after an unexpected shutdown.

Users not needing data after unexpected shutdown should use temp tables.

If several users need the same data, then global temp tables should be
implemented / used.

> Surely there's some way around this?  Some kind of special
> fsync-on-write table, for example?

This is sure to have a large negative performance impact. WAL was added
to postgreSQL for just this - to get rid of fsync-on-commit
(fsync-on-write is as bad or worse than fsync-on-commit)

>  The access pattern to a queue is
> quite specialized.

A generic solution for such users would be implementing Global Temporary
Tables (which need no WAL), and then using these  for non-persistent GDQ

--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training

Re: GDQ iimplementation

From

Marko Kreen

Date:

18 May 2010, 18:40:16

Josh, could you give more details on the item:

* Something which will work with other databases and caching systems?

What exactly are you thinking here?  Compatibility with some API-s?
Frameworks?  Having some low-level guarantees?

--
marko

Re: GDQ iimplementation

From

Jan Wieck

Date:

20 May 2010, 09:12:30

On 5/17/2010 5:46 PM, Josh Berkus wrote:
> Jan, Marko, Simon,
>
> I'm concerned that doing anything about the write overhead issue was
> discarded almost immediately in this discussion.  This is not a trivial
> issue for performance; it means that each row which is being tracked by
> the GDQ needs to be written to disk a minimum of 4 times (once to WAL,
> once to table, once to WAL for queue, once to queue).  That's at least
> one time too many, and effectively doubles the load on the master server.
>
> This is particularly unacceptable overhead for systems where users are
> not that interested in retaining the queue after an unexpected shutdown.
>
> Surely there's some way around this?  Some kind of special
> fsync-on-write table, for example?  The access pattern to a queue is
> quite specialized.
>

I recall this slightly different. The idea of a PostgreSQL managed
queue, that does NOT guarantee consistency with the final commit status
of the message generating transactions, was discarded. That is not the
same as ignoring the write overhead.

In all our existing use cases (Londiste/Slony/Bucardo) the information
in the queue cannot be entirely found in the WAL of the original
underlying row operation. There are old row key values and sequence
numbers or other meta information that isn't even known at the time, the
original rows WAL entry is written.

It may seem possible to implement the data capturing part of the queue
within the heap access methods, add the extra information to the WAL
record and thus get rid of one of the images. But that isn't as simple
as it sounds, since queue tables have toast tables too, they don't
consist of simply one "log entry", they actually consist of a bunch of
tuples. One in the queue table, 0-n in the queues toast table and then
the index tuples. In the case of compression, the binary data in the
toasted queue attribute will be entirely different than what you may
find in the WAL pieces that were written for the original data rows
toast segments. It is going to be a heck of a forensics job to
reconstruct all that.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Re: GDQ iimplementation

From

Simon Riggs

Date:

20 May 2010, 16:52:26

On Tue, 2010-05-18 at 01:53 +0200, Hannu Krosing wrote:
> On Mon, 2010-05-17 at 14:46 -0700, Josh Berkus wrote:
> > Jan, Marko, Simon,
> >
> > I'm concerned that doing anything about the write overhead issue was
> > discarded almost immediately in this discussion.
>
> Only thing we can do to write overhead _on_master_ is to trade it for
> transaction boundary reconstruction on slave (or special intermediate
> node), effectively implementing a "logical WAL" in addition to (or as an
> extension of) the current WAL.

That does sound pretty good to me.

Fairly easy to make the existing triggers write XLOG_NOOP WAL records
directly rather than writing to a queue table, which also gets logged to
WAL. We could just skip the queue table altogether.

Even better would be extending WAL format to include all the information
you need, so it gets written to WAL just once.

> > This is not a trivial
> > issue for performance; it means that each row which is being tracked by
> > the GDQ needs to be written to disk a minimum of 4 times (once to WAL,
> > once to table, once to WAL for queue, once to queue).
>
> In reality the WAL record for main table is forced to disk mosttimes in
> the same WAL write as the WAL record for queue. And the actual queue
> page does not reach disk at all if queue rotation is fast.

Josh, you really should do some measurements to show the overheads. Not
sure you'll get people just to accept that assertion otherwise.

--
 Simon Riggs           www.2ndQuadrant.com

Re: GDQ iimplementation

From

Hannu Krosing

Date:

20 May 2010, 20:04:29

On Thu, 2010-05-20 at 20:51 +0100, Simon Riggs wrote:
> On Tue, 2010-05-18 at 01:53 +0200, Hannu Krosing wrote:
> > On Mon, 2010-05-17 at 14:46 -0700, Josh Berkus wrote:
> > > Jan, Marko, Simon,
> > >
> > > I'm concerned that doing anything about the write overhead issue was
> > > discarded almost immediately in this discussion.
> >
> > Only thing we can do to write overhead _on_master_ is to trade it for
> > transaction boundary reconstruction on slave (or special intermediate
> > node), effectively implementing a "logical WAL" in addition to (or as an
> > extension of) the current WAL.
>
> That does sound pretty good to me.
>
> Fairly easy to make the existing triggers write XLOG_NOOP WAL records
> directly rather than writing to a queue table, which also gets logged to
> WAL. We could just skip the queue table altogether.
>
> Even better would be extending WAL format to include all the information
> you need, so it gets written to WAL just once.

Maybe it is also possible (less intrusive/easier to implement) to add
some things to WAL which have met resistance as general trigger-based
features, like "logical representation" of DDL. We already have
equivalent of minimal ON COMMIT/ON ROLLBACK triggers in form of
commit/rollback records in WAL.

Also, if we use extended WAL as GDQ, then there should be a possibility
to write WAL in form that supports only "logical" (+ of course
Durability) features but not full backup and WAL based replication .

And a possibility to have "user-defined" WAL records for specific tasks
would also be a nice and postgreSQL-ly extensibility feature.

> > > This is not a trivial
> > > issue for performance; it means that each row which is being tracked by
> > > the GDQ needs to be written to disk a minimum of 4 times (once to WAL,
> > > once to table, once to WAL for queue, once to queue).
> >
> > In reality the WAL record for main table is forced to disk mosttimes in
> > the same WAL write as the WAL record for queue. And the actual queue
> > page does not reach disk at all if queue rotation is fast.
>
> Josh, you really should do some measurements to show the overheads. Not
> sure you'll get people just to accept that assertion otherwise.
>


--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training