Thread: Re: protocol-level wait-for-LSN

Re: protocol-level wait-for-LSN

From
Ants Aasma
Date:
On Mon, 28 Oct 2024 at 17:51, Peter Eisentraut <peter@eisentraut.org> wrote:
> This is something I hacked together on the way back from pgconf.eu.
> It's highly experimental.
>
> The idea is to do the equivalent of pg_wal_replay_wait() on the protocol
> level, so that it is ideally fully transparent to the application code.
> The application just issues queries, and they might be serviced by a
> primary or a standby, but there is always a correct ordering of reads
> after writes.

The idea is great, I have been wanting something like this for a long
time. For future proofing it might be a good idea to not require the
communicated-waited value to be a LSN.

In a sharded database a Lamport timestamp would allow for sequential
consistency. Lamport timestamp is just some monotonically increasing
value that is eagerly shared between all communicating participants,
including clients. For a single cluster LSNs work fine for this
purpose. But with multiple shards LSNs will not work, unless arranged
as a vector clock which is what I think Matthias proposed.

Even without sharding LSN might not be a final choice. Right now on
the primary the visibility order is not LSN order. So if a connection
does synchronous_commit = off commit, the write location is not even
going to see the commit. By publishing the end of the commit record it
would be better. But I assume at some point we would like to have a
consistent visibility order, which quite likely means using something
other than LSN as the logical clock.

I see the patch names the field LSN, but on the protocol level and for
the client library this is just an opaque 127 byte token. So basically
I'm thinking the naming could be more generic. And for a complete
Lamport timestamp implementation we would need the capability of
extracting the last seen value and another set-if-greater update
operation.

-- 
Ants Aasma
www.cybertec-postgresql.com



Re: protocol-level wait-for-LSN

From
Jelte Fennema-Nio
Date:
On Wed, 30 Oct 2024 at 18:18, Ants Aasma <ants.aasma@cybertec.at> wrote:
> The idea is great, I have been wanting something like this for a long
> time. For future proofing it might be a good idea to not require the
> communicated-waited value to be a LSN.

Yours and Matthias' feedback make total sense I think. From an
implementation perspective I think there are a few things necessary to
enable these wider usecases:
1. The token should be considered opaque for clients (should be documented)
2. The token should be defined as variable length in the protocol
3. We should have a hook to allow postgres extensions to override the
default token generation
4. We should have a hook to allow postgres extensions to override
waiting until the token "timestamp"

> Even without sharding LSN might not be a final choice. Right now on
> the primary the visibility order is not LSN order. So if a connection
> does synchronous_commit = off commit, the write location is not even
> going to see the commit. By publishing the end of the commit record it
> would be better. But I assume at some point we would like to have a
> consistent visibility order, which quite likely means using something
> other than LSN as the logical clock.

I was going to say that the default could probably still be LSN, but
this makes me doubt that. Is there some other token that we can send
now that we could "wait" on instead of the LSN, which would work for.
If not, I think LSN is still probably a good choice as the default. Or
maybe only as a default in case synchronous_commit != off.



Re: protocol-level wait-for-LSN

From
Jesper Pedersen
Date:
Hi,

On 10/30/24 1:45 PM, Jelte Fennema-Nio wrote:
> On Wed, 30 Oct 2024 at 18:18, Ants Aasma <ants.aasma@cybertec.at> wrote:
>> The idea is great, I have been wanting something like this for a long
>> time. For future proofing it might be a good idea to not require the
>> communicated-waited value to be a LSN.
> 
> Yours and Matthias' feedback make total sense I think. From an
> implementation perspective I think there are a few things necessary to
> enable these wider usecases:
> 1. The token should be considered opaque for clients (should be documented)
> 2. The token should be defined as variable length in the protocol
> 3. We should have a hook to allow postgres extensions to override the
> default token generation
> 4. We should have a hook to allow postgres extensions to override
> waiting until the token "timestamp"
> 
>> Even without sharding LSN might not be a final choice. Right now on
>> the primary the visibility order is not LSN order. So if a connection
>> does synchronous_commit = off commit, the write location is not even
>> going to see the commit. By publishing the end of the commit record it
>> would be better. But I assume at some point we would like to have a
>> consistent visibility order, which quite likely means using something
>> other than LSN as the logical clock.
> 
> I was going to say that the default could probably still be LSN, but
> this makes me doubt that. Is there some other token that we can send
> now that we could "wait" on instead of the LSN, which would work for.
> If not, I think LSN is still probably a good choice as the default. Or
> maybe only as a default in case synchronous_commit != off.
> 

There are known wish-lists for a protocol v4, like

  https://github.com/pgjdbc/pgjdbc/blob/master/backend_protocol_v4_wanted_features.md

and a lot of clean-room implementations in drivers and embedded in 
projects/products.

Having LSN would be nice, but to break all existing implementations, no. 
Having to specify with startup parameters how a core message format 
looks like sounds like a bad idea to me,

  https://www.postgresql.org/docs/devel/protocol-message-formats.html

is it.

If we want to start on a protocol v4 thing then that is ok - but there 
are a lot of feature requests for that one.

Best regards,
  Jesper




Re: protocol-level wait-for-LSN

From
Jelte Fennema-Nio
Date:
On Wed, 30 Oct 2024 at 19:04, Jesper Pedersen
<jesper.pedersen@comcast.net> wrote:
> Having LSN would be nice, but to break all existing implementations, no.
> Having to specify with startup parameters how a core message format
> looks like sounds like a bad idea to me,

It would really help if you would explain why you think it's a bad
idea to use a startup parameter for that, instead of simply stating
that you think it needs a major protocol version bump.

The point of enabling it through a startup parameter (aka protocol
option) is exactly so it will not break any existing implementations.
If clients request the protocol option (which as the name suggests is
optional), then they are expected to be able to parse it. If they
don't, then they will get the old message format. So no existing
implementation will be broken. If some middleware/proxy gets a request
for a startup option it does not support it can advertise that to the
client using the NegotiateProtocolVersion message. Allowing the client
to continue in a mode where the option is not enabled.

So, not bumping the major protocol version and enabling this feature
through a protocol option actually causes less breakage in practice.

Also regarding the wishlist. I think it's much more likely for any of
those to happen in a minor version bump and/or protocol option than it
is that we'll bump the major protocol version.

P.S. Like I said in another email on this thread: I think for this
specific case I'd also prefer a separate new message, because that
makes it easier to filter that message out when received by PgBouncer.
But I'd still like to understand your viewpoint better on this,
because adding fields to existing message types is definitely one of
the types of changes that I personally think would be fine for some
protocol changes.



Re: protocol-level wait-for-LSN

From
Jesper Pedersen
Date:
Hi,

On 10/30/24 3:49 PM, Jelte Fennema-Nio wrote:
> On Wed, 30 Oct 2024 at 19:04, Jesper Pedersen
> <jesper.pedersen@comcast.net> wrote:
>> Having LSN would be nice, but to break all existing implementations, no.
>> Having to specify with startup parameters how a core message format
>> looks like sounds like a bad idea to me,
> 
> It would really help if you would explain why you think it's a bad
> idea to use a startup parameter for that, instead of simply stating
> that you think it needs a major protocol version bump.
> 
> The point of enabling it through a startup parameter (aka protocol
> option) is exactly so it will not break any existing implementations.
> If clients request the protocol option (which as the name suggests is
> optional), then they are expected to be able to parse it. If they
> don't, then they will get the old message format. So no existing
> implementation will be broken. If some middleware/proxy gets a request
> for a startup option it does not support it can advertise that to the
> client using the NegotiateProtocolVersion message. Allowing the client
> to continue in a mode where the option is not enabled.
> 
> So, not bumping the major protocol version and enabling this feature
> through a protocol option actually causes less breakage in practice.
>

Yes, but it opens up for everybody changing all message formats by 
startup parameters.

And, it will be confusing to clean-room implementations: When you have 
this startup parameter then you get these message formats, when you have 
this startup parameter then you get these message formats -- and what 
about combinations ? Like Tatsuo-san stated up-thread.

You are also assuming that all PostgreSQL protocol implementations uses 
the Length (Int32) field very strict - so when one developer adds the 
startup parameter, but doesn't change the underlying implementation 
everything will break.

The protocol format must be 100% clear and well-documented in all cases.

> Also regarding the wishlist. I think it's much more likely for any of
> those to happen in a minor version bump and/or protocol option than it
> is that we'll bump the major protocol version.
> 

I agree that protocol v4 is likely far out unless somebody want to 
coordinate the work needed.

> P.S. Like I said in another email on this thread: I think for this
> specific case I'd also prefer a separate new message, because that
> makes it easier to filter that message out when received by PgBouncer.
> But I'd still like to understand your viewpoint better on this,
> because adding fields to existing message types is definitely one of
> the types of changes that I personally think would be fine for some
> protocol changes.
> 

If this door is open then it has to very clear how multiple startup 
parameters are handled at the protocol level, and that is a much bigger 
fish because what happens if extensions add startup parameters as well.

Adding a new message could be the way forward, but that opens the door 
for the wish-lists for v4.

Best regards,
  Jesper




Re: protocol-level wait-for-LSN

From
Jelte Fennema-Nio
Date:
On Thu, 31 Oct 2024 at 13:59, Jesper Pedersen
<jesper.pedersen@comcast.net> wrote:
> And, it will be confusing to clean-room implementations: When you have
> this startup parameter then you get these message formats, when you have
> this startup parameter then you get these message formats -- and what
> about combinations ? Like Tatsuo-san stated up-thread.

I really don't understand why you think that's so difficult. To be
clear, no client is forced to implement any of these protocol options.
And all of these protocol options would be documented in the official
protocol docs. For instance the ReadyForQuery docs on the "Message
Formats" page in the docs could easily be made to look like the
following, which imho would be very clear to any implementer of the
protocol about ordering of these fields:

ReadyForQuery (B)
Byte1('Z')

Identifies the message type. ReadyForQuery is sent whenever the
backend is ready for a new query cycle.

Int32

Length of message contents in bytes, including self.

Int64: Only present if protocol option wait_for_lsn is set to 1 by the client

The LSN at time of commit

Int64: Only present if protocol option query_time is set to 1 by the client

Time it took to run the query in microseconds

Byte1

Current backend transaction status indicator. Possible values are 'I'
if idle (not in a transaction block); 'T' if in a transaction block;
or 'E' if in a failed transaction block (queries will be rejected
until block is ended).

> You are also assuming that all PostgreSQL protocol implementations uses
> the Length (Int32) field very strict - so when one developer adds the
> startup parameter, but doesn't change the underlying implementation
> everything will break.

Yes... But that seems equivalent to saying: If a developer of a
Postgres client advertises that they support protocol v4, but don't
actually implement it, then everything will break.

i.e. it's the job of the client author to not send protocol options
that it doesn't know anything about. Just like it's the job of the
client author not to request versions that it does not know anything
about.

> The protocol format must be 100% clear and well-documented in all cases.

Agreed. See above.

> If this door is open then it has to very clear how multiple startup
> parameters are handled at the protocol level, and that is a much bigger
> fish because what happens if extensions add startup parameters as well.

Postgres extensions **cannot** add such startup parameters. Heikki
already mentioned that the naming was confusing in the docs. At this
point in time we're only discussing protocol changes that are coming
from Postgres core (which is already a contentious enough topic).



Re: protocol-level wait-for-LSN

From
Matthias van de Meent
Date:
On Wed, 30 Oct 2024, 18:45 Jelte Fennema-Nio, <postgres@jeltef.nl> wrote:
>
> On Wed, 30 Oct 2024 at 18:18, Ants Aasma <ants.aasma@cybertec.at> wrote:
> > The idea is great, I have been wanting something like this for a long
> > time. For future proofing it might be a good idea to not require the
> > communicated-waited value to be a LSN.
>
> Yours and Matthias' feedback make total sense I think. From an
> implementation perspective I think there are a few things necessary to
> enable these wider usecases:
> 1. The token should be considered opaque for clients (should be documented)

I disagree. It is critical that a consumer knows what to do with the
output. Blindly passing it around is not a valid strategy: In my
example of keeping track of replication slots the client also has to
keep track of every cluster ID to make it work correctly, as every
postgres instance may only know about a subset of other PG instances:
A client would have to know how to discern and how to merge the
returned set of [cluster_id, LSN] pairs into its own view of a global
progress:

Say, you connect to cluster A, which receives changes from clusters X
and Y, cluster B, which receives from X and Z, and cluster C, which
receives from all of X, Y, and Z. Cluster B should ignore [Y_ID, Lsn],
as keeping the [cluster id, LSN] pair around would be sensitive to
resource attacks, but the client will have to merge the response from
that scluster to make sure it doesn't accidentally "go back in time"
when it switches from cluster A or B to another cluster with the "wait
for this minimal replication state" 'token'.

> > Even without sharding LSN might not be a final choice. Right now on
> > the primary the visibility order is not LSN order. So if a connection
> > does synchronous_commit = off commit, the write location is not even
> > going to see the commit. By publishing the end of the commit record it
> > would be better. But I assume at some point we would like to have a
> > consistent visibility order, which quite likely means using something
> > other than LSN as the logical clock.

Or have CSN=LSN -based snapshots on the primary, too, as that also
would solve the unordered visibility issue on the primary, as well as
the unacknowledged read issue.

> I was going to say that the default could probably still be LSN, but
> this makes me doubt that. Is there some other token that we can send
> now that we could "wait" on instead of the LSN, which would work for.
> If not, I think LSN is still probably a good choice as the default. Or
> maybe only as a default in case synchronous_commit != off.

I don't see how we can have anything but LSN as 'wait-for-this'
condition, as everything else could appear out-of-order in the WAL (we
don't allow the record to be modified during
XLogInsert()/ReserveXLogInsertLocation()), and WAL is our one source
of truth for change capture.

PS. I have other complaints about timestamp-based
replication/snapshots, but unless someone thinks otherwise and/or it
is made relevant I'll consider that off-topic.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)