Thread: Proposal: Snapshot cloning

Proposal: Snapshot cloning

From

Jan Wieck

Date:

25 January 2007, 23:19:19

Granted this one has a few open ends so far and I'd like to receive some 
constructive input on how to actually implement it.

The idea is to clone an existing serializable transactions snapshot 
visibility information from one backend to another. The semantics would 
be like this:
    backend1: start transaction;    backend1: set transaction isolation level serializable;    backend1: select
pg_backend_pid();   backend1: select publish_snapshot(); -- will block
 
    backend2: start transaction;    backend2: set transaction isolation level serializable;    backend2: select
clone_snapshot(<pid>);-- will unblock backend1
 
    backend1: select publish_snapshot();
    backend3: start transaction;    backend3: set transaction isolation level serializable;    backend3: select
clone_snapshot(<pid>);
    ...

This will allow a number of separate backends to assume the same MVCC 
visibility, so that they can query independent but the overall result 
will be according to one consistent snapshot of the database.

What I try to accomplish with this is to widen a bottleneck, many 
current Slony users are facing. The initial copy of a database is 
currently limited to one single reader to copy a snapshot of the data 
provider. With the above functionality, several tables could be copied 
in parallel by different client threads, feeding separate backends on 
the receiving side at the same time.

The feature could also be used by a parallel version of pg_dump as well 
as data mining tools.

The cloning process needs to make sure that the clone_snapshot() call is 
made from the same DB user in the same database as corresponding 
publish_snapshot() call was done. Since publish_snapshot() only 
publishes the information, it gained legally and that is visible in the 
PGPROC shared memory (xmin, xmax being the crucial part here), there is 
no risk of creating a snapshot for which data might have been removed by 
vacuum already.

What I am not sure about yet is what IPC method would best suit the 
transfer of the arbitrarily sized xip vector. Ideas?


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Snapshot cloning

From

Hannu Krosing

Date:

26 January 2007, 03:19:34

Ühel kenal päeval, N, 2007-01-25 kell 22:19, kirjutas Jan Wieck:
> Granted this one has a few open ends so far and I'd like to receive some 
> constructive input on how to actually implement it.
> 
> The idea is to clone an existing serializable transactions snapshot 
> visibility information from one backend to another. The semantics would 
> be like this:
> 
>      backend1: start transaction;
>      backend1: set transaction isolation level serializable;
>      backend1: select pg_backend_pid();
>      backend1: select publish_snapshot(); -- will block
> 
>      backend2: start transaction;
>      backend2: set transaction isolation level serializable;
>      backend2: select clone_snapshot(<pid>); -- will unblock backend1
> 
>      backend1: select publish_snapshot();
> 
>      backend3: start transaction;
>      backend3: set transaction isolation level serializable;
>      backend3: select clone_snapshot(<pid>);
> 
>      ...
> 
> This will allow a number of separate backends to assume the same MVCC 
> visibility, so that they can query independent but the overall result 
> will be according to one consistent snapshot of the database.

I see uses for this in implementing query parallelism in user level
code, like querying two child tables in two separate processes. 

> What I try to accomplish with this is to widen a bottleneck, many 
> current Slony users are facing. The initial copy of a database is 
> currently limited to one single reader to copy a snapshot of the data 
> provider. With the above functionality, several tables could be copied 
> in parallel by different client threads, feeding separate backends on 
> the receiving side at the same time.

I'm afraid that for most configurations this would make the copy slower,
as there will be mode random disk i/o.

Maybe better fix slony so that it allows initial copies in different
parallel transactions, or just do initial copy in several sets and merge
the sets later.

> The feature could also be used by a parallel version of pg_dump as well 
> as data mining tools.
> 
> The cloning process needs to make sure that the clone_snapshot() call is 
> made from the same DB user in the same database as corresponding 
> publish_snapshot() call was done. 

Why ? Snapshot is universal and same for whole db instance, so why limit
it to same user/database ?

> Since publish_snapshot() only 
> publishes the information, it gained legally and that is visible in the 
> PGPROC shared memory (xmin, xmax being the crucial part here), there is 
> no risk of creating a snapshot for which data might have been removed by 
> vacuum already.
> 
> What I am not sure about yet is what IPC method would best suit the 
> transfer of the arbitrarily sized xip vector. Ideas?
> 
> 
> Jan
> 
-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

Re: Proposal: Snapshot cloning

From

"Simon Riggs"

Date:

26 January 2007, 08:27:52

On Thu, 2007-01-25 at 22:19 -0500, Jan Wieck wrote:

> The idea is to clone an existing serializable transactions snapshot 
> visibility information from one backend to another. The semantics would 
> be like this:
> 
>      backend1: start transaction;
>      backend1: set transaction isolation level serializable;
>      backend1: select pg_backend_pid();
>      backend1: select publish_snapshot(); -- will block

Great idea. It can also be used by pg_dump to publish its snapshot so
that we can make VACUUM continue to process effectively while it pg_dump
is running.

Two questions:
- why does it have to block? I don't see any reason - the first process
can begin doing useful work. The second process might fail or itself be
blocked by something.

- why just serializable snapshots?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Proposal: Snapshot cloning

From

Gregory Stark

Date:

26 January 2007, 09:09:58

"Jan Wieck" <JanWieck@Yahoo.com> writes:

>     backend1: select publish_snapshot(); -- will block
>
>     backend2: start transaction;
>     backend2: set transaction isolation level serializable;
>     backend2: select clone_snapshot(<pid>); -- will unblock backend1

It seems simpler to have a current_snapshot() function that returns an bytea
or a new snapshot data type which set_current_snapshot(bytea) took to change
your snapshot. Then you could use tables or out-of-band communication to pass
around your snapshots however you please. 

set_current_snapshot() would have to sanity check that the xmin of the new
snapshot isn't older than the current globaloldestxmin. 

That could be handy for debugging purposes too. 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Proposal: Snapshot cloning

From

Jan Wieck

Date:

26 January 2007, 09:31:21

On 1/26/2007 8:06 AM, Gregory Stark wrote:
> "Jan Wieck" <JanWieck@Yahoo.com> writes:
> 
>>     backend1: select publish_snapshot(); -- will block
>>
>>     backend2: start transaction;
>>     backend2: set transaction isolation level serializable;
>>     backend2: select clone_snapshot(<pid>); -- will unblock backend1
> 
> It seems simpler to have a current_snapshot() function that returns an bytea
> or a new snapshot data type which set_current_snapshot(bytea) took to change
> your snapshot. Then you could use tables or out-of-band communication to pass
> around your snapshots however you please. 
> 
> set_current_snapshot() would have to sanity check that the xmin of the new
> snapshot isn't older than the current globaloldestxmin. 

That would solve the backend to backend IPC problem nicely.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Snapshot cloning

From

Hannu Krosing

Date:

26 January 2007, 10:11:35

Ühel kenal päeval, R, 2007-01-26 kell 12:25, kirjutas Simon Riggs:
> On Thu, 2007-01-25 at 22:19 -0500, Jan Wieck wrote:
> 
> > The idea is to clone an existing serializable transactions snapshot 
> > visibility information from one backend to another. The semantics would 
> > be like this:
> > 
> >      backend1: start transaction;
> >      backend1: set transaction isolation level serializable;
> >      backend1: select pg_backend_pid();
> >      backend1: select publish_snapshot(); -- will block
> 
> Great idea. It can also be used by pg_dump to publish its snapshot so
> that we can make VACUUM continue to process effectively while it pg_dump
> is running.

Do you mean we that vacuum would clean up tuples still visible to
pgdump ?

> Two questions:
> - why does it have to block? I don't see any reason - the first process
> can begin doing useful work. The second process might fail or itself be
> blocked by something.

As I see it, it has to block so that it's transaction woud not end so
that the system knows that it can't yet remove tuples in that snapshot.

And it should block util all its consumers have ended their use of the
published snapshot

> - why just serializable snapshots?

There s probably no point to aquire it into read-commited transaction
when the next command will revert to its own snapshot anyway.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

Re: Proposal: Snapshot cloning

From

"Simon Riggs"

Date:

26 January 2007, 10:29:51

On Fri, 2007-01-26 at 16:09 +0200, Hannu Krosing wrote:
> Ühel kenal päeval, R, 2007-01-26 kell 12:25, kirjutas Simon Riggs:

> > Two questions:
> > - why does it have to block? I don't see any reason - the first process
> > can begin doing useful work. The second process might fail or itself be
> > blocked by something.
>
> As I see it, it has to block so that it's transaction woud not end so
> that the system knows that it can't yet remove tuples in that snapshot.
>
> And it should block util all its consumers have ended their use of the
> published snapshot

Agreed that the Snapshot must be visible to all, but thats no reason why
the original call has to block, just that we must do something to
prevent the Snapshot from disappearing from view.

> > - why just serializable snapshots?
>
> There s probably no point to aquire it into read-commited transaction
> when the next command will revert to its own snapshot anyway.

But the stated use case was to share snapshots, which seems valid
whatever the type of Snapshot. One of the stated cases was parallel
query...

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Proposal: Snapshot cloning

From

"Simon Riggs"

Date:

26 January 2007, 10:38:17

On Fri, 2007-01-26 at 16:09 +0200, Hannu Krosing wrote:
> Ühel kenal päeval, R, 2007-01-26 kell 12:25, kirjutas Simon Riggs:

> > Great idea. It can also be used by pg_dump to publish its snapshot so
> > that we can make VACUUM continue to process effectively while it pg_dump
> > is running.
>
> Do you mean we that vacuum would clean up tuples still visible to
> pgdump ?

No, that would break MVCC. But we may have done lots of updates/deletes
that are *not* visible to any Snapshot, yet are not yet removable
because they are higher than OldestXmin but we don't know that because
previously the Snapshot details were not available. ISTM that this
proposal is a way of making the Snapshot limits publicly available so
that they can be used by VACUUM. Sure it isn't every backend, but the
details may be useful. So this is an additional benefit to this
proposal. (There's a hole in the above idea, so don't jump on my back to
explain it - I see it and am trying to work out a way around it...)

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Proposal: Snapshot cloning

From

Chris Browne

Date:

26 January 2007, 12:31:34

stark@enterprisedb.com (Gregory Stark) writes:
> "Jan Wieck" <JanWieck@Yahoo.com> writes:
>
>>     backend1: select publish_snapshot(); -- will block
>>
>>     backend2: start transaction;
>>     backend2: set transaction isolation level serializable;
>>     backend2: select clone_snapshot(<pid>); -- will unblock backend1
>
> It seems simpler to have a current_snapshot() function that returns an bytea
> or a new snapshot data type which set_current_snapshot(bytea) took to change
> your snapshot. Then you could use tables or out-of-band communication to pass
> around your snapshots however you please. 
>
> set_current_snapshot() would have to sanity check that the xmin of the new
> snapshot isn't older than the current globaloldestxmin. 
>
> That could be handy for debugging purposes too. 

Here's a wild thought...  

Would there be any sense in setting up the ability to declare
expressly a transaction's visibility parameters?

Consider that the Slony-I sl_event table records: ev_minxid, ev_maxxid, ev_xip

Grabbing a sample from an instance...[ ev_minxid| ev_maxxid| ev_xip ] =  [1377591608 | 1377591612 |
'1377591608','1377591610']

Would it be plausible to, in effect, assert these things?

To say:
start transaction;
set transaction isolation level serializable;
select set_transaction_visibility(1377591608, 1377591612, [1377591608, 1377591610]);

And thus assert the visibility that was recorded at that point in
time?

I may very well have the parameters characterized in a wrong way;
please assume an appropriate way instead as needed :-).

This would permit, if I am seeing this right, a way that you could, in
effect, get a form of "time travel" via this where you'd be able to
arbitrarily point at different forms of data visibility.  The wild
part being that you could assert data visibility declarations that a
normal connection couldn't naturally obtain...
-- 
let name="cbbrowne" and tld="linuxdatabases.info" in name ^ "@" ^ tld;;
http://linuxfinances.info/info/multiplexor.html
Sturgeon's Law: 90% of *EVERYTHING* is crud.

Re: Proposal: Snapshot cloning

From

Tom Lane

Date:

26 January 2007, 12:36:54

"Simon Riggs" <simon@2ndquadrant.com> writes:
> No, that would break MVCC. But we may have done lots of updates/deletes
> that are *not* visible to any Snapshot, yet are not yet removable
> because they are higher than OldestXmin but we don't know that because
> previously the Snapshot details were not available. ISTM that this
> proposal is a way of making the Snapshot limits publicly available so
> that they can be used by VACUUM.

Certainly not, unless you intend that *every* snapshot *must* be
published, which is an overhead up with which we will not put.

One pretty serious problem with the proposal as written is the part
about the sender blocking until the receiver takes the snap; that means
it's not really a "publish" in the sense that you can make it available
without worrying about exactly how many readers there might or might not
be.  That alone is sufficient to kill any thought of VACUUM making use
of the info.  I'd feel happier with an implementation more like prepared
transactions: you stuff the information into shared memory and it sits
there, readable by anyone, until such time as you take it down again.
Like prepared xacts, GlobalXmin calculations would need to include these
snapshots (and hence they'd limit vacuums).

A shared-memory area would have to be fixed size, but perhaps backing
files, like those used by prepared xacts, could handle the overflow for
very large xip lists.  Presumably crash safety is not an issue so this
wouldn't require any complicated mechanism.
        regards, tom lane

Re: Proposal: Snapshot cloning

From

Tom Lane

Date:

26 January 2007, 12:40:55

Hannu Krosing <hannu@skype.net> writes:
> Ühel kenal päeval, N, 2007-01-25 kell 22:19, kirjutas Jan Wieck:
>> The cloning process needs to make sure that the clone_snapshot() call is 
>> made from the same DB user in the same database as corresponding 
>> publish_snapshot() call was done. 

> Why ? Snapshot is universal and same for whole db instance, so why limit
> it to same user/database ?

Yeah.  Use-case: pg_dumpall could guarantee that it produces consistent
snapshots across multiple databases.  (Not sure I actually want that,
but it's at least arguably useful to someone.)

I think you would want to mark a snapshot with an owner, but that would
be for the purpose of restricting who could take it down, not who could
copy it.
        regards, tom lane

Re: Proposal: Snapshot cloning

From

Tom Lane

Date:

26 January 2007, 12:58:18

Jan Wieck <JanWieck@Yahoo.com> writes:
> On 1/26/2007 8:06 AM, Gregory Stark wrote:
>> It seems simpler to have a current_snapshot() function that returns an bytea
>> or a new snapshot data type which set_current_snapshot(bytea) took to change
>> your snapshot. Then you could use tables or out-of-band communication to pass
>> around your snapshots however you please. 
>> 
>> set_current_snapshot() would have to sanity check that the xmin of the new
>> snapshot isn't older than the current globaloldestxmin. 

> That would solve the backend to backend IPC problem nicely.

But it fails on the count of making sure that globaloldestxmin doesn't
advance past the snap you want to use.  And exactly how will you pass
a snap through a table?  It won't become visible until you commit ...
whereupon your own xmin isn't blocking the advance of globaloldestxmin.
        regards, tom lane

Re: Proposal: Snapshot cloning

From

"Simon Riggs"

Date:

26 January 2007, 13:24:39

On Fri, 2007-01-26 at 11:36 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > No, that would break MVCC. But we may have done lots of updates/deletes
> > that are *not* visible to any Snapshot, yet are not yet removable
> > because they are higher than OldestXmin but we don't know that because
> > previously the Snapshot details were not available. ISTM that this
> > proposal is a way of making the Snapshot limits publicly available so
> > that they can be used by VACUUM.
> 
> Certainly not, unless you intend that *every* snapshot *must* be
> published, which is an overhead up with which we will not put.

Agreed, but that's the general case problem.

What I was hoping was that this would provide a mechanism for long
running transactions (LRTs) to publish their min/max Xids. Then if all
backends publish the minimum Xid of any Snapshot they have generated in
the proc array, we'd be able to decide if there are any large holes in
the global set of Snapshots. As a general case that's hard to evaluate,
but in the common case of a lone LRT and all the rest short duration
transactions you can end up with a gap of 250,000+ transactions opening
up between the two. It would be fairly easy to have VACUUM check for
large "visibility gaps" between groups of transactions and then use that
to improve its effectiveness in the presence of LRTs.

Theoretically we have to keep the chain of intermediate updates around
so it can be traversed by the old transaction, but in practical terms
traversing a long chain of updates isn't sensible. Serializable LRTs
will never traverse the chain anyway (that's a serializability error),
but there are some special cases to consider, hence my mentioning an
unresolved problem previously.

We'd need to be much more careful about the way Snapshots are managed,
so we can be certain that we take them all into account.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Proposal: Snapshot cloning

From

Gregory Stark

Date:

26 January 2007, 13:32:26

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

>>> set_current_snapshot() would have to sanity check that the xmin of the new
>>> snapshot isn't older than the current globaloldestxmin. 
>
>> That would solve the backend to backend IPC problem nicely.
>
> But it fails on the count of making sure that globaloldestxmin doesn't
> advance past the snap you want to use.  And exactly how will you pass
> a snap through a table?  It won't become visible until you commit ...
> whereupon your own xmin isn't blocking the advance of globaloldestxmin.

Hm, good point. You could always do it in a separate connection, but that
starts to get annoying. I was more envisioning passing it around out-of-band
though, something like:

$db->execute(SET TRANSACTION ISOLATION LEVEL SERIALIZABLE);
$snap = $db->execute(select current_snapshot());
 for each db {     if (fork())         $slave[i] = $db->connect();         $slave[i]->execute(select
set_snapshot($snap));        $slave[i]->execute(copy table[i] to file[i]); }

I'm also wondering about something like:
 $db->execute(SET TRANSACTION ISOLATION LEVEL SERIALIZABLE); $snap = $db->execute(select current_snapshot());
 if (fork())     $slave = $db->connect();     $slave->execute(select set_snapshot($snap);     $slave->execute(copy tab
fromhugefile);     signal parent } else {     while(no signal yet) {         $rows_loaded_so_far = $db->execute(select
count(*)from tab);         display_progress($rows_loaded_so_far);         sleep(60);     } }

Sorry for the vaguely perlish pseudocode but it's the clearest way I can think
to write it. I don't think it would make much sense to try to do anything like
this in plpgsql; I think you really do want to be doing it in a language
outside the database where it's easier to open multiple connections and handle
IPC.

I realize the second idea might take more hackery than just setting the
snapshot... In particular as written above it wouldn't work because the slave
would be writing with a new xid that isn't actually in the snapshot.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Proposal: Snapshot cloning

From

Jan Wieck

Date:

26 January 2007, 13:44:12

On 1/26/2007 12:22 PM, Simon Riggs wrote:
> On Fri, 2007-01-26 at 11:36 -0500, Tom Lane wrote:
>> "Simon Riggs" <simon@2ndquadrant.com> writes:
>> > No, that would break MVCC. But we may have done lots of updates/deletes
>> > that are *not* visible to any Snapshot, yet are not yet removable
>> > because they are higher than OldestXmin but we don't know that because
>> > previously the Snapshot details were not available. ISTM that this
>> > proposal is a way of making the Snapshot limits publicly available so
>> > that they can be used by VACUUM.
>> 
>> Certainly not, unless you intend that *every* snapshot *must* be
>> published, which is an overhead up with which we will not put.
> 
> Agreed, but that's the general case problem.
> 
> What I was hoping was that this would provide a mechanism for long
> running transactions (LRTs) to publish their min/max Xids. Then if all
> backends publish the minimum Xid of any Snapshot they have generated in
> the proc array, we'd be able to decide if there are any large holes in
> the global set of Snapshots. As a general case that's hard to evaluate,
> but in the common case of a lone LRT and all the rest short duration
> transactions you can end up with a gap of 250,000+ transactions opening
> up between the two. It would be fairly easy to have VACUUM check for
> large "visibility gaps" between groups of transactions and then use that
> to improve its effectiveness in the presence of LRTs.

There is a flaw in that theory. If you have a single LTR, then each 
subsequent transactions xmin will be exactly that one, no?


Jan

> 
> Theoretically we have to keep the chain of intermediate updates around
> so it can be traversed by the old transaction, but in practical terms
> traversing a long chain of updates isn't sensible. Serializable LRTs
> will never traverse the chain anyway (that's a serializability error),
> but there are some special cases to consider, hence my mentioning an
> unresolved problem previously.
> 
> We'd need to be much more careful about the way Snapshots are managed,
> so we can be certain that we take them all into account.
> 


-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Snapshot cloning

From

Jan Wieck

Date:

26 January 2007, 13:46:49

On 1/26/2007 11:58 AM, Tom Lane wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
>> On 1/26/2007 8:06 AM, Gregory Stark wrote:
>>> It seems simpler to have a current_snapshot() function that returns an bytea
>>> or a new snapshot data type which set_current_snapshot(bytea) took to change
>>> your snapshot. Then you could use tables or out-of-band communication to pass
>>> around your snapshots however you please. 
>>> 
>>> set_current_snapshot() would have to sanity check that the xmin of the new
>>> snapshot isn't older than the current globaloldestxmin. 
> 
>> That would solve the backend to backend IPC problem nicely.
> 
> But it fails on the count of making sure that globaloldestxmin doesn't
> advance past the snap you want to use.  And exactly how will you pass
> a snap through a table?  It won't become visible until you commit ...
> whereupon your own xmin isn't blocking the advance of globaloldestxmin.

The client receives the snapshot information as a result from the 
function call to current_snapshot(). The call to 
set_current_snapshot(snap) errors out if snap's xmin is older than 
globaloldestxmin. It is the client app that has to make sure that the 
transaction that created snap is still in progress.

I didn't say passing anything through a table.

Take a modified pg_dump as an example. It could write multiple files. A 
pre-load sql with the first part of the schema. Then a post-load sql 
with the finalization of same (creating indexes, adding constraints). It 
then builds a list of all relations to COPY, starts n threads each 
writing a different file. Each thread connects to the DB and adjusts the 
snapshot to the one of the main transaction (which is still open). Then 
each thread grabs the next table to dump from the list and writes the 
COPY data to its output file. The threads exit when the list of tables 
is empty. The main thread waits until the last thread has joined and 
commits the main transaction.

Wouldn't be too hard to write a script that restores that split dump in 
parallel as well.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Proposal: Snapshot cloning

From

"Simon Riggs"

Date:

26 January 2007, 17:19:53

On Fri, 2007-01-26 at 12:43 -0500, Jan Wieck wrote:

> There is a flaw in that theory. If you have a single LTR, then each 
> subsequent transactions xmin will be exactly that one, no?

You got me. My description was too loose, but you also got the rough
picture. We'll save the detail for another day, but we all know its a
bridge we will have to cross one day, soon. I wasn't meaning to raise
this specific discussion now, just to say that publishing snapshots for
known LRTs is one way by which we can solve the LRT/VACUUMing issue.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Proposal: Snapshot cloning

From

Tom Lane

Date:

26 January 2007, 17:48:58

"Simon Riggs" <simon@2ndquadrant.com> writes:
> You got me. My description was too loose, but you also got the rough
> picture. We'll save the detail for another day, but we all know its a
> bridge we will have to cross one day, soon. I wasn't meaning to raise
> this specific discussion now, just to say that publishing snapshots for
> known LRTs is one way by which we can solve the LRT/VACUUMing issue.

I don't actually see that it buys you a darn thing ... you still won't
be able to delete dead updated tuples because of the possibility of the
LRT deciding to chase ctid chains up from the tuples it can see.   You
also seem to be assuming that a transaction can have only one snapshot,
which is not something we can enforce in enough cases to make it a very
useful restriction.
        regards, tom lane

Re: Proposal: Snapshot cloning

From

Jim Nasby

Date:

30 January 2007, 00:03:04

On Jan 26, 2007, at 4:48 PM, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
>> You got me. My description was too loose, but you also got the rough
>> picture. We'll save the detail for another day, but we all know its a
>> bridge we will have to cross one day, soon. I wasn't meaning to raise
>> this specific discussion now, just to say that publishing  
>> snapshots for
>> known LRTs is one way by which we can solve the LRT/VACUUMing issue.
>
> I don't actually see that it buys you a darn thing ... you still won't
> be able to delete dead updated tuples because of the possibility of  
> the
> LRT deciding to chase ctid chains up from the tuples it can see.   You
> also seem to be assuming that a transaction can have only one  
> snapshot,
> which is not something we can enforce in enough cases to make it a  
> very
> useful restriction.

Well, Simon was talking about a serialized LRT, which ISTM shouldn't  
be hunting down ctid chains past the point it serialized at.

Even if that's not the case, there is also the possibility if a LRT  
publishing information about what tables it will hit. Any tables not  
being touched by a LRT could be vacuumed past the global minxid. It  
would be up to the user to do that in many cases, but that's likely  
to be well worth it if you have LRTs that are only hitting a few  
tables yet you have other tables that really, really need to stay  
vacuumed. Believe me, that is a very common use case in the real  
world (think queue table, or web session table).
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Proposal: Snapshot cloning

From

Tom Lane

Date:

30 January 2007, 00:29:01

Jim Nasby <decibel@decibel.org> writes:
> On Jan 26, 2007, at 4:48 PM, Tom Lane wrote:
>> I don't actually see that it buys you a darn thing ... you still won't
>> be able to delete dead updated tuples because of the possibility of  
>> the LRT deciding to chase ctid chains up from the tuples it can see.

> Well, Simon was talking about a serialized LRT, which ISTM shouldn't  
> be hunting down ctid chains past the point it serialized at.

How you figure that?  If the LRT wants to update a tuple, it's got to
chase the ctid chain to see whether the head update committed or not.
It's not an error for a serializable transaction to update a tuple that
was tentatively updated by a transaction that rolled back.

> Even if that's not the case, there is also the possibility if a LRT  
> publishing information about what tables it will hit.

I think we already bought 99% of the possible win there by fixing
vacuum.  Most ordinary transactions aren't going to be able to predict
which other tables the user might try to touch.
        regards, tom lane

Re: Proposal: Snapshot cloning

From

Jim Nasby

Date:

02 February 2007, 12:43:15

On Jan 29, 2007, at 11:28 PM, Tom Lane wrote:
> Jim Nasby <decibel@decibel.org> writes:
>> On Jan 26, 2007, at 4:48 PM, Tom Lane wrote:
>>> I don't actually see that it buys you a darn thing ... you still  
>>> won't
>>> be able to delete dead updated tuples because of the possibility of
>>> the LRT deciding to chase ctid chains up from the tuples it can see.
>
>> Well, Simon was talking about a serialized LRT, which ISTM shouldn't
>> be hunting down ctid chains past the point it serialized at.
>
> How you figure that?  If the LRT wants to update a tuple, it's got to
> chase the ctid chain to see whether the head update committed or not.
> It's not an error for a serializable transaction to update a tuple  
> that
> was tentatively updated by a transaction that rolled back.

Nuts. :(

>> Even if that's not the case, there is also the possibility if a LRT
>> publishing information about what tables it will hit.
>
> I think we already bought 99% of the possible win there by fixing
> vacuum.  Most ordinary transactions aren't going to be able to predict
> which other tables the user might try to touch.

Presumably a single-statement transaction could do that in most (if  
not all) cases.

But even if we didn't support automatically detecting what tables a  
transaction was hitting, we could allow the user to specify it and  
then bomb out if the transaction tried to hit anything that wasn't in  
that list. That would allow users who are creating LRTs to limit  
their impact on vacuum. The safe way to perform that check would be  
to check each buffer before accessing it, but I'm unsure how large a  
performance impact that would entail; I don't know how much code we  
run through to pull a tuple out of a page and do something with it  
compared to the cost of checking if that buffer belongs to a relation/ 
file that's in the "approved list".

Perhaps a better way would be to allow users to mark vacuum-critical  
tables for "restricted" access. To access a restricted table the user  
would need to provide a list of restricted tables that a transaction  
is going to hit (or maybe just lump all restricted tables into one  
group), and that transaction would log it's XID somewhere that vacuum  
can look at. If a transaction that hasn't specified it will touch the  
restricted tables tries to do so it errors out. We might want some  
way to flag buffers as belonging to a restricted table (or one of  
it's indexes) so that transactions that aren't hitting restricted  
tables wouldn't have to pay a large performance penalty to figure  
that out. But you'd only have to mark those buffers when they're read  
in from the OS, and presumably a restricted table will be small  
enough that it's buffers should stay put. Logging the XID could prove  
to be a serialization point, but we could possibly avoid that by  
using per-relation locks.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)