Thread: Foreign key quandries

Foreign key quandries

From
Stephan Szabo
Date:
Going through the issues in doing dirty reads in foreign keys I've come up
with a few cases that I'm fairly uncertain about how to handle with
regards to deadlocks and figured I should ask for advice because I think
I'm missing something painfully obvious, but don't get large enough blocks
of time to think about it to figure out what it is.


I'd thought maybe it'd be enough to say which type of
thing on which constraint and use that to basically say that
we don't need to wait on a transaction that's waiting on us
due to a modification to the other table, but AFAICS
that lets through a bad case:T1: insert into fk values (2);T2: delete from pk where key=3;T2: delete from pk where
key=2;T1:insert into fk values (3);
 
If T1 doesn't wait in this case, you can get into a case where
a bad row is inserted into fk if you then have:T1: delete from fk where key=2;T1: commit;
Now there's no row to make the second delete fail but
transaction 2 still can't commit due to the fk row with key 3.



I'd then thought of doing something based on what row/value
transaction 2 was waiting on, but that has problems.
Given a foreign key with no referential actions and a
sequence like:
Transaction 1 inserts into the foreign key table a row with a referencing key of 2.Transaction 1 checks the foreign
keyTransaction2 deletes the primary key rows having keys 2 and 3Transaction 1 inserts another row into the foreign key
tablewith a referencing key of 2.Transactions 1 and 2 start checking the foreign key.
 

AFAICS, transaction 2 needs to wait since there's already a
row it can see in the foreign key table that's not yet committed
(so it doesn't know if the delete works or not).  We can tell
transaction 1 that it doesn't need to wait on transaction 2
because transaction 1 is inserting a value that transaction 2
will see in its check, thus we're saved from the first case.

However, this has the potential to deadlock if we had for example,
inserted a foreign key table row of 3 rather than 2 as the second
insert in transaction 1 and the delete check for 2 went first.  If
we knew that it was also going to be checking the 3 rows, we'd be
safe, but then we've got to keep that information in some way that's
visible to other transactions AFAICS.  And, if the checks were
done in the order delete check for 3, delete check for 2(t2 blocks),
insert check for 3, we'd be back in the state of the first example.
:(





Re: Foreign key quandries

From
Rod Taylor
Date:
I'm not sure I understand the question. The case as described simply has
to deadlock because your approaching the same values with conflicting
tasks from opposite directions.

Detect it, kill one, and continue on.

Same problem as:

T1 insert into pk values (1);
T2 insert into pk values (2);
T2 insert into pk values (1);
T1 insert into pk values (2);

It's up to the application(s) to minimize risk by approaching values in
a common order for a given set of work.


On Fri, 2003-02-28 at 16:58, Stephan Szabo wrote:
> Going through the issues in doing dirty reads in foreign keys I've come up
> with a few cases that I'm fairly uncertain about how to handle with
> regards to deadlocks and figured I should ask for advice because I think
> I'm missing something painfully obvious, but don't get large enough blocks
> of time to think about it to figure out what it is.
>
>
> I'd thought maybe it'd be enough to say which type of
> thing on which constraint and use that to basically say that
> we don't need to wait on a transaction that's waiting on us
> due to a modification to the other table, but AFAICS
> that lets through a bad case:
>  T1: insert into fk values (2);
>  T2: delete from pk where key=3;
>  T2: delete from pk where key=2;
>  T1: insert into fk values (3);
> If T1 doesn't wait in this case, you can get into a case where
> a bad row is inserted into fk if you then have:
>  T1: delete from fk where key=2;
>  T1: commit;
> Now there's no row to make the second delete fail but
> transaction 2 still can't commit due to the fk row with key 3.
>
>
>
> I'd then thought of doing something based on what row/value
> transaction 2 was waiting on, but that has problems.
> Given a foreign key with no referential actions and a
> sequence like:
>
>  Transaction 1 inserts into the foreign key table a row
>   with a referencing key of 2.
>  Transaction 1 checks the foreign key
>  Transaction 2 deletes the primary key rows having keys
>   2 and 3
>  Transaction 1 inserts another row into the foreign key
>   table with a referencing key of 2.
>  Transactions 1 and 2 start checking the foreign key.
>
> AFAICS, transaction 2 needs to wait since there's already a
> row it can see in the foreign key table that's not yet committed
> (so it doesn't know if the delete works or not).  We can tell
> transaction 1 that it doesn't need to wait on transaction 2
> because transaction 1 is inserting a value that transaction 2
> will see in its check, thus we're saved from the first case.
>
> However, this has the potential to deadlock if we had for example,
> inserted a foreign key table row of 3 rather than 2 as the second
> insert in transaction 1 and the delete check for 2 went first.  If
> we knew that it was also going to be checking the 3 rows, we'd be
> safe, but then we've got to keep that information in some way that's
> visible to other transactions AFAICS.  And, if the checks were
> done in the order delete check for 3, delete check for 2(t2 blocks),
> insert check for 3, we'd be back in the state of the first example.

--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Re: Foreign key quandries

From
Stephan Szabo
Date:
On 1 Mar 2003, Rod Taylor wrote:

> I'm not sure I understand the question. The case as described simply has
> to deadlock because your approaching the same values with conflicting
> tasks from opposite directions.

Well, the problem is that two cases (one of which I think deadlock is
unnecessary in) are very similar.

As I see it:

T1: insert 2
T2: delete 2
T1: insert 2/update 2 (non-key fields)shouldn't need to deadlock.

T1: insert 2
T2: delete 2 & 3  * delete 2's check blocks before     checking 3
T1: insert 3should not need to deadlock I think

T1: insert 2
T2: delete 3
T2: delete 2(or delete 2 & 3 where 3's check goes then 2's check blocks)
T1: insert 3does need to deadlock

In the second case, both deletes have happened so the row the insert wants
to check against is marked for deletion, but since it's going to be
checking for the 3 row in the future, and will error if T1 commits, I
think it's safe for it to go through.

I'm trying to find a way to differentiate the second and third case given
that I'm running inside a constraint check on insert 3. It'd be easy if
transaction 1 could see that it's going to be checking for the 3 row
later, but I think that'd involve keeping around alot of information about
the rows that are affected in some shared way which could get really
large.




Re: Foreign key quandries

From
Rod Taylor
Date:
On Sat, 2003-03-01 at 00:44, Stephan Szabo wrote:
> On 1 Mar 2003, Rod Taylor wrote:
>
> > I'm not sure I understand the question. The case as described simply has
> > to deadlock because your approaching the same values with conflicting
> > tasks from opposite directions.
>
> Well, the problem is that two cases (one of which I think deadlock is
> unnecessary in) are very similar.

I see.  Now I see what your asking about.

> As I see it:
>
> T1: insert 2
> T2: delete 2
> T1: insert 2/update 2 (non-key fields)
>  shouldn't need to deadlock.

> T1: insert 2
> T2: delete 2 & 3
>    * delete 2's check blocks before
>       checking 3
> T1: insert 3
>  should not need to deadlock I think

> T1: insert 2
> T2: delete 3
> T2: delete 2
>  (or delete 2 & 3 where 3's check goes then 2's check blocks)
> T1: insert 3
>  does need to deadlock
>
> In the second case, both deletes have happened so the row the insert wants
> to check against is marked for deletion, but since it's going to be
> checking for the 3 row in the future, and will error if T1 commits, I
> think it's safe for it to go through.
>
> I'm trying to find a way to differentiate the second and third case given
> that I'm running inside a constraint check on insert 3. It'd be easy if
> transaction 1 could see that it's going to be checking for the 3 row
> later, but I think that'd involve keeping around alot of information about
> the rows that are affected in some shared way which could get really
> large.

Isn't the differentiation going to happen automatically?

In case 2:

T1: create fk tuple (uncommitted) -> value 2
T2: delete pk tuple value 2
T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
T1: create fk tuple (uncommitted) -> value 3
T1: commit
T2:    scan through fk table, find tuple value 2 ... its committed
T2: <run cascade procedure on tuples found in fk table for value 2>
T2: continue scan through fk table, find tuple value 3 ... its committed
T2: <run cascade procedure on tuples found in fk table for value 3>
T2: All is well -- return control to user.

In case 3:
T1: create fk tuple (uncommitted) -> value 2
T2: delete pk tuple value 3
T2:    scan through fk table, value 3 not found
T2: delete pk tuple value 2
T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
T1: create fk value 3
T1:    scan through pk table, find uncommitted tuple value 3 ... sleep
T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
T1:    scan through pk table, find uncommitted tuple value 3 ... sleep
T2:    scan through fk table, find uncommitted tuple value 2 ... sleep

--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Re: Foreign key quandries

From
Rod Taylor
Date:
Gah, hit wrong key combination and the email sent early.

Anyway, after that 'sleep' mess at the bottom is:
T1 or T2: Sleeping too long -- lets run deadlock detection code
T1 or T2: Kill off random participant of deadlock.

The other participant is then allowed to continue their work.

On Sat, 2003-03-01 at 02:03, Rod Taylor wrote:
> On Sat, 2003-03-01 at 00:44, Stephan Szabo wrote:
> > On 1 Mar 2003, Rod Taylor wrote:
> >
> > > I'm not sure I understand the question. The case as described simply has
> > > to deadlock because your approaching the same values with conflicting
> > > tasks from opposite directions.
> >
> > Well, the problem is that two cases (one of which I think deadlock is
> > unnecessary in) are very similar.
>
> I see.  Now I see what your asking about.
>
> > As I see it:
> >
> > T1: insert 2
> > T2: delete 2
> > T1: insert 2/update 2 (non-key fields)
> >  shouldn't need to deadlock.
>
> > T1: insert 2
> > T2: delete 2 & 3
> >    * delete 2's check blocks before
> >       checking 3
> > T1: insert 3
> >  should not need to deadlock I think
>
> > T1: insert 2
> > T2: delete 3
> > T2: delete 2
> >  (or delete 2 & 3 where 3's check goes then 2's check blocks)
> > T1: insert 3
> >  does need to deadlock
> >
> > In the second case, both deletes have happened so the row the insert wants
> > to check against is marked for deletion, but since it's going to be
> > checking for the 3 row in the future, and will error if T1 commits, I
> > think it's safe for it to go through.
> >
> > I'm trying to find a way to differentiate the second and third case given
> > that I'm running inside a constraint check on insert 3. It'd be easy if
> > transaction 1 could see that it's going to be checking for the 3 row
> > later, but I think that'd involve keeping around alot of information about
> > the rows that are affected in some shared way which could get really
> > large.
>
> Isn't the differentiation going to happen automatically?
>
> In case 2:
>
> T1: create fk tuple (uncommitted) -> value 2
> T2: delete pk tuple value 2
> T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> T1: create fk tuple (uncommitted) -> value 3
> T1: commit
> T2:    scan through fk table, find tuple value 2 ... its committed
> T2: <run cascade procedure on tuples found in fk table for value 2>
> T2: continue scan through fk table, find tuple value 3 ... its committed
> T2: <run cascade procedure on tuples found in fk table for value 3>
> T2: All is well -- return control to user.
>
> In case 3:
> T1: create fk tuple (uncommitted) -> value 2
> T2: delete pk tuple value 3
> T2:    scan through fk table, value 3 not found
> T2: delete pk tuple value 2
> T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> T1: create fk value 3
> T1:    scan through pk table, find uncommitted tuple value 3 ... sleep
> T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> T1:    scan through pk table, find uncommitted tuple value 3 ... sleep
> T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Re: Foreign key quandries

From
Stephan Szabo
Date:
On 1 Mar 2003, Rod Taylor wrote:

> Gah, hit wrong key combination and the email sent early.
>
> Anyway, after that 'sleep' mess at the bottom is:
> T1 or T2: Sleeping too long -- lets run deadlock detection code
> T1 or T2: Kill off random participant of deadlock.
>
> The other participant is then allowed to continue their work.
>
> > Isn't the differentiation going to happen automatically?

The problem is that in case 2, both tuples 2 and 3 are already removed
before either foreign key check runs, so when T1 adds the value 3
row and checks the pk table it will find that its pk row has been
modified.  If the ordering went, delete 2 - check 2, delete 3 - check
3, this wouldn't be a problem, but then that'd fail in a
spec-non-compliant way if row 2 refered to row 3.

> > In case 2:
> >
> > T1: create fk tuple (uncommitted) -> value 2
*   T1: scan through pk table, no problems
> > T2: delete pk tuple value 2
*   T2: delete pk tuple value 3
> > T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> > T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> > T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> > T1: create fk tuple (uncommitted) -> value 3
*   T1: scan through pk table, find modified tuple value 3 ...



Re: Foreign key quandries

From
Rod Taylor
Date:
On Sat, 2003-03-01 at 02:38, Stephan Szabo wrote:
> On 1 Mar 2003, Rod Taylor wrote:
>
> > Gah, hit wrong key combination and the email sent early.
> >
> > Anyway, after that 'sleep' mess at the bottom is:
> > T1 or T2: Sleeping too long -- lets run deadlock detection code
> > T1 or T2: Kill off random participant of deadlock.
> >
> > The other participant is then allowed to continue their work.
> >
> > > Isn't the differentiation going to happen automatically?
>
> The problem is that in case 2, both tuples 2 and 3 are already removed
> before either foreign key check runs, so when T1 adds the value 3
> row and checks the pk table it will find that its pk row has been
> modified.  If the ordering went, delete 2 - check 2, delete 3 - check
> 3, this wouldn't be a problem, but then that'd fail in a
> spec-non-compliant way if row 2 refered to row 3.

The foreign key cascade is explicitly deferred to the end of the
statement via the trigger queue, but there is no reason that the foreign
key code can't run immediately for each tuple removed.

For the defer check to commit case, have the code add the element to the
trigger queue by making a fake trigger wrapper -- that isn't really a
trigger (simplest method probably) and it should and will deadlock.

> > > In case 2:
> > >
> > > T1: create fk tuple (uncommitted) -> value 2
> *   T1: scan through pk table, no problems
> > > T2: delete pk tuple value 2
> *   T2: delete pk tuple value 3
> > > T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> > > T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> > > T2:    scan through fk table, find uncommitted tuple value 2 ... sleep
> > > T1: create fk tuple (uncommitted) -> value 3
> *   T1: scan through pk table, find modified tuple value 3 ...
--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Re: Foreign key quandries

From
Stephan Szabo
Date:
On 1 Mar 2003, Rod Taylor wrote:

> On Sat, 2003-03-01 at 02:38, Stephan Szabo wrote:
> > On 1 Mar 2003, Rod Taylor wrote:
> >
> > > Gah, hit wrong key combination and the email sent early.
> > >
> > > Anyway, after that 'sleep' mess at the bottom is:
> > > T1 or T2: Sleeping too long -- lets run deadlock detection code
> > > T1 or T2: Kill off random participant of deadlock.
> > >
> > > The other participant is then allowed to continue their work.
> > >
> > > > Isn't the differentiation going to happen automatically?
> >
> > The problem is that in case 2, both tuples 2 and 3 are already removed
> > before either foreign key check runs, so when T1 adds the value 3
> > row and checks the pk table it will find that its pk row has been
> > modified.  If the ordering went, delete 2 - check 2, delete 3 - check
> > 3, this wouldn't be a problem, but then that'd fail in a
> > spec-non-compliant way if row 2 refered to row 3.
>
> The foreign key cascade is explicitly deferred to the end of the
> statement via the trigger queue, but there is no reason that the foreign
> key code can't run immediately for each tuple removed.

The problem happens when you have two rows in a table that refer to each
other with a foreign key to the same table.  If both are deleted, it must
succeed, but it won't if you do the check in between the deletes unless
I'm missing something. It's effectively the same problem as we currently
have with the unique constraint (premature checking of the constraint).
AFAICS, you need to defer to end of statement to get the correct semantics
out of the checks, or you need to have a state where the rows are sort of
pseudo-deleted/updated which could be better.




Re: Foreign key quandries

From
Rod Taylor
Date:
On Sat, 2003-03-01 at 11:06, Stephan Szabo wrote:
> On 1 Mar 2003, Rod Taylor wrote:
>
> > On Sat, 2003-03-01 at 02:38, Stephan Szabo wrote:
> > > On 1 Mar 2003, Rod Taylor wrote:
> > >
> > > > Gah, hit wrong key combination and the email sent early.
> > > >
> > > > Anyway, after that 'sleep' mess at the bottom is:
> > > > T1 or T2: Sleeping too long -- lets run deadlock detection code
> > > > T1 or T2: Kill off random participant of deadlock.
> > > >
> > > > The other participant is then allowed to continue their work.
> > > >
> > > > > Isn't the differentiation going to happen automatically?
> > >
> > > The problem is that in case 2, both tuples 2 and 3 are already removed
> > > before either foreign key check runs, so when T1 adds the value 3
> > > row and checks the pk table it will find that its pk row has been
> > > modified.  If the ordering went, delete 2 - check 2, delete 3 - check
> > > 3, this wouldn't be a problem, but then that'd fail in a
> > > spec-non-compliant way if row 2 refered to row 3.
> >
> > The foreign key cascade is explicitly deferred to the end of the
> > statement via the trigger queue, but there is no reason that the foreign
> > key code can't run immediately for each tuple removed.
>
> The problem happens when you have two rows in a table that refer to each
> other with a foreign key to the same table.  If both are deleted, it must
> succeed, but it won't if you do the check in between the deletes unless
> I'm missing something. It's effectively the same problem as we currently

In the general case, you're not going to have interdependent tables.
Store a list of tables touched (changed -- not just read) by keys. The
second fkey to touch a table is deferred until the end of the statement
and will have semantics required for that (deadlock in case 2).

So consider a cascade delete from pk tab1 to fk tab2, and pk tab2 to fk
tab1.

tabl  -> tab2
tab2  -> tab1


Delete from pk tab1 value in (2, 3)

- Seek tab1, find tuple value 2
- Record tab1 as unsafe
- tab2 fkey doesn't find tab2 on dirty list -- starts work
- Seek tab2, find tuple value 2, cascade delete
- Record tab2 as dirty
- tab1 fkey finds tab1 on dirty list -- defers work
- tab2 fkey finishs scan on tab2 (no more tuples of value 2)
- Seek on tab1 continues, find tuple value 3
- Record tab1 as dirty
- tab2 fkey finds tab2 on dirty list but it was done by the tab2 fkey
(myself) -- so we assume this is a different value (keys are unique) and
safe to deal with
- seek tab2, find tuple value 3, cascade delete
- Record tab2 as dirty
- tab1 fkey finds tab1 on dirty list -- defers work
- tab2 fkey finishs scan on tab2 (no more tuples of value 3)
- seek on tab1 continues -- no more tuples of value 2 or 3

- command counter increment for finished 'delete' on tab1
- deferred (tab2 pk -> tab1 fkey) actions begin execution.
- ...

This *will* deadlock in case 2 as described earlier.  But it's no worse
than what we have today -- in fact it's slightly better as one of the
fkeys is fast and clean.

But this is not a common scenario, as it represents a parent <-> child
relationship split between two or more tables. Tuples in two different
tables cross keyed are already deferred for the insert to succeed -- as
such will deadlock by default since checks are deferred until commit.


Whats more interesting is a parent <-> child relationship on a single
table.  What you really want to hint at as being dirty is a given pk
value within a table -- but that becomes difficult when multiple foreign
keys on different unique keys are involved.

I think the tree table (parent / child are in same table, making a tree)
also needs to defer it's checks and be susceptible to deadlocks when
involved in a 'case 2' scenario. The accounting required to check for
loops won't be worth it if there are a large number of values --
especially when multiple foreign keys on different unique keys are
involved.


Anyway, the 'checking the constraint too early' problem only exists when
looking at the same tuples and there is a temporary problem.  The
solution for the  value = value + 1 case in a unique constraint seems to
be to recheck at the end of the statement if there is still a problem.
If there wasn't an issue during the initial index insert, then
everything will be clean later as well.

Fkeys have a slight advantage as they tend to change the table being
scanned, so more work is safe to be done immediately -- but tracking
what has and hasn't been touched in a given CommandCounter will be
important.

> have with the unique constraint (premature checking of the constraint).
> AFAICS, you need to defer to end of statement to get the correct semantics
> out of the checks, or you need to have a state where the rows are sort of
> pseudo-deleted/updated which could be better.
>
--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Re: Foreign key quandries

From
Tom Lane
Date:
Seems like this sort of approach is going to introduce a huge amount of
very fragile mechanism, and probably a wide variety of hard-to-reproduce
bugs :-(.

ISTM the only thing we really need to do to address the complaints about
FKs is to invent some kind of sharable row-level lock.  Then
transactions adding references to an FK table would take out shared
instead of exclusive locks on PK rows; nothing else changes.

Of course, it's easy to say "sharable row-level lock" and not so easy to
come up with an implementation that has decent performance.  But when it
was done, I think we might have some faith that it works.  I'm going to
have great difficulty putting any faith at all in an FK implementation
that relies on dirty reads.
        regards, tom lane


Re: Foreign key quandries

From
Stephan Szabo
Date:
On Sat, 1 Mar 2003, Tom Lane wrote:

> Seems like this sort of approach is going to introduce a huge amount of
> very fragile mechanism, and probably a wide variety of hard-to-reproduce
> bugs :-(.

Possibly, that's why I brought it up here because more minds to find
problems are better.

> ISTM the only thing we really need to do to address the complaints about
> FKs is to invent some kind of sharable row-level lock.  Then
> transactions adding references to an FK table would take out shared
> instead of exclusive locks on PK rows; nothing else changes.

That gets rid of most of the problems. There are problems with read locks
locking more than they may need to (insert into fk followed by another
transaction doing a modification of the referenced row that doesn't affect
the key).

> Of course, it's easy to say "sharable row-level lock" and not so easy to
> come up with an implementation that has decent performance.  But when it
> was done, I think we might have some faith that it works.  I'm going to
> have great difficulty putting any faith at all in an FK implementation
> that relies on dirty reads.

As a thought exercise, what all is involved in making sharable row-level
locks...  No matter what is done for foreign keys, they're still useful to
constraints written by users, and even using them now in foreign keys
doesn't prevent a future change if the other issues are worked out.



Re: Foreign key quandries

From
Tom Lane
Date:
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> On Sat, 1 Mar 2003, Tom Lane wrote:
>> ISTM the only thing we really need to do to address the complaints about
>> FKs is to invent some kind of sharable row-level lock.

> That gets rid of most of the problems. There are problems with read locks
> locking more than they may need to (insert into fk followed by another
> transaction doing a modification of the referenced row that doesn't affect
> the key).

Agreed, it's not a 100% solution ... but it would be way better than
what we have.  (Hm, I wonder whether a read lock could specify that only
certain columns are locked?  Nah, probably too much trouble.)

>> Of course, it's easy to say "sharable row-level lock" and not so easy to
>> come up with an implementation that has decent performance.

> As a thought exercise, what all is involved in making sharable row-level
> locks...

ISTM the problem is where to keep the state.  You can't keep track of an
indefinite number of shared-lock holders in the on-row header ... but if
you try to keep the state in shared memory, you can't keep track of a
very large number of locked rows, either.  Perhaps some scheme could be
implemented to keep lock state in memory but spill to disk when there
get to be too many locked rows.  I don't see how to make that work
efficiently, though.

We talked about this a week or two back, and someone (was it Rod?) asked
essentially "do we *need* any state --- would a marker on the row that
it's share-locked be enough?".  I suppose we could use an infomask bit
to indicate share-locking and overload xmax as a count of the number of
lockers.  Then each transaction with read locks would have to have local
state remembering every row it's read-locked (this is much less bad than
shared state, since local RAM is more easily expansible), and at either
transaction commit or abort you'd have to run around and decrement those
counts.  (But when you decrement a count to zero, what next?  You still
need to figure out who's blocked on that row and release them.)  This
could perhaps be made to work with reasonable efficiency, but it makes
me nervous.  If someone crashes while holding read locks, how do you
recover?  Seems like you need to scan the *entire database* during
restart to zero out shared-lock counts.  In general we have stayed away
from the notion of requiring transactions to do end-of-transaction
cleanup on disk, and I think that is a good design choice not to be
tossed away lightly.

So I don't see how to do that efficiently.  But still, it seems a more
tractable problem than trying to prove a dirty-read implementation correct.
        regards, tom lane