Home > mailing lists

Thread: SSI patch version 14

SSI patch version 14

From

"Kevin Grittner"

Date:

24 January 2011, 23:31:14

Attached.

Dan and I have spent many, many hours desk-check and testing,
including pounding for many hours in DBT-2 at high concurrency
factors on a 16 processor box. In doing so, we found and fixed a few
issues. Neither of us is aware of any bugs or deficiencies
remaining, even after a solid day of pounding in DBT-2, other than
the failure to extend any new functionality to hot standby.

At Heikki's suggestion I have included a test to throw an error on an
attempt to switch to serializable mode during recovery. Anything
more to address that issue can be a small follow-up patch -- probably
mainly a few notes in the docs.

-Kevin

Attachment

ssi-14.patch.gz

Re: SSI patch version 14

From

Jeff Davis

Date:

25 January 2011, 05:07:55

On Mon, 2011-01-24 at 21:30 -0600, Kevin Grittner wrote: 
> Attached.
>  
> Dan and I have spent many, many hours desk-check and testing,
> including pounding for many hours in DBT-2 at high concurrency
> factors on a 16 processor box.  In doing so, we found and fixed a few
> issues.  Neither of us is aware of any bugs or deficiencies
> remaining, even after a solid day of pounding in DBT-2, other than
> the failure to extend any new functionality to hot standby.
>  
> At Heikki's suggestion I have included a test to throw an error on an
> attempt to switch to serializable mode during recovery.  Anything
> more to address that issue can be a small follow-up patch -- probably
> mainly a few notes in the docs.

Here is my first crack at a review:

First of all, I am very impressed with the patch. I was pleased to see
that both READ ONLY DEFERRABLE and 2PC work! Also, I found the patch
very usable and did not encounter any bugs or big surprises.

Second, I do not understand this patch entirely, so the following
statements can be considered questions as much as answers.

At a high level, there is a nice conceptual simplicity. Let me try to
summarize it as I understand it: * RW dependencies are detected using predicate locking. * RW dependencies are tracked
fromthe reading transaction (as an   "out") conflict; and from the writing transaction (as an "in"   conflict). *
Beforecommitting a transaction, then by looking only at the RW    dependencies (and predicate locks) for current and
past   transactions, you can determine if committing this transaction will   result in a cycle (and therefore a
serializationanomaly); and if   so, abort it.

That's where the simplicity ends, however ;)

For starters, the above structures require unlimited memory, while we
have fixed amounts of memory. The predicate locks are referenced by a
global shared hash table as well as per-process SHMQueues. It can adapt
memory usage downward in three ways: * increase lock granularity -- e.g. change N page locks into a   table lock *
"summarization"-- fold multiple locks on the same object across   many old committed transactions into a single lock *
usethe SLRU

Tracking of RW conflicts of current and past transactions is more
complex. Obviously, it doesn't keep _all_ past transactions, but only
ones that overlap with a currently-running transaction. It does all of
this management using SHMQueue. There isn't much of an attempt to
gracefully handle OOM here as far as I can tell, it just throws an error
if there's not enough room to track a new transaction (which is
reasonable, considering that it should be quite rare and can be
mitigated by increasing max_connections). 

The heavy use of SHMQueue is quite reasonable, but for some reason I
find the API very difficult to read. I think it would help (at least me)
quite a lot to annotate (i.e. with a comment in the struct) the various
lists and links with the types of data being held.

The actual checking of conflicts isn't quite so simple, either, because
we want to detect problems before the victim transaction has done too
much work. So, checking is done along the way in several places, and
it's a little difficult to follow exactly how those smaller checks add
up to the overall serialization-anomaly check (the third point in my
simple model).

There are also optimizations around transactions declared READ ONLY.
Some of these are a little difficult to follow as well, and I haven't
followed them all.

There is READ ONLY DEFERRABLE, which is a nice feature that waits for a
"safe" snapshot, so that the transaction will never be aborted.

Now, on to my code comments (non-exhaustive):

* I see that you use a union as well as using "outLinks" to also be a
free list. I suppose you did this to conserve shared memory space,
right?

* In RegisterSerializableTransactionInt(), for a RO xact, it considers
any concurrent RW xact a possible conflict. It seems like it would be
possible to know whether a RW transaction may have overlapped with any
committed RW transaction (in finishedLink queue), and only add those as
potential conflicts. Would that work? If so, that would make more
snapshots safe.

* When a transaction finishes, then PID should probably be set to zero.
You only use it for waking up a deferrable RO xact waiting for a
snapshot, right?

* Still some compiler warnings:
twophase.c: In function ‘FinishPreparedTransaction’:
twophase.c:1360: warning: implicit declaration of function
‘PredicateLockTwoPhaseFinish’

* You have a function called CreatePredTran. We are already using
"Transaction" and "Xact" -- we probably don't need "Tran" as well.

* HS error has a typo:
"ERROR:  cannot set transaction isolationn level to serializable during
recovery"

* I'm a little unclear on summarization and writing to the SLRU. I don't
see where it's determining that the predicate locks can be safely
released. Couldn't the oldest transaction still have relevant predicate
locks?

* In RegisterSerializableTransactionInt, if it doesn't get an sxact, it
goes into summarization. But summarization assumes that it has at least
one finished xact. Is that always true? If you have enough memory to
hold a transaction for each connection, plus max_prepared_xacts, plus
one, I think that's true. But maybe that could be made more clear?

I'll keep working on this patch. I hope I can be of some help getting
this committed, because I'm looking forward to this feature. And I
certainly think that you and Dan have applied the amount of planning,
documentation, and careful implementation necessary for a feature like
this.

Regards,Jeff Davis

Re: SSI patch version 14

From

Dan Ports

Date:

25 January 2011, 06:57:52

Thanks for working your way through this patch. I'm certainly well
aware that that's not a trivial task!

I'm suffering through a bout of insomnia, so I'll respond to some of
your high-level comments in hopes that serializability will help put me
to sleep (as it often does). I'll leave the more detailed code comments
for later when I'm actually looking at the code, or better yet Kevin
will take care of them and I won't have to. ;-)

On Tue, Jan 25, 2011 at 01:07:39AM -0800, Jeff Davis wrote:
> At a high level, there is a nice conceptual simplicity. Let me try to
> summarize it as I understand it:
>   * RW dependencies are detected using predicate locking.
>   * RW dependencies are tracked from the reading transaction (as an
>     "out") conflict; and from the writing transaction (as an "in"
>     conflict).
>   * Before committing a transaction, then by looking only at the RW 
>     dependencies (and predicate locks) for current and past 
>     transactions, you can determine if committing this transaction will
>     result in a cycle (and therefore a serialization anomaly); and if
>     so, abort it.

This summary is right on. I would add one additional detail or
clarification to the last point, which is that rather than checking for
a cycle, we're checking for a transaction with both "in" and "out"
conflicts, which every cycle must contain.

> That's where the simplicity ends, however ;)

Indeed!

> Tracking of RW conflicts of current and past transactions is more
> complex. Obviously, it doesn't keep _all_ past transactions, but only
> ones that overlap with a currently-running transaction. It does all of
> this management using SHMQueue. There isn't much of an attempt to
> gracefully handle OOM here as far as I can tell, it just throws an error
> if there's not enough room to track a new transaction (which is
> reasonable, considering that it should be quite rare and can be
> mitigated by increasing max_connections). 

If the OOM condition you're referring to is the same one from the
following comment, then it can't happen: (Apologies if I've
misunderstood what you're referring to.)

> * In RegisterSerializableTransactionInt, if it doesn't get an sxact, it
> goes into summarization. But summarization assumes that it has at least
> one finished xact. Is that always true? If you have enough memory to
> hold a transaction for each connection, plus max_prepared_xacts, plus
> one, I think that's true. But maybe that could be made more clear?

Yes -- the SerializableXact pool is allocated up front and it
definitely has to be bigger than the number of possible active
transactions. In fact, it's much larger: 10 * (MaxBackends +
max_prepared_xacts) to allow some room for the committed transactions
we still have to track.

> * In RegisterSerializableTransactionInt(), for a RO xact, it considers
> any concurrent RW xact a possible conflict. It seems like it would be
> possible to know whether a RW transaction may have overlapped with any
> committed RW transaction (in finishedLink queue), and only add those as
> potential conflicts. Would that work? If so, that would make more
> snapshots safe.

Interesting idea. That's worth some careful thought. I think it's
related to the condition that the RW xact needs to commit with a
conflict out to a transaction earlier than the RO xact. My first guess
is that this wouldn't make more transactions safe, but could detect
safe snapshots faster.

> * When a transaction finishes, then PID should probably be set to zero.
> You only use it for waking up a deferrable RO xact waiting for a
> snapshot, right?

Correct. It probably wouldn't hurt to clear that field when releasing
the transaction, but we don't use it after.

> * I'm a little unclear on summarization and writing to the SLRU. I don't
> see where it's determining that the predicate locks can be safely
> released. Couldn't the oldest transaction still have relevant predicate
> locks?

When a SerializableXact gets summarized to the SLRU, its predicate
locks aren't released; they're transferred to the dummy
OldCommittedSxact.

> I'll keep working on this patch. I hope I can be of some help getting
> this committed, because I'm looking forward to this feature. And I
> certainly think that you and Dan have applied the amount of planning,
> documentation, and careful implementation necessary for a feature like
> this.

Hopefully my comments here will help clarify the patch. It's not lost
on me that there's no shortage of complexity in the patch, so if you
found anything particularly confusing we should probably add some
documentation to README-SSI.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

25 January 2011, 11:41:17

Thanks for the review, Jeff!

Dan Ports  wrote:
> On Tue, Jan 25, 2011 at 01:07:39AM -0800, Jeff Davis wrote:
>> At a high level, there is a nice conceptual simplicity. Let me
>> try to summarize it as I understand it:
>> * RW dependencies are detected using predicate locking.
>> * RW dependencies are tracked from the reading transaction (as an
>> "out") conflict; and from the writing transaction (as an "in"
>> conflict).
>> * Before committing a transaction, then by looking only at the RW
>> dependencies (and predicate locks) for current and past
>> transactions, you can determine if committing this transaction
>> will result in a cycle (and therefore a serialization anomaly);
>> and if so, abort it.
> 
> This summary is right on. I would add one additional detail or
> clarification to the last point, which is that rather than
> checking for a cycle, we're checking for a transaction with both
> "in" and "out" conflicts, which every cycle must contain.
Yep.  For the visual thinkers out there, the whole concept can be
understood by looking at the jpeg file that's in the Wiki page:
http://wiki.postgresql.org/images/e/eb/Serialization-Anomalies-in-Snapshot-Isolation.png
>> * In RegisterSerializableTransactionInt(), for a RO xact, it
>> considers any concurrent RW xact a possible conflict. It seems
>> like it would be possible to know whether a RW transaction may
>> have overlapped with any committed RW transaction (in
>> finishedLink queue), and only add those as potential conflicts.
>> Would that work? If so, that would make more snapshots safe.
> 
> Interesting idea. That's worth some careful thought. I think it's
> related to the condition that the RW xact needs to commit with a
> conflict out to a transaction earlier than the RO xact. My first
> guess is that this wouldn't make more transactions safe, but could
> detect safe snapshots faster.
Yes, that would work.  It would lower one type of overhead a little
and allow RO transactions to be released from SSI tracking earlier. 
The question is how to determine it without taking too much time
scanning the finished transaction list for every active read write
transaction every time you start a RO transaction.  I don't think
that it's a trivial enough issue to consider for 9.1; it's certainly
one to put on the list to look at for 9.2.
>> * When a transaction finishes, then PID should probably be set to
>> zero. You only use it for waking up a deferrable RO xact waiting
>> for a snapshot, right?
> 
> Correct. It probably wouldn't hurt to clear that field when
> releasing the transaction, but we don't use it after.
I thought they might show up in the pid column of pg_locks, but I
see they don't.  Should they?  If so, should we still see the pid
after a commit, for as long as the lock survives?
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

25 January 2011, 13:17:24

Jeff Davis <pgsql@j-davis.com> wrote:
> For starters, the above structures require unlimited memory, while
> we have fixed amounts of memory. The predicate locks are
> referenced by a global shared hash table as well as per-process
> SHMQueues. It can adapt memory usage downward in three ways:
>   * increase lock granularity -- e.g. change N page locks into a
>     table lock
>   * "summarization" -- fold multiple locks on the same object
>     across many old committed transactions into a single lock
>   * use the SLRU
Those last two are related -- the summarization process takes the
oldest committed-but-still-significant transaction and does two
things with it:
(1)  We move predicate locks to the "dummy" transaction, kept just
for this purpose.  We combine locks against the same lock target,
using the more recent commit point to determine when the resulting
lock can be removed.
(2)  We use SLRU to keep track of the top level xid of the old
committed transaction, and the earliest commit point of any
transactions to which it had an out-conflict.
The above reduces the information available to avoid serialization
failure in certain corner cases, and is more expensive to access
than the other structures, but it keeps us running in pessimal
circumstances, even if at a degraded level of performance.
> The heavy use of SHMQueue is quite reasonable, but for some reason
> I find the API very difficult to read. I think it would help (at
> least me) quite a lot to annotate (i.e. with a comment in the
> struct) the various lists and links with the types of data being
> held.
We've tried to comment enough, but when you have your head buried in
code you don't always recognize how mysterious something can look. 
Can you suggest some particular places where more comments would be
helpful?
> The actual checking of conflicts isn't quite so simple, either,
> because we want to detect problems before the victim transaction
> has done too much work. So, checking is done along the way in
> several places, and it's a little difficult to follow exactly how
> those smaller checks add up to the overall serialization-anomaly
> check (the third point in my simple model).
> 
> There are also optimizations around transactions declared READ
> ONLY. Some of these are a little difficult to follow as well, and
> I haven't followed them all.
Yeah.  After things were basically working correctly, in terms of
not allowing any anomalies, we still had a lot of false positives. 
Checks around the order and timing of commits, as well as read-only
status, helped bring these down.  The infamous receipting example
was my main guide here.  There are 210 permutations in the way the
statements can be interleaved, of which only six can result in
anomalies.  At first we only managed to commit a couple dozen
permutations.  As we added code to cover optimizations for various
special cases the false positive rate dropped a little at a time,
until that test hit 204 commits, six rollbacks.  Although, all the
tests in the dcheck target are useful -- if I made a mistake in
implementing an optimization there would sometimes be just one or
two of the other tests which would fail.  Looking at which
permutations got it right and which didn't was a really good way to
figure out what I did wrong.
> There is READ ONLY DEFERRABLE, which is a nice feature that waits
> for a "safe" snapshot, so that the transaction will never be
> aborted.
*And* will not contribute to causing any other transaction to be
rolled back, *and* dodges the overhead of predicate locking and
conflict checking.  Glad you like it!  ;-)
> Now, on to my code comments (non-exhaustive):
> 
> * I see that you use a union as well as using "outLinks" to also
> be a free list. I suppose you did this to conserve shared memory
> space, right?
Yeah, we tried to conserve shared memory space where we could do so
without hurting performance.  Some of it gets to be a little bit-
twiddly, but it seemed like a good idea at the time.  Does any of it
seem like it creates a confusion factor which isn't worth it
compared to the shared memory savings?
> * Still some compiler warnings:
> twophase.c: In function *FinishPreparedTransaction*:
> twophase.c:1360: warning: implicit declaration of function
> *PredicateLockTwoPhaseFinish*
Ouch!  That could cause bugs, since the implicit declaration didn't
actually *match* the actual definition.  Don't know how we missed
that.  I strongly recommend that anyone who wants to test 2PC with
the patch add this commit to it:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=8e020d97bc7b8c72731688515b6d895f7e243e27
> * You have a function called CreatePredTran. We are already using
> "Transaction" and "Xact" -- we probably don't need "Tran" as well.
OK.  Will rename if you like.  Since that creates the PredTran
structure, I assume you want that renamed, too?
> * HS error has a typo:
> "ERROR:  cannot set transaction isolationn level to serializable
> during recovery"
Will fix.
> I'll keep working on this patch. I hope I can be of some help
> getting this committed, because I'm looking forward to this
> feature. And I certainly think that you and Dan have applied the
> amount of planning, documentation, and careful implementation
> necessary for a feature like this.
Thanks much!  This effort was driven, for my part, by the needs of
my employer; but I have to admit it was kinda fun to do some serious
coding on innovative ideas again.  It's been a while.  I'm ready to
kick back and party a bit when this gets done, though.  ;-)
-Kevin

Re: SSI patch version 14

From

Jeff Davis

Date:

25 January 2011, 14:37:45

On Tue, 2011-01-25 at 11:17 -0600, Kevin Grittner wrote:
> > The heavy use of SHMQueue is quite reasonable, but for some reason
> > I find the API very difficult to read. I think it would help (at
> > least me) quite a lot to annotate (i.e. with a comment in the
> > struct) the various lists and links with the types of data being
> > held.
>  
> We've tried to comment enough, but when you have your head buried in
> code you don't always recognize how mysterious something can look. 
> Can you suggest some particular places where more comments would be
> helpful?

I think just annotating RWConflict.in/outLink and
PredTranList.available/activeList with the types of things they hold
would be a help.

Also, you say something about RWConflict and "when the structure is not
in use". Can you elaborate on that a bit?

I'll address the rest of your comments in a later email.

Regards,Jeff Davis

Re: SSI patch version 14

From

Jeff Davis

Date:

25 January 2011, 14:48:04

On Tue, 2011-01-25 at 09:41 -0600, Kevin Grittner wrote:
> Yep.  For the visual thinkers out there, the whole concept can be
> understood by looking at the jpeg file that's in the Wiki page:
>  
> http://wiki.postgresql.org/images/e/eb/Serialization-Anomalies-in-Snapshot-Isolation.png

Yes, that helped a lot throughout the review process. Good job keeping
it up-to-date!
> Yes, that would work.  It would lower one type of overhead a little
> and allow RO transactions to be released from SSI tracking earlier. 
> The question is how to determine it without taking too much time
> scanning the finished transaction list for every active read write
> transaction every time you start a RO transaction.  I don't think
> that it's a trivial enough issue to consider for 9.1; it's certainly
> one to put on the list to look at for 9.2.

It's OK to leave it to 9.2. But if it's a RO deferrable transaction,
it's just going to go to sleep in that case anyway; so why not look for
an opportunity to get a safe snapshot right away?

Regards,Jeff Davis

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

25 January 2011, 16:41:31

On 25.01.2011 05:30, Kevin Grittner wrote:
> Attached.

The readme says this:
> +   4. PostgreSQL supports subtransactions -- an issue not mentioned
> +in the papers.

But I don't see any mention anywhere else on how subtransactions are 
handled. If a subtransaction aborts, are its predicate locks immediately 
released?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

25 January 2011, 16:53:15

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> On 25.01.2011 05:30, Kevin Grittner wrote:
> The readme says this:
>> 4. PostgreSQL supports subtransactions -- an issue not mentioned
>>    in the papers.
> 
> But I don't see any mention anywhere else on how subtransactions
> are handled. If a subtransaction aborts, are its predicate locks
> immediately released?
No.  Here's the reasoning.  Within a top level transaction, you
might start a subtransaction, read some data, and then decide based
on what you read that the subtransaction should be rolled back.  If
the decision as to what is part of the top level transaction can
depend on what is read in the subtransaction, predicate locks taken
by the subtransaction must survive rollback of the subtransaction.
Does that make sense to you?  Is there somewhere you would like to
see that argument documented?
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

25 January 2011, 17:23:16

Jeff Davis <pgsql@j-davis.com> wrote:
> I think just annotating RWConflict.in/outLink and
> PredTranList.available/activeList with the types of things they
> hold would be a help.
> 
> Also, you say something about RWConflict and "when the structure
> is not in use". Can you elaborate on that a bit?
Please let me know whether this works for you:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=325ec55e8c9e5179e2e16ff303af6afc1d6e732b
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

25 January 2011, 17:56:45

Jeff Davis <pgsql@j-davis.com> wrote:
> It's OK to leave it to 9.2. But if it's a RO deferrable
> transaction, it's just going to go to sleep in that case anyway;
> so why not look for an opportunity to get a safe snapshot right
> away?
If you're talking about doing this only for DEFERRABLE transactions
it *might* make sense for 9.1.  I'd need to look at what's involved.
We make similar checks for all read only transactions, so they can
withdraw from SSI while running, if their snapshot *becomes* safe.
I don't think I'd want to consider messing with that code at this
point.
-Kevin

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

26 January 2011, 03:20:54

On 25.01.2011 22:53, Kevin Grittner wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  wrote:
>> On 25.01.2011 05:30, Kevin Grittner wrote:
>
>> The readme says this:
>>> 4. PostgreSQL supports subtransactions -- an issue not mentioned
>>>     in the papers.
>>
>> But I don't see any mention anywhere else on how subtransactions
>> are handled. If a subtransaction aborts, are its predicate locks
>> immediately released?
>
> No.  Here's the reasoning.  Within a top level transaction, you
> might start a subtransaction, read some data, and then decide based
> on what you read that the subtransaction should be rolled back.  If
> the decision as to what is part of the top level transaction can
> depend on what is read in the subtransaction, predicate locks taken
> by the subtransaction must survive rollback of the subtransaction.
>
> Does that make sense to you?

Yes, that's what I suspected. And I gather that all the data structures 
in predicate.c work with top-level xids, not subxids. When looking at an 
xid that comes from a tuple's xmin or xmax, for example, you always call 
SubTransGetTopmostTransaction() before doing much else with it.

>  Is there somewhere you would like to
> see that argument documented?

README-SSI .

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

Simon Riggs

Date:

26 January 2011, 07:36:26

On Mon, 2011-01-24 at 21:30 -0600, Kevin Grittner wrote:

> Dan and I have spent many, many hours desk-check and testing,
> including pounding for many hours in DBT-2 at high concurrency
> factors on a 16 processor box. In doing so, we found and fixed a few
> issues. Neither of us is aware of any bugs or deficiencies
> remaining, even after a solid day of pounding in DBT-2, other than
> the failure to extend any new functionality to hot standby.

A couple of comments on this.

I looked at the patch to begin a review and immediately saw "dtest".
There's no docs to explain what it is, but a few comments fill me in a
little more. Can we document that please? And/or explain why its an
essential part of this commit? Could we keep it out of core, or if not,
just commit that part separately? I notice the code is currently
copyright someone else than PGDG.

Pounding for hours on 16 CPU box sounds good. What diagnostics or
instrumentation are included with the patch? How will we know whether
pounding for hours is actually touching all relevant parts of code? I've
done such things myself only to later realise I wasn't actually testing
the right piece of code.

When this runs in production, how will we know whether SSI is stuck or
is consuming too much memory? e.g. Is there a dynamic view e.g.
pg_prepared_xacts, or is there a log mode log_ssi_impact, etc??

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services

Re: SSI patch version 14

From

Simon Riggs

Date:

26 January 2011, 09:01:57

On Wed, 2011-01-26 at 11:36 +0000, Simon Riggs wrote:

> Pounding for hours on 16 CPU box sounds good. What diagnostics or
> instrumentation are included with the patch? How will we know whether
> pounding for hours is actually touching all relevant parts of code? I've
> done such things myself only to later realise I wasn't actually testing
> the right piece of code.

An example of this is the XIDCACHE_DEBUG code used in procarray.c to
validate TransactionIdIsInProgress().

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

26 January 2011, 11:22:25

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> On 25.01.2011 22:53, Kevin Grittner wrote:
>>  Is there somewhere you would like to
>> see that argument documented?
> 
> README-SSI .
Done (borrowing some of your language).
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=470abc51c5626cf3c7cbd734b1944342973d0d47
Let me know if you think more is needed.
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

26 January 2011, 12:01:39

Simon Riggs <simon@2ndQuadrant.com> wrote:
> On Wed, 2011-01-26 at 11:36 +0000, Simon Riggs wrote:
> 
>> Pounding for hours on 16 CPU box sounds good. What diagnostics or
>> instrumentation are included with the patch? How will we know
>> whether pounding for hours is actually touching all relevant
>> parts of code? I've done such things myself only to later realise
>> I wasn't actually testing the right piece of code.
> 
> An example of this is the XIDCACHE_DEBUG code used in procarray.c
> to validate TransactionIdIsInProgress().
It isn't exactly equivalent, but on a conceptually similar note some
of the hours of DBT-2 pounding were done with #ifdef statements to
force code into code paths which are normally rarely used.  We left
one of them in the codebase with the #define commented out, although
I know that's not strictly necessary.  (It does provide a convenient
place to put a comment about what it's for, though.)
In looking at it just now, I noticed that after trying it in a
couple different places what was left in the repository was not the
optimal version for code coverage.  I've put this back to the
version which did a better job, for reasons described in the commit
comment:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=8af1bc84318923ba0ec3d4413f374a3beb10bc70
Dan, did you have some others which should maybe be included?
I'm not sure I see any counts we could get from SSI which would be
useful beyond what we might get from a code coverage tool or
profiling, but I'm open to suggestions.
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

26 January 2011, 13:07:36

Simon Riggs <simon@2ndQuadrant.com> wrote:
> I looked at the patch to begin a review and immediately saw
> "dtest". There's no docs to explain what it is, but a few comments
> fill me in a little more. Can we document that please? And/or
> explain why its an essential part of this commit? Could we keep it
> out of core, or if not, just commit that part separately? I notice
> the code is currently copyright someone else than PGDG.
I'm including Markus on this reply, because he's the only one who
can address the copyright issue.
I will say that while the dcheck make target is not required for a
proper build, and the tests run for too long to consider including
this in the check target, I would not recommend that anybody hack on
SSI without regularly running these tests or some equivalent..
When I suggested breaking this out of the patch, everybody who spoke
up said not to do so.  How the eventual committer commits it is of
course up to that person.
> Pounding for hours on 16 CPU box sounds good. What diagnostics or
> instrumentation are included with the patch? How will we know
> whether pounding for hours is actually touching all relevant parts
> of code? I've done such things myself only to later realise I
> wasn't actually testing the right piece of code.
We've looked at distributions of failed transactions by ereport
statement.  This confirms that we are indeed exercising the vast
majority of the code.  See separate post for how we pushed execution
into the summarization path to test code related to that.
I do have some concern that the 2PC testing hasn't yet covered all
code paths.  I don't see how the problem found by Jeff during review
could have survived thorough testing there.
> When this runs in production, how will we know whether SSI is
> stuck
Stuck?  I'm not sure what you mean by that.  Other than LW locking
(which I believe is always appropriately brief, with rules for order
of acquisition which should prevent deadlocks), the only blocking
introduced by SSI is when there is an explicit request for
DEFERRABLE READ ONLY transactions.  Such a transaction may take a
while to start.  Is that what you're talking about?
> or is consuming too much memory?
Relevant shared memory is allocated at startup, with strategies for
living within that as suggested by Heikki and summarized in a recent
post by Jeff.  It's theoretically possible to run out of certain
objects, in which case there is an ereport, but we haven't seen
anything like that since the mitigation and graceful degradation
changes were implemented.
> e.g. Is there a dynamic view e.g. pg_prepared_xacts,
We show predicate locks in the pg_locks view with mode SIReadLock.
> is there a log mode log_ssi_impact, etc??
Don't have that.  What would you expect that to show?
-Kevin

Re: SSI patch version 14

From

Markus Wanner

Date:

26 January 2011, 14:52:37

Kevin,

thanks for your heads-up.

On 01/26/2011 06:07 PM, Kevin Grittner wrote:
> Simon Riggs <simon@2ndQuadrant.com> wrote:
>  
>> I looked at the patch to begin a review and immediately saw
>> "dtest". There's no docs to explain what it is, but a few comments
>> fill me in a little more. Can we document that please? And/or
>> explain why its an essential part of this commit? Could we keep it
>> out of core, or if not, just commit that part separately? I notice
>> the code is currently copyright someone else than PGDG.
>  
> I'm including Markus on this reply, because he's the only one who
> can address the copyright issue.

I certainly agree to change the copyright notice for my parts of the
code in Kevin's SSI to say "Copyright ... Postgres Global Development
Group".

However, it's also worth mentioning that 'make dcheck' requires my
dtester python package to be installed.  See [1].  I put that under the
Boost license, which seems very similar to the Postgres license.

> When I suggested breaking this out of the patch, everybody who spoke
> up said not to do so.  How the eventual committer commits it is of
> course up to that person.

If the committer decides not to commit the dtests, I'm happy to add
these dtests to the "official" postgres-dtests repository [2].  There I
could let it follow the development of dtester.

If Kevin's dtests gets committed, either dtester needs to be backwards
compatible or the Postgres sources need to follow development of
dtester, which I'm not satisfied with, yet.  (However, note that I
didn't have any time to work on dtester, recently, so that is somewhat
hypothetical anyway).

If you have any more questions, I'm happy to help.

Regards

Markus Wanner

[1]: dtester project site:
http://www.bluegap.ch/projects/dtester/

[2]: postgres dtests:
http://git.postgres-r.org/?p=postgres-dtests;a=summary

Re: SSI patch version 14

From

Simon Riggs

Date:

26 January 2011, 15:08:52

On Wed, 2011-01-26 at 11:07 -0600, Kevin Grittner wrote:

> > When this runs in production, how will we know whether SSI is
> > stuck
>  
> Stuck?  I'm not sure what you mean by that.  Other than LW locking
> (which I believe is always appropriately brief, with rules for order
> of acquisition which should prevent deadlocks), the only blocking
> introduced by SSI is when there is an explicit request for
> DEFERRABLE READ ONLY transactions.  Such a transaction may take a
> while to start.  Is that what you're talking about?

I'm thinking of a general requirement for diagnostics.

What has been done so far looks great to me, so talking about this
subject is in no way meant to be negative. Everything has bugs and we
find them quicker if there are some ways of getting out more information
about what is happening in the guts of the system.

For HS, I put in a special debug mode and various information functions.
For HOT, I produced a set of page inspection functions.
I'm keen to have some ways where we can find out various metrics about
what is happening, so we can report back to you to check if there are
bugs. I foresee that some applications will be more likely to generate
serialization errors than others. People will ask questions and they may
claim there are bugs. I would like to be able to say "there is no bug -
look at XYZ and see that the errors you are getting are because of ABC".

> > or is consuming too much memory?
>  
> Relevant shared memory is allocated at startup, with strategies for
> living within that as suggested by Heikki and summarized in a recent
> post by Jeff.  It's theoretically possible to run out of certain
> objects, in which case there is an ereport, but we haven't seen
> anything like that since the mitigation and graceful degradation
> changes were implemented.
>  
> > e.g. Is there a dynamic view e.g. pg_prepared_xacts,
>  
> We show predicate locks in the pg_locks view with mode SIReadLock.

OK, that's good.

> > is there a log mode log_ssi_impact, etc??
>  
> Don't have that.  What would you expect that to show?
>  
> -Kevin

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services

Re: SSI patch version 14

From

Alvaro Herrera

Date:

26 January 2011, 15:25:47

Excerpts from Kevin Grittner's message of mié ene 26 14:07:18 -0300 2011:
> Simon Riggs <simon@2ndQuadrant.com> wrote:

> > Pounding for hours on 16 CPU box sounds good. What diagnostics or
> > instrumentation are included with the patch? How will we know
> > whether pounding for hours is actually touching all relevant parts
> > of code? I've done such things myself only to later realise I
> > wasn't actually testing the right piece of code.
>  
> We've looked at distributions of failed transactions by ereport
> statement.  This confirms that we are indeed exercising the vast
> majority of the code.  See separate post for how we pushed execution
> into the summarization path to test code related to that.

BTW did you try "make coverage" and friends?  See
http://www.postgresql.org/docs/9.0/static/regress-coverage.html
and
http://developer.postgresql.org/~petere/coverage/

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

26 January 2011, 15:42:38

Alvaro Herrera <alvherre@commandprompt.com> wrote:
> BTW did you try "make coverage" and friends?  See
> http://www.postgresql.org/docs/9.0/static/regress-coverage.html
> and
> http://developer.postgresql.org/~petere/coverage/
I had missed that; thanks for pointing it out!
I'm doing a coverage build now, to see what coverage we get from
`make check` (probably not much) and `make dcheck`.
Dan, do you still have access to that machine you were using for the
DBT-2 runs?  Could we get a coverage run with and without
TEST_OLDSERXID defined?
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

26 January 2011, 16:10:33

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: 
> Alvaro Herrera <alvherre@commandprompt.com> wrote:
>  
>> BTW did you try "make coverage" and friends?  See
>> http://www.postgresql.org/docs/9.0/static/regress-coverage.html
>> and
>> http://developer.postgresql.org/~petere/coverage/
>  
> I had missed that; thanks for pointing it out!
>  
> I'm doing a coverage build now, to see what coverage we get from
> `make check` (probably not much) and `make dcheck`.
Well, that was a bit better than I expected.  While the overall code
coverage for PostgreSQL source code is:
Overall coverage rate: lines......: 64.8% (130296 of 201140 lines) functions..: 72.0% (7997 of 11109 functions)
The coverage for predicate.c, after running both check and dcheck,
was (formatted to match above): lines......: 69.8% (925 of 1325 lines) functions..: 85.7% (48 of 56 functions)
Most of what was missed was in the SLRU or 2PC code, which is
expected.  I'll bet that the DBT-2 runs, between the "normal"
and TEST_OLDSERXID flavors, would get us about 2/3 of the way from
those numbers toward 100%, with almost all the residual being in
2PC.
Does anyone have suggestions for automated 2PC tests?
-Kevin

Re: SSI patch version 14

From

Dan Ports

Date:

26 January 2011, 16:16:36

On Wed, Jan 26, 2011 at 10:01:28AM -0600, Kevin Grittner wrote:
> In looking at it just now, I noticed that after trying it in a
> couple different places what was left in the repository was not the
> optimal version for code coverage.  I've put this back to the
> version which did a better job, for reasons described in the commit
> comment:

Isn't this placement the same as the version we had before that didn't
work?

Specifically, aren't we going to have problems running with
TEST_OLDSERXID enabled because CreatePredTran succeeds and links a new
SerializableXact into the active list, but we don't initialize it
before we drop SerializableXactHashLock to call
SummarizeOldestCommittedSxact?

I seem to recall SummarizeOldestCommittedSxact failing before because
of the uninitialized entry, but more generally since we drop the lock
something else might scan the list.

(This isn't a problem in the non-test case because we'd only do that if
CreatePredTran fails.)

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: SSI patch version 14

From

Dan Ports

Date:

26 January 2011, 16:20:17

On Wed, Jan 26, 2011 at 01:42:23PM -0600, Kevin Grittner wrote:
> Dan, do you still have access to that machine you were using for the
> DBT-2 runs?  Could we get a coverage run with and without
> TEST_OLDSERXID defined?

Sure, I'll give it a shot (once I figure out how to enable coverage...)

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

26 January 2011, 16:36:43

Dan Ports <drkp@csail.mit.edu> wrote: 
> Isn't this placement the same as the version we had before that
> didn't work?
You're right.  How about this?:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=86b839291e2588e59875fb87d05432f8049575f6
Same benefit in terms of exercising more lines of code, but
*without* exposing the uninitialized structure to other threads.
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

26 January 2011, 16:43:36

I wrote:
> You're right.  How about this?:
That's even worse.  I'm putting back to where you had it and taking
a break before I do anything else that dumb.
-Kevin

Re: SSI patch version 14

From

Dan Ports

Date:

26 January 2011, 16:46:29

On Wed, Jan 26, 2011 at 02:36:25PM -0600, Kevin Grittner wrote:
> Same benefit in terms of exercising more lines of code, but
> *without* exposing the uninitialized structure to other threads.

Won't this cause a deadlock because locks are being acquired out of
order?

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

27 January 2011, 11:44:42

I wrote:
> While the overall code coverage for PostgreSQL source code is:
>  
> Overall coverage rate:
>   lines......: 64.8% (130296 of 201140 lines)
>   functions..: 72.0% (7997 of 11109 functions)
By the way, I discovered that the above is lower if you just run the
check target.  The dcheck target helps with overall PostgreSQL code
coverage, even though it was targeted at the SSI code.
> The coverage for predicate.c, after running both check and dcheck,
> was (formatted to match above):
>  
>   lines......: 69.8% (925 of 1325 lines)
>   functions..: 85.7% (48 of 56 functions)
Some minor tweaks to the regression tests boosts that to: lines......: 73.1% (968 of 1325 lines) functions..: 89.3% (50
of56 functions)
 
Most of the code not covered in the regular build (above) is in the
OldSerXidXxxxx functions, which are covered well in a build with
TEST_OLDSERXID defined.  The 2PC code is very well covered now,
except for the recovery-time function.  We don't have a way to test
that in the `make check` process, do we?
There is some code which is not covered just because it is so hard
to hit -- for example, code which is only executed if vacuum cleans
up an index page when we are right at the edge of running out of the
memory used to track predicate locks.  It would be hard to include a
test for that in the normal regression tests.
The regression test changes are here:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=d4c1005d731c81049cc2622e50b7a2ebb99bbcac
-Kevin

Re: SSI patch version 14

From

Jeff Davis

Date:

27 January 2011, 13:18:37

On Tue, 2011-01-25 at 05:57 -0500, Dan Ports wrote:
> This summary is right on. I would add one additional detail or
> clarification to the last point, which is that rather than checking for
> a cycle, we're checking for a transaction with both "in" and "out"
> conflicts, which every cycle must contain.

To clarify, this means that it will get some false positives, right?

For instance:

T1: get snapshot

T2: get snapshot insert R1 commit

T1: read R1 write R2

T3: get snapshot read R2

T3: commit

T1: commit -- throws error

T1 has a conflict out to T2, and T1 has a conflict in from T3.
T2 has a conflict in from T1.
T3 has a conflict out to T1.

T1 is canceled because it has both a conflict in and a conflict out. But
the results are the same as a serial order of execution: T3, T1, T2.

Is there a reason we can't check for a real cycle, which would let T1
succeed?

Regards,Jeff Davis

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

27 January 2011, 14:23:04

Jeff Davis <pgsql@j-davis.com> wrote:
> To clarify, this means that it will get some false positives,
> right?
Yes.  But the example you're about to get into isn't one of them.
> For instance:
> 
> T1:
>   get snapshot
> 
> T2:
>   get snapshot
>   insert R1
>   commit
> 
> T1:
>   read R1
>   write R2
> 
> T3:
>   get snapshot
>   read R2
> 
> T3:
>   commit
> 
> T1:
>   commit -- throws error
> 
> 
> T1 has a conflict out to T2, and T1 has a conflict in from T3.
> T2 has a conflict in from T1.
> T3 has a conflict out to T1.
> 
> T1 is canceled because it has both a conflict in and a conflict
> out. But the results are the same as a serial order of execution:
> T3, T1, T2.
> 
> Is there a reason we can't check for a real cycle, which would let
> T1 succeed?
Yes.  Because T2 committed before T3 started, it's entirely possible
that there is knowledge outside the database server that the work of
T2 was done and committed before the start of T3, which makes the
order of execution: T2, T3, T1, T2.  So you can have anomalies.  Let
me give you a practical example.
Pretend there are receipting terminals in public places for the
organization.  In most financial systems, those receipts are
assigned to batches of some type.  Let's say that's done by an
insert for the new batch ID, which closes the old batch.  Receipts
are always added with the maximum batch ID, reflecting the open
batch.
Your above example could be:
-- setup
test=# create table ctl (batch_id int not null primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"ctl_pkey" for table "ctl"
CREATE TABLE
test=# create table receipt (batch_id int not null, amt
numeric(13,2) not null);
CREATE TABLE
test=# insert into ctl values (1),(2),(3);
INSERT 0 3
test=# insert into receipt values ((select max(batch_id) from ctl),
50),((select max(batch_id) from ctl), 100);
INSERT 0 2
-- receipting workstation
-- T1 starts working on receipt insert transaction
test=# begin transaction isolation level repeatable read;
BEGIN
test=# select 1; -- to grab snapshot, per above?column?
----------       1
(1 row)

-- accounting workstation
-- T2 closes old receipt batch; opens new
test=# begin transaction isolation level repeatable read;
BEGIN
test=# insert into ctl values (4);
INSERT 0 1
test=# commit;
COMMIT

-- receipting workstation
-- T1 continues work on receipt
test=# select max(batch_id) from ctl;max
-----  3
(1 row)

test=# insert into receipt values (3, 1000);
INSERT 0 1

-- accounting workstation
-- T3 lists receipts from the closed batch
-- (Hey, we inserted a new batch_id and successfully
-- committed, right?  The old batch is closed.)
test=# begin transaction isolation level repeatable read;
BEGIN
test=# select * from receipt where batch_id = 3;batch_id |  amt
----------+--------       3 |  50.00       3 | 100.00
(2 rows)

test=# commit;
COMMIT

-- receipting workstation
-- T1 receipt insert transaction commits
test=# commit;
COMMIT
Now, with serializable transactions, as you saw, T1 will be rolled
back.  With a decent software framework, it will be automatically
retried, without any user interaction.  It will select max(batch_id)
of 4 this time, and the insert will succeed and be committed. 
Accounting's list is accurate.
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

27 January 2011, 14:39:47

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> Now, with serializable transactions, as you saw, T1 will be rolled
> back.
I should probably have mentioned, that if all the transactions were
SERIALIZABLE and the report of transactions from the batch was run
as SERIALIZABLE READ ONLY DEFERRABLE, the start of the report would
block until it was certain that it had a snapshot which could not
lead to an anomaly, so the BEGIN for T3 would wait until the COMMIT
of T1, get a new snapshot which it would determine to be safe, and
proceed.  This would allow that last receipt to land in batch 3 and
show up on accounting's receipt list with no rollbacks *or*
anomalies.
-Kevin

Re: SSI patch version 14

From

Jeff Davis

Date:

28 January 2011, 12:54:49

On Tue, 2011-01-25 at 15:22 -0600, Kevin Grittner wrote:
> Jeff Davis <pgsql@j-davis.com> wrote:
>  
> > I think just annotating RWConflict.in/outLink and
> > PredTranList.available/activeList with the types of things they
> > hold would be a help.
> > 
> > Also, you say something about RWConflict and "when the structure
> > is not in use". Can you elaborate on that a bit?
>  
> Please let me know whether this works for you:
>  
>
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=325ec55e8c9e5179e2e16ff303af6afc1d6e732b

Looks good.
Jeff

Re: SSI patch version 14

From

Dan Ports

Date:

29 January 2011, 00:36:49

On Thu, Jan 27, 2011 at 09:18:23AM -0800, Jeff Davis wrote:
> On Tue, 2011-01-25 at 05:57 -0500, Dan Ports wrote:
> > This summary is right on. I would add one additional detail or
> > clarification to the last point, which is that rather than checking for
> > a cycle, we're checking for a transaction with both "in" and "out"
> > conflicts, which every cycle must contain.
> 
> To clarify, this means that it will get some false positives, right?

Yes, this is correct.

> Is there a reason we can't check for a real cycle, which would let T1
> succeed?

I talked today with someone who experimented with doing exactly that in
MySQL as part of a research project (Perfectly Serializable Snapshot
Isolation, paper forthcoming in ICDE)

It confirmed my intuition that this is possible but not as
straightforward as it sounds. Some complications I thought of in
adapting that to what we're doing:

1) it requires tracking all edges in the serialization graph; besides  the rw-conflicts we track there are also the
moremundane  rw-dependencies (if T2 sees an update made by T1, then T2 has to  come after T1) and ww-dependencies (if
T2'swrite modifies a tuple  created by T1, then T2 has to come after T1).    We are taking advantage of the fact that
anycycle must have two  adjacent rw-conflict edges, but the rest of the cycle can be  composed of other types. It would
certainlybe possible to track the  others, but it would add a bit more complexity and use more memory.

2) it requires doing a depth-first search to check for a cycle, which  is more expensive than just detecting two edges.
Thatseems OK if  you only want to check it on commit, but maybe not if you want to  detect serialization failures at
thetime of the conflicting  read/write (as we do).

3) it doesn't seem to interact very well with our memory mitigation  efforts, wherein we discard lots of data about
committed transactions that we know we won't need. If we were checking for  cycles, we'd need to keep more of it. I
suspectsummarization (when  we combine two transactions' predicate locks to save memory) would  also cause problems.

None of these seem particularly insurmountable, but they suggest it
isn't a clear win to try to find an entire cycle.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

30 January 2011, 18:02:07

> Dan Ports  wrote:
> On Thu, Jan 27, 2011 at 09:18:23AM -0800, Jeff Davis wrote:
>
>> Is there a reason we can't check for a real cycle, which would let
>> T1 succeed?
>
> I talked today with someone who experimented with doing exactly
> that in MySQL as part of a research project (Perfectly Serializable
> Snapshot Isolation, paper forthcoming in ICDE)
I'm wondering how this differs from what is discussed in Section 2.7
("Serialization Graph Testing") of Cahill's doctoral thesis.  That
discusses a technique for trying to avoid false positives by testing
the full graph for cycles, with the apparent conclusion that the
costs of doing so are prohibitive.  The failure of attempts to
implement that technique seem to have been part of the impetus to
develop the SSI technique, where a particular structure involving two
to three transactions has been proven to exist in all graphs which
form such a cycle.
I've been able to identify four causes for false positives in the
current SSI implementation:
(1)  Collateral reads.  In particular, data skew caused by inserts
past the end of an index between an ANALYZE and subsequent data
access was a recurring source of performance complaints.  This was
fixed by having the planner probe the ends of an index to correct the
costs in such situations.  This has been effective at correcting the
target problem, but we haven't yet excluded such index probes from
predicate locking.
(2)  Lock granularity.  This includes vertical granularity issues,
such as granularity promotion to conserve shared memory and page
locking versus next-key locking; and it also includes the possibility
of column locking. Many improvements are still possible in this area;
although each should be treated as a performance enhancement patch
and subject to the appropriate performance testing to justify the
extra code and complexity.  In some cases the work needed to effect
the reduction in serialization failures may cost more than retrying
the failed transactions.
(3)  Dependencies other than read-write.  SSI relies on current
PostgreSQL MVCC handling of write-write conflicts -- when one of
these occurs between two transactions running at this level, one of
the transactions must fail with a serialization failure; since only
one runs to successful completion, there is no question of which ran
*first*.  Write-read dependencies (where one transaction *can* see
the work of another because they *don't* overlap) is handled in SSI
by assuming that if T1 commits before T2 acquires its snapshot, T1
will appear to have run before T2.  I won't go into it at length on
this thread, but one has to be very careful about assuming anything
else; trying to explicitly track these conflicts and consider that T2
may have appeared to run before T1 can fall apart completely in the
face of some common and reasonable programming practices.
(4)  Length of read-write dependency (a/k/a rw-conflict) chains.
Basically, it is a further extension from the Cahill papers in the
direction of work which apparently failed because of the overhead of
full cycle-checking.  The 2008 paper (which was "best paper" at the
2008 ACM SIGMOD), just used inConflict and outConflict flag bits.
The 2009 paper extended this to pointers by those names, with NULL
meaning "no conflict", and a self-reference meaning "couldn't track
the detail; consider it to conflict with *all* concurrent
transactions".  The latter brought the false positive rate down
without adding too much overhead.  In the PostgreSQL effort, we
replaced those pointers with *lists* of conflicts.  We only get to
"conflict with all concurrent transactions" in certain circumstances
after summarizing transaction data to avoid blowing out shared
memory.
The lists allowed avoidance of many false positives, facilitated
earlier cleanup of much information from shared memory, and led to a
cleaner implementation of the summarization logic.  They also, as it
happens, provide enough data to fully trace the read-write
dependencies and avoid some false positives where the "dangerous
structure" SSI is looking for exists, but there is neither a complete
rw-conflict cycle, nor any transaction in the graph which committed
early enough to make a write-read conflict possible to any
transaction in the graph.  Whether such rigorous tracing prevents
enough false positives to justify the programming effort, code
complexity, and run-time cost is anybody's guess.
I only raise these to clarify the issue for the Jeff (who is
reviewing the patch), since he asked.  I strongly feel that none of
them are issues which need to be addressed for 9.1, nor do I think
they can be properly addressed within the time frame of this CF.  
On the other hand, perhaps an addition to the README-SSI file is
warranted?
-Kevin

Re: SSI patch version 14

From

Jeff Davis

Date:

30 January 2011, 23:33:46

1. In CheckForSerializableConflictIn(), I think the comment above may be
out of date. It says:

"A tuple insert is in conflict only if there is a predicate lock against
the entire relation."

That doesn't appear to be true if, for example, there's a predicate lock
on the index page that the tuple goes into. I examined it with gdb, and
it calls the function, and the function does identify the conflict.

2. Also in the comment above CheckForSerializableConflictIn(), I see:

"The call to this function also indicates that we need an entry in the
serializable transaction hash table, so that this write's conflicts can
be detected for the proper lifetime, which is until this transaction and
all overlapping serializable transactions have completed."

which doesn't make sense to me. The transaction should already have an
entry in the hash table at this point, right?

3. The comment above CheckForSerializableConflictOut() seems to trail
off, as though you may have meant to write more. It also seems to be out
of date.

And why are you reading the infomask directly? Do the existing
visibility functions not suffice?



I have made it through predicate.c, and I have a much better
understanding of what it's actually doing. I can't claim that I have a
clear understanding of everything involved, but the code looks like it's
in good shape (given the complexity involved) and well-commented.

I am marking the patch Ready For Committer, because any committer will
need time to go through the patch; and the authors have clearly applied
the thought, care, and testing required for something of this
complexity. I will continue to work on it, though.

The biggest issue on my mind is what to do about Hot Standby. The
authors have a plan, but there is also some resistance to it:

http://archives.postgresql.org/message-id/23698.1295566621@sss.pgh.pa.us

We don't need a perfect solution for 9.1, but it would be nice if we had
a viable plan for 9.2.

Regards,Jeff Davis

Re: SSI patch version 14

From

Dan Ports

Date:

31 January 2011, 07:57:07

On Sun, Jan 30, 2011 at 04:01:56PM -0600, Kevin Grittner wrote:
> I'm wondering how this differs from what is discussed in Section 2.7
> ("Serialization Graph Testing") of Cahill's doctoral thesis.  That
> discusses a technique for trying to avoid false positives by testing
> the full graph for cycles, with the apparent conclusion that the
> costs of doing so are prohibitive.  The failure of attempts to
> implement that technique seem to have been part of the impetus to
> develop the SSI technique, where a particular structure involving two
> to three transactions has been proven to exist in all graphs which
> form such a cycle.

I'm not sure. My very limited understanding is that people have tried
to do concurrency control via serialization graph testing but it's (at
least thought to be) too expensive to actually use. This seems to be
saying the opposite of that, so there must be some difference...

> I've been able to identify four causes for false positives in the
> current SSI implementation:
>  
> (1)  Collateral reads.  In particular, data skew caused by inserts
> past the end of an index between an ANALYZE and subsequent data
> access was a recurring source of performance complaints.  This was
> fixed by having the planner probe the ends of an index to correct the
> costs in such situations.  This has been effective at correcting the
> target problem, but we haven't yet excluded such index probes from
> predicate locking.

I wasn't aware of this one (which probably means you mentioned it at
some point and I dropped that packet). Seems like it would not be too
difficult to exclude these -- for 9.2.

> (3)  Dependencies other than read-write.
[...]
> one has to be very careful about assuming anything
> else; trying to explicitly track these conflicts and consider that T2
> may have appeared to run before T1 can fall apart completely in the
> face of some common and reasonable programming practices.

Yes. If you want to do precise cycle testing you'll have to track these
dependencies also, and I believe that would require quite a different
design from what we're doing.

> (4)  Length of read-write dependency (a/k/a rw-conflict) chains.
[...]
> They also, as it
> happens, provide enough data to fully trace the read-write
> dependencies and avoid some false positives where the "dangerous
> structure" SSI is looking for exists, but there is neither a complete
> rw-conflict cycle, nor any transaction in the graph which committed
> early enough to make a write-read conflict possible to any
> transaction in the graph.  Whether such rigorous tracing prevents
> enough false positives to justify the programming effort, code
> complexity, and run-time cost is anybody's guess.

I think I understand what you're getting at here, and it does sound
quite complicated for a benefit that is not clear.

> I only raise these to clarify the issue for the Jeff (who is
> reviewing the patch), since he asked.  I strongly feel that none of
> them are issues which need to be addressed for 9.1, nor do I think
> they can be properly addressed within the time frame of this CF.  

Absolutely, no question about it!

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

31 January 2011, 09:46:55

Jeff Davis  wrote:
> 1. In CheckForSerializableConflictIn(), I think the comment above
> may be out of date. It says:
> 2. Also in the comment above CheckForSerializableConflictIn(), I
> see:
> 3. The comment above CheckForSerializableConflictOut() seems to
> trail off, as though you may have meant to write more. It also
> seems to be out of date.
Will fix and post a patch version 15, along with the other fixes
based on feedback to v14 (mostly to comments and docs) within a few
hours.
I'll also do that global change from using "tran" as an abbreviation
for transaction in some places to consistently using xact whenever it
is abbreviated.
> And why are you reading the infomask directly? Do the existing
> visibility functions not suffice?
It's possible we re-invented some code somewhere, but I'm not clear
on what code from this patch might use what existing function.  Could
you provide specifics?
> I have made it through predicate.c, and I have a much better
> understanding of what it's actually doing. I can't claim that I
> have a clear understanding of everything involved, but the code
> looks like it's in good shape (given the complexity involved) and
> well-commented.
Thanks!  I know that's a lot of work, and I appreciate you pointing
out where comments have not kept up with coding.
> I am marking the patch Ready For Committer, because any committer
> will need time to go through the patch; and the authors have
> clearly applied the thought, care, and testing required for
> something of this complexity. I will continue to work on it,
> though.
Thanks!
> The biggest issue on my mind is what to do about Hot Standby. The
> authors have a plan, but there is also some resistance to it:
>
>
http://archives.postgresql.org/message-id/23698.1295566621@sss.pgh.pa.us
>
> We don't need a perfect solution for 9.1, but it would be nice if
> we had a viable plan for 9.2.
I don't recall any real opposition to what I sketched out in this
post, which came after the above-referenced one:
http://archives.postgresql.org/message-id/4D39D5EC0200002500039A19@gw.wicourts.gov
Also, that opposition appears to be based on a misunderstanding of
the first alternative, which was for sending at most one snapshot per
commit or rollback of a serializable read write transaction, with
possible throttling.  The alternative needs at most two bits per
commit or rollback of a serializable read write transaction; although
I haven't checked whether that can be scared up without adding a
whole byte.  Read only transactions have nothing to do with the
traffic under either alternative.
-Kevin

SSI patch version 15

From

"Kevin Grittner"

Date:

31 January 2011, 13:32:14

Jeff Davis <pgsql@j-davis.com> wrote:

> 1. [& 2.] In CheckForSerializableConflictIn()

All of that was obsolete and could just be deleted.  I did.

> 3. The comment above CheckForSerializableConflictOut()

I reworked the comment there.  Hopefully it is now more clear.

> I am marking the patch Ready For Committer

Patch v15 is attached with fixes for all issues identified in v14.
There was one (two-line) bug fix.  No other logic changes.  We had
an addition to the README-SSI file, comment changes, doc changes,
changes to the text of a few messages, and some structure/field
renames to avoid using Tran as an abbreviation for transaction in
addition to the use of Xact as an abbreviation.

Pretty minimal differences from V14, but I figured it would save the
committer some work if I rolled them all up here.

-Kevin

Attachment

ssi-15.patch.gz

Re: SSI patch version 15

From

Robert Haas

Date:

31 January 2011, 14:05:37

On Mon, Jan 31, 2011 at 12:31 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Pretty minimal differences from V14, but I figured it would save the
> committer some work if I rolled them all up here.

Sounds good.  I believe Heikki is planning to work on this one.
Hopefully that will happen soon, since we are running short on time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: SSI patch version 14

From

Jeff Davis

Date:

31 January 2011, 14:41:02

On Mon, 2011-01-31 at 07:26 -0600, Kevin Grittner wrote:
> > And why are you reading the infomask directly? Do the existing
> > visibility functions not suffice?
>  
> It's possible we re-invented some code somewhere, but I'm not clear
> on what code from this patch might use what existing function.  Could
> you provide specifics?

In CheckForSerializableConflictOut(), it takes a boolean "valid". Then
within the function, it tries to differentiate:
 1. Valid with no indication that it will be deleted. 2. Valid, but delete in progress 3. Invalid

For #1, you are using the hint bit (not the real transaction status),
and manually checking whether it's just a lock or a real delete. For #2
you are assuming any other xmax means that the transaction is in
progress (which it may not be, because the hint bit might not be set for
some time). I assume that will cause additional false positives.

If you used HeapTupleSatisfiesVacuum(), you could do something like:
 case HEAPTUPLE_LIVE:    return; case HEAPTUPLE_RECENTLY_DEAD: case HEAPTUPLE_DELETE_IN_PROGRESS:    xid =
HeapTupleHeaderGetXmax(tuple->t_data);   break; case HEAPTUPLE_INSERT_IN_PROGRESS:    xid =
HeapTupleHeaderGetXmin(tuple->t_data);   break; case HEAPTUPLE_DEAD:    return;

This is not identical to what's happening currently, and I haven't
thought this through thoroughly yet. For instance, "recently dead and
invalid" would be checking on the xmax instead of the xmin. Perhaps you
could exit early in that case (if you still keep the "valid" flag), but
that will happen soon enough anyway.

I don't think this function really cares about the visibility with
respect to the current snapshot, right? It really cares about what other
transactions are interacting with the tuple and how. And I think HTSV
meets that need a little better.

> > The biggest issue on my mind is what to do about Hot Standby. The
> > authors have a plan, but there is also some resistance to it:
> >
> >
> http://archives.postgresql.org/message-id/23698.1295566621@sss.pgh.pa.us
> >
> > We don't need a perfect solution for 9.1, but it would be nice if
> > we had a viable plan for 9.2.
>  
> I don't recall any real opposition to what I sketched out in this
> post, which came after the above-referenced one:
>  
> http://archives.postgresql.org/message-id/4D39D5EC0200002500039A19@gw.wicourts.gov
>  
> Also, that opposition appears to be based on a misunderstanding of
> the first alternative, which was for sending at most one snapshot per
> commit or rollback of a serializable read write transaction, with
> possible throttling.  The alternative needs at most two bits per
> commit or rollback of a serializable read write transaction; although
> I haven't checked whether that can be scared up without adding a
> whole byte.  Read only transactions have nothing to do with the
> traffic under either alternative.

Ok, great. When I read that before I thought that WAL might need to be
sent for implicit RO transactions. I will read it more carefully again.

Regards,Jeff Davis

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

31 January 2011, 15:32:21

Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2011-01-31 at 07:26 -0600, Kevin Grittner wrote:
>>> And why are you reading the infomask directly? Do the existing
>>> visibility functions not suffice?
>>  
>> It's possible we re-invented some code somewhere, but I'm not
>> clear on what code from this patch might use what existing
>> function. Could you provide specifics?
> 
> In CheckForSerializableConflictOut(), it takes a boolean "valid".
Ah, now I see what you're talking about.  Take a look at where that
"valid" flag come from -- the CheckForSerializableConflictOut are
all place right after calls to HeapTupleSatisfiesVisibility.  The
"valid" value is what HeapTupleSatisfiesVisibility returned.  Is it
possible that the hint bits will not be accurate right after that? 
With that in mind, do you still see a problem with how things are
currently done?
-Kevin

Re: SSI patch version 14

From

Jeff Davis

Date:

31 January 2011, 15:54:57

On Mon, 2011-01-31 at 13:32 -0600, Kevin Grittner wrote:
> Ah, now I see what you're talking about.  Take a look at where that
> "valid" flag come from -- the CheckForSerializableConflictOut are
> all place right after calls to HeapTupleSatisfiesVisibility.  The
> "valid" value is what HeapTupleSatisfiesVisibility returned.  Is it
> possible that the hint bits will not be accurate right after that? 
> With that in mind, do you still see a problem with how things are
> currently done?

Oh, ok. The staleness of the hint bit was a fairly minor point though.

Really, I think this should be using HTSV to separate concerns better
and improve readability. My first reaction was to try to find out what
the function was doing that's special. If it is doing something special,
and HTSV is not what you're really looking for, a comment to explain
would be helpful.

As an example, consider that Robert Haas recently suggested using an
infomask bit to mean frozen, rather than actually removing the xid, to
save the xid as forensic information. If that were to happen, your code
would be reading an xid that may have been re-used.

Regards,Jeff Davis

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

31 January 2011, 15:55:29

Jeff Davis <pgsql@j-davis.com> wrote:
> I don't think this function really cares about the visibility with
> respect to the current snapshot, right?
What it cares about is whether some other particular top level
transaction wrote a tuple which we *would* read except that it is
not visible to us because that other top level transaction is
concurrent with ours.  If so, we want to flag a read-write conflict
out from our transaction and in to that other transaction.
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

31 January 2011, 16:03:16

Jeff Davis <pgsql@j-davis.com> wrote:
> Really, I think this should be using HTSV to separate concerns
> better and improve readability. My first reaction was to try to
> find out what the function was doing that's special. If it is
> doing something special, and HTSV is not what you're really
> looking for, a comment to explain would be helpful.
It does seem that at least a comment would be needed.  I'm not at
all confident that there isn't some macro or function which would
yield what I need.  I just sent an email clarifying exactly what I
want to check, so if you can see a better way to determine that, I'm
all ears.
> As an example, consider that Robert Haas recently suggested using
> an infomask bit to mean frozen, rather than actually removing the
> xid, to save the xid as forensic information. If that were to
> happen, your code would be reading an xid that may have been
> re-used.
Yeah, clearly if the code remains as it is, it would be sensitive to
changes in how hint bits or the xid values are used.  If we can
abstract that, it's clearly a Good Thing to do so.
-Kevin

Re: SSI patch version 14

From

Jeff Davis

Date:

31 January 2011, 16:15:59

On Mon, 2011-01-31 at 13:55 -0600, Kevin Grittner wrote:
> Jeff Davis <pgsql@j-davis.com> wrote:
>  
> > I don't think this function really cares about the visibility with
> > respect to the current snapshot, right?
>  
> What it cares about is whether some other particular top level
> transaction wrote a tuple which we *would* read except that it is
> not visible to us because that other top level transaction is
> concurrent with ours.

Or a tuple that you *are* reading, but is being deleted concurrently,
right? Or has been deleted by an overlapping transaction?

> If so, we want to flag a read-write conflict
> out from our transaction and in to that other transaction.

It still seems like HTSV would suffice, unless I'm missing something.

I think "visible" is still needed though: it matters in the cases
HEAPTUPLE_RECENTLY_DEAD and HEAPTUPLE_LIVE. For the former, it only
allows an early exit (if !visible); but for the latter, I think it's
required.

Regards,Jeff Davis

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

31 January 2011, 16:38:16

Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2011-01-31 at 13:55 -0600, Kevin Grittner wrote:
>> What it cares about is whether some other particular top level
>> transaction wrote a tuple which we *would* read except that it is
>> not visible to us because that other top level transaction is
>> concurrent with ours.
> 
> Or a tuple that you *are* reading, but is being deleted
> concurrently, right? Or has been deleted by an overlapping
> transaction?
Right.  I guess that wasn't as precise a statement as I thought.  I
was trying to say that the effects of some write (insert, update,
delete to a permanent table) would not be visible to us because the
writing transaction is concurrent, for some tuple under
consideration.
>> If so, we want to flag a read-write conflict
>> out from our transaction and in to that other transaction.
> 
> It still seems like HTSV would suffice, unless I'm missing
> something.
It is at least as likely that I'm missing something.  If I'm
following you, we're talking about these 24 lines of code, where
"valid" is the what was just returned from
HeapTupleSatisfiesVisibility:   if (valid)   {       /*        * We may bail out if previous xmax aborted, or if it
committed but        * only locked the tuple without updating it.        */       if (tuple->t_data->t_infomask &
(HEAP_XMAX_INVALID|
 
HEAP_IS_LOCKED))           return;
       /*        * If there's a valid xmax, it must be from a concurrent
transaction,        * since it deleted a tuple which is visible to us.        */       xid =
HeapTupleHeaderGetXmax(tuple->t_data);      if (!TransactionIdIsValid(xid))           return;   }   else   {       /*
    * We would read this row, but it isn't visible to us.        */       xid = HeapTupleHeaderGetXmin(tuple->t_data);
}
 
We follow this by a check for the top-level xid, and return if
that's early enough to have overlapped our transaction.
This seems to work as intended for a all known tests.  I guess my
questions would be:
(1)  Do you see a case where this would do the wrong thing?  Can you
describe that or (even better) provide a test case to demonstrate
it?
(2)  I haven't gotten my head around how HTSV helps or is even the
right thing.  If I want to try the switch statement from your recent
post, what should I use as the OldestXmin value on the call to HTSV?
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

31 January 2011, 16:42:09

I wrote:
> We follow this by a check for the top-level xid, and return if
> that's early enough to have overlapped our transaction.
s/early enough to have overlapped/early enough not to have
overlapped/
-Kevin

Re: SSI patch version 14

From

Jeff Davis

Date:

31 January 2011, 17:13:19

On Mon, 2011-01-31 at 14:38 -0600, Kevin Grittner wrote:
> It is at least as likely that I'm missing something.  If I'm
> following you, we're talking about these 24 lines of code, where
> "valid" is the what was just returned from
> HeapTupleSatisfiesVisibility:

Yes.

> (1)  Do you see a case where this would do the wrong thing?  Can you
> describe that or (even better) provide a test case to demonstrate
> it?

No, I don't see any incorrect results.
> (2)  I haven't gotten my head around how HTSV helps or is even the
> right thing.

It primarily just encapsulates the access to the tuple header fields. I
think that avoiding the messy logic of hint bits, tuple locks, etc., is
a significant win for readability and maintainability.

> If I want to try the switch statement from your recent
> post, what should I use as the OldestXmin value on the call to HTSV?

I believe RecentGlobalXmin should work.

And I don't think the original switch statement I posted did the right
thing for HEAPTUPLE_LIVE. I think that case needs to account for the
visible flag (if it's live but not visible, that's the same as
insert-in-progress for your purposes).

Regards,Jeff Davis

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

31 January 2011, 17:20:57

Jeff Davis <pgsql@j-davis.com> wrote:
> Ok, great. When I read that before I thought that WAL might need
> to be sent for implicit RO transactions. I will read it more
> carefully again.
In looking back over recent posts to see what I might have missed or
misinterpreted, I now see your point.  Either of these alternatives
would involve potentially sending something through the WAL on
commit or rollback of some serializable transactions which *did not*
write anything, if they were not *declared* to be READ ONLY.  If
that is not currently happening (again, I confess to not having yet
delved into the mysteries of writing WAL records), then we would
need a new WAL record type for writing these.
That said, the logic would not make it at all useful to send
something for *every* such transaction, and I've rather assumed
that we would want some heuristic for setting a minimum interval
between notifications, whether we sent the snapshots themselves or
just flags to indicate it was time to build or validate a candidate
snapshot.
Sorry for misunderstanding the concerns.
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

31 January 2011, 17:31:26

Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2011-01-31 at 14:38 -0600, Kevin Grittner wrote:
>> If I want to try the switch statement from your recent
>> post, what should I use as the OldestXmin value on the call to
>> HTSV?
> 
> I believe RecentGlobalXmin should work.
> 
> And I don't think the original switch statement I posted did the
> right thing for HEAPTUPLE_LIVE. I think that case needs to account
> for the visible flag (if it's live but not visible, that's the
> same as insert-in-progress for your purposes).
I'll try to set this up and see if I can get it to pass the check
and dcheck make targets.  Can we assume that the performance impact
would be too small to matter when we know for sure that hint bits
have already been set?
-Kevin

Re: SSI patch version 15

From

Heikki Linnakangas

Date:

31 January 2011, 17:36:05

On 31.01.2011 20:05, Robert Haas wrote:
> On Mon, Jan 31, 2011 at 12:31 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov>  wrote:
>> Pretty minimal differences from V14, but I figured it would save the
>> committer some work if I rolled them all up here.
>
> Sounds good.  I believe Heikki is planning to work on this one.
> Hopefully that will happen soon, since we are running short on time.

Yeah, I can commit this. Jeff, are you satisfied with this patch now? 
I'm glad you're reviewing this, more eyeballs helps a lot with a big 
patch like this.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

Jeff Davis

Date:

31 January 2011, 17:49:24

On Mon, 2011-01-31 at 15:30 -0600, Kevin Grittner wrote:
> I'll try to set this up and see if I can get it to pass the check
> and dcheck make targets.  Can we assume that the performance impact
> would be too small to matter when we know for sure that hint bits
> have already been set?

I think that's a safe assumption. If there is some kind of noticeable
difference in conflict rates or runtime, that probably indicates a bug
in the new or old code.

Regards,Jeff Davis

Re: SSI patch version 15

From

Jeff Davis

Date:

31 January 2011, 18:54:30

On Mon, 2011-01-31 at 23:35 +0200, Heikki Linnakangas wrote:
> Yeah, I can commit this. Jeff, are you satisfied with this patch now? 
> I'm glad you're reviewing this, more eyeballs helps a lot with a big 
> patch like this.

I think the patch is very close. I am doing my best in my free time to
complete a thorough review. If you have other patches to review/commit
then I will still be making progress reviewing SSI.

However, I would recommend leaving yourself some time to think on this
one if you don't already understand the design well.

Regards,Jeff Davis

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

31 January 2011, 19:55:19

Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2011-01-31 at 15:30 -0600, Kevin Grittner wrote:
>> I'll try to set this up and see if I can get it to pass the check
>> and dcheck make targets.  Can we assume that the performance
>> impact would be too small to matter when we know for sure that
> hint bits have already been set?
> 
> I think that's a safe assumption. If there is some kind of
> noticeable difference in conflict rates or runtime, that probably
> indicates a bug in the new or old code.
That worked fine; passed check and dcheck targets.
Here's the code:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=6360b0d4ca88c09cf590a75409cd29831afff58b
With confidence that it works, I looked it over some more and now
like this a lot.  It is definitely more readable and should be less
fragile in the face of changes to MVCC bit-twiddling techniques.  Of
course, any changes to the HTSV_Result enum will require changes to
this code, but that seems easier to spot and fix than the
alternative.  Thanks for the suggestion!
Having gotten my head around it, I embellished here:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=f9307a41c198a9aa4203eb529f9c6d1b55c5c6e1
Do those changes look reasonable?  None of that is really
*necessary*, but it seemed cleaner and clearer that way once I
looked at the code with the changes you suggested.
-Kevin

Re: SSI patch version 14

From

Jeff Davis

Date:

01 February 2011, 05:56:50

On Mon, 2011-01-31 at 17:55 -0600, Kevin Grittner wrote:
>
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=6360b0d4ca88c09cf590a75409cd29831afff58b
>  
> With confidence that it works, I looked it over some more and now
> like this a lot.  It is definitely more readable and should be less
> fragile in the face of changes to MVCC bit-twiddling techniques.  Of
> course, any changes to the HTSV_Result enum will require changes to
> this code, but that seems easier to spot and fix than the
> alternative.  Thanks for the suggestion!

One thing that confused me a little about the code is the default case
at the end. The enum is exhaustive, so the default doesn't really make
sense. The compiler warning you are silencing is the uninitialized
variable xid (right?), which is clearly a spurious warning. Since you
have the "Assert(TransactionIdIsValid(xid))" there anyway, why not just
initialize xid to InvalidTransactionId and get rid of the default case?
I assume the "Assert(false)" is there to detect if someone adds a new
enum value, but the compiler should issue a warning in that case anyway
(and the comment next to Assert(false) is worded in a confusing way).
This is all really minor stuff, obviously.

Also, from a code standpoint, it might be possible to early return in
the HEAPTUPLE_RECENTLY_DEAD case where visible=false. It looks like it
will be handled quickly afterward (at TransactionIdPrecedes), so you
don't have to change anything, but I thought I would mention it.

> Having gotten my head around it, I embellished here:
>  
>
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=f9307a41c198a9aa4203eb529f9c6d1b55c5c6e1
>  
> Do those changes look reasonable?  None of that is really
> *necessary*, but it seemed cleaner and clearer that way once I
> looked at the code with the changes you suggested.

Yes, I like those changes.

Regards,Jeff Davis

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

01 February 2011, 13:01:58

Jeff Davis <pgsql@j-davis.com> wrote:
> One thing that confused me a little about the code is the default
> case at the end. The enum is exhaustive, so the default doesn't
> really make sense. The compiler warning you are silencing is the
> uninitialized variable xid (right?)
Right.
> Since you have the "Assert(TransactionIdIsValid(xid))" there
> anyway, why not just initialize xid to InvalidTransactionId and
> get rid of the default case?
I feel a little better keeping even that trivial work out of the
code path if possible, and it seems less confusing to me on the
default case than up front.  I'll improve the comment.
> I assume the "Assert(false)" is there to detect if someone adds a
> new enum value, but the compiler should issue a warning in that
> case anyway
My compiler doesn't.  Would it make sense to elog here, rather than
Assert?  I'm not clear on the rules for that.  I'll push something
that way for review and comment.  If it's wrong, I'll change it.
> This is all really minor stuff, obviously.
In a million line code base, I hate to call anything which affects
readability minor.  ;-)
> Also, from a code standpoint, it might be possible to early return
> in the HEAPTUPLE_RECENTLY_DEAD case where visible=false.
Yeah, good point.  It seems worth testing a bool there.
A small push dealing with all the above issues and adding a little
to comments:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=538ff57691256de0341e22513f59e9dc4dfd998f
Let me know if any of that still needs work to avoid confusion and
comply with PostgreSQL coding conventions.  Like I said, I'm not
totally clear whether elog is right here, but it seems to me a
conceptually similar case to some I found elsewhere that elog was
used.
-Kevin

Re: SSI patch version 14

From

Jeff Davis

Date:

01 February 2011, 13:57:33

On Tue, 2011-02-01 at 11:01 -0600, Kevin Grittner wrote:
> My compiler doesn't.

Strange. Maybe it requires -O2?

> Would it make sense to elog here, rather than
> Assert?  I'm not clear on the rules for that.

elog looks fine there to me, assuming we have the default case. I'm not
100% clear on the rules, either. I think invalid input/corruption are
usually elog (so they can be caught in non-assert builds); but other
switch statements have them as well ("unrecognized node...").

> A small push dealing with all the above issues and adding a little
> to comments:
>  
>
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=538ff57691256de0341e22513f59e9dc4dfd998f
>  
> Let me know if any of that still needs work to avoid confusion and
> comply with PostgreSQL coding conventions.  Like I said, I'm not
> totally clear whether elog is right here, but it seems to me a
> conceptually similar case to some I found elsewhere that elog was
> used.

Looks good. It also looks like it contains a bugfix for subtransactions,
right?

Regards,Jeff Davis

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

01 February 2011, 14:30:48

Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2011-02-01 at 11:01 -0600, Kevin Grittner wrote:
>> My compiler doesn't.
> 
> Strange. Maybe it requires -O2?
That's not it; I see -O2 in my compiles.
At any rate, I think the default clause is the best place to quash
the warning because that leaves us with a warning if changes leave a
path through the other options without setting xid.  In other words,
unconditionally setting a value before the switch could prevent
warnings on actual bugs, and I don't see such a risk when it's on
the default.
>> A small push dealing with all the above issues and adding a
>> little to comments:
> Looks good. It also looks like it contains a bugfix for
> subtransactions, right?
I think it fixes a bug introduced in the push from late yesterday. 
In reviewing what went into the last push yesterday, it looked like
I might have introduced an assertion failure for the case where
there is a write to a row within a subtransaction and then row is
read again after leaving the subtransaction.  I didn't actually
confirm that through testing, because it looked like the safe
approach was better from a performance standpoint, anyway.  That
last check for "our own xid" after finding the top level xid comes
before acquiring the LW lock and doing an HTAB lookup which aren't
necessary in that case. Re-reading a row within the same transaction
seems likely enough to make it worth that quick test before doing
more expensive things.
It did throw a scare into me, though.  The last thing I want to do
is destabilize things at this juncture.  I'll try to be more
conservative with changes from here out.
-Kevin

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

02 February 2011, 06:21:05

On 26.01.2011 13:36, Simon Riggs wrote:
> I looked at the patch to begin a review and immediately saw "dtest".
> There's no docs to explain what it is, but a few comments fill me in a
> little more. Can we document that please? And/or explain why its an
> essential part of this commit? Could we keep it out of core, or if not,
> just commit that part separately? I notice the code is currently
> copyright someone else than PGDG.

We still haven't really resolved this..

Looking at the dtest suite, I'd really like to have those tests in the 
PostgreSQL repository. However, I'd really like not to require python to 
run them, and even less dtester (no offense Markus) and the dependencies 
it has. I couldn't get it to run on my basic Debian system, clearly I'm 
doing something wrong, but it will be even harder for people on more 
exotic platforms.

The tests don't seem very complicated, and they don't seem to depend 
much on the dtester features. How hard would it be to rewrite the test 
engine in C or perl?

I'm thinking of defining each test in a simple text file, and write a 
small test engine to parse that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

Markus Wanner

Date:

02 February 2011, 08:49:04

Heikki,

On 02/02/2011 11:20 AM, Heikki Linnakangas wrote:
> I couldn't get it to run on my basic Debian system, clearly I'm
> doing something wrong, but it will be even harder for people on more
> exotic platforms.

On Debian you only need 'python-twisted' and the dtester sources [1],
IIRC.  What issue did you run into?

> The tests don't seem very complicated, and they don't seem to depend
> much on the dtester features. How hard would it be to rewrite the test
> engine in C or perl?

I've taken a quick look at a perl framework similar to twisted (POE).
Doesn't seem too complicated, but I don't see the benefit of requiring
one over the other.  Another async, event-based framework that I've been
playing with is asio (C++).

I don't think it's feasible to write anything similar without basing on
such a framework.  (And twisted seems more mature and widespread (in
terms of supported platforms) than POE, but that's probably just my
subjective impression).

However, the question here probably is whether or not you need all of
the bells and whistles of dtester (as if there were that many) [2].

> I'm thinking of defining each test in a simple text file, and write a
> small test engine to parse that.

AFAIUI there are lots of possible permutations, so you don't want to
have to edit files for each manually (several hundreds, IIRC).  So using
a scripting language to generate the permutations makes sense, IMO.

Pretty much everything else that Kevin currently uses dtester for (i.e.
initdb, running the postmaster, connecting with psql) is covered by the
existing regression testing infrastructure already, I think.  So it
might be a matter of integrating the permutation generation and test
running code into the existing infrastructure.  Kevin?

Regards

Markus Wanner

[1]: dtester project site:
http://www.bluegap.ch/projects/dtester/

[2]: side note:
Dtester mainly serves me to test Postgres-R, where the current
regression testing infrastructure simply doesn't suffice.  With dtester
I tried to better structure the processes and services, their
dependencies and the test environment.

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

02 February 2011, 08:57:09

On 02.02.2011 12:20, Heikki Linnakangas wrote:
> On 26.01.2011 13:36, Simon Riggs wrote:
>> I looked at the patch to begin a review and immediately saw "dtest".
>> There's no docs to explain what it is, but a few comments fill me in a
>> little more. Can we document that please? And/or explain why its an
>> essential part of this commit? Could we keep it out of core, or if not,
>> just commit that part separately? I notice the code is currently
>> copyright someone else than PGDG.
>
> We still haven't really resolved this..
>
> Looking at the dtest suite, I'd really like to have those tests in the
> PostgreSQL repository. However, I'd really like not to require python to
> run them, and even less dtester (no offense Markus) and the dependencies
> it has. I couldn't get it to run on my basic Debian system, clearly I'm
> doing something wrong, but it will be even harder for people on more
> exotic platforms.
>
> The tests don't seem very complicated, and they don't seem to depend
> much on the dtester features. How hard would it be to rewrite the test
> engine in C or perl?
>
> I'm thinking of defining each test in a simple text file, and write a
> small test engine to parse that.

I took a stab at doing that. Attached, untar in the top source tree, and
do "cd src/test/isolation; make check". It's like "installcheck", so it
needs a postgres server to be running. Also available in my git
repository at git://git.postgresql.org/git/users/heikki/postgres.git,
branch "serializable".

I think this will work. This is a rough first version, but it already
works. We can extend the test framework later if we need more features,
but I believe this is enough for all the existing tests.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

isolationtester.tar.gz

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

02 February 2011, 09:04:39

On 02.02.2011 14:48, Markus Wanner wrote:
> Heikki,
>
> On 02/02/2011 11:20 AM, Heikki Linnakangas wrote:
>> I couldn't get it to run on my basic Debian system, clearly I'm
>> doing something wrong, but it will be even harder for people on more
>> exotic platforms.
>
> On Debian you only need 'python-twisted' and the dtester sources [1],
> IIRC.  What issue did you run into?

I installed dtester with:

PYTHONPATH=~/pythonpath/lib/python2.6/site-packages/ python ./setup.py 
install --prefix=/home/heikki/pythonpath/

And that worked. But when I try to run Kevin's test suite, I get this:

~/git-sandbox/postgresql/src/test/regress (serializable)$ 
PYTHONPATH=~/pythonpath/lib/python2.6/site-packages/ make dcheck
./pg_dtester.py --temp-install --top-builddir=../../.. \        --multibyte=
Postgres dtester suite                Copyright (c) 2004-2010, by Markus 
Wanner

Traceback (most recent call last):  File "./pg_dtester.py", line 1376, in <module>    main(sys.argv[1:])  File
"./pg_dtester.py",line 1370, in main    runner = Runner(reporter=TapReporter(sys.stdout, sys.stderr, 

showTimingInfo=True),
TypeError: __init__() got an unexpected keyword argument 'showTimingInfo'
make: *** [dcheck] Virhe 1

At that point I had no idea what's wrong.

PS. I tried "python ./setup.py dtest", mentioned in the README, and it 
just said "invalid command 'dtest'". But I presume the installation 
worked anyway.

> Pretty much everything else that Kevin currently uses dtester for (i.e.
> initdb, running the postmaster, connecting with psql) is covered by the
> existing regression testing infrastructure already, I think.  So it
> might be a matter of integrating the permutation generation and test
> running code into the existing infrastructure.  Kevin?

Yeah, that was my impression too. (see my other post with WIP 
implementation of that)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

Markus Wanner

Date:

02 February 2011, 09:22:33

Heikki,

On 02/02/2011 02:04 PM, Heikki Linnakangas wrote:
> TypeError: __init__() got an unexpected keyword argument 'showTimingInfo'
> make: *** [dcheck] Virhe 1
> 
> At that point I had no idea what's wrong.

Hm.. looks like Kevin already uses the latest dtester code from git.
You can either do that as well, or try to drop the 'showTimingInfo'
argument from the call to Runner() in pg_dtester.py:1370.

> PS. I tried "python ./setup.py dtest", mentioned in the README, and it
> just said "invalid command 'dtest'". But I presume the installation
> worked anyway.

Yeah, that's a gotcha in dtester.  setup.py dtest requires dtester to be
installed - a classic chicken and egg problem.  I certainly need to look
into this.

> Yeah, that was my impression too. (see my other post with WIP
> implementation of that)

I'd like to get Kevin's opinion on that, but it certainly sounds like a
pragmatic approach for now.

Regards

Markus Wanner

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

02 February 2011, 10:12:15

Heikki Linnakangas  wrote:
> On 02.02.2011 12:20, Heikki Linnakangas wrote:
>> The tests don't seem very complicated, and they don't seem to
>> depend much on the dtester features. How hard would it be to
>> rewrite the test engine in C or perl?
>>
>> I'm thinking of defining each test in a simple text file, and
>> write a small test engine to parse that.
> 
> I took a stab at doing that. Attached, untar in the top source
> tree, and do "cd src/test/isolation; make check". It's like
> "installcheck", so it needs a postgres server to be running. Also
> available in my git repository at
> git://git.postgresql.org/git/users/heikki/postgres.git,
> branch "serializable".
> 
> I think this will work. This is a rough first version, but it
> already works. We can extend the test framework later if we need
> more features, but I believe this is enough for all the existing
> tests.
Thanks for this.  It does seem the way to go.
I can fill it out with the rest of the tests, unless you'd rather.
-Kevin

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

02 February 2011, 10:16:35

On 02.02.2011 16:11, Kevin Grittner wrote:
> Heikki Linnakangas  wrote:
>> On 02.02.2011 12:20, Heikki Linnakangas wrote:
>
>>> The tests don't seem very complicated, and they don't seem to
>>> depend much on the dtester features. How hard would it be to
>>> rewrite the test engine in C or perl?
>>>
>>> I'm thinking of defining each test in a simple text file, and
>>> write a small test engine to parse that.
>>
>> I took a stab at doing that. Attached, untar in the top source
>> tree, and do "cd src/test/isolation; make check". It's like
>> "installcheck", so it needs a postgres server to be running. Also
>> available in my git repository at
>> git://git.postgresql.org/git/users/heikki/postgres.git,
>> branch "serializable".
>>
>> I think this will work. This is a rough first version, but it
>> already works. We can extend the test framework later if we need
>> more features, but I believe this is enough for all the existing
>> tests.
>
> Thanks for this.  It does seem the way to go.
>
> I can fill it out with the rest of the tests, unless you'd rather.

Thanks, please do. I converted the next test, receipt-report, grab that 
from my git repository first. It produces over two hundred permutations, 
so it's going to be a bit tedious to check the output for that, but I 
think it's still acceptable so that we don't need more complicated rules 
for what is accepted. IOW, we can just print the output of all 
permutations and "diff", we won't need "if step X ran before step Y, 
commit should succeed" rules that the python version had.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

02 February 2011, 10:27:30

Heikki Linnakangas  wrote:
> I converted the next test, receipt-report, grab that from my git
> repository first.
Will do.
> It produces over two hundred permutations, so it's going to be a
> bit tedious to check the output for that, but I think it's still
> acceptable so that we don't need more complicated rules for what is
> accepted. IOW, we can just print the output of all permutations and
> "diff", we won't need "if step X ran before step Y, commit should
> succeed" rules that the python version had.
During initial development, I was very glad to have the one-line-
per-permutation summary; however, lately I've been wishing for more
detail, such as is available with this approach.  At least for the
moment, what this provides is exactly what is most useful.  If there
is ever a major refactoring (versus incremental enhancements), it
might be worth developing a way to filter the detailed input into the
sort of summary we were getting from dtester, but we can cross that
bridge when (and if) we come to it.
-Kevin

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

02 February 2011, 10:30:58

On 02.02.2011 16:27, Kevin Grittner wrote:
> Heikki Linnakangas  wrote:
>> It produces over two hundred permutations, so it's going to be a
>> bit tedious to check the output for that, but I think it's still
>> acceptable so that we don't need more complicated rules for what is
>> accepted. IOW, we can just print the output of all permutations and
>> "diff", we won't need "if step X ran before step Y, commit should
>> succeed" rules that the python version had.
>
> During initial development, I was very glad to have the one-line-
> per-permutation summary; however, lately I've been wishing for more
> detail, such as is available with this approach.  At least for the
> moment, what this provides is exactly what is most useful.  If there
> is ever a major refactoring (versus incremental enhancements), it
> might be worth developing a way to filter the detailed input into the
> sort of summary we were getting from dtester, but we can cross that
> bridge when (and if) we come to it.

Ok, great. Let's just add comments to the test specs to explain which 
permutations should succeed and which should fail, then. And which ones 
fail because they're false positives.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

02 February 2011, 11:52:50

Heikki Linnakangas  wrote:
> I converted the next test, receipt-report, grab that from my git
> repository first. It produces over two hundred permutations, so
> it's going to be a bit tedious to check the output for that, but I
> think it's still acceptable so that we don't need more complicated
> rules for what is accepted.
I was a bit disconcerted to see 48 permutations fail when dcheck had
been seeing 6; but it turned out to be simple to fix -- the third
connection was declared READ ONLY in the dcheck version.  Measuring
false positives for the READ WRITE version is probably valuable, but
we also want to test the READ ONLY optimizations.  The "two ids test"
has similar dynamics, so I'll change one or the other so we cover
both scenarios.
> IOW, we can just print the output of all permutations and "diff",
> we won't need "if step X ran before step Y, commit should succeed"
> rules that the python version had.
I found two problems with this, and I'm not sure how to handle
either.  If we can figure out an approach, I'm happy to work on it.
(1)  We don't have much in the way of detail yet.  I put a different
detail message on each, so that there is more evidence, hopefully at
least somewhat comprehensible to an educated user, about how the
cancelled transaction fit into the dangerous pattern of access among
transactions.  Ultimately, I hope we can improve these messages to
include such detail as table names in many circumstances, but that's
not 9.1 material.  What I did include, when it was easily available,
was another xid involved in the conflict.  These are not matching
from one test to the next.
(2)  The NOTICE lines for implicit index creation pop up at odd times
in the output, like in the middle of a SQL statement.  It looks like
these are piling up in a buffer somewhere and getting dumped into the
output when the buffer fills.  They are actually showing up at
exactly the same point on each run, but I doubt that we can count on
that for all platforms, and even if we could -- it's kinda ugly. 
Perhaps we should change the client logging level to suppress these? 
They're not really important here.
So, I think (2) is probably easy, but I don't see how we can deal
with (1) as easily.  Suppress detail?  Filter to change the xid
number to some literal?
-Kevin

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

02 February 2011, 12:09:10

On 02.02.2011 17:52, Kevin Grittner wrote:
> I found two problems with this, and I'm not sure how to handle
> either.  If we can figure out an approach, I'm happy to work on it.
>
> (1)  We don't have much in the way of detail yet.  I put a different
> detail message on each, so that there is more evidence, hopefully at
> least somewhat comprehensible to an educated user, about how the
> cancelled transaction fit into the dangerous pattern of access among
> transactions.  Ultimately, I hope we can improve these messages to
> include such detail as table names in many circumstances, but that's
> not 9.1 material.  What I did include, when it was easily available,
> was another xid involved in the conflict.  These are not matching
> from one test to the next.
>
> (2)  The NOTICE lines for implicit index creation pop up at odd times
> in the output, like in the middle of a SQL statement.  It looks like
> these are piling up in a buffer somewhere and getting dumped into the
> output when the buffer fills.  They are actually showing up at
> exactly the same point on each run, but I doubt that we can count on
> that for all platforms, and even if we could -- it's kinda ugly.
> Perhaps we should change the client logging level to suppress these?
> They're not really important here.
>
> So, I think (2) is probably easy, but I don't see how we can deal
> with (1) as easily.  Suppress detail?  Filter to change the xid
> number to some literal?

Suppressing detail seems easiest. AFAICS all the variable parts are in 
errdetail.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

02 February 2011, 13:36:40

Heikki Linnakangas  wrote:
> Suppressing detail seems easiest. AFAICS all the variable parts are
> in errdetail.
I pushed that with some work on the tests.  If you could review the
changes to isolationtester.c to confirm that they look sane, I'd
appreciate it.

http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=2e0254b82a62ead25311f545b0626c97f9ac35b4#patch6
If that part is right, it shouldn't take me very long to finish the
specs and capture the expected results.
I really appreciate you putting this testing framework together. 
This is clearly the right way to do it, although we also clearly
don't want this test in the check target at the root directory,
because of the run time.
-Kevin

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

02 February 2011, 13:41:47

On 02.02.2011 19:36, Kevin Grittner wrote:
> Heikki Linnakangas  wrote:
>
>> Suppressing detail seems easiest. AFAICS all the variable parts are
>> in errdetail.
>
> I pushed that with some work on the tests.  If you could review the
> changes to isolationtester.c to confirm that they look sane, I'd
> appreciate it.
>
>
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=2e0254b82a62ead25311f545b0626c97f9ac35b4#patch6
>
> If that part is right, it shouldn't take me very long to finish the
> specs and capture the expected results.

Looks good. A comment would be in order to explain why we only print the 
primary message. (yeah, I know I didn't comment the tester code very 
well myself)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

02 February 2011, 13:59:08

Heikki Linnakangas  wrote:
> A comment would be in order to explain why we only print the
> primary message.
Done:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=ddeea22db06ed763b39f716b86db248008a3aa92
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

02 February 2011, 17:43:05

Heikki Linnakangas  wrote:
On 02.02.2011 19:36, Kevin Grittner wrote:
>> If that part is right, it shouldn't take me very long to finish
>> the specs and capture the expected results.
> Looks good.
First cut on the rest of it pushed.  I'll want to go over it again to
confirm, but I think all dcheck testing is now replicated there.  I
didn't eliminate dtester/dcheck code yet, in case either of us wanted
to compare, but it can go any time now.
-Kevin

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

04 February 2011, 07:57:34

On 02.02.2011 19:36, Kevin Grittner wrote:
> I really appreciate you putting this testing framework together.
> This is clearly the right way to do it, although we also clearly
> don't want this test in the check target at the root directory,
> because of the run time.

It would be nice to get some buildfarm members to run them though.

I'm reading through the patch again now, hoping to commit this weekend. 
There's still this one TODO item that you commented on earlier:

> Then there's one still lurking outside the new predicate* files, in
> heapam.c.  It's about 475 lines into the heap_update function (25
> lines from the bottom).  In reviewing where we needed to acquire
> predicate locks, this struck me as a place where we might need to
> duplicate the predicate lock from one tuple to another, but I wasn't
> entirely sure.  I tried for a few hours one day to construct a
> scenario which would show a serialization anomaly if I didn't do
> this, and failed create one.  If someone who understands both the
> patch and heapam.c wants to look at this and offer an opinion, I'd
> be most grateful.  I'll take another look and see if I can get my
> head around it better, but failing that, I'm disinclined to either
> add locking I'm not sure we need or to remove the comment that says
> we *might* need it there.

Have you convinced yourself that there's nothing to do? If so, we should 
replace the todo comment with a comment explaining why.

PageIsPredicateLocked() is currently only called by one assertion in 
RecordFreeIndexPage(). The comment in PageIsPredicateLocked() says 
something about gist vacuuming; I presume this is something that will be 
needed to implement more fine-grained predicate locking in GiST. But we 
don't have that at the moment. The assertion in RecordFreeIndexPage() is 
a reasonable sanity check, but I'm inclined to move it to the caller 
instead: I don't think the FSM should need to access predicate lock 
manager, even for an assertion.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

04 February 2011, 09:30:28

Heikki Linnakangas  wrote:
> On 02.02.2011 19:36, Kevin Grittner wrote:
>> I really appreciate you putting this testing framework together.
>> This is clearly the right way to do it, although we also clearly
>> don't want this test in the check target at the root directory,
>> because of the run time.
> 
> It would be nice to get some buildfarm members to run them though.
Maybe it could be included in the installcheck or installcheck-world
target?
> There's still this one TODO item that you commented on earlier:
> 
>> Then there's one still lurking outside the new predicate* files,
>> in heapam.c. It's about 475 lines into the heap_update function
>> (25 lines from the bottom). In reviewing where we needed to
>> acquire predicate locks, this struck me as a place where we might
>> need to duplicate the predicate lock from one tuple to another,
>> but I wasn't entirely sure. I tried for a few hours one day to
>> construct a scenario which would show a serialization anomaly if I
>> didn't do this, and failed create one. If someone who understands
>> both the patch and heapam.c wants to look at this and offer an
>> opinion, I'd be most grateful. I'll take another look and see if I
>> can get my head around it better, but failing that, I'm
>> disinclined to either add locking I'm not sure we need or to
>> remove the comment that says we *might* need it there.
> 
> Have you convinced yourself that there's nothing to do? If so, we
> should replace the todo comment with a comment explaining why.
I'll look at that today and tomorrow and try to resolve the issue one
way or the other.
> PageIsPredicateLocked() is currently only called by one assertion
> in RecordFreeIndexPage(). The comment in PageIsPredicateLocked()
> says something about gist vacuuming; I presume this is something
> that will be needed to implement more fine-grained predicate
> locking in GiST. But we don't have that at the moment.
Yeah.  We had something close at one point which we had to back out
because it didn't cover page splits correctly in all cases.  It's a
near-certainty that we can have fine-grained coverage for the GiST AM
covered with a small patch in 9.2, and I'm pretty sure we'll need
that function for it.
> The assertion in RecordFreeIndexPage() is a reasonable sanity
> check, but I'm inclined to move it to the caller instead: I don't
> think the FSM should need to access predicate lock manager, even
> for an assertion.
OK.  So it looks like right now it will move to btvacuumpage(), right
before the call to RecordFreeIndexPage(), and we will need to add it
to one spot each in the GiST and GIN AMs, when we get to those. 
Would you like me to do that?
-Kevin

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

04 February 2011, 11:33:11

On 04.02.2011 14:30, Kevin Grittner wrote:
> Heikki Linnakangas  wrote:
>> The assertion in RecordFreeIndexPage() is a reasonable sanity
>> check, but I'm inclined to move it to the caller instead: I don't
>> think the FSM should need to access predicate lock manager, even
>> for an assertion.
>
> OK.  So it looks like right now it will move to btvacuumpage(), right
> before the call to RecordFreeIndexPage(), and we will need to add it
> to one spot each in the GiST and GIN AMs, when we get to those.
> Would you like me to do that?

Yeah, please do. Thanks!

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

04 February 2011, 12:44:37

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> On 04.02.2011 14:30, Kevin Grittner wrote:
>> OK.  So it looks like right now it will move to btvacuumpage(),
>> right before the call to RecordFreeIndexPage(), and we will need
>> to add it to one spot each in the GiST and GIN AMs, when we get
>> to those.  Would you like me to do that?
> 
> Yeah, please do. Thanks!
Done:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=69b1c072048fbae42dacb098d70f5e29fb8be63c
Any objections to me taking out the dcheck target and scripts now?
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

05 February 2011, 13:58:24

Heikki Linnakangas  wrote:
> On 02.02.2011 19:36, Kevin Grittner wrote:
>  
>> Then there's one still lurking outside the new predicate* files,
>> in heapam.c. It's about 475 lines into the heap_update function
>> (25 lines from the bottom). In reviewing where we needed to
>> acquire predicate locks, this struck me as a place where we might
>> need to duplicate the predicate lock from one tuple to another,
>> but I wasn't entirely sure. I tried for a few hours one day to
>> construct a scenario which would show a serialization anomaly if I
>> didn't do this, and failed create one. If someone who understands
>> both the patch and heapam.c wants to look at this and offer an
>> opinion, I'd be most grateful. I'll take another look and see if I
>> can get my head around it better, but failing that, I'm
>> disinclined to either add locking I'm not sure we need or to
>> remove the comment that says we *might* need it there.
>
> Have you convinced yourself that there's nothing to do? If so, we
> should replace the todo comment with a comment explaining why.
It turns out that nagging doubt from my right-brain was on target.
Here's the simplest example I was able to construct of a false
negative due to the lack of some code there:
-- setup
create table t (id int not null, txt text) with (fillfactor=50);
insert into t (id) select x from (select * from generate_series(1, 1000000)) a(x);
alter table t add primary key (id);

-- session 1
-- T1
begin transaction isolation level serializable;
select * from t where id = 1000000;

-- session 2
-- T2
begin transaction isolation level serializable;
update t set txt = 'x' where id = 1000000;
commit;
-- T3
begin transaction isolation level serializable;
update t set txt = 'y' where id = 1000000;
select * from t where id = 500000;

-- session 3
-- T4
begin transaction isolation level serializable;
update t set txt = 'z' where id = 500000;
select * from t where id = 1;
commit;

-- session 2
commit;

-- session 1
update t set txt = 'a' where id = 1;
commit;

Based on visibility of work, there's a cycle in the apparent order of
execution:
T1 -> T2 -> T3 -> T4 -> T1

So now that I'm sure we actually do need code there, I'll add it. 
And I'll add the new test to the isolation suite.
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

05 February 2011, 16:43:25

"Kevin Grittner"  wrote:
> So now that I'm sure we actually do need code there, I'll add it.
In working on this I noticed the apparent need to move two calls to
PredicateLockTuple a little bit to keep them inside the buffer lock. 
Without at least a share lock on the buffer, it seems that here is a
window where a read could miss the MVCC from a write and the write
could fail to see the predicate lock.  Please see whether this seems
reasonable:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=7841a22648c3f4ae46f674d7cf4a7c2673cf9ed2
> And I'll add the new test to the isolation suite.
We don't need all permutations for this test, which is a good thing
since it has such a long setup time.  Is there an easy way to just
run the one schedule of statements on three connections?
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

06 February 2011, 01:40:44

"Kevin Grittner"  wrote:
> So now that I'm sure we actually do need code there, I'll add it.
Hmmm...  First cut at this here:
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=3ccedeb451ee74ee78ef70735782f3550b621eeb
It passes all the usual regression tests, the new isolation tests,
and the example posted earlier in the thread of a test case which was
allowing an anomaly.  (That is to say, the anomaly is now prevented.)
I didn't get timings, but it *seems* noticeably slower; hopefully
that's either subjective or fixable.  Any feedback on whether this
seems a sane approach to the issue is welcome.
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

06 February 2011, 12:44:28

I wrote:
> First cut [of having predicate locks linked to later row versions
> for conflict detection purposes]
> It passes all the usual regression tests, the new isolation tests,
> and the example posted earlier in the thread of a test case which
> was allowing an anomaly. (That is to say, the anomaly is now
> prevented.)
> 
> I didn't get timings, but it *seems* noticeably slower; hopefully
> that's either subjective or fixable. Any feedback on whether this
> seems a sane approach to the issue is welcome.
After sleeping on it and reviewing the patch this morning, I'm
convinced that this is fundamentally right, but the recursion in
RemoveTargetIfNoLongerUsed() can't work.  I'll have to just break the
linkage there and walk the HTAB in the general cleanup to eliminate
orphaned targets.
I'm working on it now, and hope to have it all settled down today.  I
fear I've messed up Heikki's goal of a commit this weekend, but I
think this is a "must fix".
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

06 February 2011, 14:30:14

"Kevin Grittner"  wrote:
> I'm working on it now, and hope to have it all settled down today.
Done and pushed to the git repo.  Branch serializable on:
git://git.postgresql.org/git/users/kgrittn/postgres.git
Heikki: I'm back to not having any outstanding work on the patch.
Does anyone want me to post a new version of the patch with recent
changes?
-Kevin

Re: SSI patch version 14

From

Jeff Davis

Date:

06 February 2011, 15:18:17

On Sat, 2011-02-05 at 14:43 -0600, Kevin Grittner wrote:
> "Kevin Grittner"  wrote:
>  
> > So now that I'm sure we actually do need code there, I'll add it.
>  
> In working on this I noticed the apparent need to move two calls to
> PredicateLockTuple a little bit to keep them inside the buffer lock. 
> Without at least a share lock on the buffer, it seems that here is a
> window where a read could miss the MVCC from a write and the write
> could fail to see the predicate lock.  Please see whether this seems
> reasonable:
>  
>
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=7841a22648c3f4ae46f674d7cf4a7c2673cf9ed2

What does PredicateLockTuple do that needs a share lock? Does a pin
suffice?

Regards,Jeff Davis

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

06 February 2011, 15:40:56

Jeff Davis  wrote:
> On Sat, 2011-02-05 at 14:43 -0600, Kevin Grittner wrote:
>> In working on this I noticed the apparent need to move two calls
>> to PredicateLockTuple a little bit to keep them inside the buffer
>> lock.  Without at least a share lock on the buffer, it seems that
>> here is a window where a read could miss the MVCC from a write and
>> the write could fail to see the predicate lock.
> What does PredicateLockTuple do that needs a share lock? Does a pin
> suffice?
If one process is reading a tuple and another is writing it (e.g.,
UPDATE or DELETE) the concern is that we need to be able to guarantee
that either the predicate lock appears in time for the writer to see
it on the call to CheckForSerializableConflictIn or the reader sees
the MVCC changes in CheckForSerializableConflictOut.  It's OK if the
conflict is detected both ways, but if it's missed both ways then we
could have a false negative, allowing anomalies to slip through.  It
didn't seem to me on review that acquiring the predicate lock after
releasing the shared buffer lock (and just having the pin) would be
enough to ensure that a write which followed that would see the
predicate lock.

reader has pin and shared lock
writer increments pin count
reader releases shared lock
writer acquires exclusive lock
writer checks predicate lock and fails to see one
reader adds predicate lock
we have a problem

-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

06 February 2011, 15:59:23

"Kevin Grittner"  wrote:
> Jeff Davis wrote:
>  
>> What does PredicateLockTuple do that needs a share lock? Does a
>> pin suffice?
> 
> If one process is reading a tuple and another is writing it (e.g.,
> UPDATE or DELETE) the concern is that we need to be able to
> guarantee that either the predicate lock appears in time for the
> writer to see it on the call to CheckForSerializableConflictIn or
> the reader sees the MVCC changes in
> CheckForSerializableConflictOut. It's OK if the conflict is
> detected both ways, but if it's missed both ways then we could have
> a false negative, allowing anomalies to slip through. It didn't
> seem to me on review that acquiring the predicate lock after
> releasing the shared buffer lock (and just having the pin) would be
> enough to ensure that a write which followed that would see the
> predicate lock.
> 
> reader has pin and shared lock
> writer increments pin count
> reader releases shared lock
> writer acquires exclusive lock
> writer checks predicate lock and fails to see one
> reader adds predicate lock
> we have a problem
Hmmm...  Or do we?  If both sides were careful to record what they're
doing before checking for a conflict, the pin might be enough.  I'll
check for that.  In at least one of those moves I was moving the
predicate lock acquisition from after the conflict check to before,
but maybe I didn't need to move it quite so far....
-Kevin

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

07 February 2011, 07:10:31

On 05.02.2011 21:43, Kevin Grittner wrote:
> "Kevin Grittner"  wrote:
>
>> So now that I'm sure we actually do need code there, I'll add it.
>
> In working on this I noticed the apparent need to move two calls to
> PredicateLockTuple a little bit to keep them inside the buffer lock.
> Without at least a share lock on the buffer, it seems that here is a
> window where a read could miss the MVCC from a write and the write
> could fail to see the predicate lock.  Please see whether this seems
> reasonable:
>
>
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=7841a22648c3f4ae46f674d7cf4a7c2673cf9ed2
>
>> And I'll add the new test to the isolation suite.
>
> We don't need all permutations for this test, which is a good thing
> since it has such a long setup time.  Is there an easy way to just
> run the one schedule of statements on three connections?

Not at the moment, but we can add that..

I added the capability to specify exact permutations like:

permutation "rwx1" "rwx2" "c1" "c2"

See my git repository at 
git://git.postgresql.org/git/users/heikki/postgres.git, branch 
"serializable". I also added a short README to explain what the 
isolation test suite is all about, as well as separate "make check" and 
"make installcheck" targets.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

07 February 2011, 13:40:50

Kevin Grittner <kgrittn@wicourts.gov> wrote:
> "Kevin Grittner"  wrote:
>> Jeff Davis wrote:
>>  
>>> What does PredicateLockTuple do that needs a share lock? Does a
>>> pin suffice?
>> 
>> If one process is reading a tuple and another is writing it
>> (e.g., UPDATE or DELETE) the concern is that we need to be able
>> to guarantee that either the predicate lock appears in time for
>> the writer to see it on the call to
>> CheckForSerializableConflictIn or the reader sees the MVCC
>> changes in CheckForSerializableConflictOut. It's OK if the
>> conflict is detected both ways, but if it's missed both ways then
>> we could have a false negative, allowing anomalies to slip
>> through. It didn't seem to me on review that acquiring the
>> predicate lock after releasing the shared buffer lock (and just
>> having the pin) would be enough to ensure that a write which
>> followed that would see the predicate lock.
>> 
>> reader has pin and shared lock
>> writer increments pin count
>> reader releases shared lock
>> writer acquires exclusive lock
>> writer checks predicate lock and fails to see one
>> reader adds predicate lock
>> we have a problem
>  
> Hmmm...  Or do we?  If both sides were careful to record what
> they're doing before checking for a conflict, the pin might be
> enough. I'll check for that.  In at least one of those moves I was
> moving the predicate lock acquisition from after the conflict
> check to before, but maybe I didn't need to move it quite so
> far....
Some of these calls are placed where there is no reasonable choice
but to do them while holding a buffer lock.  There are some which
*might* be able to be moved out, but I'm inclined to say that should
be on a list of possible performance improvements in future
releases.  My concern is that the code paths are complicated enough
to make it hard to confidently desk-check such changes, we don't yet
have a good way to test whether such a change is introducing a race
condition.  The src/test/isolation tests are good at proving the
fundamental correctness of the logic in predicate.c, and the DBT-2
stress tests Dan ran are good at flushing out race conditions within
predicate.c code; but we don't have a test which is good at pointing
out race conditions based on *placement of the calls* to predicate.c
from other code.
Until we have such tests, I think we should be cautious about
risking possible hard-to-reproduce correctness violations to shave a
few nanoseconds off the time a shared buffer lock is held. 
Particularly since we're so close to the end of the release cycle.
-Kevin

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

07 February 2011, 18:14:33

On 06.02.2011 20:30, Kevin Grittner wrote:
> "Kevin Grittner"  wrote:
>
>> I'm working on it now, and hope to have it all settled down today.
>
> Done and pushed to the git repo.  Branch serializable on:
>
> git://git.postgresql.org/git/users/kgrittn/postgres.git
>
> Heikki: I'm back to not having any outstanding work on the patch.

Ok, committed.

Thank you for all this! It has been a huge effort from you and Dan, and 
a great feature as a result.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

Magnus Hagander

Date:

07 February 2011, 18:23:44

On Mon, Feb 7, 2011 at 23:14, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 06.02.2011 20:30, Kevin Grittner wrote:
>>
>> "Kevin Grittner"  wrote:
>>
>>> I'm working on it now, and hope to have it all settled down today.
>>
>> Done and pushed to the git repo.  Branch serializable on:
>>
>> git://git.postgresql.org/git/users/kgrittn/postgres.git
>>
>> Heikki: I'm back to not having any outstanding work on the patch.
>
> Ok, committed.
>
> Thank you for all this! It has been a huge effort from you and Dan, and a
> great feature as a result.

+1!

(or more)

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: SSI patch version 14

From

Robert Haas

Date:

07 February 2011, 19:20:12

On Mon, Feb 7, 2011 at 5:23 PM, Magnus Hagander <magnus@hagander.net> wrote:
> On Mon, Feb 7, 2011 at 23:14, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> On 06.02.2011 20:30, Kevin Grittner wrote:
>>>
>>> "Kevin Grittner"  wrote:
>>>
>>>> I'm working on it now, and hope to have it all settled down today.
>>>
>>> Done and pushed to the git repo.  Branch serializable on:
>>>
>>> git://git.postgresql.org/git/users/kgrittn/postgres.git
>>>
>>> Heikki: I'm back to not having any outstanding work on the patch.
>>
>> Ok, committed.
>>
>> Thank you for all this! It has been a huge effort from you and Dan, and a
>> great feature as a result.
>
> +1!

Indeed, congratulations Kevin!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

07 February 2011, 19:55:17

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> Ok, committed.
Thanks for that, and for all the suggestions along the way!
Thanks also to Joe Conway for the review and partial commit in the
first CF for this release; to Jeff Davis for the reviews, pointing
out areas which needed work; and to Anssi Kääriäinen for testing
thoroughly enough to find some issues to fix!
And thanks especially to Dan Ports for joining me in this effort and
co-authoring this with me!  Besides the actual code contributions,
the ability to pound on the result with DBT-2 for days at a time was
invaluable.
WOO-HOO!
-Kevin

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

08 February 2011, 04:43:20

Heikki Linnakangas  wrote:
> Ok, committed.
I see that at least three BuildFarm critters don't have UINT64_MAX
defined.  Not sure why coypu is running out of connections.
-Kevin

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

08 February 2011, 05:25:49

On 08.02.2011 10:43, Kevin Grittner wrote:
> Heikki Linnakangas  wrote:
>
>> Ok, committed.
>
> I see that at least three BuildFarm critters don't have UINT64_MAX
> defined.

I guess we'll have to just #define it ourselves. Or could we just pick 
another magic value, do we actually rely on InvalidSerCommitSeqno being 
higher than all other values anywhere?

>  Not sure why coypu is running out of connections.

Hmm, it seems to choose a smaller max_connections value now, 20 instead 
of 30. Looks like our shared memory usage went up by just enough to pass 
that threshold.

Looks like our shared memory footprint grew by about 2MB with default 
configuration, from 37MB to 39MB. That's quite significant. Should we 
dial down the default of max_predicate_locks_per_transaction? Or tweak 
the sizing of the hash tables somehow?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

Bruce Momjian

Date:

08 February 2011, 08:24:18

Kevin Grittner wrote:
> Heikki Linnakangas  wrote:
>  
> > Ok, committed.
>  
> I see that at least three BuildFarm critters don't have UINT64_MAX
> defined.  Not sure why coypu is running out of connections.

Yes, I am seeing that failures here too on my BSD machine.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

08 February 2011, 09:35:59

> Heikki Linnakangas  wrote:
> On 08.02.2011 10:43, Kevin Grittner wrote:
>  
>> I see that at least three BuildFarm critters don't have UINT64_MAX
>> defined.
> 
> I guess we'll have to just #define it ourselves. Or could we just
> pick another magic value, do we actually rely on
> InvalidSerCommitSeqno being higher than all other values anywhere?
It seemed more robust than a low-end number, based on how it's used.
>> Not sure why coypu is running out of connections.
> 
> Hmm, it seems to choose a smaller max_connections value now, 20
> instead of 30. Looks like our shared memory usage went up by just
> enough to pass that threshold.
> 
> Looks like our shared memory footprint grew by about 2MB with
> default configuration, from 37MB to 39MB. That's quite significant.
> Should we dial down the default of
> max_predicate_locks_per_transaction? Or tweak the sizing of the
> hash tables somehow?
Dialing down max_predicate_locks_per_transaction could cause the user
to see "out of shared memory" errors sooner, so I'd prefer to stay
away from that.  Personally, I feel that max_connections is higher
than it should be for machines which would have trouble with the RAM
space, but I suspect I'd have trouble selling the notion that the
default for that should be reduced.
The multiplier of 10 PredXactList structures per connection is kind
of arbitrary.  It affects the point at which information is pushed to
the lossy summary, so any number from 2 up will work correctly; it's
a matter of performance and false positive rate.  We might want to
put that on a GUC and default it to something lower.
-Kevin

Re: SSI patch version 14

From

Dan Ports

Date:

08 February 2011, 11:54:25

On Tue, Feb 08, 2011 at 11:25:34AM +0200, Heikki Linnakangas wrote:
> On 08.02.2011 10:43, Kevin Grittner wrote:
> > I see that at least three BuildFarm critters don't have UINT64_MAX
> > defined.
> 
> I guess we'll have to just #define it ourselves. Or could we just pick 
> another magic value, do we actually rely on InvalidSerCommitSeqno being 
> higher than all other values anywhere?

As far as I know we don't specifically rely on that anywhere, and
indeed I did have it #defined to 1 at one point (with the other
constants adjusted to match) and I don't recall any problems. But given
that we most often use InvalidSerCommitSeqNo to mean "not committed
yet", it made more sense to set it to UINT64_MAX so that if a
comparison did sneak in it would do the right thing.

I did dust off a copy of the ANSI standard at the time, and it was
pretty explicit that UINT64_MAX is supposed to be defined in <stdint.h>.
But that may just be a C99 requirement (I didn't have an older copy of
the standard), and it's obviously no guarantee that it actually is
defined.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

08 February 2011, 11:55:17

Dan Ports <drkp@csail.mit.edu> wrote:
> On Tue, Feb 08, 2011 at 11:25:34AM +0200, Heikki Linnakangas
wrote:
>> On 08.02.2011 10:43, Kevin Grittner wrote:
>>> I see that at least three BuildFarm critters don't have
>>> UINT64_MAX defined.
>>
>> I guess we'll have to just #define it ourselves. Or could we just
>> pick another magic value, do we actually rely on
>> InvalidSerCommitSeqno being higher than all other values
>> anywhere?
>
> As far as I know we don't specifically rely on that anywhere, and
> indeed I did have it #defined to 1 at one point (with the other
> constants adjusted to match) and I don't recall any problems. But
> given that we most often use InvalidSerCommitSeqNo to mean "not
> committed yet", it made more sense to set it to UINT64_MAX so that
> if a comparison did sneak in it would do the right thing.
>
> I did dust off a copy of the ANSI standard at the time, and it was
> pretty explicit that UINT64_MAX is supposed to be defined in
> <stdint.h>. But that may just be a C99 requirement (I didn't have
> an older copy of the standard), and it's obviously no guarantee
> that it actually is defined.

Attached is something which will work.  Whether people prefer this
or a definition of UINT64_MAX in some header file (if it's missing)
doesn't matter much to me.

At any rate, if someone commits this one-liner, three critters
should go back to green.

-Kevin

Attachment

UINT64_MAX-elimination.patch

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

08 February 2011, 12:14:58

I wrote:
> The multiplier of 10 PredXactList structures per connection is
> kind of arbitrary.  It affects the point at which information is
> pushed to the lossy summary, so any number from 2 up will work
> correctly; it's a matter of performance and false positive rate. 
> We might want to put that on a GUC and default it to something
> lower.
If the consensus is that we want to add this knob, I can code it up
today.  If we default it to something low, we can knock off a large
part of the 2MB increase in shared memory used by SSI in the default
configuration.  For those not using SERIALIZABLE transactions the
only impact is that less shared memory will be reserved for
something they're not using.  For those who try SERIALIZABLE
transactions, the smaller the number, the sooner performance will
start to drop off under load -- especially in the face of a
long-running READ WRITE transaction.  Since it determines shared
memory allocation, it would have to be a restart-required GUC.
I do have some concern that if this defaults to too low a number,
those who try SSI without bumping it and restarting the postmaster
will not like the performance under load very much.  SSI performance
would not be affected by a low setting under light load when there
isn't a long-running READ WRITE transaction.
-Kevin

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

08 February 2011, 12:17:40

On 08.02.2011 17:50, Kevin Grittner wrote:
> Attached is something which will work.  Whether people prefer this
> or a definition of UINT64_MAX in some header file (if it's missing)
> doesn't matter much to me.
>
> At any rate, if someone commits this one-liner, three critters
> should go back to green.

Thanks, committed.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

Dan Ports

Date:

08 February 2011, 14:34:54

On Tue, Feb 08, 2011 at 10:14:44AM -0600, Kevin Grittner wrote:
> I do have some concern that if this defaults to too low a number,
> those who try SSI without bumping it and restarting the postmaster
> will not like the performance under load very much.  SSI performance
> would not be affected by a low setting under light load when there
> isn't a long-running READ WRITE transaction.

If we're worried about this, we could add a log message the first time
SummarizeOldestCommittedXact is called, to suggest increasing the GUC
for number of SerializableXacts. This also has the potential benefit of
alerting the user that there's a long-running transaction, in case that's
unexpected (say, if it were caused by a wedged client)

I don't have any particular opinion on what the default value of the
GUC should be.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

08 February 2011, 15:01:53

On 08.02.2011 18:14, Kevin Grittner wrote:
> I wrote:
>
>> The multiplier of 10 PredXactList structures per connection is
>> kind of arbitrary.  It affects the point at which information is
>> pushed to the lossy summary, so any number from 2 up will work
>> correctly; it's a matter of performance and false positive rate.
>> We might want to put that on a GUC and default it to something
>> lower.
>
> If the consensus is that we want to add this knob, I can code it up
> today.  If we default it to something low, we can knock off a large
> part of the 2MB increase in shared memory used by SSI in the default
> configuration.  For those not using SERIALIZABLE transactions the
> only impact is that less shared memory will be reserved for
> something they're not using.  For those who try SERIALIZABLE
> transactions, the smaller the number, the sooner performance will
> start to drop off under load -- especially in the face of a
> long-running READ WRITE transaction.  Since it determines shared
> memory allocation, it would have to be a restart-required GUC.
>
> I do have some concern that if this defaults to too low a number,
> those who try SSI without bumping it and restarting the postmaster
> will not like the performance under load very much.  SSI performance
> would not be affected by a low setting under light load when there
> isn't a long-running READ WRITE transaction.

Hmm, comparing InitPredicateLocks() and PredicateLockShmemSize(), it 
looks like RWConflictPool is missing altogether from the calculations in 
PredicateLockShmemSize().

I added an elog to InitPredicateLocks() and PredicateLockShmemSize(), to 
print the actual and estimated size. Here's what I got with 
max_predicate_locks_per_transaction=10 and max_connections=100:

LOG:  shmemsize 635467
LOG:  actual 1194392
WARNING:  out of shared memory
FATAL:  not enough shared memory for data structure "shmInvalBuffer" 
(67224 bytes requested)

On the other hand, when I bumped max_predicate_locks_per_transaction to 
100, I got:

LOG:  shmemsize 3153112
LOG:  actual 2339864

Which is a pretty big overestimate, percentage-wise. Taking 
RWConflictPool into account in PredicateLockShmemSize() fixes the 
underestimate, but makes the overestimate correspondingly larger. I've 
never compared the actual and estimated shmem sizes of other parts of 
the backend, so I'm not sure how large discrepancies we usually have, 
but that seems quite big.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

08 February 2011, 15:24:21

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> Taking RWConflictPool into account in PredicateLockShmemSize() fixes
the 
> underestimate, but makes the overestimate correspondingly larger.
I've 
> never compared the actual and estimated shmem sizes of other parts of

> the backend, so I'm not sure how large discrepancies we usually have,

> but that seems quite big.
Looking into it...
-Kevin

Re: SSI patch version 14

From

Dan Ports

Date:

08 February 2011, 16:53:13

One other nit re. the predicate lock table size GUCs: the out-of-memory
case in RegisterPredicateLockingXid (predicate.c:1592 in my tree) gives
the hint to increase max_predicate_locks_per_transaction. I don't think
that's correct, since that GUC isn't used to size SerializableXidHash.

In fact, that error shouldn't arise at all because if there was room in
PredXact to register the transaction, then there should be room to
register it's xid in SerializableXidHash. Except that it's possible for
something else to allocate all of our shared memory and thus prevent
SerializbleXidHash from reaching its intended max capacity.

In general, it might be worth considering making a HTAB's max_size a
hard limit, but that's a larger issue. Here, it's probably worth just
removing the hint.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

08 February 2011, 18:04:57

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

> LOG:  shmemsize 3153112
> LOG:  actual 2339864
>
> Which is a pretty big overestimate, percentage-wise. Taking
> RWConflictPool into account in PredicateLockShmemSize() fixes the
> underestimate, but makes the overestimate correspondingly larger.
> I've never compared the actual and estimated shmem sizes of other
> parts of the backend, so I'm not sure how large discrepancies we
> usually have, but that seems quite big.

I found two things which probably explain that:

(1)  When HTABs are created, there is the max_size, which is what
the PredicateLockShmemSize function must use in its calculations,
and the init_size, which is what will initially be allocated (and
so, is probably what you see in the usage at the end of the
InitPredLocks function).  That's normally set to half the maximum.

(2)  The predicate lock and lock target initialization code was
initially copied and modified from the code for heavyweight locks.
The heavyweight lock code adds 10% to the calculated maximum size.
So I wound up doing that for PredicateLockTargetHash and
PredicateLockHash, but didn't do it for SerializableXidHassh.
Should I eliminate this from the first two, add it to the third, or
leave it alone?

So if the space was all in HTABs, you might expect shmemsize to be
110% of the estimated maximum, and actual (at the end of the init
function) to be 50% of the estimated maximum.  So the shmemsize
would be (2.2 * actual) at that point.  The difference isn't that
extreme because the list-based pools now used for some structures
are allocated at full size without padding.

In addition to the omission of the RWConflictPool (which is a
biggie), the OldSerXidControlData estimate was only for a *pointer*
to it, not the structure itself.  The attached patch should correct
the shmemsize numbers.

-Kevin

Attachment

ssi-shmemsize.patch

Re: SSI patch version 14

From

Dan Ports

Date:

08 February 2011, 20:23:20

On Tue, Feb 08, 2011 at 04:04:39PM -0600, Kevin Grittner wrote:
> (2)  The predicate lock and lock target initialization code was
> initially copied and modified from the code for heavyweight locks. 
> The heavyweight lock code adds 10% to the calculated maximum size. 
> So I wound up doing that for PredicateLockTargetHash and
> PredicateLockHash, but didn't do it for SerializableXidHassh. 
> Should I eliminate this from the first two, add it to the third, or
> leave it alone?

Actually, I think for SerializableXidHash we should probably just
initially allocate it at its maximum size. Then it'll match the
PredXact list which is allocated in full upfront, and there's no risk
of being able to allocate a transaction but not register its xid. In
fact, I believe there would be no way for starting a new serializable
transaction to fail.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: SSI patch version 14

From

Robert Haas

Date:

08 February 2011, 22:09:58

On Tue, Feb 8, 2011 at 7:23 PM, Dan Ports <drkp@csail.mit.edu> wrote:
> On Tue, Feb 08, 2011 at 04:04:39PM -0600, Kevin Grittner wrote:
>> (2)  The predicate lock and lock target initialization code was
>> initially copied and modified from the code for heavyweight locks.
>> The heavyweight lock code adds 10% to the calculated maximum size.
>> So I wound up doing that for PredicateLockTargetHash and
>> PredicateLockHash, but didn't do it for SerializableXidHassh.
>> Should I eliminate this from the first two, add it to the third, or
>> leave it alone?
>
> Actually, I think for SerializableXidHash we should probably just
> initially allocate it at its maximum size. Then it'll match the
> PredXact list which is allocated in full upfront, and there's no risk
> of being able to allocate a transaction but not register its xid. In
> fact, I believe there would be no way for starting a new serializable
> transaction to fail.

No way to fail is a tall order.

If we don't allocate all the memory up front, does that allow memory
to be dynamically shared between different hash tables in shared
memory?  I'm thinking not, but...

Frankly, I think this is an example of how our current shared memory
model is a piece of garbage.  Our insistence on using sysv shm, and
only sysv shm, is making it impossible for us to do things that other
products can do easily.  My first reaction to this whole discussion
was "who gives a crap about 2MB of shared memory?" and then I said
"oh, right, we do, because it might cause someone who was going to get
24MB of shared buffers to get 16MB instead, and then performance will
suck even worse than it does already".  But of course the person
should really be running with 256MB or more, in all likelihood, and
would be happy to have us do that right out of the box if it didn't
require them to do tap-dance around their kernel settings and reboot.

We really need to fix this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: SSI patch version 14

From

Dan Ports

Date:

09 February 2011, 01:24:27

On Tue, Feb 08, 2011 at 09:09:48PM -0500, Robert Haas wrote:
> No way to fail is a tall order.

Well, no way to fail due to running out of shared memory in
RegisterPredicateLock/RegisterPredicateLockingXid, but that doesn't
have quite the same ring to it...

> If we don't allocate all the memory up front, does that allow memory
> to be dynamically shared between different hash tables in shared
> memory?  I'm thinking not, but...

Not in a useful way. If we only allocate some of the memory up front,
then the rest goes into the global shmem pool (actually, that has
nothing to do with the hash table per se, just the ShmemSize
calculations), and it's up for grabs for any hash table that wants to
expand, even beyond its declared maximum capacity. But once it's
claimed by a hash table it can't get returned.

This doesn't sound like a feature to me.

In particular, I'd worry that something that allocates a lot of locks
(either of the heavyweight or predicate variety) would fill up the
associated hash table, and then we're out of shared memory for the
other hash tables -- and have no way to get it back short of restarting
the whole system.

> Frankly, I think this is an example of how our current shared memory
> model is a piece of garbage.  Our insistence on using sysv shm, and
> only sysv shm, is making it impossible for us to do things that other
> products can do easily.  My first reaction to this whole discussion
> was "who gives a crap about 2MB of shared memory?" and then I said
> "oh, right, we do, because it might cause someone who was going to get
> 24MB of shared buffers to get 16MB instead, and then performance will
> suck even worse than it does already".  But of course the person
> should really be running with 256MB or more, in all likelihood, and
> would be happy to have us do that right out of the box if it didn't
> require them to do tap-dance around their kernel settings and reboot.

I'm completely with you on this. 

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

09 February 2011, 06:27:13

On 09.02.2011 00:04, Kevin Grittner wrote:
> (1)  When HTABs are created, there is the max_size, which is what
> the PredicateLockShmemSize function must use in its calculations,
> and the init_size, which is what will initially be allocated (and
> so, is probably what you see in the usage at the end of the
> InitPredLocks function).  That's normally set to half the maximum.

Oh, I see.

> (2)  The predicate lock and lock target initialization code was
> initially copied and modified from the code for heavyweight locks.
> The heavyweight lock code adds 10% to the calculated maximum size.
> So I wound up doing that for PredicateLockTargetHash and
> PredicateLockHash, but didn't do it for SerializableXidHassh.
> Should I eliminate this from the first two, add it to the third, or
> leave it alone?

I'm inclined to eliminate it from the first two. Even in 
LockShmemSize(), it seems a bit weird to add a safety margin, the sizes 
of the lock and proclock hashes are just rough estimates anyway.

> So if the space was all in HTABs, you might expect shmemsize to be
> 110% of the estimated maximum, and actual (at the end of the init
> function) to be 50% of the estimated maximum.  So the shmemsize
> would be (2.2 * actual) at that point.  The difference isn't that
> extreme because the list-based pools now used for some structures
> are allocated at full size without padding.
>
> In addition to the omission of the RWConflictPool (which is a
> biggie), the OldSerXidControlData estimate was only for a *pointer*
> to it, not the structure itself.  The attached patch should correct
> the shmemsize numbers.

The actual and estimated shmem sizes still didn't add up, I still saw 
actual usage much higher than estimated size, with max_connections=1000 
and max_predicate_locks_per_transaction=10. It turned out to be because:

* You missed that RWConflictPool is sized five times as large as 
SerializableXidHash, and

* The allocation for RWConflictPool elements was wrong, while the 
estimate was correct.

With these changes, the estimated and actual sizes match closely, so 
that actual hash table sizes are 50% of the estimated size as expected.

I fixed those bugs, but this doesn't help with the buildfarm members 
with limited shared memory yet.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

David Fetter

Date:

09 February 2011, 11:16:32

On Tue, Feb 08, 2011 at 09:09:48PM -0500, Robert Haas wrote:
> If we don't allocate all the memory up front, does that allow memory
> to be dynamically shared between different hash tables in shared
> memory?  I'm thinking not, but...
> 
> Frankly, I think this is an example of how our current shared memory
> model is a piece of garbage.

What other model(s) might work better?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

09 February 2011, 11:21:53

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

>> (2)  The predicate lock and lock target initialization code was
>> initially copied and modified from the code for heavyweight
>> locks.  The heavyweight lock code adds 10% to the calculated
>> maximum size.  So I wound up doing that for
>> PredicateLockTargetHash and PredicateLockHash, but didn't do it
>> for SerializableXidHassh.  Should I eliminate this from the first
>> two, add it to the third, or leave it alone?
>
> I'm inclined to eliminate it from the first two. Even in
> LockShmemSize(), it seems a bit weird to add a safety margin, the
> sizes of the lock and proclock hashes are just rough estimates
> anyway.

I'm fine with that.  Trivial patch attached.

> * You missed that RWConflictPool is sized five times as large as
> SerializableXidHash, and
>
> * The allocation for RWConflictPool elements was wrong, while the
> estimate was correct.
>
> With these changes, the estimated and actual sizes match closely,
> so that actual hash table sizes are 50% of the estimated size as
> expected.
>
> I fixed those bugs

Thanks.  Sorry for missing them.

> but this doesn't help with the buildfarm members with limited
> shared memory yet.

Well, if dropping the 10% fudge factor on those two HTABs doesn't
bring it down far enough (which seems unlikely), what do we do?  We
could, as I said earlier, bring down the multiplier for the number
of transactions we track in SSI based on the maximum allowed
connections connections, but I would really want a GUC on it if we
do that.  We could bring down the default number of predicate locks
per transaction.  We could make the default configuration more
stingy about max_connections when memory is this tight.  Other
ideas?

I do think that anyone using SSI with a heavy workload will need
something like the current values to see decent performance, so it
would be good if there was some way to do this which would tend to
scale up as they increased something.  Wild idea: make the
multiplier equivalent to the bytes of shared memory divided by 100MB
clamped to a minimum of 2 and a maximum of 10?

-Kevin

Attachment

ssi-drop-fudge-factor.patch

Re: SSI patch version 14

From

Markus Wanner

Date:

09 February 2011, 11:38:45

On 02/09/2011 04:16 PM, David Fetter wrote:
> On Tue, Feb 08, 2011 at 09:09:48PM -0500, Robert Haas wrote:
>> Frankly, I think this is an example of how our current shared memory
>> model is a piece of garbage.
> 
> What other model(s) might work better?

Thread based, dynamically allocatable and resizeable shared memory, as
most other projects and developers use, for example.

My dynshmem work is a first attempt at addressing the allocation part of
that.  It would theoretically allow more dynamic use of the overall
fixed amount of shared memory available (instead of requiring every
subsystem to use a fixed fraction of the overall available shared
memory, as is required now).

It has dismissed from CF 2010-07 for good reasons (lacking evidence of
usable performance, possible patent issues (on the allocator chosen),
lots of work for questionable benefit (existing subsystems would have to
be reworked to use that allocator)).

For anybody interested, please search the archives for 'dynshmem'.

Regards

Markus Wanner

Re: SSI patch version 14

From

"Kevin Grittner"

Date:

09 February 2011, 11:59:00

Dan Ports <drkp@csail.mit.edu> wrote:

> I think for SerializableXidHash we should probably just initially
> allocate it at its maximum size. Then it'll match the PredXact
> list which is allocated in full upfront, and there's no risk of
> being able to allocate a transaction but not register its xid. In
> fact, I believe there would be no way for starting a new
> serializable transaction to fail.

To be more precise, it would prevent an out of shared memory error
during an attempt to register an xid for an active serializable
transaction.  That seems like a good thing.  Patch to remove the
hint and initially allocate that HTAB at full size attached.

I didn't attempt to address the larger general issue of one HTAB
stealing shared memory from space calculated to belong to another,
and then holding on to it until the postmaster is shut down.

-Kevin

Attachment

ssi-full-xid-alloc.patch

Re: SSI patch version 14

From

Robert Haas

Date:

09 February 2011, 13:25:13

On Wed, Feb 9, 2011 at 10:38 AM, Markus Wanner <markus@bluegap.ch> wrote:
> On 02/09/2011 04:16 PM, David Fetter wrote:
>> On Tue, Feb 08, 2011 at 09:09:48PM -0500, Robert Haas wrote:
>>> Frankly, I think this is an example of how our current shared memory
>>> model is a piece of garbage.
>>
>> What other model(s) might work better?
>
> Thread based, dynamically allocatable and resizeable shared memory, as
> most other projects and developers use, for example.

Or less invasively, a small sysv shm to prevent the double-postmaster
problem, and allocate the rest using POSIX shm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: SSI patch version 14

From

"A.M."

Date:

09 February 2011, 15:26:13

On Feb 9, 2011, at 12:25 PM, Robert Haas wrote:

> On Wed, Feb 9, 2011 at 10:38 AM, Markus Wanner <markus@bluegap.ch> wrote:
>> On 02/09/2011 04:16 PM, David Fetter wrote:
>>> On Tue, Feb 08, 2011 at 09:09:48PM -0500, Robert Haas wrote:
>>>> Frankly, I think this is an example of how our current shared memory
>>>> model is a piece of garbage.
>>> 
>>> What other model(s) might work better?
>> 
>> Thread based, dynamically allocatable and resizeable shared memory, as
>> most other projects and developers use, for example.
> 
> Or less invasively, a small sysv shm to prevent the double-postmaster
> problem, and allocate the rest using POSIX shm.

Such a patch was proposed and rejected:
http://thread.gmane.org/gmane.comp.db.postgresql.devel.general/94791
Cheers,
M

Re: SSI patch version 14

From

Robert Haas

Date:

09 February 2011, 15:36:34

On Wed, Feb 9, 2011 at 2:16 PM, A.M. <agentm@themactionfaction.com> wrote:
> On Feb 9, 2011, at 12:25 PM, Robert Haas wrote:
>> On Wed, Feb 9, 2011 at 10:38 AM, Markus Wanner <markus@bluegap.ch> wrote:
>>> On 02/09/2011 04:16 PM, David Fetter wrote:
>>>> On Tue, Feb 08, 2011 at 09:09:48PM -0500, Robert Haas wrote:
>>>>> Frankly, I think this is an example of how our current shared memory
>>>>> model is a piece of garbage.
>>>>
>>>> What other model(s) might work better?
>>>
>>> Thread based, dynamically allocatable and resizeable shared memory, as
>>> most other projects and developers use, for example.
>>
>> Or less invasively, a small sysv shm to prevent the double-postmaster
>> problem, and allocate the rest using POSIX shm.
>
> Such a patch was proposed and rejected:
> http://thread.gmane.org/gmane.comp.db.postgresql.devel.general/94791

I know.  We need to revisit that for 9.2 and un-reject it.  It's nice
that PostgreSQL can run on my thermostat, but it isn't nice that
that's the only place where it delivers the expected level of
performance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: SSI patch version 14

From

Markus Wanner

Date:

09 February 2011, 15:51:55

On 02/09/2011 06:25 PM, Robert Haas wrote:
> On Wed, Feb 9, 2011 at 10:38 AM, Markus Wanner <markus@bluegap.ch> wrote:
>> Thread based, dynamically allocatable and resizeable shared memory, as
>> most other projects and developers use, for example.

I didn't mean to say we should switch to that model.  It's just *the*
other model that works (whether or not it's better in general or for
Postgres is debatable).

> Or less invasively, a small sysv shm to prevent the double-postmaster
> problem, and allocate the rest using POSIX shm.

..which allows ftruncate() to resize, right?  That's the main benefit
over sysv shm which we currently use.

ISTM that addresses the resizing-of-the-overall-shared-memory question,
but doesn't that require dynamic allocation or some other kind of
book-keeping?  Or do you envision all subsystems to have to
re-initialize their new (grown or shrunken) chunk of it?

Regards

Markus Wanner

Re: SSI patch version 14

From

Robert Haas

Date:

09 February 2011, 16:11:23

On Wed, Feb 9, 2011 at 2:51 PM, Markus Wanner <markus@bluegap.ch> wrote:
> On 02/09/2011 06:25 PM, Robert Haas wrote:
>> On Wed, Feb 9, 2011 at 10:38 AM, Markus Wanner <markus@bluegap.ch> wrote:
>>> Thread based, dynamically allocatable and resizeable shared memory, as
>>> most other projects and developers use, for example.
>
> I didn't mean to say we should switch to that model.  It's just *the*
> other model that works (whether or not it's better in general or for
> Postgres is debatable).
>
>> Or less invasively, a small sysv shm to prevent the double-postmaster
>> problem, and allocate the rest using POSIX shm.
>
> ..which allows ftruncate() to resize, right?  That's the main benefit
> over sysv shm which we currently use.
>
> ISTM that addresses the resizing-of-the-overall-shared-memory question,
> but doesn't that require dynamic allocation or some other kind of
> book-keeping?  Or do you envision all subsystems to have to
> re-initialize their new (grown or shrunken) chunk of it?

Basically, I'd be happy if all we got out of it was freedom from the
oppressive system shared memory limits.   On a modern system, it's
hard to imagine that the default for shared_buffers should be less
than 256MB, but that blows out the default POSIX shared memory
allocation limits on every operating system I use, and some of those
need a reboot to fix it.  That's needlessly reducing performance and
raising the barrier of entry for new users.  I am waiting for the day
when I have to explain to the guy with a terabyte of memory that the
reason why his performance sucks so bad is because he's got a 16MB
buffer cache.  The percentage of memory we're allocating to
shared_buffers should not need to be expressed in scientific notation.

But once we get out from under that, I think there might well be some
advantage to have certain subsystems allocate their own segments,
and/or using ftruncate() for resizing.  I don't have a concrete
proposal in mind, though.  It's very much non-trivial to resize
shared_buffers, for example, even if you assume that the size of the
shm can easily be changed.  So I don't expect quick progress on this
front; but it would be nice to have those options available.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: SSI patch version 14

From

Heikki Linnakangas

Date:

10 February 2011, 06:09:48

On 09.02.2011 17:58, Kevin Grittner wrote:
> Dan Ports<drkp@csail.mit.edu>  wrote:
>
>> I think for SerializableXidHash we should probably just initially
>> allocate it at its maximum size. Then it'll match the PredXact
>> list which is allocated in full upfront, and there's no risk of
>> being able to allocate a transaction but not register its xid. In
>> fact, I believe there would be no way for starting a new
>> serializable transaction to fail.
>
> To be more precise, it would prevent an out of shared memory error
> during an attempt to register an xid for an active serializable
> transaction.  That seems like a good thing.  Patch to remove the
> hint and initially allocate that HTAB at full size attached.

Committed.

Curiously, coypu has gone green again. It's now choosing 40 connections 
and 8 MB of shared_buffers, while it used to choose 30 connections and 
24 MB of shared_buffers before the SSI patch. Looks like fixing the size 
estimation bugs helped that, but I'm not entirely sure how. Maybe it 
just failed with higher max_connections settings because of the 
misestimate. But why does it now choose a *higher* max_connections 
setting than before?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: SSI patch version 14

From

Andrew Dunstan

Date:

10 February 2011, 06:29:57


On 02/10/2011 05:09 AM, Heikki Linnakangas wrote:
> On 09.02.2011 17:58, Kevin Grittner wrote:
>> Dan Ports<drkp@csail.mit.edu>  wrote:
>>
>>> I think for SerializableXidHash we should probably just initially
>>> allocate it at its maximum size. Then it'll match the PredXact
>>> list which is allocated in full upfront, and there's no risk of
>>> being able to allocate a transaction but not register its xid. In
>>> fact, I believe there would be no way for starting a new
>>> serializable transaction to fail.
>>
>> To be more precise, it would prevent an out of shared memory error
>> during an attempt to register an xid for an active serializable
>> transaction.  That seems like a good thing.  Patch to remove the
>> hint and initially allocate that HTAB at full size attached.
>
> Committed.
>
> Curiously, coypu has gone green again. It's now choosing 40 
> connections and 8 MB of shared_buffers, while it used to choose 30 
> connections and 24 MB of shared_buffers before the SSI patch. Looks 
> like fixing the size estimation bugs helped that, but I'm not entirely 
> sure how. Maybe it just failed with higher max_connections settings 
> because of the misestimate. But why does it now choose a *higher* 
> max_connections setting than before?

Rémi might have increased its available resources.

cheers

andrew