Re: [noob] How to optimize this double pivot query? - Mailing list pgsql-sql

From Samuel Gendler
Subject Re: [noob] How to optimize this double pivot query?
Date
Msg-id CAEV0TzCHcA_oA2KOKFMn3y_sWYoS56a+4sPHRMqHdgpxOdoMCw@mail.gmail.com
Whole thread Raw
In response to Re: [noob] How to optimize this double pivot query?  (Robert Buck <buck.robert.j@gmail.com>)
List pgsql-sql


On Wed, Oct 3, 2012 at 9:31 AM, Robert Buck <buck.robert.j@gmail.com> wrote:
Samuel,

You asked:

Some questions before I provide crosstab samples - your example query has a hardcoded set of keys that it is looking for, but maybe that was code-generated.  There are multiple forms of the crosstab function, some of which rely on a constant column count and others which can generate the set of columns based on another query.  There are complications in that second form relating to race conditions with regard to new keys showing up between the query for the columns and the query for the data, so it is important to understand that and either structure queries accordingly or make sure you execute in a transaction with sufficient transaction isolation to prevent the race condition from appearing.

The query was hand written from examples I found on the web. The list here was hand written too. This will run in Ruby, so I can template as much as needed on that side. I can also run this in a defined transaction boundary rather than using auto-commit. Right now I am thumbing through your email and trying it out.

Actually, it isn't sufficient just to run within a defined transaction, since the default isolation level of a transaction in postgresql doesn't guarantee that you'll see the same rows in two queries during the same transaction.  If another transaction commits between your queries, you'll see the new rows in the 2nd query.  You need to have your transaction isolation set to 'serializable' in order to prevent phantom reads.  The default isolation level is 'read committed' which says you'll only see rows that have been committed, but there's no guarantee that two queries will get the same values if another transaction commits a change between them.

For details, see this:


In general, I find that it is easier to set up my crosstab processing to generate consistent column sets by including a where clause in all but the first query such that the where clause always matches the first query.  That won't work if column categories can be deleted, since you'll still get a different column set from one query to the next, but the vast majority of my apps don't allow actual row deletion, or it is at least incredibly rare and only done by administrative users, not end-users of the app.  Generally, rows representing domain entities get marked as disabled, so tows going away isn't usually a factor for me.  And it allows me to not have to worry that someone who comes along and re-uses my code won't set their transaction isolation up correctly when running the crosstab query and will get whacky results.

Note that if you do use serializable transactions to do this, you should use a read-only transaction to cut down on locking overhead and the possibility that other transactions will fail due to serialization errors while your crosstab is running.

This all seems terribly complex when describing it, but it really is pretty easy once you've done it once.  The column-count/serializable thing applies to using array_agg to build a crosstab, too, since you still need to ensure that your arrays all have the same number of entries in the same order.

--sam

pgsql-sql by date:

Previous
From: Robert Buck
Date:
Subject: Re: [noob] How to optimize this double pivot query?
Next
From: air
Date:
Subject: Calling the CTE for multiple inputs