Help with optimizing a query over hierarchical data - Mailing list pgsql-performance

From Damon Snyder
Subject Help with optimizing a query over hierarchical data
Date
Msg-id CACkQbuiU7aaWxHsOuwVZgibg=Kjj5+tyM3h3-10ntdBK7SSYpw@mail.gmail.com
Whole thread Raw
Responses Re: Help with optimizing a query over hierarchical data
List pgsql-performance
Hi Everyone,
We have a data set and access pattern that is causing us some performance issues. The data set is hierarchical with about 2 million rows at the lowest level (object), followed by 500k at the next level (container) and approximately 10 at the highest level (category). 

The way the data is structured objects live in one primary container but can also reside in one or more secondary containers. The secondary containers have to be filtered by category ids for access control. This doesn't really pose a problem on the primary containers because they are homogeneous by category but it slows things down on the secondary containers.

The access pattern certainly complicates things. We need to order the data by a value (chronological, score, etc) and jump to an arbitrary position and window within the ordering. I have provided an example of the chronological and score based ordering and indexing in the script provided below. I'm proxying the chronological ordering by using the sequence generated id for the sample code. In the application we use a timestamp.

I have created sample code (https://gist.github.com/drsnyder/9277054) that will build a dataset that is similar to the one that we have in production. The distributions aren't exactly the same but they should reproduce the behavior. I have also provided examples of the queries that give us problems. 

The code is documented inline and points out the queries that are causing problems. You should be able to run the script on a 9.2.2 (we use 9.2.6) database in about 10m on a development laptop (or 1-2m on production-like hardware) to experiment with the data and the SQL. The total database size is about 570MB.

The primary query that I'm trying to optimize executes in about 1600ms on my laptop and about 800ms on production-like hardware (more for the score version). My target is to get the data fetch down below 100ms if possible. 

If you have any suggestions it would be greatly appreciated. Am I missing something obvious? Is there a logically equivalent alternative that would be more efficient?

I'm also open to creative ways to cache the data in PostgreSQL. Should we use a rollup table or arrays? Those options aren't ideal because of the maintenance required but if its the only option I'm ok with that.

Thanks,
Damon

pgsql-performance by date:

Previous
From: Tom Coogan
Date:
Subject: Re: Inefficient filter order in query plan
Next
From: Claudio Freire
Date:
Subject: Re: Help with optimizing a query over hierarchical data