Home > mailing lists

Help with optimizing a query over hierarchical data - Mailing list pgsql-performance

From	Damon Snyder
Subject	Help with optimizing a query over hierarchical data
Date	March 1, 2014 23:47:37
Msg-id	CACkQbuiU7aaWxHsOuwVZgibg=Kjj5+tyM3h3-10ntdBK7SSYpw@mail.gmail.com Whole thread
Responses	Re: Help with optimizing a query over hierarchical data
List	pgsql-performance

Tree view

Hi Everyone,

We have a data set and access pattern that is causing us some performance issues. The data set is hierarchical with about 2 million rows at the lowest level (object), followed by 500k at the next level (container) and approximately 10 at the highest level (category).

The way the data is structured objects live in one primary container but can also reside in one or more secondary containers. The secondary containers have to be filtered by category ids for access control. This doesn't really pose a problem on the primary containers because they are homogeneous by category but it slows things down on the secondary containers.

The access pattern certainly complicates things. We need to order the data by a value (chronological, score, etc) and jump to an arbitrary position and window within the ordering. I have provided an example of the chronological and score based ordering and indexing in the script provided below. I'm proxying the chronological ordering by using the sequence generated id for the sample code. In the application we use a timestamp.

I have created sample code (https://gist.github.com/drsnyder/9277054) that will build a dataset that is similar to the one that we have in production. The distributions aren't exactly the same but they should reproduce the behavior. I have also provided examples of the queries that give us problems.

The code is documented inline and points out the queries that are causing problems. You should be able to run the script on a 9.2.2 (we use 9.2.6) database in about 10m on a development laptop (or 1-2m on production-like hardware) to experiment with the data and the SQL. The total database size is about 570MB.

The primary query that I'm trying to optimize executes in about 1600ms on my laptop and about 800ms on production-like hardware (more for the score version). My target is to get the data fetch down below 100ms if possible.

If you have any suggestions it would be greatly appreciated. Am I missing something obvious? Is there a logically equivalent alternative that would be more efficient?

I'm also open to creative ways to cache the data in PostgreSQL. Should we use a rollup table or arrays? Those options aren't ideal because of the maintenance required but if its the only option I'm ok with that.

Thanks,

Damon

pgsql-performance by date:

From: Tom Coogan
Date: 28 February 2014, 18:29:33
Subject: Re: Inefficient filter order in query plan

From: Claudio Freire
Date: 02 March 2014, 01:02:31
Subject: Re: Help with optimizing a query over hierarchical data

Help with optimizing a query over hierarchical data - Mailing list pgsql-performance

Previous

Next