Re: Will pg_repack improve this query performance? - Mailing list pgsql-general

From Alban Hertroys
Subject Re: Will pg_repack improve this query performance?
Date
Msg-id E3E6B7B4-C00E-411F-BF9D-C34BA058A4FC@gmail.com
Whole thread Raw
In response to Will pg_repack improve this query performance?  (Abelard Hoffman <abelardhoffman@gmail.com>)
Responses Re: Will pg_repack improve this query performance?
Re: Will pg_repack improve this query performance?
List pgsql-general
On 15 Oct 2014, at 4:33, Abelard Hoffman <abelardhoffman@gmail.com> wrote:

> I believe this query is well optimized, but it's slow if the all the blocks aren't already in memory.
>
> Here's example explain output. You can see it takes over 7 seconds to run when it needs to hit the disk, and almost
allof it is related to checking if the user has "messages." 
>
>   http://explain.depesz.com/s/BLT

From that plan it is obvious that the index scan takes the most time. It looks like you have 3315 rows matching to_id =
users.id,of which only 10 match your query conditions after applying the filter. 

With your current setup, the database first needs to find candidate rows in the index and then has to check the other
conditionsagainst the table, which is likely to involve some disk access. 

> On a second run, it's extremely fast (< 50ms). So I'm thinking it's a lack of clustering on the "Index Cond: (to_id =
users.user_id)"that's the culprit. 

That probably means that the relevant parts of the table were still in memory, which means the scan did not need to
visitthe disk to load the matched rows to filter the NULL conditions in your query. 

> I'm afraid of using CLUSTER due to the exclusive lock, but I found pg_repack while researching:
> http://reorg.github.io/pg_repack/

A CLUSTER would help putting rows with the same to_id together. Disk access would be less random that way, so it would
helpsome.  

According to your query plan, accessing disks (assuming that’s what made the difference) was 154 (7700 ms / 50 ms)
timesslower than accessing memory. I don’t have the numbers for your disks or memory, but that doesn’t look like an
incrediblyunrealistic difference. That begs the question, how random was that disk access and how much can be gained
fromclustering that data? 

Did you try a partial index on to_id with those NULL conditions? That should result in a much smaller index size, which
inturn makes it faster to scan - much so if the index is difficult to keep in memory because of its size. More
importantlythough, the scan wouldn’t need to visit the table to verify those NULL fields. 

> Does it seem likely that doing an --order-by on the to_id column would have a significant impact in this case?
pg_repackseems pretty stable and safe at this point? 

Not being familiar with pg_repack I can’t advise on that.

> I am going to try and test this in a dev environment first but wanted feedback if this seemed like a good direction?

You can try that CLUSTER or the approach with pg_repack regardless of my suggestion for the partial index. It should
speeddisk access to those records up regardless of how they are indexed. 

Good luck!

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



pgsql-general by date:

Previous
From: Albe Laurenz
Date:
Subject: Re: copying a large database to change encoding
Next
From: Léa Massiot
Date:
Subject: How to start several PostgreSQL clusters at boot time on a Debian OS