Home > mailing lists

Query optimization using order by and limit - Mailing list pgsql-performance

From	Michael Viscuso
Subject	Query optimization using order by and limit
Date	September 21, 2011 21:59:35
Msg-id	CA+Z27QQDRQ1h_tD1Tm_+T+-cBdt00hLWYRnvXygxLf_JWNTxZQ@mail.gmail.com Whole thread
Responses	Re: Query optimization using order by and limit
List	pgsql-performance

Tree view

First of all, thank you for taking the time to review my question. After attending the PostgresOpen conference in Chicago last week, I've been pouring over explain logs for hours on end and although my system is MUCH better, I still can't resolve a few issues. Luckily my data is pretty well structured so solving one issue will likely solve many more so I'll start with this one.

Version: PostgreSQL 9.1rc1, compiled by Visual C++ build 1500, 64-bit

OS: Windows 7 64-bit

ORM: SQLAlchemy

Postgres table structure: I have daily partitioned tables for each of 4 "core tables" (the tables with the majority of my application's data). Each daily table inherits from its parent. I do not explicitly define a REFERENCE between these tables because I cannot guarantee the order in which the events are inserted into the database, but where there are references, the referenced row should exist in the other's daily table. The reason I partitioned the data in this manner is to increase query speed and make it easy to archive old data. (I'm new to high-end Postgres performance so there's likely several fundamental flaws in my assumptions. I won't turn down any recommendation.)

An example of a daily partitioned table follows:

cb=# \d osmoduleloads_2011_09_14;

Table "public.osmoduleloads_2011_09_14"

Column | Type | Modifiers

-----------------------+-----------------------------+------------------------------------------------------------

guid | numeric(20,0) | not null

osprocess_guid | numeric(20,0) | not null

filepath_guid | numeric(20,0) | not null

firstloadtime | numeric(20,0) | not null

md5hash | bytea | not null

host_guid | numeric(20,0) | default NULL::numeric

process_create_time | numeric(20,0) | default NULL::numeric

process_filepath_guid | numeric(20,0) | default NULL::numeric

event_time | timestamp without time zone | default '2011-09-14 00:00:00'::timestamp without time zone

Indexes:

"osmoduleloads_2011_09_14_pkey" PRIMARY KEY, btree (guid)

"idx_osmoduleloads_2011_09_14_filepath_guid" btree (filepath_guid)

"idx_osmoduleloads_2011_09_14_firstload_time" btree (firstloadtime)

"idx_osmoduleloads_2011_09_14_host_guid" btree (host_guid)

"idx_osmoduleloads_2011_09_14_md5hash" btree (md5hash)

"idx_osmoduleloads_2011_09_14_osprocess_guid" btree (osprocess_guid)

Check constraints:

"osmoduleloads_2011_09_14_event_time_check" CHECK (event_time = '2011-09-14 00:00:00'::timestamp without time zone)

"osmoduleloads_2011_09_14_firstloadtime_check" CHECK (firstloadtime >= 129604464000000000::bigint::numeric AND firstloadtime < 129605328000000000::bigint::numeric)

Inherits: osmoduleloads

Objective: The firstloadtime check constraint ensures that the record is applicable to that daily table. (In case you were wondering, the large numerics correspond to the Windows 100-nanosecond since the Epoch.) I'm inserting millions of records into each daily table so "query slowness" is quite easy to spot. Given that there is so much data per daily table, I was hoping to use the order by and limit clauses to "stop out" a query once it sufficed the limit clause and not be forced to visit each daily table. However, I'm spending way too much time in the older tables than I'd like - which leads me to believe that I;m doing something wrong. For ease of viewing, my explain analyze can be found at http://explain.depesz.com/s/tot

I'm still very new to this so I'm not sure if explain.depesz.com saves the original query. It wasn't readily apparent that it did so here is the original query:

SELECT osm_1.*, storefiles_1.*, filepaths_1.*, filepaths_2.* FROM (SELECT * FROM osmoduleloads JOIN hosts ON hosts.guid = osmoduleloads.host_guid WHERE hosts.guid = '2007075705813916178' AND osmoduleloads.firstloadtime >= 129604320000000000 AND osmoduleloads.firstloadtime < 129610367990000000 AND hosts.enabled = true AND hosts.user_id = 111 ORDER BY osmoduleloads.firstloadtime DESC LIMIT 251) AS osm_1 LEFT OUTER JOIN storefiles AS storefiles_1 ON osm_1.md5hash = storefiles_1.md5hash LEFT OUTER JOIN filepaths AS filepaths_1 ON osm_1.process_filepath_guid = filepaths_1.guid AND osm_1.event_time = filepaths_1.event_time LEFT OUTER JOIN filepaths AS filepaths_2 ON osm_1.filepath_guid = filepaths_2.guid AND osm_1.event_time= filepaths_2.event_time ORDER BY osm_1.firstloadtime DESC;

Hopefully my assumptions about order by and limit are correct and this query can be optimized.

Again, appreciate any help you can lend. Thanks in advance.

Mike

pgsql-performance by date:

From: Royce Ausburn
Date: 21 September 2011, 21:59:33
Subject: Re: Prepared statements and suboptimal plans

From: Greg Smith
Date: 21 September 2011, 22:48:50
Subject: Re: Query optimization using order by and limit

Query optimization using order by and limit - Mailing list pgsql-performance

Previous

Next