Thread: Performance of complicated query

Performance of complicated query

From
Jonathan Morra
Date:
I am fairly new to squeezing performance out of Postgres, but I hope this mailing list can help me.  I have read the instructions found at http://wiki.postgresql.org/wiki/Slow_Query_Questions and have tried to abide by them the best that I can.  I am running "PostgreSQL 9.1.7, compiled by Visual C++ build 1500, 64-bit" on an x64 Windows 7 Professional Service Pack 1 machine with 8 GB of RAM.  I installed this using the downloadable installer.  I am testing this using pgAdminIII but ultimately this will be deployed within a Rails application.  Here are the values of some configuration parameters:

shared_buffers = 1GB
temp_buffers = 8MB
work_mem = 10MB
maintenance_work_mem = 256MB
random_page_cost = 1.2
default_statistics_target = 10000

Table schema:

reads-- ~250,000 rows
CREATE TABLE reads
(
  id serial NOT NULL,
  device_id integer NOT NULL,
  value bigint NOT NULL,
  read_datetime timestamp without time zone NOT NULL,
  created_at timestamp without time zone NOT NULL,
  updated_at timestamp without time zone NOT NULL,
  CONSTRAINT reads_pkey PRIMARY KEY (id )
)
WITH (
  OIDS=FALSE
);
ALTER TABLE reads
  OWNER TO postgres;

CREATE INDEX index_reads_on_device_id
  ON reads
  USING btree
  (device_id );

CREATE INDEX index_reads_on_device_id_and_read_datetime
  ON reads
  USING btree
  (device_id , read_datetime );

CREATE INDEX index_reads_on_read_datetime
  ON reads
  USING btree
  (read_datetime );

devices -- ~25,000 rows
CREATE TABLE devices
(
  id serial NOT NULL,
  serial_number character varying(20) NOT NULL,
  created_at timestamp without time zone NOT NULL,
  updated_at timestamp without time zone NOT NULL,
  CONSTRAINT devices_pkey PRIMARY KEY (id )
)
WITH (
  OIDS=FALSE
);
ALTER TABLE devices
  OWNER TO postgres;

CREATE UNIQUE INDEX index_devices_on_serial_number
  ON devices
  USING btree
  (serial_number COLLATE pg_catalog."default" );

patient_devices -- ~25,000 rows
CREATE TABLE patient_devices
(
  id serial NOT NULL,
  patient_id integer NOT NULL,
  device_id integer NOT NULL,
  issuance_datetime timestamp without time zone NOT NULL,
  unassignment_datetime timestamp without time zone,
  issued_value bigint NOT NULL,
  created_at timestamp without time zone NOT NULL,
  updated_at timestamp without time zone NOT NULL,
  CONSTRAINT patient_devices_pkey PRIMARY KEY (id )
)
WITH (
  OIDS=FALSE
);
ALTER TABLE patient_devices
  OWNER TO postgres;

CREATE INDEX index_patient_devices_on_device_id
  ON patient_devices
  USING btree
  (device_id );

CREATE INDEX index_patient_devices_on_issuance_datetime
  ON patient_devices
  USING btree
  (issuance_datetime );

CREATE INDEX index_patient_devices_on_patient_id
  ON patient_devices
  USING btree
  (patient_id );

CREATE INDEX index_patient_devices_on_unassignment_datetime
  ON patient_devices
  USING btree
  (unassignment_datetime );

patients -- ~1000 rows
CREATE TABLE patients
(
  id serial NOT NULL,
  first_name character varying(50) NOT NULL,
  last_name character varying(50) NOT NULL,
  created_at timestamp without time zone NOT NULL,
  updated_at timestamp without time zone NOT NULL,
  CONSTRAINT patients_pkey PRIMARY KEY (id )
)
WITH (
  OIDS=FALSE
);
ALTER TABLE patients
  OWNER TO postgres;

Finally, this is the query I am running:

SELECT first_name, last_name, serial_number, latest_read, value, lifetime_value, lifetime.patient_id
FROM (
SELECT DISTINCT patient_id, first_name, last_name, MAX(max_read) OVER(PARTITION BY patient_id) AS latest_read, SUM(value) OVER(PARTITION BY patient_id) AS value, first_value(serial_number) OVER(PARTITION BY patient_id ORDER BY max_read DESC) AS serial_number
FROM (
SELECT patient_id, first_name, last_name, value - issued_value AS value, serial_number, read_datetime, MAX(read_datetime) OVER (PARTITION BY patient_devices.id) AS max_read
FROM reads
INNER JOIN devices ON devices.id = reads.device_id
INNER JOIN patient_devices ON patient_devices.device_id = devices.id
AND read_datetime >= issuance_datetime
AND read_datetime < COALESCE(unassignment_datetime , 'infinity'::timestamp)
INNER JOIN patients ON patients.id = patient_devices.patient_id
WHERE read_datetime BETWEEN '2012-01-01 10:30:01' AND '2013-05-18 03:03:42'
) AS first WHERE read_datetime = max_read
) AS filtered
INNER JOIN (
SELECT DISTINCT patient_id, SUM(value) AS lifetime_value
FROM (
SELECT patient_id, value - issued_value AS value, read_datetime, MAX(read_datetime) OVER (PARTITION BY patient_devices.id) AS max_read
FROM reads
INNER JOIN devices ON devices.id = reads.device_id
INNER JOIN patient_devices ON patient_devices.device_id = devices.id
AND read_datetime >= issuance_datetime
AND read_datetime < COALESCE(unassignment_datetime , 'infinity'::timestamp)
) AS first WHERE read_datetime = max_read GROUP BY patient_id
) AS lifetime ON filtered.patient_id = lifetime.patient_id

The EXPLAIN (ANALYZE, BUFFERS) output can be found at the following link http://explain.depesz.com/s/7Zr.  Ultimately what I want to do is to find a sum of values for each patient.  The scenario is that each patient is assigned a device and they get incremental values on their device.  Since these values are incremental if a patient never switches devices, the reported value should be the last value for a patient.  However, if a patient switches devices then the reported value should be the sum of the last value for each device that the patient was assigned.  This leads to the conditions read_datetime >= issuance_datetime AND read_datetime < COALESCE(unassignment_datetime , 'infinity'::timestamp).  In addition I must report the serial number of the last device that the patient was assigned (or is currently assigned).  The only way I could come up with doing that is first_value(serial_number) OVER(PARTITION BY patient_id ORDER BY max_read DESC) AS serial_number.  Finally, I must report 2 values, one with respect to a time range and one which is the lifetime value.  In order to satisfy this requirement, I have to run essentially the same query twice (one with the WHERE time clause and one without) and INNER JOIN the results.  My questions are

1.  Can I make the query as I have constructed it faster by adding indices or changing any postgres configuration parameters?
2.  Can I modify the query to return the same results in a faster way?
3.  Can I modify my tables to make this query (which is the crux of my application) run faster?

Thanks

Re: Performance of complicated query

From
Steve Crawford
Date:
On 05/23/2013 10:19 AM, Jonathan Morra wrote:
> I am fairly new to squeezing performance out of Postgres, but I hope
> this mailing list can help me.  I have read the instructions found at
> http://wiki.postgresql.org/wiki/Slow_Query_Questions and have tried to
> abide by them the best that I can.  I am running "PostgreSQL 9.1.7,
> compiled by Visual C++ build 1500, 64-bit" on an x64 Windows 7
> Professional Service Pack 1 machine with 8 GB of RAM.

I'm not sure under what constraints you are operating but you will find
most people on the list will recommend running live systems on
Linux/Unix for a variety of reasons.

> CREATE TABLE reads
> ...
> ALTER TABLE reads
>   OWNER TO postgres;

To avoid future grief you should set up a user (see CREATE ROLE...) for
your database that is not the cluster superuser (postgres). I assume you
set up a database (see CREATE DATABASE...) for your app. The base
databases (postgres, template*) should be used for administrative
purposes only.

>
> ...
> Ultimately what I want to do is to find a sum of values for each
> patient.  The scenario is that each patient is assigned a device and
> they get incremental values on their device.  Since these values are
> incremental if a patient never switches devices, the reported value
> should be the last value for a patient.  However, if a patient
> switches devices then the reported value should be the sum of the last
> value for each device that the patient was assigned.

I'm afraid I'm a bit confused about what you are after due to switching
between "sum" and "last".

It sounds like a patient is issued a device which takes a number of
readings. Do you want the sum of those readings for a given patient
across all devices they have been issued, the sum of readings for a
specific device, the most recent reading for a specific patient
regardless of which device was in use for that reading, or the sum of
the most recent readings on each device issued to a specific patient?

Are you looking to generate a report across all patients/devices or
lookup information on a specific patient or device?

Cheers,
Steve





Re: Performance of complicated query

From
Jonathan Morra
Date:
Ultimately I'm going to deploy this to Heroku on a Linux machine (my tests have so far indicated that Heroku is MUCH slower than my machine), but I wanted to get it fast on my local machine first.  I agree with your role partitioning, however, this is only a dev machine.

For the sum vs. last, the idea is that each patient is issued a device and reads are recorded.  The nature of the reads are that they are incremental, so if a patient never changes devices there is no need for a sum.  However, patients will be changing devices, and the patient_device table records when each patient had a given device.  What I want to sum up is the total value for a patient regardless of how many times they changed devices.  In order to do this I have to sum up just the values of the last read for each device a patient was assigned to.  This leads to the WHERE clause, WHERE read_datetime = max_read, and hence I'm only summing the last read for each device for each patient.  Ultimately I want to report the values listed in the outer select for each patient.  I will use these values to run other queries, but those queries are currently very quick (< 50ms) and so I'm not worried about them now.


On Thu, May 23, 2013 at 10:47 AM, Steve Crawford <scrawford@pinpointresearch.com> wrote:
On 05/23/2013 10:19 AM, Jonathan Morra wrote:
I am fairly new to squeezing performance out of Postgres, but I hope this mailing list can help me.  I have read the instructions found at http://wiki.postgresql.org/wiki/Slow_Query_Questions and have tried to abide by them the best that I can.  I am running "PostgreSQL 9.1.7, compiled by Visual C++ build 1500, 64-bit" on an x64 Windows 7 Professional Service Pack 1 machine with 8 GB of RAM.

I'm not sure under what constraints you are operating but you will find most people on the list will recommend running live systems on Linux/Unix for a variety of reasons.

CREATE TABLE reads
...

ALTER TABLE reads
  OWNER TO postgres;

To avoid future grief you should set up a user (see CREATE ROLE...) for your database that is not the cluster superuser (postgres). I assume you set up a database (see CREATE DATABASE...) for your app. The base databases (postgres, template*) should be used for administrative purposes only.


...

Ultimately what I want to do is to find a sum of values for each patient.  The scenario is that each patient is assigned a device and they get incremental values on their device.  Since these values are incremental if a patient never switches devices, the reported value should be the last value for a patient.  However, if a patient switches devices then the reported value should be the sum of the last value for each device that the patient was assigned.

I'm afraid I'm a bit confused about what you are after due to switching between "sum" and "last".

It sounds like a patient is issued a device which takes a number of readings. Do you want the sum of those readings for a given patient across all devices they have been issued, the sum of readings for a specific device, the most recent reading for a specific patient regardless of which device was in use for that reading, or the sum of the most recent readings on each device issued to a specific patient?

Are you looking to generate a report across all patients/devices or lookup information on a specific patient or device?

Cheers,
Steve




Re: Performance of complicated query

From
Vladimir Sitnikov
Date:
>>This leads to the WHERE clause, WHERE read_datetime = max_read, and hence I'm only summing the last read for each device for each patient. 
Is "reads" table insert-only? Do you have updates/deletes of the "historical" rows?

>>3.  Can I modify my tables to make this query (which is the crux of my application) run faster?
Can you have a second "reads" table that stores only up to date values?
That will eliminate max-over completely, enable efficient usage in other queries, and make your queries much easier to understand by humans and computers.

PS. read_datetime = max_read is prone to "what if two measurements have same date" errors.
PPS. distinct MAX(max_read) OVER(PARTITION BY patient_id) AS latest_read looks like a complete mess. Why don't you just use group by?

Regards,
Vladimir

Re: Performance of complicated query

From
Jonathan Morra
Date:
1.  Reads is constantly inserted upon.  It should never be updated or deleted.
2.  I suppose I can, but that will make my insertion logic very complicated.  I cannot guarantee the order of any of this data, so I might get reads at any time and also get assignments at any time (historical as well).  I suppose I could do that, but I'd like to avoid it if at all possible.
3.  2 measurements can have the same date, and that is fine.  The problem arises when the same device produces 2 reads at the same time and that isn't possible.
4.  I agree that a lot of this is a mess, however MAX(max_read) OVER(PARTITION BY patient_id) AS latest_read seems necessary as using a group by clause forces me to group by all elements I'm selecting, which I don't want to do.


On Thu, May 23, 2013 at 12:23 PM, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
>>This leads to the WHERE clause, WHERE read_datetime = max_read, and hence I'm only summing the last read for each device for each patient. 
Is "reads" table insert-only? Do you have updates/deletes of the "historical" rows?

>>3.  Can I modify my tables to make this query (which is the crux of my application) run faster?
Can you have a second "reads" table that stores only up to date values?
That will eliminate max-over completely, enable efficient usage in other queries, and make your queries much easier to understand by humans and computers.

PS. read_datetime = max_read is prone to "what if two measurements have same date" errors.
PPS. distinct MAX(max_read) OVER(PARTITION BY patient_id) AS latest_read looks like a complete mess. Why don't you just use group by?

Regards,
Vladimir

Re: Performance of complicated query

From
Steve Crawford
Date:
On 05/23/2013 10:57 AM, Jonathan Morra wrote:
> Ultimately I'm going to deploy this to Heroku on a Linux machine (my
> tests have so far indicated that Heroku is MUCH slower than my
> machine), but I wanted to get it fast on my local machine first.  I
> agree with your role partitioning, however, this is only a dev machine.
>
> For the sum vs. last, the idea is that each patient is issued a device
> and reads are recorded.  The nature of the reads are that they are
> incremental, so if a patient never changes devices there is no need
> for a sum.  However, patients will be changing devices, and the
> patient_device table records when each patient had a given device.
>  What I want to sum up is the total value for a patient regardless of
> how many times they changed devices

If the reads are always incremented - that is the read you want is
always the largest read - then something along these lines might work
well and be more readable (untested code);

-- distill out max value for each device
with device_maxreads as (
select
     device_id,
     max(value) as max_read
from
     reads
group by
     device_id)

-- then sum into a totals for each patient
patient_value as (
select
     p.patient_id,
     sum(max_read) patient_value
from
      device_maxreads d
      join patient_devices p on p.device_id = d.device_id
group by
     p.patient_id
)

select
     ...whatever...
from
     ...your tables.
     join patient_value p on p.patient_id = ...
;


If the values increment and decrement or patients are issued devices at
overlapping times (i.e. using two devices at one time) then the query
gets more complicated but "with..." is still a likely usable construct.

Cheers,
Steve


Re: Performance of complicated query

From
Jonathan Morra
Date:
I'm not sure I understand your proposed solution.  There is also the case to consider where the same patient can be assigned the same device multiple times.  In this case, the value may be reset at each assignment (hence the line value - issued_value AS value from the original query).


On Thu, May 23, 2013 at 1:01 PM, Steve Crawford <scrawford@pinpointresearch.com> wrote:
On 05/23/2013 10:57 AM, Jonathan Morra wrote:
Ultimately I'm going to deploy this to Heroku on a Linux machine (my tests have so far indicated that Heroku is MUCH slower than my machine), but I wanted to get it fast on my local machine first.  I agree with your role partitioning, however, this is only a dev machine.

For the sum vs. last, the idea is that each patient is issued a device and reads are recorded.  The nature of the reads are that they are incremental, so if a patient never changes devices there is no need for a sum.  However, patients will be changing devices, and the patient_device table records when each patient had a given device.  What I want to sum up is the total value for a patient regardless of how many times they changed devices

If the reads are always incremented - that is the read you want is always the largest read - then something along these lines might work well and be more readable (untested code);

-- distill out max value for each device
with device_maxreads as (
select
    device_id,
    max(value) as max_read
from
    reads
group by
    device_id)

-- then sum into a totals for each patient
patient_value as (
select
    p.patient_id,
    sum(max_read) patient_value
from
     device_maxreads d
     join patient_devices p on p.device_id = d.device_id
group by
    p.patient_id
)

select
    ...whatever...
from
    ...your tables.
    join patient_value p on p.patient_id = ...
;


If the values increment and decrement or patients are issued devices at overlapping times (i.e. using two devices at one time) then the query gets more complicated but "with..." is still a likely usable construct.

Cheers,
Steve

Re: Performance of complicated query

From
james
Date:
On 23/05/2013 22:57, Jonathan Morra wrote:
I'm not sure I understand your proposed solution.  There is also the case to consider where the same patient can be assigned the same device multiple times.  In this case, the value may be reset at each assignment (hence the line value - issued_value AS value from the original query).


Perhaps you could use triggers to help somewhat?  At least for the lifetime part.

For a given assignment of a device to a patient, only the last value is useful, so you can maintain that easily enough (a bit like a materialised view but before 9.3 I guess).

But, that might fix 'lifetime' but not some arbitrary windowed view.  I can see why an 'as at' end time is useful, but not why a start time is so useful: if a device has readings before the window but not in the window, is that 'no reading' or should the last reading prior to the window apply?

It also seems to me that the solution you have is hard to reason about.  Its like a Haskell program done in one big inline fold rather than a bunch of 'where' clauses, and I find these cause significant brain overload.

Perhaps you could break it out into identifiable chunks that work out (both for lifetime if not using triggers, and for your date range otherwise) the readings that are not superceded (ie the last in the date bounds for a device assignment), and then work with those.  Consider the CTE 'WITH queries' for doing this?

It seems to me that if you can do this, then the problem might be easier to express.

Failing that, I'd be looking at using temporary tables, and forcing a series of reduce steps using them, but then I'm a nasty old Sybase hacker at heart. ;-)

Re: Performance of complicated query

From
Jonathan Morra
Date:
Sorry for the messy query, I'm very new to writing these complex queries.  I'll try and make it easier to read by using WITH clauses.  However, just to clarify, the WITH clauses only increase readability and not performance in any way, right?


On Thu, May 23, 2013 at 4:22 PM, james <james@mansionfamily.plus.com> wrote:
On 23/05/2013 22:57, Jonathan Morra wrote:
I'm not sure I understand your proposed solution.  There is also the case to consider where the same patient can be assigned the same device multiple times.  In this case, the value may be reset at each assignment (hence the line value - issued_value AS value from the original query).


Perhaps you could use triggers to help somewhat?  At least for the lifetime part.

For a given assignment of a device to a patient, only the last value is useful, so you can maintain that easily enough (a bit like a materialised view but before 9.3 I guess).

But, that might fix 'lifetime' but not some arbitrary windowed view.  I can see why an 'as at' end time is useful, but not why a start time is so useful: if a device has readings before the window but not in the window, is that 'no reading' or should the last reading prior to the window apply?

It also seems to me that the solution you have is hard to reason about.  Its like a Haskell program done in one big inline fold rather than a bunch of 'where' clauses, and I find these cause significant brain overload.

Perhaps you could break it out into identifiable chunks that work out (both for lifetime if not using triggers, and for your date range otherwise) the readings that are not superceded (ie the last in the date bounds for a device assignment), and then work with those.  Consider the CTE 'WITH queries' for doing this?

It seems to me that if you can do this, then the problem might be easier to express.

Failing that, I'd be looking at using temporary tables, and forcing a series of reduce steps using them, but then I'm a nasty old Sybase hacker at heart. ;-)


Re: Performance of complicated query

From
Jonathan Morra
Date:
I have been working on this query, and I was able to modify it and get it's run time cut in half.  Here's where it is right now:

SELECT first_name, last_name, serial_number, latest_read, value, lifetime_value, lifetime.patient_id
FROM (
SELECT DISTINCT patient_id, first_name, last_name, MAX(read_datetime) OVER(PARTITION BY patient_id) AS latest_read
, SUM(value) OVER(PARTITION BY patient_id) AS value, first_value(serial_number) OVER(PARTITION BY patient_id ORDER BY read_datetime DESC) AS serial_number
FROM (
SELECT patient_devices.device_id, patient_id, MAX(value - issued_value) AS value, MAX(read_datetime) AS read_datetime
FROM read_reads
INNER JOIN patient_devices ON patient_devices.device_id = read_reads.device_id
AND read_datetime >= issuance_datetime
AND read_datetime < COALESCE(unassignment_datetime , 'infinity'::timestamp)
WHERE read_datetime BETWEEN '2012-01-01 10:30:01' AND '2013-05-18 03:03:42'
) AS first
INNER JOIN devices ON devices.id = device_id
INNER JOIN patients ON patient_id = patients.id
) AS filtered
INNER JOIN (
SELECT patient_id, SUM(value) AS lifetime_value
FROM (
SELECT patient_id, MAX(value - issued_value) AS value FROM read_reads
INNER JOIN patient_devices ON patient_devices.device_id = read_reads.device_id
AND read_datetime >= issuance_datetime
AND read_datetime < COALESCE(unassignment_datetime , 'infinity'::timestamp)
) AS first GROUP BY patient_id
) AS lifetime ON filtered.patient_id = lifetime.patient_id

I think the key to cutting it down was moving some of the joins up a level.  Even though this is faster, I'd still like to cut it down a bunch more (as this will be run a lot in my application).  Any more insight would be greatly appreciated.  A summary of explain (analyze, buffers) can be found at http://explain.depesz.com/s/qx7f.

Thanks


On Thu, May 23, 2013 at 5:21 PM, Jonathan Morra <jonmorra@gmail.com> wrote:
Sorry for the messy query, I'm very new to writing these complex queries.  I'll try and make it easier to read by using WITH clauses.  However, just to clarify, the WITH clauses only increase readability and not performance in any way, right?


On Thu, May 23, 2013 at 4:22 PM, james <james@mansionfamily.plus.com> wrote:
On 23/05/2013 22:57, Jonathan Morra wrote:
I'm not sure I understand your proposed solution.  There is also the case to consider where the same patient can be assigned the same device multiple times.  In this case, the value may be reset at each assignment (hence the line value - issued_value AS value from the original query).


Perhaps you could use triggers to help somewhat?  At least for the lifetime part.

For a given assignment of a device to a patient, only the last value is useful, so you can maintain that easily enough (a bit like a materialised view but before 9.3 I guess).

But, that might fix 'lifetime' but not some arbitrary windowed view.  I can see why an 'as at' end time is useful, but not why a start time is so useful: if a device has readings before the window but not in the window, is that 'no reading' or should the last reading prior to the window apply?

It also seems to me that the solution you have is hard to reason about.  Its like a Haskell program done in one big inline fold rather than a bunch of 'where' clauses, and I find these cause significant brain overload.

Perhaps you could break it out into identifiable chunks that work out (both for lifetime if not using triggers, and for your date range otherwise) the readings that are not superceded (ie the last in the date bounds for a device assignment), and then work with those.  Consider the CTE 'WITH queries' for doing this?

It seems to me that if you can do this, then the problem might be easier to express.

Failing that, I'd be looking at using temporary tables, and forcing a series of reduce steps using them, but then I'm a nasty old Sybase hacker at heart. ;-)



Re: Performance of complicated query

From
Steve Crawford
Date:
On 05/23/2013 05:21 PM, Jonathan Morra wrote:
> Sorry for the messy query, I'm very new to writing these complex
> queries.  I'll try and make it easier to read by using WITH clauses.
>  However, just to clarify, the WITH clauses only increase readability
> and not performance in any way, right?

It depends. The planner is a tricky beast and sometimes rewriting a
seeming identical query will result in a much more (or less) efficient
plan. A classic case was the difference between ....where foo in (select
bar from...)... vs. where exists (select 1 from bar where...).... In an
ideal world the planner would figure out that both are the same and
optimize accordingly but there was a point where one was typically more
efficient then it switched to the other being better for the planner. I
don't recall the current state.

Casting can be important - sometimes the planner needs a "nudge" to use
an index on, say, a varchar column being compared to, perhaps, a text
value or column in which case casting to the exact data-type being
indexed can be a big win.

Cheers,
Steve