Thread: INDEX Performance Issue

INDEX Performance Issue

From
Mark Davidson
Date:
Hi All,

Hoping someone can help me out with some performance issues I'm having with the INDEX on my database. I've got a database that has a data table containing ~55,000,000 rows which have point data and an area table containing ~3,500 rows which have polygon data. A user queries the data by selecting what areas they want to view and using some other filters such as datatime and what datasets they want to query. This all works fine and previously the intersect of the data rows to the areas was being done on the fly with PostGIS ST_Intersects. However as the data table grow we decided it would make sense to offload the data processing and not calculate the intersect for a row on the fly each time, but to pre-calculate it and store the result in the join table. Resultantly this produce a table data_area which contains ~250,000,000 rows. This simply has two columns which show the intersect between data and area. We where expecting that this would give a significant performance improvement to query time, but the query seems to take a very long time to analyse the INDEX as part of the query. I'm thinking there must be something wrong with my setup or the query its self as I'm sure postgres will perform better.
I've tried restructuring the query, changing config settings and doing maintenance like VACUUM but nothing has helped.

Hope that introduction is clear enough and makes sense if anything is unclear please let me know.

I'm using PostgreSQL 9.1.4 on x86_64-unknown-linux-gnu, compiled by gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5.1) 4.4.3, 64-bit on Ubuntu 12.04 which was installed using apt.

Here is the structure of my database tables

CREATE TABLE data
(
  id bigserial NOT NULL,
  datasetid integer NOT NULL,
  readingdatetime timestamp without time zone NOT NULL,
  depth double precision NOT NULL,
  readingdatetime2 timestamp without time zone,
  depth2 double precision,
  value double precision NOT NULL,
  uploaddatetime timestamp without time zone,
  description character varying(255),
  point geometry,
  point2 geometry,
  CONSTRAINT "DATAPRIMARYKEY" PRIMARY KEY (id ),
  CONSTRAINT enforce_dims_point CHECK (st_ndims(point) = 2),
  CONSTRAINT enforce_dims_point2 CHECK (st_ndims(point2) = 2),
  CONSTRAINT enforce_geotype_point CHECK (geometrytype(point) = 'POINT'::text OR point IS NULL),
  CONSTRAINT enforce_geotype_point2 CHECK (geometrytype(point2) = 'POINT'::text OR point2 IS NULL),
  CONSTRAINT enforce_srid_point CHECK (st_srid(point) = 4326),
  CONSTRAINT enforce_srid_point2 CHECK (st_srid(point2) = 4326)
);

CREATE INDEX data_datasetid_index ON data USING btree (datasetid );
CREATE INDEX data_point_index ON data USING gist (point );
CREATE INDEX "data_readingDatetime_index" ON data USING btree (readingdatetime );
ALTER TABLE data CLUSTER ON "data_readingDatetime_index";

CREATE TABLE area
(
  id serial NOT NULL,
  "areaCode" character varying(10) NOT NULL,
  country character varying(250) NOT NULL,
  "polysetID" integer NOT NULL,
  polygon geometry,
  CONSTRAINT area_primary_key PRIMARY KEY (id ),
  CONSTRAINT polyset_foreign_key FOREIGN KEY ("polysetID")
      REFERENCES polyset (id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE CASCADE,
  CONSTRAINT enforce_dims_area CHECK (st_ndims(polygon) = 2),
  CONSTRAINT enforce_geotype_area CHECK (geometrytype(polygon) = 'POLYGON'::text OR polygon IS NULL),
  CONSTRAINT enforce_srid_area CHECK (st_srid(polygon) = 4326)
);

CREATE INDEX area_polygon_index ON area USING gist (polygon );
CREATE INDEX "area_polysetID_index" ON area USING btree ("polysetID" );
ALTER TABLE area CLUSTER ON "area_polysetID_index";

CREATE TABLE data_area
(
  data_id integer NOT NULL,
  area_id integer NOT NULL,
  CONSTRAINT data_area_pkey PRIMARY KEY (data_id , area_id ),
  CONSTRAINT data_area_area_id_fk FOREIGN KEY (area_id)
      REFERENCES area (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT data_area_data_id_fk FOREIGN KEY (data_id)
      REFERENCES data (id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE CASCADE
);

Here is the query I'm running and the result of its explain can be found here http://explain.depesz.com/s/1yu

SELECT * FROM data d JOIN data_area da ON da.data_id = d.id LEFT JOIN area a ON da.area_id = a.id WHERE d.datasetid IN (5634,5635,5636,5637,5638,5639,5640,5641,5642)   AND da.area_id IN
(1, 2, 3 .... 9999) AND (readingdatetime BETWEEN '1990-01-01' AND '2013-01-01') AND depth BETWEEN 0 AND 99999;

If you look at the explain the index scan is taking 97% of the time is spent on the index scan for the JOIN of data_area.

Hardware

 - CPU: Intel(R) Xeon(R) CPU E5420 ( 8 Cores )
 - RAM: 16GB

Config Changes

I'm using the base Ubuntu config apart from the following changes

 - shared_buffers set to 2GB
 - work_mem set to 1GB
 - maintenance_work_men set to 512MB
 - effective_cache_size set to 8GB

Think that covers everything hope this has enough detail for someone to be able to help if there is anything I've missed please let me know and I'll add any more info needed. Any input on the optimisation of the table structure, the query, or anything else I can do to sort this issue would be most appreciated.

Thanks in advance,

Mark Davidson

Re: INDEX Performance Issue

From
Kevin Grittner
Date:
Mark Davidson <mark@4each.co.uk> wrote:

>   CONSTRAINT data_area_pkey PRIMARY KEY (data_id , area_id ),

So the only index on this 250 million row table starts with the ID
of the point, but you are joining to it by the ID of the area.
That's requires a sequential scan of all 250 million rows.  Switch
the order of the columns in the primary key, add a unique index
with the columns switched, or add an index on just the area ID.

Perhaps you thought that the foreign key constraints would create
indexes?  (They don't.)

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: INDEX Performance Issue

From
Mark Davidson
Date:
Hi Kevin

Thanks for your response. I tried doing what you suggested so that table now has a primary key of ' CONSTRAINT data_area_pkey PRIMARY KEY(area_id , data_id ); ' and I've added the INDEX of 'CREATE INDEX data_area_data_id_index ON data_area USING btree (data_id );' unfortunately it hasn't resulted in an improvement of the query performance. Here is the explain http://explain.depesz.com/s/tDL I think there is no performance increase because its now not using primary key and just using the index on the data_id. Have I done what you suggested correctly? Any other suggestions?

Thanks very much for your help,

Mark



On 5 April 2013 17:37, Kevin Grittner <kgrittn@ymail.com> wrote:
Mark Davidson <mark@4each.co.uk> wrote:

>   CONSTRAINT data_area_pkey PRIMARY KEY (data_id , area_id ),

So the only index on this 250 million row table starts with the ID
of the point, but you are joining to it by the ID of the area.
That's requires a sequential scan of all 250 million rows.  Switch
the order of the columns in the primary key, add a unique index
with the columns switched, or add an index on just the area ID.

Perhaps you thought that the foreign key constraints would create
indexes?  (They don't.)

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: INDEX Performance Issue

From
Greg Williamson
Date:
> Thanks for your response. I tried doing what you suggested so that table now has a primary key of ' CONSTRAINT
data_area_pkeyPRIMARY KEY(area_id , data_id ); ' and I've added the INDEX > of 'CREATE INDEX data_area_data_id_index ON
data_areaUSING btree (data_id );' unfortunately it hasn't resulted in an improvement of the query performance. Here is
theexplain  

> ...

Did you run analyze on the table after creating the index ?

GW



Re: INDEX Performance Issue

From
Kevin Grittner
Date:
Greg Williamson <gwilliamson39@yahoo.com> wrote:

>> Thanks for your response. I tried doing what you suggested so
>> that table now has a primary key of
>> ' CONSTRAINT data_area_pkey PRIMARY KEY(area_id , data_id ); '
>> and I've added the INDEX of
>> 'CREATE INDEX data_area_data_id_index ON data_area USING btree (data_id );'

Yeah, that is what I was suggesting.

>> unfortunately it hasn't resulted in an improvement of the query
>> performance.

> Did you run analyze on the table after creating the index ?

That probably isn't necessary.  Statistics are normally on relations
and columns; there are only certain special cases where an ANALYZE
is needed after an index build, like if the index is on an
expression rather than a list of columns.

Mark, what happens if you change that left join to a normal (inner)
join?  Since you're doing an inner join to data_area and that has a
foreign key to area, there should always be a match anyway, right?
The optimizer doesn't recognize that, so it can't start from the
area and just match to the appropriate points.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: INDEX Performance Issue

From
Mark Davidson
Date:
Takes a little longer with the INNER join unfortunately. Takes about ~3.5 minutes, here is the query plan http://explain.depesz.com/s/EgBl.

With the JOIN there might not be a match if the data does not fall within one of the areas that is selected in the IN query.

So if we have data id (10) that might fall in areas ( 1, 5, 8, 167 ) but the user might be querying areas ( 200 ... 500 ) so no match in area would be found just to be absolutely clear.

Is it worth considering adding additional statistics on any of the columns? And / Or additional INDEXES or different types INDEX? Would it be worth restructuring the query starting with areas and working to join data to that?


On 7 April 2013 16:15, Kevin Grittner <kgrittn@ymail.com> wrote:
Greg Williamson <gwilliamson39@yahoo.com> wrote:

>> Thanks for your response. I tried doing what you suggested so
>> that table now has a primary key of
>> ' CONSTRAINT data_area_pkey PRIMARY KEY(area_id , data_id ); '
>> and I've added the INDEX of
>> 'CREATE INDEX data_area_data_id_index ON data_area USING btree (data_id );'

Yeah, that is what I was suggesting.

>> unfortunately it hasn't resulted in an improvement of the query
>> performance.

> Did you run analyze on the table after creating the index ?

That probably isn't necessary.  Statistics are normally on relations
and columns; there are only certain special cases where an ANALYZE
is needed after an index build, like if the index is on an
expression rather than a list of columns.

Mark, what happens if you change that left join to a normal (inner)
join?  Since you're doing an inner join to data_area and that has a
foreign key to area, there should always be a match anyway, right?
The optimizer doesn't recognize that, so it can't start from the
area and just match to the appropriate points.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: INDEX Performance Issue

From
Mark Davidson
Date:
Been trying to progress with this today. Decided to setup the database on my local machine to try a few things and I'm getting much more sensible results and a totally different query plan http://explain.depesz.com/s/KGd in this case the query took about a minute but does sometimes take around 80 seconds.

The config is exactly the same between the two database. The databases them selves are identical with all indexes the same on the tables.

The server has an 2 x Intel Xeon E5420 running at 2.5Ghz each, 16GB RAM and the database is just on a SATA HDD which is a Western Digital WD5000AAKS.
My desktop has a single i5-3570K running at 3.4Ghz, 16GB RAM and the database is running on a SATA HDD which is a Western Digital WD1002FAEX-0

Could anyone offer any reasoning as to why the plan would be so different across the two machines? I would have thought that the server would perform a lot better since it has more cores or is postgres more affected by the CPU speed? Could anyone suggest a way to bench mark the machines for their postgres performance?

Thanks again for everyones input,

Mark


On 7 April 2013 23:22, Mark Davidson <mark@4each.co.uk> wrote:
Takes a little longer with the INNER join unfortunately. Takes about ~3.5 minutes, here is the query plan http://explain.depesz.com/s/EgBl.

With the JOIN there might not be a match if the data does not fall within one of the areas that is selected in the IN query.

So if we have data id (10) that might fall in areas ( 1, 5, 8, 167 ) but the user might be querying areas ( 200 ... 500 ) so no match in area would be found just to be absolutely clear.

Is it worth considering adding additional statistics on any of the columns? And / Or additional INDEXES or different types INDEX? Would it be worth restructuring the query starting with areas and working to join data to that?


On 7 April 2013 16:15, Kevin Grittner <kgrittn@ymail.com> wrote:
Greg Williamson <gwilliamson39@yahoo.com> wrote:

>> Thanks for your response. I tried doing what you suggested so
>> that table now has a primary key of
>> ' CONSTRAINT data_area_pkey PRIMARY KEY(area_id , data_id ); '
>> and I've added the INDEX of
>> 'CREATE INDEX data_area_data_id_index ON data_area USING btree (data_id );'

Yeah, that is what I was suggesting.

>> unfortunately it hasn't resulted in an improvement of the query
>> performance.

> Did you run analyze on the table after creating the index ?

That probably isn't necessary.  Statistics are normally on relations
and columns; there are only certain special cases where an ANALYZE
is needed after an index build, like if the index is on an
expression rather than a list of columns.

Mark, what happens if you change that left join to a normal (inner)
join?  Since you're doing an inner join to data_area and that has a
foreign key to area, there should always be a match anyway, right?
The optimizer doesn't recognize that, so it can't start from the
area and just match to the appropriate points.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: INDEX Performance Issue

From
Vasilis Ventirozos
Date:

Hello Mark,
PostgreSQL currently doesn't support parallel query so a faster cpu even if it has less cores would be faster for a single query, about benchmarking you can try pgbench that you will find in the contrib,
the execution plan may be different because of different statistics, have you analyzed both databases when you compared the execution plans ?

Vasilis Ventirozos


Been trying to progress with this today. Decided to setup the database on my local machine to try a few things and I'm getting much more sensible results and a totally different query plan http://explain.depesz.com/s/KGd in this case the query took about a minute but does sometimes take around 80 seconds.

The config is exactly the same between the two database. The databases them selves are identical with all indexes the same on the tables.

The server has an 2 x Intel Xeon E5420 running at 2.5Ghz each, 16GB RAM and the database is just on a SATA HDD which is a Western Digital WD5000AAKS.
My desktop has a single i5-3570K running at 3.4Ghz, 16GB RAM and the database is running on a SATA HDD which is a Western Digital WD1002FAEX-0

Could anyone offer any reasoning as to why the plan would be so different across the two machines? I would have thought that the server would perform a lot better since it has more cores or is postgres more affected by the CPU speed? Could anyone suggest a way to bench mark the machines for their postgres performance?

Thanks again for everyones input,

Mark


On 7 April 2013 23:22, Mark Davidson <mark@4each.co.uk> wrote:
Takes a little longer with the INNER join unfortunately. Takes about ~3.5 minutes, here is the query plan http://explain.depesz.com/s/EgBl.

With the JOIN there might not be a match if the data does not fall within one of the areas that is selected in the IN query.

So if we have data id (10) that might fall in areas ( 1, 5, 8, 167 ) but the user might be querying areas ( 200 ... 500 ) so no match in area would be found just to be absolutely clear.

Is it worth considering adding additional statistics on any of the columns? And / Or additional INDEXES or different types INDEX? Would it be worth restructuring the query starting with areas and working to join data to that?


On 7 April 2013 16:15, Kevin Grittner <kgrittn@ymail.com> wrote:
Greg Williamson <gwilliamson39@yahoo.com> wrote:

>> Thanks for your response. I tried doing what you suggested so
>> that table now has a primary key of
>> ' CONSTRAINT data_area_pkey PRIMARY KEY(area_id , data_id ); '
>> and I've added the INDEX of
>> 'CREATE INDEX data_area_data_id_index ON data_area USING btree (data_id );'

Yeah, that is what I was suggesting.

>> unfortunately it hasn't resulted in an improvement of the query
>> performance.

> Did you run analyze on the table after creating the index ?

That probably isn't necessary.  Statistics are normally on relations
and columns; there are only certain special cases where an ANALYZE
is needed after an index build, like if the index is on an
expression rather than a list of columns.

Mark, what happens if you change that left join to a normal (inner)
join?  Since you're doing an inner join to data_area and that has a
foreign key to area, there should always be a match anyway, right?
The optimizer doesn't recognize that, so it can't start from the
area and just match to the appropriate points.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INDEX Performance Issue

From
Mark Davidson
Date:
Thanks for your response Vasillis. I've run pgbench on both machines `./pgbench -c 10 -t 10000 pgbench` getting 99.800650 tps on my local machine and 23.825332 tps on the server so quite a significant difference.
Could this purely be down to the CPU clock speed or is it likely something else causing the issue?
I have run ANALYZE on both databases and tried the queries a number of times on each to make sure the results are consistent, this is the case.


On 8 April 2013 18:19, Vasilis Ventirozos <v.ventirozos@gmail.com> wrote:

Hello Mark,
PostgreSQL currently doesn't support parallel query so a faster cpu even if it has less cores would be faster for a single query, about benchmarking you can try pgbench that you will find in the contrib,
the execution plan may be different because of different statistics, have you analyzed both databases when you compared the execution plans ?

Vasilis Ventirozos


Been trying to progress with this today. Decided to setup the database on my local machine to try a few things and I'm getting much more sensible results and a totally different query plan http://explain.depesz.com/s/KGd in this case the query took about a minute but does sometimes take around 80 seconds.

The config is exactly the same between the two database. The databases them selves are identical with all indexes the same on the tables.

The server has an 2 x Intel Xeon E5420 running at 2.5Ghz each, 16GB RAM and the database is just on a SATA HDD which is a Western Digital WD5000AAKS.
My desktop has a single i5-3570K running at 3.4Ghz, 16GB RAM and the database is running on a SATA HDD which is a Western Digital WD1002FAEX-0

Could anyone offer any reasoning as to why the plan would be so different across the two machines? I would have thought that the server would perform a lot better since it has more cores or is postgres more affected by the CPU speed? Could anyone suggest a way to bench mark the machines for their postgres performance?

Thanks again for everyones input,

Mark


On 7 April 2013 23:22, Mark Davidson <mark@4each.co.uk> wrote:
Takes a little longer with the INNER join unfortunately. Takes about ~3.5 minutes, here is the query plan http://explain.depesz.com/s/EgBl.

With the JOIN there might not be a match if the data does not fall within one of the areas that is selected in the IN query.

So if we have data id (10) that might fall in areas ( 1, 5, 8, 167 ) but the user might be querying areas ( 200 ... 500 ) so no match in area would be found just to be absolutely clear.

Is it worth considering adding additional statistics on any of the columns? And / Or additional INDEXES or different types INDEX? Would it be worth restructuring the query starting with areas and working to join data to that?


On 7 April 2013 16:15, Kevin Grittner <kgrittn@ymail.com> wrote:
Greg Williamson <gwilliamson39@yahoo.com> wrote:

>> Thanks for your response. I tried doing what you suggested so
>> that table now has a primary key of
>> ' CONSTRAINT data_area_pkey PRIMARY KEY(area_id , data_id ); '
>> and I've added the INDEX of
>> 'CREATE INDEX data_area_data_id_index ON data_area USING btree (data_id );'

Yeah, that is what I was suggesting.

>> unfortunately it hasn't resulted in an improvement of the query
>> performance.

> Did you run analyze on the table after creating the index ?

That probably isn't necessary.  Statistics are normally on relations
and columns; there are only certain special cases where an ANALYZE
is needed after an index build, like if the index is on an
expression rather than a list of columns.

Mark, what happens if you change that left join to a normal (inner)
join?  Since you're doing an inner join to data_area and that has a
foreign key to area, there should always be a match anyway, right?
The optimizer doesn't recognize that, so it can't start from the
area and just match to the appropriate points.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: INDEX Performance Issue

From
Vasilis Ventirozos
Date:

-c 10 means 10 clients so that should take advantage of all your cores (see bellow)

%Cpu0 : 39.3 us, 21.1 sy, 0.0 ni, 38.7 id, 0.9 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 38.0 us, 25.0 sy, 0.0 ni, 26.0 id, 4.2 wa, 0.0 hi, 6.8 si, 0.0 st
%Cpu2 : 39.3 us, 20.4 sy, 0.0 ni, 39.0 id, 1.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 40.0 us, 18.7 sy, 0.0 ni, 40.0 id, 1.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 13.9 us, 7.1 sy, 0.0 ni, 79.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 13.1 us, 8.4 sy, 0.0 ni, 78.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 14.8 us, 6.4 sy, 0.0 ni, 78.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 15.7 us, 6.7 sy, 0.0 ni, 77.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

 i am pasting you the results of the same test on a i7-2600 16gb with a sata3 SSD and the results from a VM with 2 cores and a normal 7200 rpm hdd

-- DESKTOP
vasilis@Disorder ~ $ pgbench -c 10 -t 10000 bench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
tps = 1713.338111 (including connections establishing)
tps = 1713.948478 (excluding connections establishing)

-- VM

postgres@pglab1:~/postgresql-9.2.4/contrib/pgbench$ ./pgbench -c 10 -t 10000 bench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
tps = 1118.976496 (including connections establishing)
tps = 1119.180126 (excluding connections establishing)

i am assuming that you didn't populate your pgbench db with the default values , if you tell me how you did i will be happy to re run the test and see the differences.



On Mon, Apr 8, 2013 at 10:31 PM, Mark Davidson <mark@4each.co.uk> wrote:
Thanks for your response Vasillis. I've run pgbench on both machines `./pgbench -c 10 -t 10000 pgbench` getting 99.800650 tps on my local machine and 23.825332 tps on the server so quite a significant difference.
Could this purely be down to the CPU clock speed or is it likely something else causing the issue?
I have run ANALYZE on both databases and tried the queries a number of times on each to make sure the results are consistent, this is the case.


On 8 April 2013 18:19, Vasilis Ventirozos <v.ventirozos@gmail.com> wrote:

Hello Mark,
PostgreSQL currently doesn't support parallel query so a faster cpu even if it has less cores would be faster for a single query, about benchmarking you can try pgbench that you will find in the contrib,
the execution plan may be different because of different statistics, have you analyzed both databases when you compared the execution plans ?

Vasilis Ventirozos


Been trying to progress with this today. Decided to setup the database on my local machine to try a few things and I'm getting much more sensible results and a totally different query plan http://explain.depesz.com/s/KGd in this case the query took about a minute but does sometimes take around 80 seconds.

The config is exactly the same between the two database. The databases them selves are identical with all indexes the same on the tables.

The server has an 2 x Intel Xeon E5420 running at 2.5Ghz each, 16GB RAM and the database is just on a SATA HDD which is a Western Digital WD5000AAKS.
My desktop has a single i5-3570K running at 3.4Ghz, 16GB RAM and the database is running on a SATA HDD which is a Western Digital WD1002FAEX-0

Could anyone offer any reasoning as to why the plan would be so different across the two machines? I would have thought that the server would perform a lot better since it has more cores or is postgres more affected by the CPU speed? Could anyone suggest a way to bench mark the machines for their postgres performance?

Thanks again for everyones input,

Mark


On 7 April 2013 23:22, Mark Davidson <mark@4each.co.uk> wrote:
Takes a little longer with the INNER join unfortunately. Takes about ~3.5 minutes, here is the query plan http://explain.depesz.com/s/EgBl.

With the JOIN there might not be a match if the data does not fall within one of the areas that is selected in the IN query.

So if we have data id (10) that might fall in areas ( 1, 5, 8, 167 ) but the user might be querying areas ( 200 ... 500 ) so no match in area would be found just to be absolutely clear.

Is it worth considering adding additional statistics on any of the columns? And / Or additional INDEXES or different types INDEX? Would it be worth restructuring the query starting with areas and working to join data to that?


On 7 April 2013 16:15, Kevin Grittner <kgrittn@ymail.com> wrote:
Greg Williamson <gwilliamson39@yahoo.com> wrote:

>> Thanks for your response. I tried doing what you suggested so
>> that table now has a primary key of
>> ' CONSTRAINT data_area_pkey PRIMARY KEY(area_id , data_id ); '
>> and I've added the INDEX of
>> 'CREATE INDEX data_area_data_id_index ON data_area USING btree (data_id );'

Yeah, that is what I was suggesting.

>> unfortunately it hasn't resulted in an improvement of the query
>> performance.

> Did you run analyze on the table after creating the index ?

That probably isn't necessary.  Statistics are normally on relations
and columns; there are only certain special cases where an ANALYZE
is needed after an index build, like if the index is on an
expression rather than a list of columns.

Mark, what happens if you change that left join to a normal (inner)
join?  Since you're doing an inner join to data_area and that has a
foreign key to area, there should always be a match anyway, right?
The optimizer doesn't recognize that, so it can't start from the
area and just match to the appropriate points.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Re: INDEX Performance Issue

From
Mark Davidson
Date:
Wow my results are absolutely appalling compared to both of those which is really interesting. Are you running postgres 9.2.4 on both instances? Any specific configuration changes?
Thinking there must be something up with my setup to be getting such a low tps compared with you.

On 8 April 2013 21:02, Vasilis Ventirozos <v.ventirozos@gmail.com> wrote:

-c 10 means 10 clients so that should take advantage of all your cores (see bellow)

%Cpu0 : 39.3 us, 21.1 sy, 0.0 ni, 38.7 id, 0.9 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 38.0 us, 25.0 sy, 0.0 ni, 26.0 id, 4.2 wa, 0.0 hi, 6.8 si, 0.0 st
%Cpu2 : 39.3 us, 20.4 sy, 0.0 ni, 39.0 id, 1.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 40.0 us, 18.7 sy, 0.0 ni, 40.0 id, 1.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 13.9 us, 7.1 sy, 0.0 ni, 79.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 13.1 us, 8.4 sy, 0.0 ni, 78.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 14.8 us, 6.4 sy, 0.0 ni, 78.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 15.7 us, 6.7 sy, 0.0 ni, 77.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

 i am pasting you the results of the same test on a i7-2600 16gb with a sata3 SSD and the results from a VM with 2 cores and a normal 7200 rpm hdd

-- DESKTOP
vasilis@Disorder ~ $ pgbench -c 10 -t 10000 bench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
tps = 1713.338111 (including connections establishing)
tps = 1713.948478 (excluding connections establishing)

-- VM

postgres@pglab1:~/postgresql-9.2.4/contrib/pgbench$ ./pgbench -c 10 -t 10000 bench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
tps = 1118.976496 (including connections establishing)
tps = 1119.180126 (excluding connections establishing)

i am assuming that you didn't populate your pgbench db with the default values , if you tell me how you did i will be happy to re run the test and see the differences.



On Mon, Apr 8, 2013 at 10:31 PM, Mark Davidson <mark@4each.co.uk> wrote:
Thanks for your response Vasillis. I've run pgbench on both machines `./pgbench -c 10 -t 10000 pgbench` getting 99.800650 tps on my local machine and 23.825332 tps on the server so quite a significant difference.
Could this purely be down to the CPU clock speed or is it likely something else causing the issue?
I have run ANALYZE on both databases and tried the queries a number of times on each to make sure the results are consistent, this is the case.


On 8 April 2013 18:19, Vasilis Ventirozos <v.ventirozos@gmail.com> wrote:

Hello Mark,
PostgreSQL currently doesn't support parallel query so a faster cpu even if it has less cores would be faster for a single query, about benchmarking you can try pgbench that you will find in the contrib,
the execution plan may be different because of different statistics, have you analyzed both databases when you compared the execution plans ?

Vasilis Ventirozos


Been trying to progress with this today. Decided to setup the database on my local machine to try a few things and I'm getting much more sensible results and a totally different query plan http://explain.depesz.com/s/KGd in this case the query took about a minute but does sometimes take around 80 seconds.

The config is exactly the same between the two database. The databases them selves are identical with all indexes the same on the tables.

The server has an 2 x Intel Xeon E5420 running at 2.5Ghz each, 16GB RAM and the database is just on a SATA HDD which is a Western Digital WD5000AAKS.
My desktop has a single i5-3570K running at 3.4Ghz, 16GB RAM and the database is running on a SATA HDD which is a Western Digital WD1002FAEX-0

Could anyone offer any reasoning as to why the plan would be so different across the two machines? I would have thought that the server would perform a lot better since it has more cores or is postgres more affected by the CPU speed? Could anyone suggest a way to bench mark the machines for their postgres performance?

Thanks again for everyones input,

Mark


On 7 April 2013 23:22, Mark Davidson <mark@4each.co.uk> wrote:
Takes a little longer with the INNER join unfortunately. Takes about ~3.5 minutes, here is the query plan http://explain.depesz.com/s/EgBl.

With the JOIN there might not be a match if the data does not fall within one of the areas that is selected in the IN query.

So if we have data id (10) that might fall in areas ( 1, 5, 8, 167 ) but the user might be querying areas ( 200 ... 500 ) so no match in area would be found just to be absolutely clear.

Is it worth considering adding additional statistics on any of the columns? And / Or additional INDEXES or different types INDEX? Would it be worth restructuring the query starting with areas and working to join data to that?


On 7 April 2013 16:15, Kevin Grittner <kgrittn@ymail.com> wrote:
Greg Williamson <gwilliamson39@yahoo.com> wrote:

>> Thanks for your response. I tried doing what you suggested so
>> that table now has a primary key of
>> ' CONSTRAINT data_area_pkey PRIMARY KEY(area_id , data_id ); '
>> and I've added the INDEX of
>> 'CREATE INDEX data_area_data_id_index ON data_area USING btree (data_id );'

Yeah, that is what I was suggesting.

>> unfortunately it hasn't resulted in an improvement of the query
>> performance.

> Did you run analyze on the table after creating the index ?

That probably isn't necessary.  Statistics are normally on relations
and columns; there are only certain special cases where an ANALYZE
is needed after an index build, like if the index is on an
expression rather than a list of columns.

Mark, what happens if you change that left join to a normal (inner)
join?  Since you're doing an inner join to data_area and that has a
foreign key to area, there should always be a match anyway, right?
The optimizer doesn't recognize that, so it can't start from the
area and just match to the appropriate points.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: INDEX Performance Issue

From
Vasilis Ventirozos
Date:



On Mon, Apr 8, 2013 at 11:18 PM, Mark Davidson <mark@4each.co.uk> wrote:
Wow my results are absolutely appalling compared to both of those which is really interesting. Are you running postgres 9.2.4 on both instances? Any specific configuration changes?
Thinking there must be something up with my setup to be getting such a low tps compared with you.

Both installations are 9.2.4 and both db's have absolutely default configurations, i can't really explain why there is so much difference between our results, i can only imagine the initialization, thats why i asked how you populated your pgbench database (scale factor / fill factor).

Vasilis Ventirozos
 
On 8 April 2013 21:02, Vasilis Ventirozos <v.ventirozos@gmail.com> wrote:

-c 10 means 10 clients so that should take advantage of all your cores (see bellow)

%Cpu0 : 39.3 us, 21.1 sy, 0.0 ni, 38.7 id, 0.9 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 38.0 us, 25.0 sy, 0.0 ni, 26.0 id, 4.2 wa, 0.0 hi, 6.8 si, 0.0 st
%Cpu2 : 39.3 us, 20.4 sy, 0.0 ni, 39.0 id, 1.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 40.0 us, 18.7 sy, 0.0 ni, 40.0 id, 1.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 13.9 us, 7.1 sy, 0.0 ni, 79.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 13.1 us, 8.4 sy, 0.0 ni, 78.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 14.8 us, 6.4 sy, 0.0 ni, 78.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 15.7 us, 6.7 sy, 0.0 ni, 77.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

 i am pasting you the results of the same test on a i7-2600 16gb with a sata3 SSD and the results from a VM with 2 cores and a normal 7200 rpm hdd

-- DESKTOP
vasilis@Disorder ~ $ pgbench -c 10 -t 10000 bench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
tps = 1713.338111 (including connections establishing)
tps = 1713.948478 (excluding connections establishing)

-- VM

postgres@pglab1:~/postgresql-9.2.4/contrib/pgbench$ ./pgbench -c 10 -t 10000 bench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
tps = 1118.976496 (including connections establishing)
tps = 1119.180126 (excluding connections establishing)

i am assuming that you didn't populate your pgbench db with the default values , if you tell me how you did i will be happy to re run the test and see the differences.



On Mon, Apr 8, 2013 at 10:31 PM, Mark Davidson <mark@4each.co.uk> wrote:
Thanks for your response Vasillis. I've run pgbench on both machines `./pgbench -c 10 -t 10000 pgbench` getting 99.800650 tps on my local machine and 23.825332 tps on the server so quite a significant difference.
Could this purely be down to the CPU clock speed or is it likely something else causing the issue?
I have run ANALYZE on both databases and tried the queries a number of times on each to make sure the results are consistent, this is the case.


On 8 April 2013 18:19, Vasilis Ventirozos <v.ventirozos@gmail.com> wrote:

Hello Mark,
PostgreSQL currently doesn't support parallel query so a faster cpu even if it has less cores would be faster for a single query, about benchmarking you can try pgbench that you will find in the contrib,
the execution plan may be different because of different statistics, have you analyzed both databases when you compared the execution plans ?

Vasilis Ventirozos


Been trying to progress with this today. Decided to setup the database on my local machine to try a few things and I'm getting much more sensible results and a totally different query plan http://explain.depesz.com/s/KGd in this case the query took about a minute but does sometimes take around 80 seconds.

The config is exactly the same between the two database. The databases them selves are identical with all indexes the same on the tables.

The server has an 2 x Intel Xeon E5420 running at 2.5Ghz each, 16GB RAM and the database is just on a SATA HDD which is a Western Digital WD5000AAKS.
My desktop has a single i5-3570K running at 3.4Ghz, 16GB RAM and the database is running on a SATA HDD which is a Western Digital WD1002FAEX-0

Could anyone offer any reasoning as to why the plan would be so different across the two machines? I would have thought that the server would perform a lot better since it has more cores or is postgres more affected by the CPU speed? Could anyone suggest a way to bench mark the machines for their postgres performance?

Thanks again for everyones input,

Mark


On 7 April 2013 23:22, Mark Davidson <mark@4each.co.uk> wrote:
Takes a little longer with the INNER join unfortunately. Takes about ~3.5 minutes, here is the query plan http://explain.depesz.com/s/EgBl.

With the JOIN there might not be a match if the data does not fall within one of the areas that is selected in the IN query.

So if we have data id (10) that might fall in areas ( 1, 5, 8, 167 ) but the user might be querying areas ( 200 ... 500 ) so no match in area would be found just to be absolutely clear.

Is it worth considering adding additional statistics on any of the columns? And / Or additional INDEXES or different types INDEX? Would it be worth restructuring the query starting with areas and working to join data to that?


On 7 April 2013 16:15, Kevin Grittner <kgrittn@ymail.com> wrote:
Greg Williamson <gwilliamson39@yahoo.com> wrote:

>> Thanks for your response. I tried doing what you suggested so
>> that table now has a primary key of
>> ' CONSTRAINT data_area_pkey PRIMARY KEY(area_id , data_id ); '
>> and I've added the INDEX of
>> 'CREATE INDEX data_area_data_id_index ON data_area USING btree (data_id );'

Yeah, that is what I was suggesting.

>> unfortunately it hasn't resulted in an improvement of the query
>> performance.

> Did you run analyze on the table after creating the index ?

That probably isn't necessary.  Statistics are normally on relations
and columns; there are only certain special cases where an ANALYZE
is needed after an index build, like if the index is on an
expression rather than a list of columns.

Mark, what happens if you change that left join to a normal (inner)
join?  Since you're doing an inner join to data_area and that has a
foreign key to area, there should always be a match anyway, right?
The optimizer doesn't recognize that, so it can't start from the
area and just match to the appropriate points.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: INDEX Performance Issue

From
Jeff Janes
Date:
On Mon, Apr 8, 2013 at 12:31 PM, Mark Davidson <mark@4each.co.uk> wrote:
Thanks for your response Vasillis. I've run pgbench on both machines `./pgbench -c 10 -t 10000 pgbench` getting 99.800650 tps on my local machine and 23.825332 tps on the server so quite a significant difference.

These results are almost certainly being driven by how fast your machines can fsync the WAL data.  The type of query you originally posted does not care about that at all, so these results are not useful to you.  You could run the "pgbench -S", which is getting closer to the nature of the work your original query does (but still not all that close).  

Cheers,

Jeff

Re: INDEX Performance Issue

From
Mark Davidson
Date:
Sorry Vasillis I missed you asking that I just did './pgbench -i pgbench' didn't specific set any values. I can try some specific ones if you can suggest any.


On 8 April 2013 21:28, Vasilis Ventirozos <v.ventirozos@gmail.com> wrote:



On Mon, Apr 8, 2013 at 11:18 PM, Mark Davidson <mark@4each.co.uk> wrote:
Wow my results are absolutely appalling compared to both of those which is really interesting. Are you running postgres 9.2.4 on both instances? Any specific configuration changes?
Thinking there must be something up with my setup to be getting such a low tps compared with you.

Both installations are 9.2.4 and both db's have absolutely default configurations, i can't really explain why there is so much difference between our results, i can only imagine the initialization, thats why i asked how you populated your pgbench database (scale factor / fill factor).

Vasilis Ventirozos
 
On 8 April 2013 21:02, Vasilis Ventirozos <v.ventirozos@gmail.com> wrote:

-c 10 means 10 clients so that should take advantage of all your cores (see bellow)

%Cpu0 : 39.3 us, 21.1 sy, 0.0 ni, 38.7 id, 0.9 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 38.0 us, 25.0 sy, 0.0 ni, 26.0 id, 4.2 wa, 0.0 hi, 6.8 si, 0.0 st
%Cpu2 : 39.3 us, 20.4 sy, 0.0 ni, 39.0 id, 1.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 40.0 us, 18.7 sy, 0.0 ni, 40.0 id, 1.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 13.9 us, 7.1 sy, 0.0 ni, 79.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 13.1 us, 8.4 sy, 0.0 ni, 78.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 14.8 us, 6.4 sy, 0.0 ni, 78.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 15.7 us, 6.7 sy, 0.0 ni, 77.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

 i am pasting you the results of the same test on a i7-2600 16gb with a sata3 SSD and the results from a VM with 2 cores and a normal 7200 rpm hdd

-- DESKTOP
vasilis@Disorder ~ $ pgbench -c 10 -t 10000 bench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
tps = 1713.338111 (including connections establishing)
tps = 1713.948478 (excluding connections establishing)

-- VM

postgres@pglab1:~/postgresql-9.2.4/contrib/pgbench$ ./pgbench -c 10 -t 10000 bench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
tps = 1118.976496 (including connections establishing)
tps = 1119.180126 (excluding connections establishing)

i am assuming that you didn't populate your pgbench db with the default values , if you tell me how you did i will be happy to re run the test and see the differences.



On Mon, Apr 8, 2013 at 10:31 PM, Mark Davidson <mark@4each.co.uk> wrote:
Thanks for your response Vasillis. I've run pgbench on both machines `./pgbench -c 10 -t 10000 pgbench` getting 99.800650 tps on my local machine and 23.825332 tps on the server so quite a significant difference.
Could this purely be down to the CPU clock speed or is it likely something else causing the issue?
I have run ANALYZE on both databases and tried the queries a number of times on each to make sure the results are consistent, this is the case.


On 8 April 2013 18:19, Vasilis Ventirozos <v.ventirozos@gmail.com> wrote:

Hello Mark,
PostgreSQL currently doesn't support parallel query so a faster cpu even if it has less cores would be faster for a single query, about benchmarking you can try pgbench that you will find in the contrib,
the execution plan may be different because of different statistics, have you analyzed both databases when you compared the execution plans ?

Vasilis Ventirozos


Been trying to progress with this today. Decided to setup the database on my local machine to try a few things and I'm getting much more sensible results and a totally different query plan http://explain.depesz.com/s/KGd in this case the query took about a minute but does sometimes take around 80 seconds.

The config is exactly the same between the two database. The databases them selves are identical with all indexes the same on the tables.

The server has an 2 x Intel Xeon E5420 running at 2.5Ghz each, 16GB RAM and the database is just on a SATA HDD which is a Western Digital WD5000AAKS.
My desktop has a single i5-3570K running at 3.4Ghz, 16GB RAM and the database is running on a SATA HDD which is a Western Digital WD1002FAEX-0

Could anyone offer any reasoning as to why the plan would be so different across the two machines? I would have thought that the server would perform a lot better since it has more cores or is postgres more affected by the CPU speed? Could anyone suggest a way to bench mark the machines for their postgres performance?

Thanks again for everyones input,

Mark


On 7 April 2013 23:22, Mark Davidson <mark@4each.co.uk> wrote:
Takes a little longer with the INNER join unfortunately. Takes about ~3.5 minutes, here is the query plan http://explain.depesz.com/s/EgBl.

With the JOIN there might not be a match if the data does not fall within one of the areas that is selected in the IN query.

So if we have data id (10) that might fall in areas ( 1, 5, 8, 167 ) but the user might be querying areas ( 200 ... 500 ) so no match in area would be found just to be absolutely clear.

Is it worth considering adding additional statistics on any of the columns? And / Or additional INDEXES or different types INDEX? Would it be worth restructuring the query starting with areas and working to join data to that?


On 7 April 2013 16:15, Kevin Grittner <kgrittn@ymail.com> wrote:
Greg Williamson <gwilliamson39@yahoo.com> wrote:

>> Thanks for your response. I tried doing what you suggested so
>> that table now has a primary key of
>> ' CONSTRAINT data_area_pkey PRIMARY KEY(area_id , data_id ); '
>> and I've added the INDEX of
>> 'CREATE INDEX data_area_data_id_index ON data_area USING btree (data_id );'

Yeah, that is what I was suggesting.

>> unfortunately it hasn't resulted in an improvement of the query
>> performance.

> Did you run analyze on the table after creating the index ?

That probably isn't necessary.  Statistics are normally on relations
and columns; there are only certain special cases where an ANALYZE
is needed after an index build, like if the index is on an
expression rather than a list of columns.

Mark, what happens if you change that left join to a normal (inner)
join?  Since you're doing an inner join to data_area and that has a
foreign key to area, there should always be a match anyway, right?
The optimizer doesn't recognize that, so it can't start from the
area and just match to the appropriate points.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company








Re: INDEX Performance Issue

From
Mark Davidson
Date:
Hi Jeff,

I'ved tried this test using the -S flag './pgbench -c 4 -j 2 -T 600 -S pgbench'

Desktop gives me

./pgbench -c 4 -j 2 -T 600 -S pgbench
starting vacuum...end.
transaction type: SELECT only
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 2
duration: 600 s
number of transactions actually processed: 35261835
tps = 58769.715695 (including connections establishing)
tps = 58770.258977 (excluding connections establishing)

Server

./pgbench -c 4 -j 2 -T 600 -S pgbench
starting vacuum...end.
transaction type: SELECT only
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 2
duration: 600 s
number of transactions actually processed: 22642303
tps = 37737.157641 (including connections establishing)
tps = 37738.167325 (excluding connections establishing)



On 8 April 2013 21:39, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Apr 8, 2013 at 12:31 PM, Mark Davidson <mark@4each.co.uk> wrote:
Thanks for your response Vasillis. I've run pgbench on both machines `./pgbench -c 10 -t 10000 pgbench` getting 99.800650 tps on my local machine and 23.825332 tps on the server so quite a significant difference.

These results are almost certainly being driven by how fast your machines can fsync the WAL data.  The type of query you originally posted does not care about that at all, so these results are not useful to you.  You could run the "pgbench -S", which is getting closer to the nature of the work your original query does (but still not all that close).  

Cheers,

Jeff

Re: INDEX Performance Issue

From
Jeff Janes
Date:
On Fri, Apr 5, 2013 at 8:51 AM, Mark Davidson <mark@4each.co.uk> wrote:
Hi All,

Hoping someone can help me out with some performance issues I'm having with the INDEX on my database. I've got a database that has a data table containing ~55,000,000 rows which have point data and an area table containing ~3,500 rows which have polygon data. A user queries the data by selecting what areas they want to view and using some other filters such as datatime and what datasets they want to query. This all works fine and previously the intersect of the data rows to the areas was being done on the fly with PostGIS ST_Intersects. However as the data table grow we decided it would make sense to offload the data processing and not calculate the intersect for a row on the fly each time, but to pre-calculate it and store the result in the join table. Resultantly this produce a table data_area which contains ~250,000,000 rows.


I think your old method is likely the better option, especially if the intersect can be offloaded to the client or app server (I don't know enough about ST_Intersects to know how likely that is).

What is the difference in performance between the old method and the new method?

Cheers,

Jeff

Re: INDEX Performance Issue

From
Jeff Janes
Date:
On Sun, Apr 7, 2013 at 3:22 PM, Mark Davidson <mark@4each.co.uk> wrote:
Takes a little longer with the INNER join unfortunately. Takes about ~3.5 minutes, here is the query plan http://explain.depesz.com/s/EgBl.

With the JOIN there might not be a match if the data does not fall within one of the areas that is selected in the IN query.

So if we have data id (10) that might fall in areas ( 1, 5, 8, 167 ) but the user might be querying areas ( 200 ... 500 ) so no match in area would be found just to be absolutely clear.

I'm not clear on what you *want* to happen.  Are you sure it works the way you want it to now?  If you want every specified id to return at least one row even if there is no qualified area matching it, you have to move the LEFT JOIN one join to the left, and have to move the IN list criteria from the WHERE to the JOIN.

Cheers,

Jeff

Re: INDEX Performance Issue

From
Jeff Janes
Date:
On Mon, Apr 8, 2013 at 10:02 AM, Mark Davidson <mark@4each.co.uk> wrote:
Been trying to progress with this today. Decided to setup the database on my local machine to try a few things and I'm getting much more sensible results and a totally different query plan http://explain.depesz.com/s/KGd in this case the query took about a minute but does sometimes take around 80 seconds.

The config is exactly the same between the two database. The databases them selves are identical with all indexes the same on the tables.

The server has an 2 x Intel Xeon E5420 running at 2.5Ghz each, 16GB RAM and the database is just on a SATA HDD which is a Western Digital WD5000AAKS.
My desktop has a single i5-3570K running at 3.4Ghz, 16GB RAM and the database is running on a SATA HDD which is a Western Digital WD1002FAEX-0

Could anyone offer any reasoning as to why the plan would be so different across the two machines?


The estimated costs of the two plans are very close to each other, so it doesn't take much to cause a switch to happen.

Is the test instance a binary copy of the production one (i.e. created from a base backup) or is it only a logical copy (e.g. pg_dump followed by a restore)?  A logical copy will probably be more compact than the original and so will have different slightly estimates.

You could check pg_class for relpages on all relevant tables and indexes on both servers.

Also, since ANALYZE uses a random sampling for large tables, the estimates can move around just by chance. If you repeat the query several times with an ANALYZE in between, does the plan change, or if not how much does the estimated cost change within the plan?  You could check pg_stats for the relevant tables and columns between the two servers to see how similar they are.

The estimated cost of a hash join is very dependent on how frequent the most common value of the hashed column is thought to be.  And the estimate of this number can be very fragile if ANALYZE is based on a small fraction of the table.  Turning up the statistics for those columns might be worthwhile.

Cheers,

Jeff