Slow performance with join on many fields - Mailing list pgsql-performance

From Alex Johnson
Subject Slow performance with join on many fields
Date
Msg-id F8F237F6-4E04-11D7-85B6-000502FCE08D@aretesystems.com
Whole thread Raw
Responses Re: Slow performance with join on many fields
List pgsql-performance
Hello-

I'm working with a MS Access database that a client wants to turn into
a multi-user database. I'm evaluating PostgreSQL for that purpose (I'd
_really_ like to be able to recommend and open-source solution to
them). However, I'm running into a performance-related issue that I was
hoping this list could help me with.

I have three tables: tbl_samples (~2000 rows), tbl_tests (~4000 rows),
and tbl_results (~20,000 rows), with one-to-many relationships between
them (cascading, in the order given - table definitions are attached
below). I'm looking at the following query that joins these three
tables:

SELECT
    tbl_samples.station_id,
    tbl_samples.samp_date,
    tbl_samples.matrix,
    tbl_samples.samp_type_code,
    tbl_samples.samp_no,
    tbl_samples.field_samp_id,
    tbl_tests.method,
    tbl_tests.lab,
    tbl_results.par_code,
    tbl_results.val_qualifier,
    tbl_results.value,
    tbl_results.units,
    tbl_results.mdl,
    tbl_results.date_anal
FROM
      (tbl_samples
           INNER JOIN tbl_tests USING
        (station_id,
        samp_date,
        matrix,
        samp_type_code,
        samp_no,
        samp_bdepth,
        samp_edepth)
      )
      INNER JOIN tbl_results USING
        (station_id,
        samp_date,
        matrix,
        samp_type_code,
        samp_no,
        samp_bdepth,
        samp_edepth,
        method);

In Access, this query runs in about a second. In PostgreSQL on the same
machine, it takes about 12-15 seconds for the initial rows to be
returned, and about 45 seconds to returns all rows. (This is consistent
whether I use psql, use the pgAdminII SQL window, or use Access with
the ODBC driver.)

This is the output from EXPLAIN:

Nested Loop  (cost=437.73..1216.02 rows=1 width=245)
   Join Filter: ("outer".method = "inner".method)
   ->  Merge Join  (cost=437.73..461.38 rows=125 width=131)
         Merge Cond: (("outer".matrix = "inner".matrix) AND
("outer".samp_edepth = "inner".samp_edepth) AND ("outer".samp_bdepth =
"inner".samp_bdepth) AND ("outer".samp_no = "inner".samp_no) AND
("outer".samp_type_code = "inner".samp_type_code) AND
("outer".samp_date = "inner".samp_date) AND ("outer".station_id =
"inner".station_id))
         ->  Sort  (cost=117.51..120.77 rows=1304 width=63)
               Sort Key: tbl_samples.matrix, tbl_samples.samp_edepth,
tbl_samples.samp_bdepth, tbl_samples.samp_no,
tbl_samples.samp_type_code, tbl_samples.samp_date,
tbl_samples.station_id
               ->  Seq Scan on tbl_samples  (cost=0.00..50.04 rows=1304
width=63)
         ->  Sort  (cost=320.22..328.68 rows=3384 width=68)
               Sort Key: tbl_tests.matrix, tbl_tests.samp_edepth,
tbl_tests.samp_bdepth, tbl_tests.samp_no, tbl_tests.samp_type_code,
tbl_tests.samp_date, tbl_tests.station_id
               ->  Seq Scan on tbl_tests  (cost=0.00..121.84 rows=3384
width=68)
   ->  Index Scan using tbl_results_pkey on tbl_results
(cost=0.00..5.99 rows=1 width=114)
         Index Cond: (("outer".station_id = tbl_results.station_id) AND
("outer".samp_date = tbl_results.samp_date) AND ("outer".matrix =
tbl_results.matrix) AND ("outer".samp_type_code =
tbl_results.samp_type_code) AND ("outer".samp_no = tbl_results.samp_no)
AND ("outer".samp_bdepth = tbl_results.samp_bdepth) AND
("outer".samp_edepth = tbl_results.samp_edepth))

I've done the following to try to improve performance:

-postgresql.conf:
    increased shared_buffers to 384
    increased sort_mem to 2048
-clustered all tables on the pkey index
-made sure the joined fields are indexed (they are through the pkeys)

As a note, vm_stat shows no paging while the query is run. Also, I
realize that these keys are large (as in the number of fields). I'll be
condensing these down to sequential IDs (e.g. a SERIAL type) for a
further test, but I'm curious why Access seems to outperform Postgres
in this instance.

My question is, am I missing anything? PostgreSQL will be a hard sell
if they have to take a performance hit.

Thanks for any suggestions you can provide. Sorry for the long e-mail,
but I wanted to provide enough info to diagnose the issue.

Alex Johnson
________________________________
Table defs:

CREATE TABLE tbl_Samples (
    Station_ID VARCHAR (25) NOT NULL,
    Samp_Date TIMESTAMP WITH TIME ZONE NOT NULL,
    Matrix VARCHAR (10) NOT NULL,
    Samp_Type_Code VARCHAR (5) NOT NULL,
    Samp_No INTEGER NOT NULL,
    Samp_BDepth DOUBLE PRECISION NOT NULL,
    Samp_EDepth DOUBLE PRECISION NOT NULL,
    Depth_units VARCHAR (3),
    Samp_start_time TIME,
    Samp_end_time TIME,
    Field_Samp_ID VARCHAR (20),
    Lab_Samp_ID VARCHAR (20),
    Samp_Meth VARCHAR (20),
  ...snip...
  PRIMARY KEY
(Station_ID,Samp_Date,Matrix,Samp_Type_Code,Samp_No,Samp_BDepth,Samp_EDe
pth)
);

CREATE TABLE tbl_Tests (
    Station_ID VARCHAR (25) NOT NULL,
    Samp_Date TIMESTAMP WITH TIME ZONE NOT NULL,
    Matrix VARCHAR (10) NOT NULL,
    Samp_Type_Code VARCHAR (5) NOT NULL,
    Samp_No INTEGER NOT NULL,
    Samp_BDepth DOUBLE PRECISION NOT NULL,
    Samp_EDepth DOUBLE PRECISION NOT NULL,
    Method VARCHAR (50) NOT NULL,
    Lab VARCHAR (10) NOT NULL,
    Date_Rec TIMESTAMP WITH TIME ZONE,
...snip...
  PRIMARY KEY
(Station_ID,Samp_Date,Matrix,Samp_Type_Code,Samp_No,Samp_BDepth,Samp_EDe
pth,Method)
);

CREATE TABLE tbl_Results (
    Station_ID VARCHAR (25) NOT NULL,
    Samp_Date TIMESTAMP WITH TIME ZONE NOT NULL,
    Matrix VARCHAR (10) NOT NULL,
    Samp_Type_Code VARCHAR (5) NOT NULL,
    Samp_No INTEGER NOT NULL,
    Samp_BDepth DOUBLE PRECISION NOT NULL,
    Samp_EDepth DOUBLE PRECISION NOT NULL,
    Method VARCHAR (50) NOT NULL,
    Par_code VARCHAR (50) NOT NULL,
    Val_Qualifier VARCHAR (50) NOT NULL,
    Value DECIMAL (20,9) NOT NULL,
...snip...
  PRIMARY KEY
(Station_ID,Samp_Date,Matrix,Samp_Type_Code,Samp_No,Samp_BDepth,Samp_EDe
pth,Method,Par_code)
);

ALTER TABLE tbl_Tests ADD CONSTRAINT REL_1 FOREIGN KEY
(Station_ID,Samp_Date,Matrix,Samp_Type_Code,Samp_No,Samp_BDepth,Samp_EDe
pth)
    REFERENCES tbl_Samples ON DELETE CASCADE ON UPDATE CASCADE;
ALTER TABLE tbl_Results ADD CONSTRAINT REL_2 FOREIGN KEY
(Station_ID,Samp_Date,Matrix,Samp_Type_Code,Samp_No,Samp_BDepth,Samp_EDe
pth,Method)
    REFERENCES tbl_Tests ON DELETE CASCADE ON UPDATE CASCADE;

________________________________________________________________________
______
A r e t e   S y s t e m s
Alexander M. Johnson, P.E.


pgsql-performance by date:

Previous
From: Orito
Date:
Subject: Re: ABOUT YOUR CREDIT........... eyr
Next
From: Tom Lane
Date:
Subject: Re: Slow performance with join on many fields