[GENERAL] Weird performance difference - Mailing list pgsql-general

From Israel Brewster
Subject [GENERAL] Weird performance difference
Date
Msg-id AAD8DF38-32CE-437F-B2D2-C7BB0D5B5BA3@ravnalaska.net
Whole thread Raw
Responses Re: [GENERAL] Weird performance difference  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-general
Summary: the following query takes around 12 seconds on my test machine. On my production machine, it's at half an hour and counting. What's going on?

Details:

As a first stab at getting some data I need, I've developed the following SQL query:

SELECT
    legfrom,
    count(*) as totaldeps,
    count(*) filter (where depdly<='00:00:00'::interval) as d0,
    count(*) filter (where depdly<='00:05:00'::interval) as d5,
    count(*) filter (where depdly<='00:14:00'::interval) as d14,
    count(*) filter (where depdly<='00:30:00'::interval) as d30,
    count(*) filter (where legto=destination) as totalarriv,
    count(*) filter (where arrdly<='00:00:00'::interval AND legto=destination) as a0,
    count(*) filter (where arrdly<='00:05:00'::interval AND legto=destination) as a5,
    count(*) filter (where arrdly<='00:14:00'::interval AND legto=destination) as a14,
    count(*) filter (where arrdly<='00:30:00'::interval AND legto=destination) as a30,
    count(*) filter (where legto!=destination) as div
FROM
(SELECT distinct on (legdetail.flightnum, legfrom, outtime)
    legdetail.flightnum,
    legfrom,
    legto,
    outtime,
    intime,
    depart_time,
    arrival_time,
    destination,
    outtime::time-oag_schedules.depart_time as depdly,
    intime::time-oag_Schedules.arrival_time as arrdly
FROM oag_batches
NATURAL JOIN oag_schedules
INNER JOIN legdetail
    ON outtime::date BETWEEN oag_schedules.startdate AND coalesce(oag_schedules.enddate,'infinity')
    AND outtime::date BETWEEN oag_batches.eff_from AND coalesce(oag_batches.eff_to,'infinity')
    AND extract(isodow from outtime::date)=ANY(frequency)
    AND oag_schedules.flightnum::text=legdetail.flightnum
    AND oag_schedules.origin=legdetail.legfrom
WHERE
    outtime::date>='2017-01-01'
    AND legdetail.flightnum~E'^[0-9]*$'
    AND legdetail.flightnum::integer < 8000
    AND csom='S'
ORDER BY legdetail.flightnum, legfrom, outtime, oag_batches.eff_from DESC) s1
GROUP BY legfrom
ORDER BY legfrom;

which, when run on my test server, has this explain analyze output: https://explain.depesz.com/s/4piv. Around 12 second runtime, which isn't too bad (in the grand scheme of things), although there is probably room for improvement.

oag_batches and oag_schedules are postgres_fdw foreign tables in the same database cluster (different DB same host/server instance). oag_schedules has about 80,000 rows, legdetail has about 60,000 rows, and oag_batches has 125 rows, so while we are dealing with a fair amount of data, it's nothing extreme - I have tables with billions of rows that it handles just fine.

With the query comfortably running on my test machine, I went ahead and tried it out on my production machine. And I waited. And waited. And waited some more. As of the writing of this line, pg_stat_activity shows that the query has been running for over 30 minutes. It also shows a status of "active", and nothing under the wait_event column. A second entry in pg_stat_activity shows the foreign data wrapper connection, with a state of "idle in transaction" and again nothing under the wait_event column. Top shows the two processes using around 62% and 45% CPU, with the rest of machine being 96% idle, so they are doing *something*, but not even enough to max out one of the cores, nor is the machine as a whole maxed out. The PostgreSQL log shows plenty of activity (and none of the users of the database are complaining), so apparently the database as a whole is working fine - it's just this one query that's not completing. Further, iotop shows no significant disk access - read is a solid 0.00 B/s, and write is only around 80 K/s, so it's not like it's waiting on disk i/o.

Both test server and production are running PostgreSQL 9.6, although my test server is at 9.6.1 while production is at 9.6.5 (huh, production got ahead of test. I'll have to fix that. Oh well), and I did a dump from production to test using -C just before starting work on the query, so the datasets are pretty closely matched. The production server is configured to take full(er) advantage of the hardware than my test database, so if anything I would expect it to be faster. Hardware differences obviously can have an effect on the query speed, but I wouldn't think a difference of 12 seconds to over half an hour (and counting). So what could explain this huge discrepancy?

One difference I can think of is that my production database is replicating (streaming replication), but since this is just a select, I can't think why that would make a difference... any thoughts?

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------



Attachment

pgsql-general by date:

Previous
From: Joe Conway
Date:
Subject: Re: [GENERAL] How to find out extension directory
Next
From: Justin Pryzby
Date:
Subject: Re: [GENERAL] Weird performance difference