Thread: [GENERAL] Weird performance difference

[GENERAL] Weird performance difference

From

Israel Brewster

Date:

21 October 2017, 05:08:26

Summary: the following query takes around 12 seconds on my test machine. On my production machine, it's at half an hour and counting. What's going on?

Details:

As a first stab at getting some data I need, I've developed the following SQL query:

SELECT

legfrom,

count(*) as totaldeps,

count(*) filter (where depdly<='00:00:00'::interval) as d0,

count(*) filter (where depdly<='00:05:00'::interval) as d5,

count(*) filter (where depdly<='00:14:00'::interval) as d14,

count(*) filter (where depdly<='00:30:00'::interval) as d30,

count(*) filter (where legto=destination) as totalarriv,

count(*) filter (where arrdly<='00:00:00'::interval AND legto=destination) as a0,

count(*) filter (where arrdly<='00:05:00'::interval AND legto=destination) as a5,

count(*) filter (where arrdly<='00:14:00'::interval AND legto=destination) as a14,

count(*) filter (where arrdly<='00:30:00'::interval AND legto=destination) as a30,

count(*) filter (where legto!=destination) as div

FROM

(SELECT distinct on (legdetail.flightnum, legfrom, outtime)

legdetail.flightnum,

legfrom,

legto,

outtime,

intime,

depart_time,

arrival_time,

destination,

outtime::time-oag_schedules.depart_time as depdly,

intime::time-oag_Schedules.arrival_time as arrdly

FROM oag_batches

NATURAL JOIN oag_schedules

INNER JOIN legdetail

ON outtime::date BETWEEN oag_schedules.startdate AND coalesce(oag_schedules.enddate,'infinity')

AND outtime::date BETWEEN oag_batches.eff_from AND coalesce(oag_batches.eff_to,'infinity')

AND extract(isodow from outtime::date)=ANY(frequency)

AND oag_schedules.flightnum::text=legdetail.flightnum

AND oag_schedules.origin=legdetail.legfrom

WHERE

outtime::date>='2017-01-01'

AND legdetail.flightnum~E'^[0-9]*$'

AND legdetail.flightnum::integer < 8000

AND csom='S'

ORDER BY legdetail.flightnum, legfrom, outtime, oag_batches.eff_from DESC) s1

GROUP BY legfrom

ORDER BY legfrom;

which, when run on my test server, has this explain analyze output: https://explain.depesz.com/s/4piv. Around 12 second runtime, which isn't too bad (in the grand scheme of things), although there is probably room for improvement.

oag_batches and oag_schedules are postgres_fdw foreign tables in the same database cluster (different DB same host/server instance). oag_schedules has about 80,000 rows, legdetail has about 60,000 rows, and oag_batches has 125 rows, so while we are dealing with a fair amount of data, it's nothing extreme - I have tables with billions of rows that it handles just fine.

With the query comfortably running on my test machine, I went ahead and tried it out on my production machine. And I waited. And waited. And waited some more. As of the writing of this line, pg_stat_activity shows that the query has been running for over 30 minutes. It also shows a status of "active", and nothing under the wait_event column. A second entry in pg_stat_activity shows the foreign data wrapper connection, with a state of "idle in transaction" and again nothing under the wait_event column. Top shows the two processes using around 62% and 45% CPU, with the rest of machine being 96% idle, so they are doing *something*, but not even enough to max out one of the cores, nor is the machine as a whole maxed out. The PostgreSQL log shows plenty of activity (and none of the users of the database are complaining), so apparently the database as a whole is working fine - it's just this one query that's not completing. Further, iotop shows no significant disk access - read is a solid 0.00 B/s, and write is only around 80 K/s, so it's not like it's waiting on disk i/o.

Both test server and production are running PostgreSQL 9.6, although my test server is at 9.6.1 while production is at 9.6.5 (huh, production got ahead of test. I'll have to fix that. Oh well), and I did a dump from production to test using -C just before starting work on the query, so the datasets are pretty closely matched. The production server is configured to take full(er) advantage of the hardware than my test database, so if anything I would expect it to be faster. Hardware differences obviously can have an effect on the query speed, but I wouldn't think a difference of 12 seconds to over half an hour (and counting). So what could explain this huge discrepancy?

One difference I can think of is that my production database is replicating (streaming replication), but since this is just a select, I can't think why that would make a difference... any thoughts?

-----------------------------------------------

Israel Brewster

Systems Analyst II

Ravn Alaska

5245 Airport Industrial Rd

Fairbanks, AK 99709

(907) 450-7293

-----------------------------------------------

Attachment

Israel Brewster.vcf

Re: [GENERAL] Weird performance difference

From

Justin Pryzby

Date:

21 October 2017, 05:31:40

On Fri, Oct 20, 2017 at 03:08:26PM -0800, Israel Brewster wrote:
> Summary: the following query takes around 12 seconds on my test machine. On my production machine, it's at half an
hourand counting. What's going on?
 
> 
> which, when run on my test server, has this explain analyze output: https://explain.depesz.com/s/4piv
<https://explain.depesz.com/s/4piv>.Around 12 second runtime, which isn't too bad (in the grand scheme of things),
althoughthere is probably room for improvement.
 

Are these cast to ::date cast is really needed (Alternately, do you have an index on column::date ?)
|WHERE outtime::date>='2017-01-01'
|ON outtime::date BETWEEN oag_schedules.startdate
|AND outtime::date BETWEEN oag_batches.eff_from

The problem is clearly here:
Merge Join (cost=30,604.12..31,301.12 ROWS=1 width=76) (actual time=1,153.883..9,812.434 ROWS=3,420,235 loops=1)
MergeCond: ((((oag_schedules.flightnum)::text) = (legdetail.flightnum)::text) AND ((oag_schedules.origin)::text =
(legdetail.legfrom)::text))  Join Filter: (((legdetail.outtime)::date >= oag_schedules.startdate) AND
((legdetail.outtime)::date<= COALESCE(oag_schedules.enddate, 'infinity'::date)) AND (date_part('isodow'::text,
((legdetail.outtime)::date)::timestampwithout time zone) = ANY ((oag_schedules.frequency)::double precision[])))   ROWS
REMOVEDBY JOIN FILTER: 6822878
 

Can you send "explain" (not explain analyze) for the production server?

And \d for those tables.

And/or EXPLAIN ANALYZE for a query with shorter date range on production (to
confirm it has a similar problem in rowcount estimation).

You can try munging the query to move/change the "Join Filter" components of
the query to see which one is contributing most to the rowcount estimate being
off by a factor of 3e6.

Justin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Weird performance difference

From

Israel Brewster

Date:

21 October 2017, 05:52:44

On Oct 20, 2017, at 3:31 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:

On Fri, Oct 20, 2017 at 03:08:26PM -0800, Israel Brewster wrote:
Summary: the following query takes around 12 seconds on my test machine. On my production machine, it's at half an hour and counting. What's going on?

which, when run on my test server, has this explain analyze output: https://explain.depesz.com/s/4piv <https://explain.depesz.com/s/4piv>. Around 12 second runtime, which isn't too bad (in the grand scheme of things), although there is probably room for improvement.

Are these cast to ::date cast is really needed (Alternately, do you have an index on column::date ?)
|WHERE outtime::date>='2017-01-01'
|ON outtime::date BETWEEN oag_schedules.startdate
|AND outtime::date BETWEEN oag_batches.eff_from

Possibly not, but outtime is a timestamp while startdate and eff_from are simple date fields. I was being explicit :-) and no, I don't have an index on column::date, that's certainly something I can try to optimize things.

The problem is clearly here:

Merge Join (cost=30,604.12..31,301.12 ROWS=1 width=76) (actual time=1,153.883..9,812.434 ROWS=3,420,235 loops=1)
   Merge Cond: ((((oag_schedules.flightnum)::text) = (legdetail.flightnum)::text) AND ((oag_schedules.origin)::text = (legdetail.legfrom)::text))
   Join Filter: (((legdetail.outtime)::date >= oag_schedules.startdate) AND ((legdetail.outtime)::date <= COALESCE(oag_schedules.enddate, 'infinity'::date)) AND (date_part('isodow'::text, ((legdetail.outtime)::date)::timestamp without time zone) = ANY ((oag_schedules.frequency)::double precision[])))
   ROWS REMOVED BY JOIN FILTER: 6822878

Can you send "explain" (not explain analyze) for the production server?

Not for the full query - it only just completed, after 70 minutes or so, and I wasn't running under EXPLAIN ANALYZE. Running with a shorter date range of only 7 days, as you suggest below: https://explain.depesz.com/s/r80j

I notice that this is a COMPLETELY different query plan from what I got on my test machine, but I'm not sure if that's a good thing or not. The end result is obviously not.

And \d for those tables.

Sure:

And/or EXPLAIN ANALYZE for a query with shorter date range on production (to
confirm it has a similar problem in rowcount estimation).

You can try munging the query to move/change the "Join Filter" components of
the query to see which one is contributing most to the rowcount estimate being
off by a factor of 3e6.

Justin

Re: [GENERAL] Weird performance difference

From

Tom Lane

Date:

21 October 2017, 06:38:33

Israel Brewster <israel@ravnalaska.net> writes:
>> Can you send "explain" (not explain analyze) for the production server?

> Not for the full query - it only just completed, after 70 minutes or so, and I wasn't running under EXPLAIN ANALYZE.
Runningwith a shorter date range of only 7 days, as you suggest below: https://explain.depesz.com/s/r80j
<https://explain.depesz.com/s/r80j>

First thing that jumps out from that is

> Foreign Scan on oag_schedules (cost=100.00..128.60 rows=620 width=108) (actual time=3.576..477.524 rows=79,853
loops=1)

Being off by a factor of 100 at the scan level is never a good start for a
join plan.  Turn on use_remote_estimate (assuming these are postgres_fdw
tables).  Also try explicitly ANALYZE'ing the foreign tables.  I do not
believe auto-analyze will touch foreign tables ...
        regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Weird performance difference

From

israel

Date:

21 October 2017, 21:36:17

On 2017-10-20 16:38, Tom Lane wrote:
> Israel Brewster <israel@ravnalaska.net> writes:
>>> Can you send "explain" (not explain analyze) for the production 
>>> server?
> 
>> Not for the full query - it only just completed, after 70 minutes or 
>> so, and I wasn't running under EXPLAIN ANALYZE. Running with a shorter 
>> date range of only 7 days, as you suggest below: 
>> https://explain.depesz.com/s/r80j <https://explain.depesz.com/s/r80j>
> 
> First thing that jumps out from that is
> 
>> Foreign Scan on oag_schedules (cost=100.00..128.60 rows=620 width=108) 
>> (actual time=3.576..477.524 rows=79,853 loops=1)
> 
> Being off by a factor of 100 at the scan level is never a good start 
> for a
> join plan.  Turn on use_remote_estimate (assuming these are 
> postgres_fdw
> tables).  Also try explicitly ANALYZE'ing the foreign tables.  I do not
> believe auto-analyze will touch foreign tables ...

Thanks - the ANALYZE apparently did it. Running the FULL query (for the 
entire year) now returns in slightly better time than my test machine: 
https://explain.depesz.com/s/GtiM

Also, the query plan now looks similar. So now that it's working, I can 
move on to optimizing. It's already been suggested that I remove the 
cast to date (or index it), so I guess that's the first thing I'll try.

> 
>             regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Weird performance difference

From

Michael Paquier

Date:

22 October 2017, 15:45:34

On Sat, Oct 21, 2017 at 9:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Also try explicitly ANALYZE'ing the foreign tables.  I do not
> believe auto-analyze will touch foreign tables ...

Autovacuum and autoanalyze only process relations and matviews,
discarding the rest when scanning pg_class. See do_autovacuum().
-- 
Michael


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general