Thread: Postgresql 10.3 , query never completes if LIMIT clause is specifiedand paralle processing is on

Hi there, we are using postgresql 10.3 and we're facing an issue with a query. The query (full query below)  completes only  when: 

1 - LIMIT 10 is removed
or
2 -  show max_parallel_workers_per_gather  is set to 0, so parallel processing is disabled.

With  max_parallel_workers_per_gather   set to the default value (8) I'm not even able to get the query plan.

Notes: 

  • We're experiencing the issue in any server of ours but I've reproduced the issue in a fresh restored database with full vacuum/reindex of the tables.
  • We didn't touch any parameter concering the parallel processing, we're running the defaults: 

cmdstaging=# show max_parallel_workers_per_gather ;
 max_parallel_workers_per_gather
---------------------------------
 8
(1 row)

cmdstaging=# show max_worker_processes ;
 max_worker_processes
----------------------
 8
(1 row)

cmdstaging=# show max_parallel_workers;
 max_parallel_workers
----------------------
 8
(1 row)




The query completes only omitting the LIMIT clause or when I disable parallel processing: 

 id   | vendor_id | gaa |   pop_name    | pop_status | pop_location | pop_provider_id | pop_provider |     pop_street      | pop_city | pop
_postal_code | pop_state | pop_country | pop_country_id
--------+-----------+-----+---------------+------------+--------------+-----------------+--------------+---------------------+----------+----
-------------+-----------+-------------+----------------
 684807 |     12346 |     | GTT/POP/LON1T | Active     | LON1T        |           12288 | Telehouse UK | 14 Coriander Avenue | London   | E14
 2AA         |           | GB          |            219
(1 row)

Time: 4374.759 ms (00:04.375)
cmdstaging=# show max_parallel_workers_per_gather ;
 max_parallel_workers_per_gather
---------------------------------
 0
(1 row)

Time: 0.097 ms


Otherwise it just keep running for forever. 






This is the full query: 



SELECT * FROM (
SELECT
seg.circuit_id AS id,
vendor_gtt_pop.vendor_id,
CASE WHEN vendor_gtt_pop.push_to_gaa = 1 THEN 'Y' END as gaa,
pop.gii_circuitid AS pop_name,
cst.label AS pop_status,
seg.a_company_name AS pop_location,
seg.vendor_id AS pop_provider_id,
seg.vendor_name AS pop_provider,
cs.address1 AS pop_street,
cs.city AS pop_city,
cs.postal_code AS pop_postal_code,
cs.state AS pop_state,
csc.code AS pop_country,
cs.country_id AS pop_country_id
FROM (
SELECT c.gii_circuitid, max(so.id) AS service_order_id
FROM service_order so
join circuit c on c.product_id=so.product_id
join master_service_order mso on mso.id=so.master_service_order_id
WHERE NOT (so.ordertype_id = 2 AND so.status <> 999) AND
NOT (so.ordertype_id = 3 AND so.status <> 999) AND
c.status >= 20 AND
c.status not in (160,999) AND
mso.client_id=11615 AND
c.service_description=28 AND
c.status!=160
GROUP BY c.gii_circuitid
) pop
JOIN service_order so ON so.id = pop.service_order_id
left JOIN client_site cs on cs.id=so.a_site_id
left JOIN country csc on csc.id=cs.country_id
JOIN circuit c ON so.product_id=c.product_id
JOIN circuit_status cst ON cst.id=c.status
JOIN (
SELECT c.id AS circuit_id, sg.id AS segment_id, c.pop_support_vendor_id AS vendor_id,
v.name AS vendor_name, sg.a_company_name
FROM segment sg
JOIN circuit_layout cl ON cl.segment_id = sg.id AND cl.ordinal = 1
JOIN circuit c ON c.id = cl.circuit_id
JOIN vendor v ON v.id = c.pop_support_vendor_id
) seg ON seg.circuit_id = c.id
JOIN vendor_gtt_pop on vendor_gtt_pop.gtt_pop_id = seg.circuit_id
) foo where vendor_id = 12346 AND (1=1) ORDER BY pop_name ASC,id LIMIT 10


Execution plan with  max_parallel_workers_per_gather =0 ,  max_parallel_workers_per_gather =8 and no LIMIT clause : 



=== QUERY PLAN WITH  max_parallel_workers_per_gather  = 0 === 

 Limit  (cost=433508.77..433508.77 rows=1 width=176) (actual time=2992.875..2992.875 rows=1 loops=1)
   ->  Sort  (cost=433508.77..433508.77 rows=1 width=176) (actual time=2992.874..2992.874 rows=1 loops=1)
         Sort Key: c_2.gii_circuitid, c_1.id
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=168438.04..433508.76 rows=1 width=176) (actual time=1747.038..2992.864 rows=1 loops=1)
               Join Filter: (c_1.id = vendor_gtt_pop.gtt_pop_id)
               Rows Removed by Join Filter: 662
               ->  Hash Join  (cost=168438.04..433486.80 rows=1 width=148) (actual time=1540.484..2952.522 rows=663 loops=1)
                     Hash Cond: (c_1.id = c.id)
                     ->  Hash Join  (cost=125957.20..388048.64 rows=788618 width=63) (actual time=1335.406..2746.835 rows=1582 loops=1)
                           Hash Cond: (c_1.pop_support_vendor_id = v.id)
                           ->  Hash Join  (cost=125518.15..377653.28 rows=788618 width=40) (actual time=1330.727..2711.206 rows=778564 loops=
1)
                                 Hash Cond: (sg.id = cl.segment_id)
                                 ->  Seq Scan on segment sg  (cost=0.00..231189.42 rows=3482542 width=32) (actual time=0.004..472.967 rows=34
83328 loops=1)
                                 ->  Hash  (cost=115660.43..115660.43 rows=788618 width=16) (actual time=1328.852..1328.852 rows=778564 loops
=1)
                                       Buckets: 1048576  Batches: 1  Memory Usage: 41653kB
                                       ->  Merge Join  (cost=4.59..115660.43 rows=788618 width=16) (actual time=0.034..1179.131 rows=778564 l
oops=1)
                                             Merge Cond: (cl.circuit_id = c_1.id)
                                             ->  Index Only Scan using circuit_layout_idx on circuit_layout cl  (cost=0.43..61739.87 rows=788
618 width=8) (actual time=0.016..348.350 rows=778564 loops=1)
                                                   Index Cond: (ordinal = 1)
                                                   Heap Fetches: 778564
                                             ->  Index Scan using circuit_id_key on circuit c_1  (cost=0.42..41790.58 rows=909014 width=8) (a
ctual time=0.011..326.429 rows=909014 loops=1)
                           ->  Hash  (cost=325.69..325.69 rows=9069 width=27) (actual time=2.753..2.753 rows=9069 loops=1)
                                 Buckets: 16384  Batches: 1  Memory Usage: 669kB
                                 ->  Seq Scan on vendor v  (cost=0.00..325.69 rows=9069 width=27) (actual time=0.006..1.489 rows=9069 loops=1
)
                     ->  Hash  (cost=42480.82..42480.82 rows=1 width=85) (actual time=205.073..205.073 rows=882 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 102kB
                           ->  Nested Loop  (cost=42469.41..42480.82 rows=1 width=85) (actual time=184.020..204.776 rows=882 loops=1)
                                 Join Filter: (c.status = cst.id)
Time: 3016.688 ms (00:03.017)



=== QUERY PLAN WITH  max_parallel_workers_per_gather  = 8 and NO LIMIT clause === 


-
-----------------------------------------------------------------------------------
 Sort  (cost=253514.80..253514.81 rows=1 width=176) (actual time=1121.737..1121.737 rows=1 loops=1)
   Sort Key: c_2.gii_circuitid, c_1.id
   Sort Method: quicksort  Memory: 25kB
   ->  Hash Join  (cost=33193.56..253514.79 rows=1 width=176) (actual time=1121.723..1121.726 rows=1 loops=1)
         Hash Cond: (c_1.id = c.id)
         ->  Gather  (cost=1466.02..221787.23 rows=3 width=75) (actual time=1027.012..1027.015 rows=1 loops=1)
               Workers Planned: 5
               Workers Launched: 5
               ->  Hash Join  (cost=466.02..220786.93 rows=1 width=75) (actual time=964.279..1020.204 rows=0 loops=6)
                     Hash Cond: (c_1.id = vendor_gtt_pop.gtt_pop_id)
                     ->  Hash Join  (cost=444.07..219779.19 rows=157724 width=63) (actual time=122.728..1019.934 rows=264 loops=6)
                           Hash Cond: (c_1.pop_support_vendor_id = v.id)
                           ->  Merge Join  (cost=5.02..217348.87 rows=157724 width=40) (actual time=1.089..1007.302 rows=129761 loops=6)
                                 Merge Cond: (cl.circuit_id = c_1.id)
                                 ->  Nested Loop  (cost=0.86..171314.49 rows=157724 width=32) (actual time=0.056..477.965 rows=129761 loops=6
)
                                       ->  Parallel Index Only Scan using circuit_layout_idx on circuit_layout cl  (cost=0.43..55430.93 rows=
157724 width=8) (actual time=0.031..96.328 rows=129761 loops=6)
                                             Index Cond: (ordinal = 1)
                                             Heap Fetches: 163021
                                       ->  Index Scan using uniqid on segment sg  (cost=0.43..0.73 rows=1 width=32) (actual time=0.002..0.002
 rows=1 loops=778564)
                                             Index Cond: (id = cl.segment_id)
                                 ->  Index Scan using circuit_id_key on circuit c_1  (cost=0.42..41790.58 rows=909014 width=8) (actual time=0
.018..377.993 rows=908914 loops=6)
                           ->  Hash  (cost=325.69..325.69 rows=9069 width=27) (actual time=3.441..3.441 rows=9069 loops=6)
                                 Buckets: 16384  Batches: 1  Memory Usage: 669kB
                                 ->  Seq Scan on vendor v  (cost=0.00..325.69 rows=9069 width=27) (actual time=0.013..1.850 rows=9069 loops=6
)
                     ->  Hash  (cost=21.91..21.91 rows=3 width=12) (actual time=0.118..0.118 rows=1 loops=6)
                           Buckets: 1024  Batches: 1  Memory Usage: 9kB
                           ->  Seq Scan on vendor_gtt_pop  (cost=0.00..21.91 rows=3 width=12) (actual time=0.113..0.113 rows=1 loops=6)
                                 Filter: (vendor_id = 12346)
                                 Rows Removed by Filter: 1192
         ->  Hash  (cost=31727.53..31727.53 rows=1 width=85) (actual time=94.693..94.693 rows=882 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 102kB
               ->  Nested Loop  (cost=31725.15..31727.53 rows=1 width=85) (actual time=83.507..94.344 rows=882 loops=1)
                     ->  Nested Loop  (cost=31725.02..31727.38 rows=1 width=74) (actual time=83.496..93.575 rows=882 loops=1)
                           ->  Nested Loop Left Join  (cost=31724.59..31726.89 rows=1 width=70) (actual time=83.484..90.885 rows=882 loops=1)
                                 ->  Nested Loop Left Join  (cost=31724.45..31726.73 rows=1 width=68) (actual time=83.475..90.067 rows=882 lo
ops=1)
                                       ->  Nested Loop  (cost=31724.02..31726.27 rows=1 width=30) (actual time=83.461..87.442 rows=882 loops=
1)
                                             ->  GroupAggregate  (cost=31723.60..31723.62 rows=1 width=26) (actual time=83.446..83.812 rows=8
82 loops=1)
                                                   Group Key: c_2.gii_circuitid
                                                   ->  Sort  (cost=31723.60..31723.60 rows=1 width=26) (actual time=83.441..83.515 rows=884 l
oops=1)
                                                         Sort Key: c_2.gii_circuitid
                                                         Sort Method: quicksort  Memory: 94kB
                                                         ->  Gather  (cost=1000.85..31723.59 rows=1 width=26) (actual time=2.159..81.342 rows
=884 loops=1)
                                                               Workers Planned: 3
                                                               Workers Launched: 2
                                                               ->  Nested Loop  (cost=0.85..30723.49 rows=1 width=26) (actual time=8.509..75.
845 rows=295 loops=3)
                                                                     ->  Nested Loop  (cost=0.42..30722.56 rows=2 width=30) (actual time=8.48
3..74.283 rows=295 loops=3)
                                                                           ->  Parallel Seq Scan on circuit c_2  (cost=0.00..30714.61 rows=3
width=26) (actual time=5.992..72.563 rows=316 loops=3)
                                                                                 Filter: ((status >= 20) AND (status <> ALL ('{160,999}'::int
eger[])) AND (status <> 160) AND (service_description = 28))
                                                                                 Rows Removed by Filter: 302689
                                                                           ->  Index Scan using so_pid_idx on service_order so_1  (cost=0.42.
.2.65 rows=1 width=12) (actual time=0.005..0.005 rows=1 loops=948)
                                                                                 Index Cond: (product_id = c_2.product_id)
                                                                                 Filter: (((ordertype_id <> 2) OR (status = 999)) AND ((order
type_id <> 3) OR (status = 999)))
                                                                                 Rows Removed by Filter: 0
                                                                     ->  Index Scan using master_service_order_id_key on master_service_order
 mso  (cost=0.42..0.46 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=884)
                                                                           Index Cond: (id = so_1.master_service_order_id)
                                                                           Filter: (client_id = 11615)
                                             ->  Index Scan using service_order_id_key on service_order so  (cost=0.42..2.64 rows=1 width=12)
 (actual time=0.003..0.003 rows=1 loops=882)
                                                   Index Cond: (id = (max(so_1.id)))
                                       ->  Index Scan using client_site_pkey on client_site cs  (cost=0.42..0.46 rows=1 width=46) (actual tim
e=0.003..0.003 rows=1 loops=882)
                                             Index Cond: (id = so.a_site_id)
                                 ->  Index Scan using country_pkey on country csc  (cost=0.14..0.16 rows=1 width=6) (actual time=0.001..0.001
 rows=1 loops=882)
                                       Index Cond: (id = cs.country_id)
                           ->  Index Scan using circuit_product_id_idx on circuit c  (cost=0.42..0.49 rows=1 width=12) (actual time=0.003..0.
003 rows=1 loops=882)
                                 Index Cond: (product_id = so.product_id)
                     ->  Index Scan using cs_id_c on circuit_status cst  (cost=0.14..0.16 rows=1 width=19) (actual time=0.001..0.001 rows=1 l
oops=882)
                           Index Cond: (id = c.status)
 Planning time: 3.889 ms
 Execution time: 1122.357 ms
(68 rows)



Thank you very much for your help,

Kind regards,


Alessandro Aste.

On 03/21/2018 05:09 PM, Alessandro Aste wrote:
> Hi there, we are using postgresql 10.3 and we're facing an issue with a
> query. The query (full query below)  completes only  when: 
> 
> 1 - LIMIT 10 is removed
> or
> 2 -  show max_parallel_workers_per_gather  is set to 0, so parallel
> processing is disabled.
> 
> With  max_parallel_workers_per_gather   set to the default value (8) I'm
> not even able to get the query plan.
> 
> Notes: 
> 
>   * We're experiencing the issue in any server of ours but I've
>     reproduced the issue in a fresh restored database with full
>     vacuum/reindex of the tables.
>   * We didn't touch any parameter concering the parallel processing,
>     we're running the defaults: 
> 
> 
> cmdstaging=# show max_parallel_workers_per_gather ;
>  max_parallel_workers_per_gather
> ---------------------------------
>  8
> (1 row)
> 
> cmdstaging=# show max_worker_processes ;
>  max_worker_processes
> ----------------------
>  8
> (1 row)
> 
> cmdstaging=# show max_parallel_workers;
>  max_parallel_workers
> ----------------------
>  8
> (1 row)
> 
> 
> 
> 
> The query completes only omitting the LIMIT clause or when I disable
> parallel processing: 
> 
>  id   | vendor_id | gaa |   pop_name    | pop_status | pop_location |
> pop_provider_id | pop_provider |     pop_street      | pop_city | pop
> _postal_code | pop_state | pop_country | pop_country_id
>
--------+-----------+-----+---------------+------------+--------------+-----------------+--------------+---------------------+----------+----
> -------------+-----------+-------------+----------------
>  684807 |     12346 |     | GTT/POP/LON1T | Active     | LON1T        | 
>          12288 | Telehouse UK | 14 Coriander Avenue | London   | E14
>  2AA         |           | GB          |            219
> (1 row)
> 
> Time: 4374.759 ms (00:04.375)
> cmdstaging=# show max_parallel_workers_per_gather ;
>  max_parallel_workers_per_gather
> ---------------------------------
>  0
> (1 row)
> 
> Time: 0.097 ms
> 
> 
> Otherwise it just keep running for forever. 
> 

When you say "running forever" is it actually using CPU, or does it get
stuck on something?

> 
> This is the full query: 
> 
> 
> SELECT * FROM (
> SELECT
> seg.circuit_id AS id,
> vendor_gtt_pop.vendor_id,
> CASE WHEN vendor_gtt_pop.push_to_gaa = 1 THEN 'Y' END as gaa,
> pop.gii_circuitid AS pop_name,
> cst.label AS pop_status,
> seg.a_company_name AS pop_location,
> seg.vendor_id AS pop_provider_id,
> seg.vendor_name AS pop_provider,
> cs.address1 AS pop_street,
> cs.city AS pop_city,
> cs.postal_code AS pop_postal_code,
> cs.state AS pop_state,
> csc.code AS pop_country,
> cs.country_id AS pop_country_id
> FROM (
> SELECT c.gii_circuitid, max(so.id <http://so.id>) AS service_order_id
> FROM service_order so
> join circuit c on c.product_id=so.product_id
> join master_service_order mso on mso.id <http://mso.id>=so.master_service_order_id
> WHERE NOT (so.ordertype_id = 2 AND so.status <> 999) AND
> NOT (so.ordertype_id = 3 AND so.status <> 999) AND
> c.status >= 20 AND
> c.status not in (160,999) AND
> mso.client_id=11615 AND
> c.service_description=28 AND
> c.status!=160
> GROUP BY c.gii_circuitid
> ) pop
> JOIN service_order so ON so.id <http://so.id> = pop.service_order_id
> left JOIN client_site cs on cs.id <http://cs.id>=so.a_site_id
> left JOIN country csc on csc.id <http://csc.id>=cs.country_id
> JOIN circuit c ON so.product_id=c.product_id
> JOIN circuit_status cst ON cst.id <http://cst.id>=c.status
> JOIN (
> SELECT c.id <http://c.id> AS circuit_id, sg.id <http://sg.id> AS segment_id, c.pop_support_vendor_id AS vendor_id,
> v.name <http://v.name> AS vendor_name, sg.a_company_name
> FROM segment sg
> JOIN circuit_layout cl ON cl.segment_id = sg.id <http://sg.id> AND cl.ordinal = 1
> JOIN circuit c ON c.id <http://c.id> = cl.circuit_id
> JOIN vendor v ON v.id <http://v.id> = c.pop_support_vendor_id
> ) seg ON seg.circuit_id = c.id <http://c.id>
> JOIN vendor_gtt_pop on vendor_gtt_pop.gtt_pop_id = seg.circuit_id
> ) foo where vendor_id = 12346 AND (1=1) ORDER BY pop_name ASC,id LIMIT 10
> 
> 
> 
> Execution plan with  max_parallel_workers_per_gather =0 , 
> max_parallel_workers_per_gather =8 and no LIMIT clause : 
> 
> 
We really need to see the execution plan that causes issues, i.e.
max_parallel_workers_per_gather=8 with LIMIT clause. Plain explain
(without analyze), at least.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Thanks for your reply Tomas.  The query just got stuck for forever.  I observed no CPU spikes, it is currently running and I see 89 of the CPU idle. I'm using a relative powerfull machine (120 GB ram, 56 CPU).  


Plain analyze as requested. : 


                                                                                                QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------
 Limit  (cost=253523.56..253523.57 rows=1 width=176)
   ->  Sort  (cost=253523.56..253523.57 rows=1 width=176)
         Sort Key: c_2.gii_circuitid, c_1.id
         ->  Nested Loop  (cost=33190.89..253523.55 rows=1 width=176)
               Join Filter: (c_1.id = c.id)
               ->  Nested Loop  (cost=31724.87..31736.29 rows=1 width=85)
                     Join Filter: (c.status = cst.id)
                     ->  Nested Loop  (cost=31724.87..31734.84 rows=1 width=74)
                           ->  Nested Loop Left Join  (cost=31724.45..31734.35 rows=1 width=70)
                                 Join Filter: (csc.id = cs.country_id)
                                 ->  Nested Loop Left Join  (cost=31724.45..31726.73 rows=1 width=68)
                                       ->  Nested Loop  (cost=31724.02..31726.27 rows=1 width=30)
                                             ->  GroupAggregate  (cost=31723.60..31723.62 rows=1 width=26)
                                                   Group Key: c_2.gii_circuitid
                                                   ->  Sort  (cost=31723.60..31723.60 rows=1 width=26)
                                                         Sort Key: c_2.gii_circuitid
                                                         ->  Gather  (cost=1000.85..31723.59 rows=1 width=26)
                                                               Workers Planned: 3
                                                               ->  Nested Loop  (cost=0.85..30723.49 rows=1 width=26)
                                                                     ->  Nested Loop  (cost=0.42..30722.56 rows=2 width=30)
                                                                           ->  Parallel Seq Scan on circuit c_2  (cost=0.00..30714.61 rows=3
width=26)
                                                                                 Filter: ((status >= 20) AND (status <> ALL ('{160,999}'::int
eger[])) AND (status <> 160) AND (service_description = 28))
                                                                           ->  Index Scan using so_pid_idx on service_order so_1  (cost=0.42.
.2.65 rows=1 width=12)
                                                                                 Index Cond: (product_id = c_2.product_id)
                                                                                 Filter: (((ordertype_id <> 2) OR (status = 999)) AND ((order
type_id <> 3) OR (status = 999)))
                                                                     ->  Index Scan using master_service_order_id_key on master_service_order
 mso  (cost=0.42..0.46 rows=1 width=4)
                                                                           Index Cond: (id = so_1.master_service_order_id)
                                                                           Filter: (client_id = 11615)
                                             ->  Index Scan using service_order_id_key on service_order so  (cost=0.42..2.64 rows=1 width=12)
                                                   Index Cond: (id = (max(so_1.id)))
                                       ->  Index Scan using client_site_pkey on client_site cs  (cost=0.42..0.46 rows=1 width=46)
                                             Index Cond: (id = so.a_site_id)
                                 ->  Seq Scan on country csc  (cost=0.00..4.50 rows=250 width=6)
                           ->  Index Scan using circuit_product_id_idx on circuit c  (cost=0.42..0.49 rows=1 width=12)
                                 Index Cond: (product_id = so.product_id)
                     ->  Seq Scan on circuit_status cst  (cost=0.00..1.20 rows=20 width=19)
               ->  Gather  (cost=1466.02..221787.23 rows=3 width=75)
                     Workers Planned: 5
                     ->  Hash Join  (cost=466.02..220786.93 rows=1 width=75)
                           Hash Cond: (c_1.id = vendor_gtt_pop.gtt_pop_id)
                           ->  Hash Join  (cost=444.07..219779.19 rows=157724 width=63)
                                 Hash Cond: (c_1.pop_support_vendor_id = v.id)
                                 ->  Merge Join  (cost=5.02..217348.87 rows=157724 width=40)
                                       Merge Cond: (cl.circuit_id = c_1.id)
                                       ->  Nested Loop  (cost=0.86..171314.49 rows=157724 width=32)
                                             ->  Parallel Index Only Scan using circuit_layout_idx on circuit_layout cl  (cost=0.43..55430.93
 rows=157724 width=8)
                                                   Index Cond: (ordinal = 1)
                                             ->  Index Scan using uniqid on segment sg  (cost=0.43..0.73 rows=1 width=32)
                                                   Index Cond: (id = cl.segment_id)
                                       ->  Index Scan using circuit_id_key on circuit c_1  (cost=0.42..41790.58 rows=909014 width=8)
                                 ->  Hash  (cost=325.69..325.69 rows=9069 width=27)
                                       ->  Seq Scan on vendor v  (cost=0.00..325.69 rows=9069 width=27)
                           ->  Hash  (cost=21.91..21.91 rows=3 width=12)
                                 ->  Seq Scan on vendor_gtt_pop  (cost=0.00..21.91 rows=3 width=12)
                                       Filter: (vendor_id = 12346)
(55 rows)





On Wed, Mar 21, 2018 at 8:01 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:


On 03/21/2018 05:09 PM, Alessandro Aste wrote:
> Hi there, we are using postgresql 10.3 and we're facing an issue with a
> query. The query (full query below)  completes only  when: 
>
> 1 - LIMIT 10 is removed
> or
> 2 -  show max_parallel_workers_per_gather  is set to 0, so parallel
> processing is disabled.
>
> With  max_parallel_workers_per_gather   set to the default value (8) I'm
> not even able to get the query plan.
>
> Notes: 
>
>   * We're experiencing the issue in any server of ours but I've
>     reproduced the issue in a fresh restored database with full
>     vacuum/reindex of the tables.
>   * We didn't touch any parameter concering the parallel processing,
>     we're running the defaults: 
>
>
> cmdstaging=# show max_parallel_workers_per_gather ;
>  max_parallel_workers_per_gather
> ---------------------------------
>  8
> (1 row)
>
> cmdstaging=# show max_worker_processes ;
>  max_worker_processes
> ----------------------
>  8
> (1 row)
>
> cmdstaging=# show max_parallel_workers;
>  max_parallel_workers
> ----------------------
>  8
> (1 row)
>
>
>
>
> The query completes only omitting the LIMIT clause or when I disable
> parallel processing: 
>
>  id   | vendor_id | gaa |   pop_name    | pop_status | pop_location |
> pop_provider_id | pop_provider |     pop_street      | pop_city | pop
> _postal_code | pop_state | pop_country | pop_country_id
> --------+-----------+-----+---------------+------------+--------------+-----------------+--------------+---------------------+----------+----
> -------------+-----------+-------------+----------------
>  684807 |     12346 |     | GTT/POP/LON1T | Active     | LON1T        | 
>          12288 | Telehouse UK | 14 Coriander Avenue | London   | E14
>  2AA         |           | GB          |            219
> (1 row)
>
> Time: 4374.759 ms (00:04.375)
> cmdstaging=# show max_parallel_workers_per_gather ;
>  max_parallel_workers_per_gather
> ---------------------------------
>  0
> (1 row)
>
> Time: 0.097 ms
>
>
> Otherwise it just keep running for forever. 
>

When you say "running forever" is it actually using CPU, or does it get
stuck on something?

>
> This is the full query: 
>
>
> SELECT * FROM (
> SELECT
> seg.circuit_id AS id,
> vendor_gtt_pop.vendor_id,
> CASE WHEN vendor_gtt_pop.push_to_gaa = 1 THEN 'Y' END as gaa,
> pop.gii_circuitid AS pop_name,
> cst.label AS pop_status,
> seg.a_company_name AS pop_location,
> seg.vendor_id AS pop_provider_id,
> seg.vendor_name AS pop_provider,
> cs.address1 AS pop_street,
> cs.city AS pop_city,
> cs.postal_code AS pop_postal_code,
> cs.state AS pop_state,
> csc.code AS pop_country,
> cs.country_id AS pop_country_id
> FROM (
> SELECT c.gii_circuitid, max(so.id <http://so.id>) AS service_order_id
> FROM service_order so
> join circuit c on c.product_id=so.product_id
> join master_service_order mso on mso.id <http://mso.id>=so.master_service_order_id
> WHERE NOT (so.ordertype_id = 2 AND so.status <> 999) AND
> NOT (so.ordertype_id = 3 AND so.status <> 999) AND
> c.status >= 20 AND
> c.status not in (160,999) AND
> mso.client_id=11615 AND
> c.service_description=28 AND
> c.status!=160
> GROUP BY c.gii_circuitid
> ) pop
> JOIN service_order so ON so.id <http://so.id> = pop.service_order_id
> left JOIN client_site cs on cs.id <http://cs.id>=so.a_site_id
> left JOIN country csc on csc.id <http://csc.id>=cs.country_id
> JOIN circuit c ON so.product_id=c.product_id
> JOIN circuit_status cst ON cst.id <http://cst.id>=c.status
> JOIN (
> SELECT c.id <http://c.id> AS circuit_id, sg.id <http://sg.id> AS segment_id, c.pop_support_vendor_id AS vendor_id,
> v.name <http://v.name> AS vendor_name, sg.a_company_name
> FROM segment sg
> JOIN circuit_layout cl ON cl.segment_id = sg.id <http://sg.id> AND cl.ordinal = 1
> JOIN circuit c ON c.id <http://c.id> = cl.circuit_id
> JOIN vendor v ON v.id <http://v.id> = c.pop_support_vendor_id
> ) seg ON seg.circuit_id = c.id <http://c.id>
> JOIN vendor_gtt_pop on vendor_gtt_pop.gtt_pop_id = seg.circuit_id
> ) foo where vendor_id = 12346 AND (1=1) ORDER BY pop_name ASC,id LIMIT 10
>
>
>
> Execution plan with  max_parallel_workers_per_gather =0 , 
> max_parallel_workers_per_gather =8 and no LIMIT clause : 
>
>
We really need to see the execution plan that causes issues, i.e.
max_parallel_workers_per_gather=8 with LIMIT clause. Plain explain
(without analyze), at least.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On 03/21/2018 08:44 PM, Alessandro Aste wrote:
> Thanks for your reply Tomas.  The query just got stuck for forever.  I
> observed no CPU spikes, it is currently running and I see 89 of the CPU
> idle. I'm using a relative powerfull machine (120 GB ram, 56 CPU).  
> 

That doesn't really answer the question, I'm afraid. I suppose "89 of
CPU" means that 89% idle in total, but 11% with 56 CPUs still means
about 6 cores 100% busy. But maybe you meant something else?

Is there something else running on the machine? If you look at "top" are
the processes (the one you're connected to and the parallel workers)
doing something on the CPU?

> 
> Plain analyze as requested. : 
> 

I don't see anything obviously broken with the query plan, and it's
difficult to compare with the other plans because they are quite different.

But there's one part of the plan interesting:

 Limit  (cost=253523.56..253523.57 rows=1 width=176)
   ->  Sort  (cost=253523.56..253523.57 rows=1 width=176)
         Sort Key: c_2.gii_circuitid, c_1.id
         ->  Nested Loop  (cost=33190.89..253523.55 rows=1 width=176)
               Join Filter: (c_1.id = c.id)
               ->  Nested Loop  (cost=31724.87..31736.29 rows=1 ...)
                     ...
               ->  Gather  (cost=1466.02..221787.23 rows=3 width=75)
                     Workers Planned: 5
                     ->  Hash Join  (cost=466.02..220786.93 rows=1 ...)
                         ...

That is, there's a Gather on the inner side of a Nested Loop. I wonder
if that might cause issues in case of under-estimate (in which case we'd
be restarting the Gather many times) ...


BTW one of the plans you sent earlier is incomplete, because it ends
like this:

    ->  Nested Loop  (cost=42469.41..42480.82 rows=1 width=85) (...)
        Join Filter: (c.status = cst.id)
    Time: 3016.688 ms (00:03.017)

That is, it's missing the part below the join.


That being said, I'm not sure what's the issue here. Can you prepare a
self-contained test case that we might use to reproduce the issue? For
example by dumping the relevant part of the tables?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Tomas, thank you. This machine  is abare metal server running only a staging postgresql 10.3  instance. Nobody is using it beside me. 

I'm attaching 4 files.  

every_30_seconds_top_stats_during_query.txt    - this is a caputure of the top command every 30 seconds(more or less) for 10+ minutes while I'm running the query. Let me know if this helps to answere your question.
EXPLAIN_WITH_LIMIT_AND_max_parallel_workers_per_gather_SET_TO_0.txt   -  query plan  with full query and  max_parallel_workers_per_gather  force to 0. Full output.
EXPLAIN_WITH_LIMIT_AND_DEFAULT_PARALLEL_PROCESSING_SETTINGS.txt  -    query plan with full query and  default parellel processing settings.  Full output.
EXPLAIN_WITHOUT_LIMIT_AND_DEFAULT_PARALLEL_PROCESSING_SETTINGS.tx  - query plan of the query omitting the LIMIT clause and default parellel processing settings. Full output.


For what concerns the  self-contained test case - I'll do my best to prepare it.

Thank you very much, please let me know if this answer your questions. 






Il 22 mar 2018 3:04 AM, "Tomas Vondra" <tomas.vondra@2ndquadrant.com> ha scritto:

On 03/21/2018 08:44 PM, Alessandro Aste wrote:
> Thanks for your reply Tomas.  The query just got stuck for forever.  I
> observed no CPU spikes, it is currently running and I see 89 of the CPU
> idle. I'm using a relative powerfull machine (120 GB ram, 56 CPU).  
>

That doesn't really answer the question, I'm afraid. I suppose "89 of
CPU" means that 89% idle in total, but 11% with 56 CPUs still means
about 6 cores 100% busy. But maybe you meant something else?

Is there something else running on the machine? If you look at "top" are
the processes (the one you're connected to and the parallel workers)
doing something on the CPU?

>
> Plain analyze as requested. : 
>

I don't see anything obviously broken with the query plan, and it's
difficult to compare with the other plans because they are quite different.

But there's one part of the plan interesting:

 Limit  (cost=253523.56..253523.57 rows=1 width=176)
   ->  Sort  (cost=253523.56..253523.57 rows=1 width=176)
         Sort Key: c_2.gii_circuitid, c_1.id
         ->  Nested Loop  (cost=33190.89..253523.55 rows=1 width=176)
               Join Filter: (c_1.id = c.id)
               ->  Nested Loop  (cost=31724.87..31736.29 rows=1 ...)
                     ...
               ->  Gather  (cost=1466.02..221787.23 rows=3 width=75)
                     Workers Planned: 5
                     ->  Hash Join  (cost=466.02..220786.93 rows=1 ...)
                         ...

That is, there's a Gather on the inner side of a Nested Loop. I wonder
if that might cause issues in case of under-estimate (in which case we'd
be restarting the Gather many times) ...


BTW one of the plans you sent earlier is incomplete, because it ends
like this:

    ->  Nested Loop  (cost=42469.41..42480.82 rows=1 width=85) (...)
        Join Filter: (c.status = cst.id)
    Time: 3016.688 ms (00:03.017)

That is, it's missing the part below the join.


That being said, I'm not sure what's the issue here. Can you prepare a
self-contained test case that we might use to reproduce the issue? For
example by dumping the relevant part of the tables?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment

On 03/22/2018 11:33 AM, Alessandro Aste wrote:
> Tomas, thank you. This machine  is abare metal server running only a
> staging postgresql 10.3  instance. Nobody is using it beside me. 
> 
> I'm attaching 4 files.  
> 
> every_30_seconds_top_stats_during_query.txt    - this is a caputure of
> the top command every 30 seconds(more or less) for 10+ minutes while I'm
> running the query. Let me know if this helps to answere your question.
> EXPLAIN_WITH_LIMIT_AND_max_parallel_workers_per_gather_SET_TO_0.txt   - 
> query plan  with full query and  max_parallel_workers_per_gather  force
> to 0. Full output.
> EXPLAIN_WITH_LIMIT_AND_DEFAULT_PARALLEL_PROCESSING_SETTINGS.txt  - 
>  query plan with full query and  default parellel processing settings. 
> Full output.
> EXPLAIN_WITHOUT_LIMIT_AND_DEFAULT_PARALLEL_PROCESSING_SETTINGS.tx  -
> query plan of the query omitting the LIMIT clause and default parellel
> processing settings. Full output.
> 

OK. Looking at the top output, I see this:

       PID USER       VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    104880 postgres  30.8g 1.9g 1.9g R 92.0  1.5  15:15.60 postmaster
    111732 postgres  30.8g 476m 473m R 88.2  0.4   0:00.47 postmaster
    111730 postgres  30.8g 473m 470m R 86.4  0.4   0:00.46 postmaster
    111731 postgres  30.8g 476m 473m R 86.4  0.4   0:00.46 postmaster
    111733 postgres  30.8g 476m 473m R 86.4  0.4   0:00.46 postmaster
    111734 postgres  30.8g 476m 473m R 86.4  0.4   0:00.46 postmaster
    111728 root      15824 1912  828 R  3.8  0.0   0:00.04 top

That means it certainly is not stuck, it's simply doing a lot of work on
CPU. The question is why and what it's doing ...

Can you collect some CPU profiles using perf? There's a howto here:

    https://wiki.postgresql.org/wiki/Profiling_with_perf

But in short - install perf, install debuginfo packages for postgres,
and then do

    perf record -a -g -s sleep 60

while running the query. Once the profile data is collected, do

    perf report > report.txt

and share the report.txt with us (well, if it's very large you may need
to only cut the first 1000 lines or so).

That should tell us in which functions most of the time is spent. That
will give us some hints, hopefully.

> 
> For what concerns the  self-contained test case - I'll do my best to
> prepare it.
> 

Yeah, that would be helpful.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Thanks Tomas. We're currently building postgres from source. In order to enable symbols, you want me to re-configure postres with  --enable-debug then run perf?

Regards,

On Thu, Mar 22, 2018 at 5:00 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:


On 03/22/2018 11:33 AM, Alessandro Aste wrote:
> Tomas, thank you. This machine  is abare metal server running only a
> staging postgresql 10.3  instance. Nobody is using it beside me. 
>
> I'm attaching 4 files.  
>
> every_30_seconds_top_stats_during_query.txt    - this is a caputure of
> the top command every 30 seconds(more or less) for 10+ minutes while I'm
> running the query. Let me know if this helps to answere your question.
> EXPLAIN_WITH_LIMIT_AND_max_parallel_workers_per_gather_SET_TO_0.txt   - 
> query plan  with full query and  max_parallel_workers_per_gather  force
> to 0. Full output.
> EXPLAIN_WITH_LIMIT_AND_DEFAULT_PARALLEL_PROCESSING_SETTINGS.txt  - 
>  query plan with full query and  default parellel processing settings. 
> Full output.
> EXPLAIN_WITHOUT_LIMIT_AND_DEFAULT_PARALLEL_PROCESSING_SETTINGS.tx  -
> query plan of the query omitting the LIMIT clause and default parellel
> processing settings. Full output.
>

OK. Looking at the top output, I see this:

       PID USER       VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    104880 postgres  30.8g 1.9g 1.9g R 92.0  1.5  15:15.60 postmaster
    111732 postgres  30.8g 476m 473m R 88.2  0.4   0:00.47 postmaster
    111730 postgres  30.8g 473m 470m R 86.4  0.4   0:00.46 postmaster
    111731 postgres  30.8g 476m 473m R 86.4  0.4   0:00.46 postmaster
    111733 postgres  30.8g 476m 473m R 86.4  0.4   0:00.46 postmaster
    111734 postgres  30.8g 476m 473m R 86.4  0.4   0:00.46 postmaster
    111728 root      15824 1912  828 R  3.8  0.0   0:00.04 top

That means it certainly is not stuck, it's simply doing a lot of work on
CPU. The question is why and what it's doing ...

Can you collect some CPU profiles using perf? There's a howto here:

    https://wiki.postgresql.org/wiki/Profiling_with_perf

But in short - install perf, install debuginfo packages for postgres,
and then do

    perf record -a -g -s sleep 60

while running the query. Once the profile data is collected, do

    perf report > report.txt

and share the report.txt with us (well, if it's very large you may need
to only cut the first 1000 lines or so).

That should tell us in which functions most of the time is spent. That
will give us some hints, hopefully.

>
> For what concerns the  self-contained test case - I'll do my best to
> prepare it.
>

Yeah, that would be helpful.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On 03/22/2018 11:29 PM, Alessandro Aste wrote:
> Thanks Tomas. We're currently building postgres from source. In order to
> enable symbols, you want me to re-configure postres with  --enable-debug
> then run perf?
> 

Yes.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Tomas, I'm attaching a 4MB file with the perf report. Let me know if it gets blocked, I'll shrink it to the first 1000 lines. 

Thank you,

Alessandro.

On Thu, Mar 22, 2018 at 11:42 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 03/22/2018 11:29 PM, Alessandro Aste wrote:
> Thanks Tomas. We're currently building postgres from source. In order to
> enable symbols, you want me to re-configure postres with  --enable-debug
> then run perf?
>

Yes.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment
PS , in the meanwhile I discovered a 2nd workaround(beside disabling parallel processing) . I added offset  0 to the subquery , and, according to the documentation, “OFFSET 0 is the same as omitting the OFFSET clause” - https://www.postgresql.org/docs/current/static/queries-limit.html

cmd3dev=# show max_parallel_workers_per_gather ;

max_parallel_workers_per_gather

---------------------------------

8

(1 row)

 

cmd3dev=# \timing

Timing is on.

cmd3dev=#  SELECT * FROM (SELECT seg.circuit_id AS id, vendor_gtt_pop.vendor_id, CASE WHEN vendor_gtt_pop.push_to_gaa = 1 THEN 'Y' END as gaa, pop.gii_circuitid AS pop_name, cst.label AS pop_status, seg.a_company_name AS pop_location, seg.vendor_id AS pop_provider_id, seg.vendor_name AS pop_provider, cs.address1 AS pop_street, cs.city AS pop_city, cs.postal_code AS pop_postal_code, cs.state AS pop_state, csc.code AS pop_country, cs.country_id AS pop_country_id FROM ( SELECT c.gii_circuitid, max(so.id) AS service_order_id FROM service_order so join circuit c on c.product_id=so.product_id join master_service_order mso on mso.id=so.master_service_order_id WHERE NOT (so.ordertype_id = 2 AND so.status <> 999) AND NOT (so.ordertype_id = 3 AND so.status <> 999) AND c.status >= 20 AND c.status not in (160,999) AND mso.client_id=11615 AND c.service_description=28 AND c.status!=160 GROUP BY c.gii_circuitid ) pop JOIN service_order so ON so.id = pop.service_order_id left JOIN client_site cs on cs.id=so.a_site_id left JOIN country csc on csc.id=cs.country_id JOIN circuit c ON so.product_id=c.product_id JOIN circuit_status cst ON cst.id=c.status JOIN ( SELECT c.id AS circuit_id, sg.id AS segment_id, c.pop_support_vendor_id AS vendor_id, v.name AS vendor_name, sg.a_company_name FROM segment sg JOIN circuit_layout cl ON cl.segment_id = sg.id AND cl.ordinal = 1 JOIN circuit c ON c.id = cl.circuit_id JOIN vendor v ON v.id = c.pop_support_vendor_id ) seg ON seg.circuit_id = c.id JOIN vendor_gtt_pop on vendor_gtt_pop.gtt_pop_id = seg.circuit_id offset 0) foo where vendor_id = 12346 AND (1=1) ORDER BY pop_name ASC,id desc LIMIT 10;

   id   | vendor_id | gaa |   pop_name    | pop_status | pop_location | pop_provider_id | pop_provider |     pop_street      | pop_city | pop

_postal_code | pop_state | pop_country | pop_country_id

--------+-----------+-----+---------------+------------+--------------+-----------------+--------------+---------------------+----------+----

-------------+-----------+-------------+----------------

684807 |     12346 | Y   | GTT/POP/LON1T | Active     | LON1T        |           12288 | Telehouse UK | 14 Coriander Avenue | London   | E14

2AA         |           | GB          |            219

(1 row)

 

Time: 2245.073 ms (00:02.245)




On Fri, Mar 23, 2018 at 9:31 AM, Alessandro Aste <alessandro.aste@gmail.com> wrote:
Tomas, I'm attaching a 4MB file with the perf report. Let me know if it gets blocked, I'll shrink it to the first 1000 lines. 

Thank you,

Alessandro.

On Thu, Mar 22, 2018 at 11:42 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 03/22/2018 11:29 PM, Alessandro Aste wrote:
> Thanks Tomas. We're currently building postgres from source. In order to
> enable symbols, you want me to re-configure postres with  --enable-debug
> then run perf?
>

Yes.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Hello,  any news ?

Thank you,

Alessandro.

On Fri, Mar 23, 2018 at 8:22 PM, Alessandro Aste <alessandro.aste@gmail.com> wrote:
PS , in the meanwhile I discovered a 2nd workaround(beside disabling parallel processing) . I added offset  0 to the subquery , and, according to the documentation, “OFFSET 0 is the same as omitting the OFFSET clause” - https://www.postgresql.org/docs/current/static/queries-limit.html

cmd3dev=# show max_parallel_workers_per_gather ;

max_parallel_workers_per_gather

---------------------------------

8

(1 row)

 

cmd3dev=# \timing

Timing is on.

cmd3dev=#  SELECT * FROM (SELECT seg.circuit_id AS id, vendor_gtt_pop.vendor_id, CASE WHEN vendor_gtt_pop.push_to_gaa = 1 THEN 'Y' END as gaa, pop.gii_circuitid AS pop_name, cst.label AS pop_status, seg.a_company_name AS pop_location, seg.vendor_id AS pop_provider_id, seg.vendor_name AS pop_provider, cs.address1 AS pop_street, cs.city AS pop_city, cs.postal_code AS pop_postal_code, cs.state AS pop_state, csc.code AS pop_country, cs.country_id AS pop_country_id FROM ( SELECT c.gii_circuitid, max(so.id) AS service_order_id FROM service_order so join circuit c on c.product_id=so.product_id join master_service_order mso on mso.id=so.master_service_order_id WHERE NOT (so.ordertype_id = 2 AND so.status <> 999) AND NOT (so.ordertype_id = 3 AND so.status <> 999) AND c.status >= 20 AND c.status not in (160,999) AND mso.client_id=11615 AND c.service_description=28 AND c.status!=160 GROUP BY c.gii_circuitid ) pop JOIN service_order so ON so.id = pop.service_order_id left JOIN client_site cs on cs.id=so.a_site_id left JOIN country csc on csc.id=cs.country_id JOIN circuit c ON so.product_id=c.product_id JOIN circuit_status cst ON cst.id=c.status JOIN ( SELECT c.id AS circuit_id, sg.id AS segment_id, c.pop_support_vendor_id AS vendor_id, v.name AS vendor_name, sg.a_company_name FROM segment sg JOIN circuit_layout cl ON cl.segment_id = sg.id AND cl.ordinal = 1 JOIN circuit c ON c.id = cl.circuit_id JOIN vendor v ON v.id = c.pop_support_vendor_id ) seg ON seg.circuit_id = c.id JOIN vendor_gtt_pop on vendor_gtt_pop.gtt_pop_id = seg.circuit_id offset 0) foo where vendor_id = 12346 AND (1=1) ORDER BY pop_name ASC,id desc LIMIT 10;

   id   | vendor_id | gaa |   pop_name    | pop_status | pop_location | pop_provider_id | pop_provider |     pop_street      | pop_city | pop

_postal_code | pop_state | pop_country | pop_country_id

--------+-----------+-----+---------------+------------+--------------+-----------------+--------------+---------------------+----------+----

-------------+-----------+-------------+----------------

684807 |     12346 | Y   | GTT/POP/LON1T | Active     | LON1T        |           12288 | Telehouse UK | 14 Coriander Avenue | London   | E14

2AA         |           | GB          |            219

(1 row)

 

Time: 2245.073 ms (00:02.245)




On Fri, Mar 23, 2018 at 9:31 AM, Alessandro Aste <alessandro.aste@gmail.com> wrote:
Tomas, I'm attaching a 4MB file with the perf report. Let me know if it gets blocked, I'll shrink it to the first 1000 lines. 

Thank you,

Alessandro.

On Thu, Mar 22, 2018 at 11:42 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 03/22/2018 11:29 PM, Alessandro Aste wrote:
> Thanks Tomas. We're currently building postgres from source. In order to
> enable symbols, you want me to re-configure postres with  --enable-debug
> then run perf?
>

Yes.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Hi,

the perf profiles look really weird - clearly, there's a lot of lock
contention, because the top item is this

    13.49%    13.40%  postmaster   postgres  [.]   LWLockAcquire
                 |

                 ---LWLockAcquire



That's a sign of lock contention - not sure which one, though. And then
there's heap_hot_search_buffer

    12.65%     7.92%  postmaster   postgres [.] heap_hot_search_buffer
                 |

                 ---heap_hot_search_buffer


So either there's a table with many HOT-updated rows (repeatedly), or we
search the HOT chains very frequently for some reason. Considering it
only affects the non-LIMIT case, I'd guess the latter.

There also seems to be quite a few page faults, for some reason.

Trying to reproduce this without the actual data is rather futile. We
need some sort of reproducer (sample data to test the query on).

regards

On 03/28/2018 10:30 AM, Alessandro Aste wrote:
> Hello,  any news ?
> 
> Thank you,
> 
> Alessandro.
> 
> On Fri, Mar 23, 2018 at 8:22 PM, Alessandro Aste
> <alessandro.aste@gmail.com <mailto:alessandro.aste@gmail.com>> wrote:
> 
>     PS , in the meanwhile I discovered a 2nd workaround(beside disabling
>     parallel processing) . I added offset  0 to the subquery , and,
>     according to the documentation, “OFFSET 0 is the same as omitting
>     the OFFSET clause”
>     - https://www.postgresql.org/docs/current/static/queries-limit.html
>     <https://www.postgresql.org/docs/current/static/queries-limit.html>
> 
>     cmd3dev=# show max_parallel_workers_per_gather ;
> 
>     *max_parallel_workers_per_gather*
> 
>     *---------------------------------*
> 
>     *8*
> 
>     (1 row)
> 
>      
> 
>     cmd3dev=# \timing
> 
>     Timing is on.
> 
>     cmd3dev=#  SELECT * FROM (SELECT seg.circuit_id AS id,
>     vendor_gtt_pop.vendor_id, CASE WHEN vendor_gtt_pop.push_to_gaa = 1
>     THEN 'Y' END as gaa, pop.gii_circuitid AS pop_name, cst.label AS
>     pop_status, seg.a_company_name AS pop_location, seg.vendor_id AS
>     pop_provider_id, seg.vendor_name AS pop_provider, cs.address1 AS
>     pop_street, cs.city AS pop_city, cs.postal_code AS pop_postal_code,
>     cs.state AS pop_state, csc.code AS pop_country, cs.country_id AS
>     pop_country_id FROM ( SELECT c.gii_circuitid, max(so.id
>     <http://so.id>) AS service_order_id FROM service_order so join
>     circuit c on c.product_id=so.product_id join master_service_order
>     mso on mso.id <http://mso.id>=so.master_service_order_id WHERE NOT
>     (so.ordertype_id = 2 AND so.status <> 999) AND NOT (so.ordertype_id
>     = 3 AND so.status <> 999) AND c.status >= 20 AND c.status not in
>     (160,999) AND mso.client_id=11615 AND c.service_description=28 AND
>     c.status!=160 GROUP BY c.gii_circuitid ) pop JOIN service_order so
>     ON so.id <http://so.id> = pop.service_order_id left JOIN client_site
>     cs on cs.id <http://cs.id>=so.a_site_id left JOIN country csc on
>     csc.id <http://csc.id>=cs.country_id JOIN circuit c ON
>     so.product_id=c.product_id JOIN circuit_status cst ON cst.id
>     <http://cst.id>=c.status JOIN ( SELECT c.id <http://c.id> AS
>     circuit_id, sg.id <http://sg.id> AS segment_id,
>     c.pop_support_vendor_id AS vendor_id, v.name <http://v.name> AS
>     vendor_name, sg.a_company_name FROM segment sg JOIN circuit_layout
>     cl ON cl.segment_id = sg.id <http://sg.id> AND cl.ordinal = 1 JOIN
>     circuit c ON c.id <http://c.id> = cl.circuit_id JOIN vendor v ON
>     v.id <http://v.id> = c.pop_support_vendor_id ) seg ON seg.circuit_id
>     = c.id <http://c.id> JOIN vendor_gtt_pop on
>     vendor_gtt_pop.gtt_pop_id = seg.circuit_id offset 0) foo where
>     vendor_id = 12346 AND (1=1) ORDER BY pop_name ASC,id desc LIMIT 10;
> 
>        id   | vendor_id | gaa |   pop_name    | pop_status |
>     pop_location | pop_provider_id | pop_provider |     pop_street     
>     | pop_city | pop
> 
>     _postal_code | pop_state | pop_country | pop_country_id
> 
>
--------+-----------+-----+---------------+------------+--------------+-----------------+--------------+---------------------+----------+----
> 
>     -------------+-----------+-------------+----------------
> 
>     684807 |     12346 | Y   | GTT/POP/LON1T | Active     | LON1T       
>     |           12288 | Telehouse UK | 14 Coriander Avenue | London   | E14
> 
>     2AA         |           | GB          |            219
> 
>     (1 row)
> 
>      
> 
>     *Time: 2245.073 ms (00:02.245)*
> 
> 
> 
> 
>     On Fri, Mar 23, 2018 at 9:31 AM, Alessandro Aste
>     <alessandro.aste@gmail.com <mailto:alessandro.aste@gmail.com>> wrote:
> 
>         Tomas, I'm attaching a 4MB file with the perf report. Let me
>         know if it gets blocked, I'll shrink it to the first 1000 lines. 
> 
>         Thank you,
> 
>         Alessandro.
> 
>         On Thu, Mar 22, 2018 at 11:42 PM, Tomas Vondra
>         <tomas.vondra@2ndquadrant.com
>         <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> 
>             On 03/22/2018 11:29 PM, Alessandro Aste wrote:
>             > Thanks Tomas. We're currently building postgres from source. In order to
>             > enable symbols, you want me to re-configure postres with  --enable-debug
>             > then run perf?
>             >
> 
>             Yes.
> 
>             regards
> 
>             --
>             Tomas Vondra                  http://www.2ndQuadrant.com
>             PostgreSQL Development, 24x7 Support, Remote DBA, Training &
>             Services
> 
> 
> 
> 

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services