BUG #18005: PSQL Process hangs in parallel mode - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #18005: PSQL Process hangs in parallel mode
Date
Msg-id 18005-5cd026ae9ac57c17@postgresql.org
Whole thread Raw
Responses Re: BUG #18005: PSQL Process hangs in parallel mode  (Laurenz Albe <laurenz.albe@cybertec.at>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      18005
Logged by:          Serge Bellina
Email address:      ess.bee59@gmail.com
PostgreSQL version: 14.8
Operating system:   Ubuntu-2204-jammy-amd64-base
Description:

Dear postgresql team!

here additional / new informations to BUG #17990 and BUG #17999.

I am using postgresql/postgis to calculate (better "estimate") pseudo-osm
tags to enhance the route-calculation in the "Brouter" for bikers .
Each continent is calculated in a separate database (Europe,
north-america,
etc..)

For the planet the "sequential" calculation runs whithout errors, but it
needs 26 hours. 

Unfortunately the "long" SQL´s (joins) used are running as mono-process
(they are only 1 or 2 % of the time running in 3 parallel processes).
That for I tried to start 2 parallel psql-processes  (Europe in
psql-process
1 and the rest of the planet in psql-process 2)

The problem:
By the first 3 tests psql-process 2 "hanged" every time for hours / never
ends !!!!!

VERY STRANGE: more than 30 SQL#s are used to process each continent, but
psql-process 2 every time hanged on the same SELECT (nearly 10 selects were
processed bevore)
VERY STRANGE: the time stamps of the hanging select for asia (2023-06-26
16:34:26) is identical with the start-time of the select itself (the time
stamp of psql-process 1 for europe is at this time 2023-06-26 17:22:51.41
!!!!)

No message at all in the postgresql or Linux logs!

Just the CPU load indicated that the process was still "working".

Also after the successful termination of the psql-process 1, psql-process 2
hangs longer and longer .... / same state

Kill -9 on the psql-process 2 in that state ( 1+ 2 processses) had no
effect: just a full database restart (administrator) could stop the loop

Remark 1: the second test was done by adding th config parameter
"max_parallel_workers_per_gather = 0" (no parallel processes whithin a
SQL!), but the "hanging" problem also occured Remark 2: after the DB
restart, the previously hanging sql in psql-process 2 was restarted and
ended successfully after 50 minutes. 


Many logs were collected (logs of the SQL´s, top, ps lsof, pg_activity_stat,
memory usage, ..) and could be submitted by need (I do not know the proper
way to do that)

Here some extracts:

  
My configuration:
==> PostgreSQL 14.8 (Ubuntu 14.8-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, 64-bit ==>
postgres@Ubuntu-2204-jammy-amd64-base

-------------------------------
 2023-06-26 16:34:26.163755+02    ==> START TIME of psql-process 2 in the
last test
(1 row)

-- create tags for noise
-- create raw data
--     when several highways-segments are producing noise, aggregate the
noises using the "ST_Union" of the segments!
--     (better as using "sum" or "max" that do not deliver good factors)
SELECT
    m.osm_id losmid, m.highway lhighway, q.highway as qhighway,
q.maxspeed_class,
    case
    when q.highway in ('motorway', 'motorway_link','trunk','trunk_link') and
q.maxspeed_class < 1.1 then
        st_area(st_intersection(m.way, ST_Union( q.way)))      /
st_area(m.way)
    when q.highway in ('motorway', 'motorway_link','trunk','trunk_link')
then
        st_area(st_intersection(m.way, ST_Union( q.way)))        / (1.5 *
st_area(m.way))
    when q.highway in ('primary','primary_link')
and
q.maxspeed_class < 2.1  then
        st_area(st_intersection(m.way, ST_Union( q.way)))       / (2 *
st_area(m.way))
    when q.highway in ('primary','primary_link')      then
        st_area(st_intersection(m.way, ST_Union( q.way)))     / (3 *
st_area(m.way))
    when q.highway in ('secondary')                                    and
q.maxspeed_class < 2.1 then
        st_area(st_intersection(m.way, ST_Union( q.way)))       / (3 *
st_area(m.way))
    when q.highway in ('secondary') then
        st_area(st_intersection(m.way, ST_Union( q.way)))       / (5 *
st_area(m.way))
    end
as noise_factor
into table noise_tmp
FROM osm_line_buf_50 AS m
INNER JOIN osm_line_buf_50 AS q ON ST_Intersects(m.way, q.way) WHERE
m.highway is not null and q.highway in ('motorway',
'motorway_link','trunk','trunk_link','primary','primary_link','secondary')
GROUP BY losmid, lhighway, m.way, q.highway, q.maxspeed_class order by
noise_factor desc;

EXPLAIN
-----------------
 Sort  (cost=337741901772.50..337742103465.40 rows=80677160 width=1020)
   Sort Key: ((st_area(st_intersection(m.way, st_union(q.way),
'-1'::double
precision)) / st_area(m.way))) DESC
   ->  GroupAggregate  (cost=302522177901.86..337695183065.46
rows=80677160
width=1020)
         Group Key: m.osm_id, m.highway, m.way
         ->  Gather Merge  (cost=302522177901.86..317079159018.69
rows=124988590885 width=1675)
               Workers Planned: 2
               ->  Sort  (cost=302522176901.84..302652373350.68
rows=52078579535 width=1675)
                     Sort Key: m.osm_id, m.highway, m.way
                     ->  Nested Loop  (cost=0.42..217422937427.56
rows=52078579535 width=1675)
                           ->  Parallel Seq Scan on osm_poly_no_buf q
(cost=0.00..223957561.03 rows=939141 width=663)
                                 Filter: (((landuse = ANY
('{forest,allotments,flowerbed,orchard,vineyard,recreation_ground,village_green}'::text[]))
OR (leisure = ANY ('{garden,park,nature_reserve}'::text[]))) AND
((st_area((st_transform(way, 4326))::geography, true) / '1000000'::double
precision) < '5000'::double precision))
                           ->  Index Scan using osm_line_buf_50_idx on
osm_line_buf_50 m  (cost=0.42..231193.41 rows=8068 width=1012)
                                 Index Cond: (way && q.way)
                                 Filter: ((highway IS NOT NULL) AND
st_intersects(way, q.way))
 JIT:
   Functions: 12
   Options: Inlining true, Optimization true, Expressions true, Deforming
true
(17 rows)


--SELECT datname, pid, state, query, age(clock_timestamp(), query_start)
AS
age
select *
FROM pg_stat_activity
WHERE state <> 'idle'
    AND query NOT LIKE '% FROM pg_stat_activity %';
   datid   | datname |   pid   | leader_pid | usesysid | usename  |
application_name | client_addr | client_hostname | client_port |        
backend_start         |          xact_start           |
query_start
         |         state_change          | wait_event_type |    wait_event

 | state  | backend_xid | backend_xmin | query_id |

                        query

   |  backend_type


-----------+---------+---------+------------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------------+--------+-------------+--------------+----------+----------------------------------------------------------------------------------------------------------+-----------------
 105680476 | asia    | 1192433 |            |       10 | postgres | psql

        |             |                 |          -1 | 2023-06-26
16:02:05.768052+02 | 2023-06-26 16:34:26.164375+02 | 2023-06-26
16:34:26.164375+02 | 2023-06-26 16:34:26.164375+02 | IO              |
DataFileRead     | active |      289276 |       289208 |          | SELECT



                  +| client backend
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          |     m.osm_id losmid, m.highway
lhighway, q.highway as qhighway, q.maxspeed_class,                      
+|
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          |     case

                                                                +|
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          |     when q.highway in ('motorway',
'motorway_link','trunk','trunk_link') and q.maxspeed_class < 1.1 then+|
 ..............
103790736 | europe  | 1183475 |            |       10 | postgres | psql

       |             |                 |          -1 | 2023-06-26
15:21:15.376492+02 | 2023-06-26 17:22:51.420038+02 | 2023-06-26
17:22:51.420038+02 | 2023-06-26 17:22:51.420039+02 |                 |

           | active |      289286 |       289276 |          | select xid ,
sum(water_river_see) as river_see

              +| client backend
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          | into table river_tmp

                                                                +|
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          | from (

                                                                +|
 ......
105680476 | asia    | 1197807 |    1192433 |       10 | postgres | psql

       |             |                 |             | 2023-06-26
16:34:26.169606+02 | 2023-06-26 16:34:26.164375+02 | 2023-06-26
16:34:26.164375+02 | 2023-06-26 16:34:26.183272+02 | IPC             |
BufferIO         | active |             |       289208 |          | SELECT



                  +| parallel worker
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          |     m.osm_id losmid, m.highway
lhighway, q.highway as qhighway, q.maxspeed_class,                      
+|
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          |     case

                                                                +|
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          |     when q.highway in ('motorway',
'motorway_link','trunk','trunk_link') and q.maxspeed_class < 1.1 then+|
           |         |         |            |          |          |

        |             |                 |             |
           |         |         |            |          |          |

        |             |                 |             |
 ..........
105680476 | asia    | 1197808 |    1192433 |       10 | postgres | psql

       |             |                 |             | 2023-06-26
16:34:26.169819+02 | 2023-06-26 16:34:26.164375+02 | 2023-06-26
16:34:26.164375+02 | 2023-06-26 16:34:26.183619+02 | IPC             |
BufferIO         | active |             |       289208 |          | SELECT



                  +| parallel worker
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          |     m.osm_id losmid, m.highway
lhighway, q.highway as qhighway, q.maxspeed_class,                      
+|
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          |     case

                                                                +|
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          |     when q.highway in ('motorway',
'motorway_link','trunk','trunk_link') and q.maxspeed_class < 1.1 then+|
           |         |         |            |          |          |

        |             |                 |             |
 ............
103790736 | europe  | 1208383 |    1183475 |       10 | postgres | psql

       |             |                 |             | 2023-06-26
17:22:51.749632+02 | 2023-06-26 17:22:51.420038+02 | 2023-06-26
17:22:51.420038+02 | 2023-06-26 17:22:51.76465+02  | IPC             |
MessageQueueSend | active |             |       289276 |          | select
xid , sum(water_river_see) as river_see

                    +| parallel worker
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          | into table river_tmp

                                                                +|
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          | from (

                                                                +|
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          | SELECT     m.osm_id  as xid,

                                                                +|
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          |   st_area(st_intersection(m.way,
ST_Union( q.way))) / st_area(m.way)                                    +|
           |         |         |            |          |          |

        |             |                 |             |
 ..............
103790736 | europe  | 1208384 |    1183475 |       10 | postgres | psql

       |             |                 |             | 2023-06-26
17:22:51.749853+02 | 2023-06-26 17:22:51.420038+02 | 2023-06-26
17:22:51.420038+02 | 2023-06-26 17:22:51.764351+02 | IPC             |
MessageQueueSend | active |             |       289276 |          | select
xid , sum(water_river_see) as river_see

                    +| parallel worker
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          | into table river_tmp

                                                                +|
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          | from (

                                                                +|
           |         |         |            |          |          |

        |             |                 |             |
       |                               |                               |

                          |                 |                  |        |

         |              |          | SELECT     m.osm_id  as xid,

                                                                +|
..............

free -h -w
               total        used        free      shared     buffers      
cache   available
Mem:           125Gi        35Gi       785Mi        50Gi       7.0Mi

89Gi        38Gi
Swap:          4.0Gi       4.0Gi       0.0Ki

top -n 1 -u postgres
top - 18:55:27 up 22 days,  5:38,  1 user,  load average: 3.65, 4.03, 4.14
Tasks: 425 total,   2 running, 423 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.8 us,  2.4 sy,  0.0 ni, 87.5 id,  2.2 wa,  0.0 hi,  0.2 si,
0.0
st
MiB Mem : 128504.4 total,    827.2 free,  36121.8 used,  91555.4
buff/cache
MiB Swap:   4091.0 total,      0.0 free,   4091.0 used.  39010.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
COMMAND
1183475 postgres  20   0   62.9g  60.3g  48.7g R  93.8  48.0 190:34.19
postgres
1192433 postgres  20   0   51.4g  38.3g  38.2g D  18.8  30.5  58:38.73
postgres
1197807 postgres  20   0   51.3g  18.7g  18.6g S  18.8  14.9  34:19.57
postgres
1197808 postgres  20   0   51.3g  18.9g  18.9g S  18.8  15.1  32:08.32
postgres
1169290 postgres  20   0   51.2g 877644 876996 S   0.0   0.7   0:00.64
postgres
1169293 postgres  20   0   51.2g  49.7g  49.7g S   0.0  39.6   3:46.63
postgres
1169294 postgres  20   0   51.2g  24.7g  24.7g S   0.0  19.7   0:15.49
postgres
1169295 postgres  20   0   51.2g  17900  17344 S   0.0   0.0   1:06.14
postgres
1169296 postgres  20   0   51.2g   4036   3200 S   0.0   0.0   0:00.12
postgres
1169297 postgres  20   0   73460   3028   2040 S   0.0   0.0   0:02.27
postgres
1169298 postgres  20   0   51.2g   3704   3036 S   0.0   0.0   0:00.00
postgres
1169387 postgres  20   0    2888    836    836 S   0.0   0.0   0:00.00 sh
1169388 postgres  20   0    2888    856    856 S   0.0   0.0   0:00.00 sh
1173213 postgres  20   0    2888    848    848 S   0.0   0.0   0:00.00 sh
1183474 postgres  20   0   21796   4872   4224 S   0.0   0.0   0:00.01
psql
1183699 postgres  20   0    8784   5248   3472 S   0.0   0.0   0:00.01
bash
1187783 postgres  20   0    2888    912    824 S   0.0   0.0   0:00.00 sh
1192431 postgres  20   0   21796   5476   4244 S   0.0   0.0   0:00.01
psql
1208383 postgres  20   0   61.3g  52.4g  42.3g S   0.0  41.8   2:20.11
postgres
1208384 postgres  20   0   61.3g  53.6g  43.6g S   0.0  42.7   2:21.45
postgres
1229692 postgres  20   0    7500   3884   3488 S   0.0   0.0   0:00.00
moni.sh
1229696 postgres  20   0   10896   4164   3300 R   0.0   0.0   0:00.00 top

ps -ef
postgres 1169290       1  0 14:20 ?        00:00:00
/usr/lib/postgresql/14/bin/postgres -D /var/lib/postgresql/14/main -c
config_file=/etc/postgresql/14/main/postgresql.conf
postgres 1169293 1169290  0 14:20 ?        00:03:46 postgres: 14/main:
checkpointer
postgres 1169294 1169290  0 14:20 ?        00:00:16 postgres: 14/main:
background writer
postgres 1169295 1169290  0 14:20 ?        00:01:06 postgres: 14/main:
walwriter
postgres 1169296 1169290  0 14:20 ?        00:00:00 postgres: 14/main:
autovacuum launcher
postgres 1169297 1169290  0 14:20 ?        00:00:02 postgres: 14/main:
stats
collector
postgres 1169298 1169290  0 14:20 ?        00:00:00 postgres: 14/main:
logical replication launcher
postgres 1169387       1  0 14:20 ?        00:00:00 /bin/sh
./gen-europe.sh
.
postgres 1169388       1  0 14:20 ?        00:00:00 /bin/sh ./gen-rest.sh
.
postgres 1173213 1169387  0 14:35 ?        00:00:00 /bin/sh
./create-tags.sh
europe
postgres 1183474 1173213  0 15:21 ?        00:00:00
/usr/lib/postgresql/14/bin/psql -v ON_ERROR_STOP=1 -U postgres -d europe
-a
-f all.sql
postgres 1183475 1169290 92 15:21 ?        05:25:22 postgres: 14/main:
postgres europe [local] SELECT
postgres 1187783 1169388  0 15:39 ?        00:00:00 /bin/sh
./create-tags.sh
asia
postgres 1192431 1187783  0 16:02 ?        00:00:00
/usr/lib/postgresql/14/bin/psql -v ON_ERROR_STOP=1 -U postgres -d asia -a
-f
all.sql
postgres 1192433 1169290 29 16:02 ?        01:31:54 postgres: 14/main:
postgres asia [local] SELECT
postgres 1197807 1169290 24 16:34 ?        01:08:08 postgres: 14/main:
parallel worker for PID 1192433
postgres 1197808 1169290 22 16:34 ?        01:02:24 postgres: 14/main:
parallel worker for PID 1192433
postgres 1259514 1169290 34 21:08 ?        00:01:19 postgres: 14/main:
parallel worker for PID 1183475
postgres 1259515 1169290 35 21:08 ?        00:01:20 postgres: 14/main:
parallel worker for PID 1183475

Thank for any help !


pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #17997: Assert failed in validatePartitionedIndex() when attaching partition index to child of valid index
Next
From: PG Bug reporting form
Date:
Subject: BUG #18006: recovery_target_action=shutdown triggers automatic recovery on next startup (beyond point in time)