Re: ETL - sql orchestrator is stuck when there is not sleep() between queries - Mailing list pgsql-performance

From Allan Barrielle
Subject Re: ETL - sql orchestrator is stuck when there is not sleep() between queries
Date
Msg-id CANrBv8pJ-cAJ1DHXL2Bj7akM-_jqEMDgoaz2dHppjkx_GSxFxg@mail.gmail.com
Whole thread Raw
In response to Re: ETL - sql orchestrator is stuck when there is not sleep() between queries  (Allan Barrielle <allan.barrielle@gmail.com>)
List pgsql-performance
On a different machine, we use 12.7. Still same issue

On Thu, Jul 8, 2021 at 3:49 PM Allan Barrielle <allan.barrielle@gmail.com> wrote:
Hello, 

> Is it true that the SELECTs have no joins in them ?

Yes there is a lot of LEFT JOIN.

When it doesn't work, you could check SELECT * FROM pg_stat_activity, and
>SELECT pg_blocking_pids(pid), * FROM pg_locks, to see what's going on.

I can't see any blocking queries blocking pg_locks, pg_blocking_pids.

> It'd be very useful to get "explain analyze" for a working query and for a
> stuck query.  It sound like the stuck query never finishes, so maybe the second
> part is impossible (?)

We run an explain analysis and we see some very interesting stuff going on.
It seems without explicitly adding a `ANALYZE`, the query has a cost of over billions, so the query is not stuck but took forever.
When I run the same scripts with an ANALYZE right before running the query, the query is exec is 50secondes and the cost is normal

Explain analyze WITHOUT ANALYZE https://explain.depesz.com/s/RaSr
Explain analyze same query WITH ANALYZE BEFORE https://explain.depesz.com/s/tYVl

The configuration is tuned by aws aurora, but this issue happens also with a default config.

allow_system_table_mods,off
application_name,DataGrip 2021.1.3
archive_command,(disabled)
archive_mode,off
archive_timeout,5min
array_nulls,on
authentication_timeout,1min
autovacuum,on
autovacuum_analyze_scale_factor,0.05
autovacuum_analyze_threshold,50
autovacuum_freeze_max_age,200000000
autovacuum_max_workers,12
autovacuum_multixact_freeze_max_age,400000000
autovacuum_naptime,5s
autovacuum_vacuum_cost_delay,1ms
autovacuum_vacuum_cost_limit,1200
autovacuum_vacuum_scale_factor,0.1
autovacuum_vacuum_threshold,50
autovacuum_work_mem,-1
backend_flush_after,0
backslash_quote,safe_encoding
bgwriter_delay,200ms
bgwriter_flush_after,0
bgwriter_lru_maxpages,100
bgwriter_lru_multiplier,2
bonjour,off
bytea_output,hex
check_function_bodies,on
checkpoint_completion_target,0.9
checkpoint_flush_after,0
checkpoint_timeout,15min
checkpoint_warning,30s
client_encoding,UTF8
client_min_messages,notice
commit_delay,0
commit_siblings,5
constraint_exclusion,partition
cpu_index_tuple_cost,0.005
cpu_operator_cost,0.0025
cpu_tuple_cost,0.01
cursor_tuple_fraction,0.1
DateStyle,"ISO, MDY"
db_user_namespace,off
deadlock_timeout,1s
debug_pretty_print,on
debug_print_parse,off
debug_print_plan,off
debug_print_rewritten,off
default_statistics_target,500
default_text_search_config,pg_catalog.simple
default_transaction_deferrable,off
default_transaction_isolation,read committed
default_transaction_read_only,off
dynamic_library_path,$libdir
effective_cache_size,4GB
effective_io_concurrency,600
enable_bitmapscan,on
enable_gathermerge,on
enable_hashagg,on
enable_hashjoin,on
enable_indexonlyscan,on
enable_indexscan,on
enable_material,on
enable_mergejoin,on
enable_nestloop,on
enable_parallel_append,on
enable_parallel_hash,on
enable_partition_pruning,on
enable_partitionwise_aggregate,off
enable_partitionwise_join,off
enable_seqscan,on
enable_sort,on
enable_tidscan,on
escape_string_warning,on
event_source,PostgreSQL
exit_on_error,off
extra_float_digits,3
force_parallel_mode,off
from_collapse_limit,8
fsync,off
full_page_writes,off
geqo,on
geqo_effort,5
geqo_generations,0
geqo_pool_size,0
geqo_seed,0
geqo_selection_bias,2
geqo_threshold,12
gin_fuzzy_search_limit,0
gin_pending_list_limit,4MB
hot_standby,off
hot_standby_feedback,on
huge_pages,try
idle_in_transaction_session_timeout,25min
ignore_checksum_failure,off
ignore_system_indexes,off
IntervalStyle,postgres
jit,off
jit_above_cost,100000
jit_debugging_support,off
jit_dump_bitcode,off
jit_expressions,on
jit_inline_above_cost,500000
jit_optimize_above_cost,500000
jit_profiling_support,off
jit_provider,llvmjit
jit_tuple_deforming,on
join_collapse_limit,8
lc_monetary,C
lc_numeric,C
lc_time,C
listen_addresses,*
lock_timeout,0
lo_compat_privileges,off
maintenance_work_mem,2GB
max_connections,100
max_files_per_process,1000
max_locks_per_transaction,256
max_logical_replication_workers,4
max_parallel_maintenance_workers,12
max_parallel_workers,12
max_parallel_workers_per_gather,6
max_pred_locks_per_page,2
max_pred_locks_per_relation,-2
max_pred_locks_per_transaction,64
max_prepared_transactions,0
max_replication_slots,10
max_stack_depth,6MB
max_standby_archive_delay,30s
max_standby_streaming_delay,14s
max_sync_workers_per_subscription,2
max_wal_senders,0
max_wal_size,8GB
max_worker_processes,12
min_parallel_index_scan_size,512kB
min_parallel_table_scan_size,8MB
min_wal_size,2GB
old_snapshot_threshold,-1
operator_precedence_warning,off
parallel_leader_participation,off
parallel_setup_cost,1000
parallel_tuple_cost,0.1
password_encryption,md5
port,5432
post_auth_delay,0
pre_auth_delay,0
quote_all_identifiers,off
random_page_cost,1.1
restart_after_crash,on
row_security,on
search_path,public
seq_page_cost,1
session_replication_role,origin
shared_buffers,1GB
standard_conforming_strings,on
statement_timeout,0
superuser_reserved_connections,3
synchronize_seqscans,on
synchronous_commit,on
syslog_facility,local0
syslog_ident,postgres
syslog_sequence_numbers,on
syslog_split_messages,on
tcp_keepalives_count,9
tcp_keepalives_idle,7200
tcp_keepalives_interval,75
temp_buffers,8MB
temp_file_limit,-1
TimeZone,UTC
trace_notify,off
trace_recovery_messages,log
trace_sort,off
track_activities,on
track_activity_query_size,4kB
track_commit_timestamp,off
track_counts,on
track_functions,none
track_io_timing,off
transform_null_equals,off
update_process_title,on
vacuum_cleanup_index_scale_factor,0.1
vacuum_cost_delay,0
vacuum_cost_limit,200
vacuum_cost_page_dirty,20
vacuum_cost_page_hit,1
vacuum_cost_page_miss,0
vacuum_defer_cleanup_age,0
vacuum_freeze_min_age,50000000
vacuum_freeze_table_age,150000000
vacuum_multixact_freeze_min_age,5000000
vacuum_multixact_freeze_table_age,150000000
wal_buffers,16MB
wal_compression,off
wal_level,minimal
wal_log_hints,off
wal_receiver_status_interval,10s
wal_receiver_timeout,30s
wal_retrieve_retry_interval,5s
wal_sender_timeout,1min
wal_sync_method,fdatasync
wal_writer_delay,200ms
wal_writer_flush_after,1MB
work_mem,2GB
xmlbinary,base64
xmloption,content
zero_damaged_pages,off

On Thu, Jul 8, 2021 at 2:33 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Thu, Jul 08, 2021 at 01:00:28AM +0200, Allan Barrielle wrote:
> All is fine, and can work great.
> But sometimes, some queries that used to take about 20 secs to complete can
> suddenly end in 5mins.
> Important all queries have the same shape -> CREATE TABLE SELECT AS *(a bit
> of transform) FROM TABLE). No update, nothing, it’s dead simple.
>
> Nothing works, it’s very random, some query won’t simply work ( even after
> hours ).

When it doesn't work, you could check SELECT * FROM pg_stat_activity, and
SELECT pg_blocking_pids(pid), * FROM pg_locks, to see what's going on.

> Important all queries have the same shape -> CREATE TABLE SELECT AS *(a bit
> of transform) FROM TABLE). No update, nothing, it’s dead simple.
> We are just trying to copy a table from schema1, to schema2, to schema3 and
> finally schema3. That’s it.

Is it true that the SELECTs have no joins in them ?

Did this ever work better or differently under different versions of postgres ?

> Why does the query never end even after hours ? Why there is no log about
> where the query is stuck.

Please send your nondefault config.
https://wiki.postgresql.org/wiki/Server_Configuration

Also enable logging (I just added this to the wiki).
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Enable_Logging

It'd be very useful to get "explain analyze" for a working query and for a
stuck query.  It sound like the stuck query never finishes, so maybe the second
part is impossible (?)

But it'd be good to get at least "explain" output.  You'd have to edit your sql
script to run an "explain" before each query, and run it, logging the ouput,
until you capture the plan for a stuck query.  Save the output and send here,
along with the query plan for a working query.

--
Justin

pgsql-performance by date:

Previous
From: Allan Barrielle
Date:
Subject: Re: ETL - sql orchestrator is stuck when there is not sleep() between queries
Next
From: Justin Pryzby
Date:
Subject: Re: ETL - sql orchestrator is stuck when there is not sleep() between queries