Thread: ETL - sql orchestrator is stuck when there is not sleep() between queries

ETL - sql orchestrator is stuck when there is not sleep() between queries

From

Allan Barrielle

Date:

07 July 2021, 23:00:28

Hello guys, I'm facing a problem. Currently I'm working on a Data transformation Pipeline on Postgres. The strategy is,

We select every tables in a given schema ( 50 tables ), we apply some case when, translation, enum and load it into a different new schema with a CREATE TABLE SCHEMA_2.table_1 AS SELECT * FROM SCHEMA_1.TABLE_1, then we do it again about 3 more times and everytime it’s a new schema, new table. We only keep and don’t drop the schema1.

To orchestrate the whole, we've got a bunch of .sql files that we run by using psql directly. That's our "strategy".

So we're copying a lot of data, but it allows us to debug, and investigate business bugs, because we can plug us into schema 2,3 and search why it's an issue.

All is fine, and can work great.
But sometimes, some queries that used to take about 20 secs to complete can suddenly end in 5mins.
Important all queries have the same shape -> CREATE TABLE SELECT AS *(a bit of transform) FROM TABLE). No update, nothing, it’s dead simple.

We are just trying to copy a table from schema1, to schema2, to schema3 and finally schema3. That’s it.
The thing to understand here is schema2, schema3 are dropped at every pipeline transformation, so everytime we run the script, it drops everything from schema2 to the final stage.

We tuned the config a little bit, and we tried kind of everything ( synchronous_commit, wal, vacuum )
Nothing works, it’s very random, some query won’t simply work ( even after hours ).

We use different machines, different config, and different datasets.

The only thing that makes it work every time, in 100% cases, is to put a sleep(10sec) between each schema.
So we select 50 tables, we create a new schema with it, then we sleep 10 sec then we do again the same query but with the freshly created schema and we create a third schema, sleep 10s and again..

And that makes the whole pipeline successful each time.

So, It seems it's a background process inside postgres, that should ingest a lot of data, and we have to give him time to take a rest, like a bg_writers or something else ?
I disabled autovacuum=off . Same.
Why does the query never end even after hours ? Why there is no log about where the query is stuck.
To be clear, if I kill the stuck query and run again it will work.

I don't know much about what's going on inside Postgres, which randomly takes a lot of time, with the same code, same data.

PostgreSQL 12.4 (Debian 12.4-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

Thank you so much for your time..

Re: ETL - sql orchestrator is stuck when there is not sleep() between queries

From

Justin Pryzby

Date:

08 July 2021, 12:33:20

On Thu, Jul 08, 2021 at 01:00:28AM +0200, Allan Barrielle wrote:
> All is fine, and can work great.
> But sometimes, some queries that used to take about 20 secs to complete can
> suddenly end in 5mins.
> Important all queries have the same shape -> CREATE TABLE SELECT AS *(a bit
> of transform) FROM TABLE). No update, nothing, it’s dead simple.
> 
> Nothing works, it’s very random, some query won’t simply work ( even after
> hours ).

When it doesn't work, you could check SELECT * FROM pg_stat_activity, and
SELECT pg_blocking_pids(pid), * FROM pg_locks, to see what's going on.

> Important all queries have the same shape -> CREATE TABLE SELECT AS *(a bit
> of transform) FROM TABLE). No update, nothing, it’s dead simple.
> We are just trying to copy a table from schema1, to schema2, to schema3 and
> finally schema3. That’s it.

Is it true that the SELECTs have no joins in them ?

Did this ever work better or differently under different versions of postgres ?

> Why does the query never end even after hours ? Why there is no log about
> where the query is stuck.

Please send your nondefault config.
https://wiki.postgresql.org/wiki/Server_Configuration

Also enable logging (I just added this to the wiki).
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Enable_Logging

It'd be very useful to get "explain analyze" for a working query and for a
stuck query.  It sound like the stuck query never finishes, so maybe the second
part is impossible (?)

But it'd be good to get at least "explain" output.  You'd have to edit your sql
script to run an "explain" before each query, and run it, logging the ouput,
until you capture the plan for a stuck query.  Save the output and send here,
along with the query plan for a working query.

-- 
Justin

Re: ETL - sql orchestrator is stuck when there is not sleep() between queries

From

Imre Samu

Date:

08 July 2021, 13:47:15

> We use different machines, different config, and different datasets.

> ...

> PostgreSQL 12.4 (Debian 12.4-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

Is It possible to upgrade and test with PG 12.7?

IMHO: lot of changes:

* https://www.postgresql.org/docs/12/release-12-5.html

* https://www.postgresql.org/docs/12/release-12-6.html

* https://www.postgresql.org/docs/12/release-12-7.html

Just rule out the possibility that it has been already fixed

Regards,

Imre

Allan Barrielle <allan.barrielle@gmail.com> ezt írta (időpont: 2021. júl. 8., Cs, 11:26):

Hello guys, I'm facing a problem. Currently I'm working on a Data transformation Pipeline on Postgres. The strategy is,

We select every tables in a given schema ( 50 tables ), we apply some case when, translation, enum and load it into a different new schema with a CREATE TABLE SCHEMA_2.table_1 AS SELECT * FROM SCHEMA_1.TABLE_1, then we do it again about 3 more times and everytime it’s a new schema, new table. We only keep and don’t drop the schema1.

To orchestrate the whole, we've got a bunch of .sql files that we run by using psql directly. That's our "strategy".

So we're copying a lot of data, but it allows us to debug, and investigate business bugs, because we can plug us into schema 2,3 and search why it's an issue.

All is fine, and can work great.
But sometimes, some queries that used to take about 20 secs to complete can suddenly end in 5mins.
Important all queries have the same shape -> CREATE TABLE SELECT AS *(a bit of transform) FROM TABLE). No update, nothing, it’s dead simple.

We are just trying to copy a table from schema1, to schema2, to schema3 and finally schema3. That’s it.
The thing to understand here is schema2, schema3 are dropped at every pipeline transformation, so everytime we run the script, it drops everything from schema2 to the final stage.

We tuned the config a little bit, and we tried kind of everything ( synchronous_commit, wal, vacuum )
Nothing works, it’s very random, some query won’t simply work ( even after hours ).

We use different machines, different config, and different datasets.

The only thing that makes it work every time, in 100% cases, is to put a sleep(10sec) between each schema.
So we select 50 tables, we create a new schema with it, then we sleep 10 sec then we do again the same query but with the freshly created schema and we create a third schema, sleep 10s and again..

And that makes the whole pipeline successful each time.

So, It seems it's a background process inside postgres, that should ingest a lot of data, and we have to give him time to take a rest, like a bg_writers or something else ?
I disabled autovacuum=off . Same.
Why does the query never end even after hours ? Why there is no log about where the query is stuck.
To be clear, if I kill the stuck query and run again it will work.

I don't know much about what's going on inside Postgres, which randomly takes a lot of time, with the same code, same data.

PostgreSQL 12.4 (Debian 12.4-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

Thank you so much for your time..

Re: ETL - sql orchestrator is stuck when there is not sleep() between queries

From

Allan Barrielle

Date:

08 July 2021, 13:49:12

Hello,

> Is it true that the SELECTs have no joins in them ?

Yes there is a lot of LEFT JOIN.

> When it doesn't work, you could check SELECT * FROM pg_stat_activity, and
>SELECT pg_blocking_pids(pid), * FROM pg_locks, to see what's going on.

I can't see any blocking queries blocking pg_locks, pg_blocking_pids.

> It'd be very useful to get "explain analyze" for a working query and for a
> stuck query. It sound like the stuck query never finishes, so maybe the second
> part is impossible (?)

We run an explain analysis and we see some very interesting stuff going on.
It seems without explicitly adding a `ANALYZE`, the query has a cost of over billions, so the query is not stuck but took forever.
When I run the same scripts with an ANALYZE right before running the query, the query is exec is 50secondes and the cost is normal

Explain analyze WITHOUT ANALYZE https://explain.depesz.com/s/RaSr
Explain analyze same query WITH ANALYZE BEFORE https://explain.depesz.com/s/tYVl

The configuration is tuned by aws aurora, but this issue happens also with a default config.

allow_system_table_mods,off
application_name,DataGrip 2021.1.3
archive_command,(disabled)
archive_mode,off
archive_timeout,5min
array_nulls,on
authentication_timeout,1min
autovacuum,on
autovacuum_analyze_scale_factor,0.05
autovacuum_analyze_threshold,50
autovacuum_freeze_max_age,200000000
autovacuum_max_workers,12
autovacuum_multixact_freeze_max_age,400000000
autovacuum_naptime,5s
autovacuum_vacuum_cost_delay,1ms
autovacuum_vacuum_cost_limit,1200
autovacuum_vacuum_scale_factor,0.1
autovacuum_vacuum_threshold,50
autovacuum_work_mem,-1
backend_flush_after,0
backslash_quote,safe_encoding
bgwriter_delay,200ms
bgwriter_flush_after,0
bgwriter_lru_maxpages,100
bgwriter_lru_multiplier,2
bonjour,off
bytea_output,hex
check_function_bodies,on
checkpoint_completion_target,0.9
checkpoint_flush_after,0
checkpoint_timeout,15min
checkpoint_warning,30s
client_encoding,UTF8
client_min_messages,notice
commit_delay,0
commit_siblings,5
constraint_exclusion,partition
cpu_index_tuple_cost,0.005
cpu_operator_cost,0.0025
cpu_tuple_cost,0.01
cursor_tuple_fraction,0.1
DateStyle,"ISO, MDY"
db_user_namespace,off
deadlock_timeout,1s
debug_pretty_print,on
debug_print_parse,off
debug_print_plan,off
debug_print_rewritten,off
default_statistics_target,500
default_text_search_config,pg_catalog.simple
default_transaction_deferrable,off
default_transaction_isolation,read committed
default_transaction_read_only,off
dynamic_library_path,$libdir
effective_cache_size,4GB
effective_io_concurrency,600
enable_bitmapscan,on
enable_gathermerge,on
enable_hashagg,on
enable_hashjoin,on
enable_indexonlyscan,on
enable_indexscan,on
enable_material,on
enable_mergejoin,on
enable_nestloop,on
enable_parallel_append,on
enable_parallel_hash,on
enable_partition_pruning,on
enable_partitionwise_aggregate,off
enable_partitionwise_join,off
enable_seqscan,on
enable_sort,on
enable_tidscan,on
escape_string_warning,on
event_source,PostgreSQL
exit_on_error,off
extra_float_digits,3
force_parallel_mode,off
from_collapse_limit,8
fsync,off
full_page_writes,off
geqo,on
geqo_effort,5
geqo_generations,0
geqo_pool_size,0
geqo_seed,0
geqo_selection_bias,2
geqo_threshold,12
gin_fuzzy_search_limit,0
gin_pending_list_limit,4MB
hot_standby,off
hot_standby_feedback,on
huge_pages,try
idle_in_transaction_session_timeout,25min
ignore_checksum_failure,off
ignore_system_indexes,off
IntervalStyle,postgres
jit,off
jit_above_cost,100000
jit_debugging_support,off
jit_dump_bitcode,off
jit_expressions,on
jit_inline_above_cost,500000
jit_optimize_above_cost,500000
jit_profiling_support,off
jit_provider,llvmjit
jit_tuple_deforming,on
join_collapse_limit,8
lc_monetary,C
lc_numeric,C
lc_time,C
listen_addresses,*
lock_timeout,0
lo_compat_privileges,off
maintenance_work_mem,2GB
max_connections,100
max_files_per_process,1000
max_locks_per_transaction,256
max_logical_replication_workers,4
max_parallel_maintenance_workers,12
max_parallel_workers,12
max_parallel_workers_per_gather,6
max_pred_locks_per_page,2
max_pred_locks_per_relation,-2
max_pred_locks_per_transaction,64
max_prepared_transactions,0
max_replication_slots,10
max_stack_depth,6MB
max_standby_archive_delay,30s
max_standby_streaming_delay,14s
max_sync_workers_per_subscription,2
max_wal_senders,0
max_wal_size,8GB
max_worker_processes,12
min_parallel_index_scan_size,512kB
min_parallel_table_scan_size,8MB
min_wal_size,2GB
old_snapshot_threshold,-1
operator_precedence_warning,off
parallel_leader_participation,off
parallel_setup_cost,1000
parallel_tuple_cost,0.1
password_encryption,md5
port,5432
post_auth_delay,0
pre_auth_delay,0
quote_all_identifiers,off
random_page_cost,1.1
restart_after_crash,on
row_security,on
search_path,public
seq_page_cost,1
session_replication_role,origin
shared_buffers,1GB
standard_conforming_strings,on
statement_timeout,0
superuser_reserved_connections,3
synchronize_seqscans,on
synchronous_commit,on
syslog_facility,local0
syslog_ident,postgres
syslog_sequence_numbers,on
syslog_split_messages,on
tcp_keepalives_count,9
tcp_keepalives_idle,7200
tcp_keepalives_interval,75
temp_buffers,8MB
temp_file_limit,-1
TimeZone,UTC
trace_notify,off
trace_recovery_messages,log
trace_sort,off
track_activities,on
track_activity_query_size,4kB
track_commit_timestamp,off
track_counts,on
track_functions,none
track_io_timing,off
transform_null_equals,off
update_process_title,on
vacuum_cleanup_index_scale_factor,0.1
vacuum_cost_delay,0
vacuum_cost_limit,200
vacuum_cost_page_dirty,20
vacuum_cost_page_hit,1
vacuum_cost_page_miss,0
vacuum_defer_cleanup_age,0
vacuum_freeze_min_age,50000000
vacuum_freeze_table_age,150000000
vacuum_multixact_freeze_min_age,5000000
vacuum_multixact_freeze_table_age,150000000
wal_buffers,16MB
wal_compression,off
wal_level,minimal
wal_log_hints,off
wal_receiver_status_interval,10s
wal_receiver_timeout,30s
wal_retrieve_retry_interval,5s
wal_sender_timeout,1min
wal_sync_method,fdatasync
wal_writer_delay,200ms
wal_writer_flush_after,1MB
work_mem,2GB
xmlbinary,base64
xmloption,content
zero_damaged_pages,off

On Thu, Jul 8, 2021 at 2:33 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Thu, Jul 08, 2021 at 01:00:28AM +0200, Allan Barrielle wrote:
> All is fine, and can work great.
> But sometimes, some queries that used to take about 20 secs to complete can
> suddenly end in 5mins.
> Important all queries have the same shape -> CREATE TABLE SELECT AS *(a bit
> of transform) FROM TABLE). No update, nothing, it’s dead simple.
>
> Nothing works, it’s very random, some query won’t simply work ( even after
> hours ).

When it doesn't work, you could check SELECT * FROM pg_stat_activity, and
SELECT pg_blocking_pids(pid), * FROM pg_locks, to see what's going on.

> Important all queries have the same shape -> CREATE TABLE SELECT AS *(a bit
> of transform) FROM TABLE). No update, nothing, it’s dead simple.
> We are just trying to copy a table from schema1, to schema2, to schema3 and
> finally schema3. That’s it.

Is it true that the SELECTs have no joins in them ?

Did this ever work better or differently under different versions of postgres ?

> Why does the query never end even after hours ? Why there is no log about
> where the query is stuck.

Please send your nondefault config.
https://wiki.postgresql.org/wiki/Server_Configuration

Also enable logging (I just added this to the wiki).
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Enable_Logging

It'd be very useful to get "explain analyze" for a working query and for a
stuck query. It sound like the stuck query never finishes, so maybe the second
part is impossible (?)

But it'd be good to get at least "explain" output. You'd have to edit your sql
script to run an "explain" before each query, and run it, logging the ouput,
until you capture the plan for a stuck query. Save the output and send here,
along with the query plan for a working query.

--
Justin

Re: ETL - sql orchestrator is stuck when there is not sleep() between queries

From

Allan Barrielle

Date:

08 July 2021, 13:51:48

On a different machine, we use 12.7. Still same issue

On Thu, Jul 8, 2021 at 3:49 PM Allan Barrielle <allan.barrielle@gmail.com> wrote:

Hello,

> Is it true that the SELECTs have no joins in them ?

Yes there is a lot of LEFT JOIN.

> When it doesn't work, you could check SELECT * FROM pg_stat_activity, and
>SELECT pg_blocking_pids(pid), * FROM pg_locks, to see what's going on.

I can't see any blocking queries blocking pg_locks, pg_blocking_pids.

> It'd be very useful to get "explain analyze" for a working query and for a
> stuck query. It sound like the stuck query never finishes, so maybe the second
> part is impossible (?)

We run an explain analysis and we see some very interesting stuff going on.
It seems without explicitly adding a `ANALYZE`, the query has a cost of over billions, so the query is not stuck but took forever.
When I run the same scripts with an ANALYZE right before running the query, the query is exec is 50secondes and the cost is normal

Explain analyze WITHOUT ANALYZE https://explain.depesz.com/s/RaSr
Explain analyze same query WITH ANALYZE BEFORE https://explain.depesz.com/s/tYVl

The configuration is tuned by aws aurora, but this issue happens also with a default config.

allow_system_table_mods,off
application_name,DataGrip 2021.1.3
archive_command,(disabled)
archive_mode,off
archive_timeout,5min
array_nulls,on
authentication_timeout,1min
autovacuum,on
autovacuum_analyze_scale_factor,0.05
autovacuum_analyze_threshold,50
autovacuum_freeze_max_age,200000000
autovacuum_max_workers,12
autovacuum_multixact_freeze_max_age,400000000
autovacuum_naptime,5s
autovacuum_vacuum_cost_delay,1ms
autovacuum_vacuum_cost_limit,1200
autovacuum_vacuum_scale_factor,0.1
autovacuum_vacuum_threshold,50
autovacuum_work_mem,-1
backend_flush_after,0
backslash_quote,safe_encoding
bgwriter_delay,200ms
bgwriter_flush_after,0
bgwriter_lru_maxpages,100
bgwriter_lru_multiplier,2
bonjour,off
bytea_output,hex
check_function_bodies,on
checkpoint_completion_target,0.9
checkpoint_flush_after,0
checkpoint_timeout,15min
checkpoint_warning,30s
client_encoding,UTF8
client_min_messages,notice
commit_delay,0
commit_siblings,5
constraint_exclusion,partition
cpu_index_tuple_cost,0.005
cpu_operator_cost,0.0025
cpu_tuple_cost,0.01
cursor_tuple_fraction,0.1
DateStyle,"ISO, MDY"
db_user_namespace,off
deadlock_timeout,1s
debug_pretty_print,on
debug_print_parse,off
debug_print_plan,off
debug_print_rewritten,off
default_statistics_target,500
default_text_search_config,pg_catalog.simple
default_transaction_deferrable,off
default_transaction_isolation,read committed
default_transaction_read_only,off
dynamic_library_path,$libdir
effective_cache_size,4GB
effective_io_concurrency,600
enable_bitmapscan,on
enable_gathermerge,on
enable_hashagg,on
enable_hashjoin,on
enable_indexonlyscan,on
enable_indexscan,on
enable_material,on
enable_mergejoin,on
enable_nestloop,on
enable_parallel_append,on
enable_parallel_hash,on
enable_partition_pruning,on
enable_partitionwise_aggregate,off
enable_partitionwise_join,off
enable_seqscan,on
enable_sort,on
enable_tidscan,on
escape_string_warning,on
event_source,PostgreSQL
exit_on_error,off
extra_float_digits,3
force_parallel_mode,off
from_collapse_limit,8
fsync,off
full_page_writes,off
geqo,on
geqo_effort,5
geqo_generations,0
geqo_pool_size,0
geqo_seed,0
geqo_selection_bias,2
geqo_threshold,12
gin_fuzzy_search_limit,0
gin_pending_list_limit,4MB
hot_standby,off
hot_standby_feedback,on
huge_pages,try
idle_in_transaction_session_timeout,25min
ignore_checksum_failure,off
ignore_system_indexes,off
IntervalStyle,postgres
jit,off
jit_above_cost,100000
jit_debugging_support,off
jit_dump_bitcode,off
jit_expressions,on
jit_inline_above_cost,500000
jit_optimize_above_cost,500000
jit_profiling_support,off
jit_provider,llvmjit
jit_tuple_deforming,on
join_collapse_limit,8
lc_monetary,C
lc_numeric,C
lc_time,C
listen_addresses,*
lock_timeout,0
lo_compat_privileges,off
maintenance_work_mem,2GB
max_connections,100
max_files_per_process,1000
max_locks_per_transaction,256
max_logical_replication_workers,4
max_parallel_maintenance_workers,12
max_parallel_workers,12
max_parallel_workers_per_gather,6
max_pred_locks_per_page,2
max_pred_locks_per_relation,-2
max_pred_locks_per_transaction,64
max_prepared_transactions,0
max_replication_slots,10
max_stack_depth,6MB
max_standby_archive_delay,30s
max_standby_streaming_delay,14s
max_sync_workers_per_subscription,2
max_wal_senders,0
max_wal_size,8GB
max_worker_processes,12
min_parallel_index_scan_size,512kB
min_parallel_table_scan_size,8MB
min_wal_size,2GB
old_snapshot_threshold,-1
operator_precedence_warning,off
parallel_leader_participation,off
parallel_setup_cost,1000
parallel_tuple_cost,0.1
password_encryption,md5
port,5432
post_auth_delay,0
pre_auth_delay,0
quote_all_identifiers,off
random_page_cost,1.1
restart_after_crash,on
row_security,on
search_path,public
seq_page_cost,1
session_replication_role,origin
shared_buffers,1GB
standard_conforming_strings,on
statement_timeout,0
superuser_reserved_connections,3
synchronize_seqscans,on
synchronous_commit,on
syslog_facility,local0
syslog_ident,postgres
syslog_sequence_numbers,on
syslog_split_messages,on
tcp_keepalives_count,9
tcp_keepalives_idle,7200
tcp_keepalives_interval,75
temp_buffers,8MB
temp_file_limit,-1
TimeZone,UTC
trace_notify,off
trace_recovery_messages,log
trace_sort,off
track_activities,on
track_activity_query_size,4kB
track_commit_timestamp,off
track_counts,on
track_functions,none
track_io_timing,off
transform_null_equals,off
update_process_title,on
vacuum_cleanup_index_scale_factor,0.1
vacuum_cost_delay,0
vacuum_cost_limit,200
vacuum_cost_page_dirty,20
vacuum_cost_page_hit,1
vacuum_cost_page_miss,0
vacuum_defer_cleanup_age,0
vacuum_freeze_min_age,50000000
vacuum_freeze_table_age,150000000
vacuum_multixact_freeze_min_age,5000000
vacuum_multixact_freeze_table_age,150000000
wal_buffers,16MB
wal_compression,off
wal_level,minimal
wal_log_hints,off
wal_receiver_status_interval,10s
wal_receiver_timeout,30s
wal_retrieve_retry_interval,5s
wal_sender_timeout,1min
wal_sync_method,fdatasync
wal_writer_delay,200ms
wal_writer_flush_after,1MB
work_mem,2GB
xmlbinary,base64
xmloption,content
zero_damaged_pages,off

On Thu, Jul 8, 2021 at 2:33 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Thu, Jul 08, 2021 at 01:00:28AM +0200, Allan Barrielle wrote:
> All is fine, and can work great.
> But sometimes, some queries that used to take about 20 secs to complete can
> suddenly end in 5mins.
> Important all queries have the same shape -> CREATE TABLE SELECT AS *(a bit
> of transform) FROM TABLE). No update, nothing, it’s dead simple.
>
> Nothing works, it’s very random, some query won’t simply work ( even after
> hours ).

When it doesn't work, you could check SELECT * FROM pg_stat_activity, and
SELECT pg_blocking_pids(pid), * FROM pg_locks, to see what's going on.

> Important all queries have the same shape -> CREATE TABLE SELECT AS *(a bit
> of transform) FROM TABLE). No update, nothing, it’s dead simple.
> We are just trying to copy a table from schema1, to schema2, to schema3 and
> finally schema3. That’s it.

Is it true that the SELECTs have no joins in them ?

Did this ever work better or differently under different versions of postgres ?

> Why does the query never end even after hours ? Why there is no log about
> where the query is stuck.

Please send your nondefault config.
https://wiki.postgresql.org/wiki/Server_Configuration

Also enable logging (I just added this to the wiki).
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Enable_Logging

It'd be very useful to get "explain analyze" for a working query and for a
stuck query. It sound like the stuck query never finishes, so maybe the second
part is impossible (?)

But it'd be good to get at least "explain" output. You'd have to edit your sql
script to run an "explain" before each query, and run it, logging the ouput,
until you capture the plan for a stuck query. Save the output and send here,
along with the query plan for a working query.

--
Justin

Re: ETL - sql orchestrator is stuck when there is not sleep() between queries

From

Justin Pryzby

Date:

08 July 2021, 14:06:03

On Thu, Jul 08, 2021 at 03:49:12PM +0200, Allan Barrielle wrote:
> > Is it true that the SELECTs have no joins in them ?
> 
> Yes there is a lot of LEFT JOIN.
> 
> > It'd be very useful to get "explain analyze" for a working query and for a
> > stuck query.  It sound like the stuck query never finishes, so maybe the second
> > part is impossible (?)
> 
> We run an explain analysis and we see some very interesting stuff going on.
> It seems without explicitly adding a `ANALYZE`, the query has a cost of
> over billions, so the query is not stuck but took forever.
> When I run the same scripts with an ANALYZE right before running the query,
> the query is exec is 50secondes and the cost is normal

It sounds like sometimes autoanalyze processes important tables being queried,
but sometimes it doesn't.

Since there are JOINs involved, you should analyze the tables after populating
them and before querying them.  The same as if it were a temp table, or
anything else.

> The configuration is tuned by aws aurora, [...]

> fsync,off
> full_page_writes,off

really?

> vacuum_cleanup_index_scale_factor,0.1

also interesting

Re: ETL - sql orchestrator is stuck when there is not sleep() between queries

From

Allan Barrielle

Date:

08 July 2021, 14:12:07

fsync is off and full_page_writes is off because the script works one time. We create the db, we load the data, then we dump the data and kill the db.
No need to handle servers crashed or anything like that.

0.1 vacuum_cleanup_index_scale_factor is the default value.

On Thu, Jul 8, 2021 at 4:06 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Thu, Jul 08, 2021 at 03:49:12PM +0200, Allan Barrielle wrote:
> > Is it true that the SELECTs have no joins in them ?
>
> Yes there is a lot of LEFT JOIN.
>
> > It'd be very useful to get "explain analyze" for a working query and for a
> > stuck query. It sound like the stuck query never finishes, so maybe the second
> > part is impossible (?)
>
> We run an explain analysis and we see some very interesting stuff going on.
> It seems without explicitly adding a `ANALYZE`, the query has a cost of
> over billions, so the query is not stuck but took forever.
> When I run the same scripts with an ANALYZE right before running the query,
> the query is exec is 50secondes and the cost is normal

It sounds like sometimes autoanalyze processes important tables being queried,
but sometimes it doesn't.

Since there are JOINs involved, you should analyze the tables after populating
them and before querying them. The same as if it were a temp table, or
anything else.

> The configuration is tuned by aws aurora, [...]

> fsync,off
> full_page_writes,off

really?

> vacuum_cleanup_index_scale_factor,0.1

also interesting