Thread: query locks up when run concurrently

query locks up when run concurrently

From
azhwkd@gmail.com
Date:
Greetings!

I'm using postgres version 9.5.5 on a ubuntu 16.04.1 server
installation which was installed through apt-get.

I have a query which if run alone usually completes in about 300ms.
When run in my application this query constantly locks up and bogs
down all connections of the connection pool (In the application this
query is run up to 10 times in parallel with different parameters).
What's really weird is that I can re-run one of the hung queries from
the command line while it's hung and it will complete as expected
while the hung queries continue to use 100% CPU time.

The query in question is this:

insert into group_history ("group", id, sub_category, "date", aa, ab,
bb, ba, quantity, "hour")
(select
a."group",
a.id,
b.sub_category,
to_timestamp($2)::date as "date",
max(a.aa / a.quantity) as aa,
min(a.aa / a.quantity) as ab,
max(a.bb / a.quantity) as bb,
min(a.bb/ a.quantity) as ba,
sum(a.quantity) as quantity,
extract('hour' from to_timestamp($2)) as "hour"
from tbla a
join tblb b on a.id = b.id
where a."group" = $1 and b."group" = $1
group by a."group", a.id, b.sub_category
);

When I'm running a perf on the system it looks like this while running
the query 10 times:

Samples: 4M of event 'cpu-clock', Event count (approx.): 18972107951
Overhead Shared Object Symbol
17.95% postgres [.] heap_hot_search_buffer
5.64% postgres [.] heap_page_prune_opt
4.62% postgres [.] hash_search_with_hash_value
3.80% postgres [.] LWLockRelease
3.73% postgres [.] 0x00000000002f420d
2.50% postgres [.] _bt_checkkeys
2.48% postgres [.] hash_any
2.45% postgres [.] 0x00000000002f41e7
2.10% postgres [.] slot_getattr
1.80% postgres [.] ResourceOwnerForgetBuffer
1.58% postgres [.] LWLockAcquire
1.58% postgres [.] ReadBufferExtended
1.54% postgres [.] index_fetch_heap
1.47% postgres [.] MemoryContextReset
1.43% postgres [.] btgettuple
1.38% postgres [.] 0x00000000002d710c
1.36% postgres [.] 0x00000000002d70a5
1.35% postgres [.] ExecQual

Explain (Analyze, Verbose) Output

QUERY PLAN
----------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------------------------------------------------------
Insert on public.group_history (cost=10254.36..10315.16 rows=2432
width=62) (actual time=1833.967..1833.967 rows=0 loops=1)
-> Subquery Scan on "*SELECT*" (cost=10254.36..10315.16 rows=2432
width=62) (actual time=353.880..376.490 rows=6139 loops=1)
Output: "*SELECT*"."group", "*SELECT*".id,
"*SELECT*".sub_category, "*SELECT*"."when", "*SELECT*".aa,
"*SELECT*".ab, "*SELECT*".bb, "*SELECT*".ba, "*SELECT*".quantity,
"*SELECT*"."hour"
-> HashAggregate (cost=10254.36..10278.68 rows=2432 width=28)
(actual time=353.871..367.144 rows=6139 loops=1)
Output: a."group", a.id, b.sub_category, '2016-11-20'::date,
max((a.aa / a.quantity)), min((a.aa / a.quantity)), max((a.bb /
a.quantity)), min((a.bb / a.quantity)), sum(a.quantity), '21'::double
precision
Group Key: a."group", a.id, b.sub_category
-> Hash Join (cost=5558.64..10181.40 rows=2432 width=28)
(actual time=193.949..294.106 rows=30343 loops=1)
Output: a."group", a.id, a.aa, a.quantity, a.bb, b.sub_category
Hash Cond: (b.id = a.id)
-> Bitmap Heap Scan on public.auctions_extra b
(cost=685.19..4719.06 rows=30550 width=8) (actual time=56.678..111.038
rows=30343 loops=1)
Output: b.sub_category, b.id
Recheck Cond: (b."group" = 7)
Heap Blocks: exact=289
-> Bitmap Index Scan on auction_extra_pk
(cost=0.00..677.55 rows=30550 width=0) (actual time=55.966..55.966
rows=30343 loops=1)
Index Cond: (b."group" = 7)
-> Hash (cost=4280.62..4280.62 rows=30627 width=28)
(actual time=137.160..137.160 rows=30343 loops=1)
Output: a."group", a.id, a.aa, a.quantity, a.bb, a.id
Buckets: 16384 Batches: 4 Memory Usage: 638kB
-> Bitmap Heap Scan on public.tbla a
(cost=689.78..4280.62 rows=30627 width=28) (actual
time=58.530..117.064 rows=30343 loops=1)
Output: a."group", a.id, a.aa, a.quantity,
a.bb, a.id
Recheck Cond: (a."group" = 7)
Heap Blocks: exact=254
-> Bitmap Index Scan on tbla_pk
(cost=0.00..682.12 rows=30627 width=0) (actual time=57.801..57.801
rows=30343 loops=1)
Index Cond: (a."group" = 7)
Planning time: 0.475 ms
Trigger group_history_trigger: time=1442.561 calls=6139
Execution time: 1834.119 ms


group_history_trigger:

CREATE OR REPLACE FUNCTION public.group_history_partition_function()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
declare
_new_date timestamptz;
_tablename text;
_startdate text;
begin
-- Takes the current inbound "when" value and determines when
midnight is for the given date
_new_date := date_trunc('day', new."when");
_startdate := to_char(_new_date, 'YYYY_MM_DD');
_tablename := 'group_history_'||_startdate;

-- Insert the current record into the correct partition
execute 'INSERT INTO public.' || quote_ident(_tablename) || ' VALUES ($1.*)
on conflict ("group", id, sub_category, "when", "hour") do
update set aa = excluded.aa,
ab = excluded.ab,
bb = excluded.bb,
ba = excluded.ba,
quantity = excluded.quantity' using new;
return null;
end
$function$

Has anyone experienced this before or even has a solution for this?
I'd be very grateful.

Kind regards
Sebastian


Re: query locks up when run concurrently

From
Adrian Klaver
Date:
On 11/22/2016 11:41 PM, azhwkd@gmail.com wrote:
> Greetings!
>
> I'm using postgres version 9.5.5 on a ubuntu 16.04.1 server
> installation which was installed through apt-get.
>
> I have a query which if run alone usually completes in about 300ms.
> When run in my application this query constantly locks up and bogs
> down all connections of the connection pool (In the application this
> query is run up to 10 times in parallel with different parameters).
> What's really weird is that I can re-run one of the hung queries from
> the command line while it's hung and it will complete as expected
> while the hung queries continue to use 100% CPU time.

Have you looked in the Postgres logs to see if there are any relevant
messages?

Best guess is that running in parallel is creating a lock problem where
one of the queries is holding a lock that is keeping the others from
completing.

What relationships do the group_history and group_history_* tables have
to other tables?

What is the definition of the trigger that fires the trigger function below?

>
> The query in question is this:
>
> insert into group_history ("group", id, sub_category, "date", aa, ab,
> bb, ba, quantity, "hour")
> (select
> a."group",
> a.id,
> b.sub_category,
> to_timestamp($2)::date as "date",
> max(a.aa / a.quantity) as aa,
> min(a.aa / a.quantity) as ab,
> max(a.bb / a.quantity) as bb,
> min(a.bb/ a.quantity) as ba,
> sum(a.quantity) as quantity,
> extract('hour' from to_timestamp($2)) as "hour"
> from tbla a
> join tblb b on a.id = b.id
> where a."group" = $1 and b."group" = $1
> group by a."group", a.id, b.sub_category
> );
>
> When I'm running a perf on the system it looks like this while running
> the query 10 times:
>
> Samples: 4M of event 'cpu-clock', Event count (approx.): 18972107951
> Overhead Shared Object Symbol
> 17.95% postgres [.] heap_hot_search_buffer
> 5.64% postgres [.] heap_page_prune_opt
> 4.62% postgres [.] hash_search_with_hash_value
> 3.80% postgres [.] LWLockRelease
> 3.73% postgres [.] 0x00000000002f420d
> 2.50% postgres [.] _bt_checkkeys
> 2.48% postgres [.] hash_any
> 2.45% postgres [.] 0x00000000002f41e7
> 2.10% postgres [.] slot_getattr
> 1.80% postgres [.] ResourceOwnerForgetBuffer
> 1.58% postgres [.] LWLockAcquire
> 1.58% postgres [.] ReadBufferExtended
> 1.54% postgres [.] index_fetch_heap
> 1.47% postgres [.] MemoryContextReset
> 1.43% postgres [.] btgettuple
> 1.38% postgres [.] 0x00000000002d710c
> 1.36% postgres [.] 0x00000000002d70a5
> 1.35% postgres [.] ExecQual
>
> Explain (Analyze, Verbose) Output
>
> QUERY PLAN
> ----------------------------------------------------------------------------
>
-------------------------------------------------------------------------------------------------------------------------------------------------
> Insert on public.group_history (cost=10254.36..10315.16 rows=2432
> width=62) (actual time=1833.967..1833.967 rows=0 loops=1)
> -> Subquery Scan on "*SELECT*" (cost=10254.36..10315.16 rows=2432
> width=62) (actual time=353.880..376.490 rows=6139 loops=1)
> Output: "*SELECT*"."group", "*SELECT*".id,
> "*SELECT*".sub_category, "*SELECT*"."when", "*SELECT*".aa,
> "*SELECT*".ab, "*SELECT*".bb, "*SELECT*".ba, "*SELECT*".quantity,
> "*SELECT*"."hour"
> -> HashAggregate (cost=10254.36..10278.68 rows=2432 width=28)
> (actual time=353.871..367.144 rows=6139 loops=1)
> Output: a."group", a.id, b.sub_category, '2016-11-20'::date,
> max((a.aa / a.quantity)), min((a.aa / a.quantity)), max((a.bb /
> a.quantity)), min((a.bb / a.quantity)), sum(a.quantity), '21'::double
> precision
> Group Key: a."group", a.id, b.sub_category
> -> Hash Join (cost=5558.64..10181.40 rows=2432 width=28)
> (actual time=193.949..294.106 rows=30343 loops=1)
> Output: a."group", a.id, a.aa, a.quantity, a.bb, b.sub_category
> Hash Cond: (b.id = a.id)
> -> Bitmap Heap Scan on public.auctions_extra b
> (cost=685.19..4719.06 rows=30550 width=8) (actual time=56.678..111.038
> rows=30343 loops=1)
> Output: b.sub_category, b.id
> Recheck Cond: (b."group" = 7)
> Heap Blocks: exact=289
> -> Bitmap Index Scan on auction_extra_pk
> (cost=0.00..677.55 rows=30550 width=0) (actual time=55.966..55.966
> rows=30343 loops=1)
> Index Cond: (b."group" = 7)
> -> Hash (cost=4280.62..4280.62 rows=30627 width=28)
> (actual time=137.160..137.160 rows=30343 loops=1)
> Output: a."group", a.id, a.aa, a.quantity, a.bb, a.id
> Buckets: 16384 Batches: 4 Memory Usage: 638kB
> -> Bitmap Heap Scan on public.tbla a
> (cost=689.78..4280.62 rows=30627 width=28) (actual
> time=58.530..117.064 rows=30343 loops=1)
> Output: a."group", a.id, a.aa, a.quantity,
> a.bb, a.id
> Recheck Cond: (a."group" = 7)
> Heap Blocks: exact=254
> -> Bitmap Index Scan on tbla_pk
> (cost=0.00..682.12 rows=30627 width=0) (actual time=57.801..57.801
> rows=30343 loops=1)
> Index Cond: (a."group" = 7)
> Planning time: 0.475 ms
> Trigger group_history_trigger: time=1442.561 calls=6139
> Execution time: 1834.119 ms
>
>
> group_history_trigger:
>
> CREATE OR REPLACE FUNCTION public.group_history_partition_function()
> RETURNS trigger
> LANGUAGE plpgsql
> AS $function$
> declare
> _new_date timestamptz;
> _tablename text;
> _startdate text;
> begin
> -- Takes the current inbound "when" value and determines when
> midnight is for the given date
> _new_date := date_trunc('day', new."when");
> _startdate := to_char(_new_date, 'YYYY_MM_DD');
> _tablename := 'group_history_'||_startdate;
>
> -- Insert the current record into the correct partition
> execute 'INSERT INTO public.' || quote_ident(_tablename) || ' VALUES ($1.*)
> on conflict ("group", id, sub_category, "when", "hour") do
> update set aa = excluded.aa,
> ab = excluded.ab,
> bb = excluded.bb,
> ba = excluded.ba,
> quantity = excluded.quantity' using new;
> return null;
> end
> $function$
>
> Has anyone experienced this before or even has a solution for this?
> I'd be very grateful.
>
> Kind regards
> Sebastian
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: query locks up when run concurrently

From
Tom Lane
Date:
azhwkd@gmail.com writes:
> I have a query which if run alone usually completes in about 300ms.
> When run in my application this query constantly locks up and bogs
> down all connections of the connection pool (In the application this
> query is run up to 10 times in parallel with different parameters).
> What's really weird is that I can re-run one of the hung queries from
> the command line while it's hung and it will complete as expected
> while the hung queries continue to use 100% CPU time.

Judging from the EXPLAIN timing, most of the work is in the trigger,
which leads me to wonder if the parallel calls are likely to be fighting
over inserting/updating the same row in the group_history partition
tables.  Or are you certain that they should be hitting different rows?

            regards, tom lane


Re: query locks up when run concurrently

From
azhwkd
Date:
Greetings!

Yes I had a look at the logfiles but there is not a single logfile generated when I'm reproducing this issue. 
Concerning locks I used there queries at https://wiki.postgresql.org/wiki/Lock_Monitoring and they came up empty.
The group_history table and its sub-tables do not have any foreign keys as they only contain calculated data and the source data is removed after some time.
The trigger which calls the partition function below is defined as follows:

create
    trigger group_history_trigger before insert
        on
        item_history for each row execute procedure group_history_partition_function()

Kind regards,
Sebastian

Adrian Klaver <adrian.klaver@aklaver.com> schrieb am Mi., 23. Nov. 2016 um 16:25 Uhr:
On 11/22/2016 11:41 PM, azhwkd@gmail.com wrote:
> Greetings!
>
> I'm using postgres version 9.5.5 on a ubuntu 16.04.1 server
> installation which was installed through apt-get.
>
> I have a query which if run alone usually completes in about 300ms.
> When run in my application this query constantly locks up and bogs
> down all connections of the connection pool (In the application this
> query is run up to 10 times in parallel with different parameters).
> What's really weird is that I can re-run one of the hung queries from
> the command line while it's hung and it will complete as expected
> while the hung queries continue to use 100% CPU time.

Have you looked in the Postgres logs to see if there are any relevant
messages?

Best guess is that running in parallel is creating a lock problem where
one of the queries is holding a lock that is keeping the others from
completing.

What relationships do the group_history and group_history_* tables have
to other tables?

What is the definition of the trigger that fires the trigger function below?

>
> The query in question is this:
>
> insert into group_history ("group", id, sub_category, "date", aa, ab,
> bb, ba, quantity, "hour")
> (select
> a."group",
> a.id,
> b.sub_category,
> to_timestamp($2)::date as "date",
> max(a.aa / a.quantity) as aa,
> min(a.aa / a.quantity) as ab,
> max(a.bb / a.quantity) as bb,
> min(a.bb/ a.quantity) as ba,
> sum(a.quantity) as quantity,
> extract('hour' from to_timestamp($2)) as "hour"
> from tbla a
> join tblb b on a.id = b.id
> where a."group" = $1 and b."group" = $1
> group by a."group", a.id, b.sub_category
> );
>
> When I'm running a perf on the system it looks like this while running
> the query 10 times:
>
> Samples: 4M of event 'cpu-clock', Event count (approx.): 18972107951
> Overhead Shared Object Symbol
> 17.95% postgres [.] heap_hot_search_buffer
> 5.64% postgres [.] heap_page_prune_opt
> 4.62% postgres [.] hash_search_with_hash_value
> 3.80% postgres [.] LWLockRelease
> 3.73% postgres [.] 0x00000000002f420d
> 2.50% postgres [.] _bt_checkkeys
> 2.48% postgres [.] hash_any
> 2.45% postgres [.] 0x00000000002f41e7
> 2.10% postgres [.] slot_getattr
> 1.80% postgres [.] ResourceOwnerForgetBuffer
> 1.58% postgres [.] LWLockAcquire
> 1.58% postgres [.] ReadBufferExtended
> 1.54% postgres [.] index_fetch_heap
> 1.47% postgres [.] MemoryContextReset
> 1.43% postgres [.] btgettuple
> 1.38% postgres [.] 0x00000000002d710c
> 1.36% postgres [.] 0x00000000002d70a5
> 1.35% postgres [.] ExecQual
>
> Explain (Analyze, Verbose) Output
>
> QUERY PLAN
> ----------------------------------------------------------------------------
> -------------------------------------------------------------------------------------------------------------------------------------------------
> Insert on public.group_history (cost=10254.36..10315.16 rows=2432
> width=62) (actual time=1833.967..1833.967 rows=0 loops=1)
> -> Subquery Scan on "*SELECT*" (cost=10254.36..10315.16 rows=2432
> width=62) (actual time=353.880..376.490 rows=6139 loops=1)
> Output: "*SELECT*"."group", "*SELECT*".id,
> "*SELECT*".sub_category, "*SELECT*"."when", "*SELECT*".aa,
> "*SELECT*".ab, "*SELECT*".bb, "*SELECT*".ba, "*SELECT*".quantity,
> "*SELECT*"."hour"
> -> HashAggregate (cost=10254.36..10278.68 rows=2432 width=28)
> (actual time=353.871..367.144 rows=6139 loops=1)
> Output: a."group", a.id, b.sub_category, '2016-11-20'::date,
> max((a.aa / a.quantity)), min((a.aa / a.quantity)), max((a.bb /
> a.quantity)), min((a.bb / a.quantity)), sum(a.quantity), '21'::double
> precision
> Group Key: a."group", a.id, b.sub_category
> -> Hash Join (cost=5558.64..10181.40 rows=2432 width=28)
> (actual time=193.949..294.106 rows=30343 loops=1)
> Output: a."group", a.id, a.aa, a.quantity, a.bb, b.sub_category
> Hash Cond: (b.id = a.id)
> -> Bitmap Heap Scan on public.auctions_extra b
> (cost=685.19..4719.06 rows=30550 width=8) (actual time=56.678..111.038
> rows=30343 loops=1)
> Output: b.sub_category, b.id
> Recheck Cond: (b."group" = 7)
> Heap Blocks: exact=289
> -> Bitmap Index Scan on auction_extra_pk
> (cost=0.00..677.55 rows=30550 width=0) (actual time=55.966..55.966
> rows=30343 loops=1)
> Index Cond: (b."group" = 7)
> -> Hash (cost=4280.62..4280.62 rows=30627 width=28)
> (actual time=137.160..137.160 rows=30343 loops=1)
> Output: a."group", a.id, a.aa, a.quantity, a.bb, a.id
> Buckets: 16384 Batches: 4 Memory Usage: 638kB
> -> Bitmap Heap Scan on public.tbla a
> (cost=689.78..4280.62 rows=30627 width=28) (actual
> time=58.530..117.064 rows=30343 loops=1)
> Output: a."group", a.id, a.aa, a.quantity,
> a.bb, a.id
> Recheck Cond: (a."group" = 7)
> Heap Blocks: exact=254
> -> Bitmap Index Scan on tbla_pk
> (cost=0.00..682.12 rows=30627 width=0) (actual time=57.801..57.801
> rows=30343 loops=1)
> Index Cond: (a."group" = 7)
> Planning time: 0.475 ms
> Trigger group_history_trigger: time=1442.561 calls=6139
> Execution time: 1834.119 ms
>
>
> group_history_trigger:
>
> CREATE OR REPLACE FUNCTION public.group_history_partition_function()
> RETURNS trigger
> LANGUAGE plpgsql
> AS $function$
> declare
> _new_date timestamptz;
> _tablename text;
> _startdate text;
> begin
> -- Takes the current inbound "when" value and determines when
> midnight is for the given date
> _new_date := date_trunc('day', new."when");
> _startdate := to_char(_new_date, 'YYYY_MM_DD');
> _tablename := 'group_history_'||_startdate;
>
> -- Insert the current record into the correct partition
> execute 'INSERT INTO public.' || quote_ident(_tablename) || ' VALUES ($1.*)
> on conflict ("group", id, sub_category, "when", "hour") do
> update set aa = excluded.aa,
> ab = excluded.ab,
> bb = excluded.bb,
> ba = excluded.ba,
> quantity = excluded.quantity' using new;
> return null;
> end
> $function$
>
> Has anyone experienced this before or even has a solution for this?
> I'd be very grateful.
>
> Kind regards
> Sebastian
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: query locks up when run concurrently

From
azhwkd
Date:
Greetings!

The parallel calls should not be working on the same row. Each query services a different group ID on it's own and there is no overlap.

Kind regards,
Sebastian


Tom Lane <tgl@sss.pgh.pa.us> schrieb am Mi., 23. Nov. 2016 um 17:47 Uhr:
azhwkd@gmail.com writes:
> I have a query which if run alone usually completes in about 300ms.
> When run in my application this query constantly locks up and bogs
> down all connections of the connection pool (In the application this
> query is run up to 10 times in parallel with different parameters).
> What's really weird is that I can re-run one of the hung queries from
> the command line while it's hung and it will complete as expected
> while the hung queries continue to use 100% CPU time.

Judging from the EXPLAIN timing, most of the work is in the trigger,
which leads me to wonder if the parallel calls are likely to be fighting
over inserting/updating the same row in the group_history partition
tables.  Or are you certain that they should be hitting different rows?

                        regards, tom lane

Re: query locks up when run concurrently

From
Adrian Klaver
Date:
On 11/23/2016 01:47 PM, azhwkd wrote:
> Greetings!
>
> Yes I had a look at the logfiles but there is not a single logfile
> generated when I'm reproducing this issue.

We are talking about the Postgres logfile, correct?

I have to believe a log file exists, but maybe not entries at that time.

What are you logging settings?

> Concerning locks I used there queries
> at https://wiki.postgresql.org/wiki/Lock_Monitoring and they came up empty.
> The group_history table and its sub-tables do not have any foreign keys
> as they only contain calculated data and the source data is removed
> after some time.
> The trigger which calls the partition function below is defined as follows:
>
> create
>     trigger group_history_trigger before insert
>         on
>         item_history for each row execute procedure
> group_history_partition_function()

This trigger is on the table item_history not group history, is that
correct?

If so how does the item_history table fit into this?

>
> Kind regards,
> Sebastian


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: query locks up when run concurrently

From
Adrian Klaver
Date:
On 11/23/2016 01:52 PM, azhwkd wrote:
> Greetings!
>
> The parallel calls should not be working on the same row. Each query
> services a different group ID on it's own and there is no overlap.

Except the INSERT query in the trigger function is working on dates not
group ids.

>
> Kind regards,
> Sebastian
>
>
> Tom Lane <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> schrieb am Mi.,
> 23. Nov. 2016 um 17:47 Uhr:
>
>     azhwkd@gmail.com <mailto:azhwkd@gmail.com> writes:
>     > I have a query which if run alone usually completes in about 300ms.
>     > When run in my application this query constantly locks up and bogs
>     > down all connections of the connection pool (In the application this
>     > query is run up to 10 times in parallel with different parameters).
>     > What's really weird is that I can re-run one of the hung queries from
>     > the command line while it's hung and it will complete as expected
>     > while the hung queries continue to use 100% CPU time.
>
>     Judging from the EXPLAIN timing, most of the work is in the trigger,
>     which leads me to wonder if the parallel calls are likely to be fighting
>     over inserting/updating the same row in the group_history partition
>     tables.  Or are you certain that they should be hitting different rows?
>
>                             regards, tom lane
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: query locks up when run concurrently

From
azhwkd
Date:
I'm sorry. I worded this quite poorly. I meant to say that there were
no log lines added to the postgres logfile at the time.
I hope these are the settings you were refferring to (I did not change
anything in the config files vs. the default)

log_destination = 'stderr'
logging_collector = off
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_file_mode = 0600
log_truncate_on_rotation = off
log_rotation_age = 1d
log_rotation_size = 10MB
syslog_facility = 'LOCAL0'
syslog_ident = 'postgres'
event_source = 'PostgreSQL'
client_min_messages = notice
log_min_messages = warning
log_min_error_statement = error
log_min_duration_statement = -1
debug_print_parse = off
debug_print_rewritten = off
debug_print_plan = off
debug_pretty_print = on
log_checkpoints = off
log_connections = off
log_disconnections = off
log_duration = off
log_error_verbosity = default
log_hostname = off
log_line_prefix = '%t '
log_lock_waits = off
log_statement = 'none'
log_temp_files = -1
log_timezone = 'UTC'

I'm sorry, it seems like I copied the trigger definition from the
wrong table (item_history is a test table I created yesterday while
trying things which is exactly the same as group_history just a
different name).
The trigger on the group_history table is exactly the same though,
except for the table name.

Kind regards,
Sebastian

2016-11-24 0:26 GMT+01:00 Adrian Klaver <adrian.klaver@aklaver.com>:
> On 11/23/2016 01:47 PM, azhwkd wrote:
>>
>> Greetings!
>>
>> Yes I had a look at the logfiles but there is not a single logfile
>> generated when I'm reproducing this issue.
>
>
> We are talking about the Postgres logfile, correct?
>
> I have to believe a log file exists, but maybe not entries at that time.
>
> What are you logging settings?
>
>> Concerning locks I used there queries
>> at https://wiki.postgresql.org/wiki/Lock_Monitoring and they came up
>> empty.
>> The group_history table and its sub-tables do not have any foreign keys
>> as they only contain calculated data and the source data is removed
>> after some time.
>> The trigger which calls the partition function below is defined as
>> follows:
>>
>> create
>>     trigger group_history_trigger before insert
>>         on
>>         item_history for each row execute procedure
>> group_history_partition_function()
>
>
> This trigger is on the table item_history not group history, is that
> correct?
>
> If so how does the item_history table fit into this?
>
>>
>> Kind regards,
>> Sebastian
>
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com


Re: query locks up when run concurrently

From
azhwkd
Date:
The group ID is part of the primary key of the group_history table. My
understanding is that two INSERTs with different group IDs should not
collide in this case, or am I wrong in thinking this?

The table definition for group_history is the following:

CREATE TABLE public.group_history (
group int4 NOT NULL,
id int4 NOT NULL,
sub_category int4 NOT NULL,
"date" date NOT NULL,
aa int8 NOT NULL,
ab int8 NOT NULL,
bb int8 NOT NULL,
ba int8 NOT NULL,
quantity int8 NOT NULL,
"hour" int2 NOT NULL,
CONSTRAINT group_history_pk PRIMARY KEY (group,id,sub_category,"date","hour")
)
WITH (
OIDS=FALSE
);

Kind regards,
Sebastian

2016-11-24 0:28 GMT+01:00 Adrian Klaver <adrian.klaver@aklaver.com>:
> On 11/23/2016 01:52 PM, azhwkd wrote:
>>
>> Greetings!
>>
>> The parallel calls should not be working on the same row. Each query
>> services a different group ID on it's own and there is no overlap.
>
>
> Except the INSERT query in the trigger function is working on dates not
> group ids.
>
>>
>> Kind regards,
>> Sebastian
>>
>>
>> Tom Lane <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> schrieb am Mi.,
>> 23. Nov. 2016 um 17:47 Uhr:
>>
>>     azhwkd@gmail.com <mailto:azhwkd@gmail.com> writes:
>>     > I have a query which if run alone usually completes in about 300ms.
>>     > When run in my application this query constantly locks up and bogs
>>     > down all connections of the connection pool (In the application this
>>     > query is run up to 10 times in parallel with different parameters).
>>     > What's really weird is that I can re-run one of the hung queries
>> from
>>     > the command line while it's hung and it will complete as expected
>>     > while the hung queries continue to use 100% CPU time.
>>
>>     Judging from the EXPLAIN timing, most of the work is in the trigger,
>>     which leads me to wonder if the parallel calls are likely to be
>> fighting
>>     over inserting/updating the same row in the group_history partition
>>     tables.  Or are you certain that they should be hitting different
>> rows?
>>
>>                             regards, tom lane
>>
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com


Re: query locks up when run concurrently

From
Adrian Klaver
Date:
On 11/23/2016 10:26 PM, azhwkd wrote:
> I'm sorry. I worded this quite poorly. I meant to say that there were
> no log lines added to the postgres logfile at the time.
> I hope these are the settings you were refferring to (I did not change
> anything in the config files vs. the default)
>
> log_destination = 'stderr'
> logging_collector = off
> log_directory = 'pg_log'
> log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
> log_file_mode = 0600
> log_truncate_on_rotation = off
> log_rotation_age = 1d
> log_rotation_size = 10MB
> syslog_facility = 'LOCAL0'
> syslog_ident = 'postgres'
> event_source = 'PostgreSQL'
> client_min_messages = notice
> log_min_messages = warning
> log_min_error_statement = error
> log_min_duration_statement = -1
> debug_print_parse = off
> debug_print_rewritten = off
> debug_print_plan = off
> debug_pretty_print = on
> log_checkpoints = off
> log_connections = off
> log_disconnections = off
> log_duration = off
> log_error_verbosity = default
> log_hostname = off
> log_line_prefix = '%t '
> log_lock_waits = off
> log_statement = 'none'
> log_temp_files = -1
> log_timezone = 'UTC'

While troubleshooting this I would enable log_connections and
log_disconnections and change log_statement to 'mod'. This will help you
see when and what is happening when you do the concurrent queries. FYI,
you need reload the server to have the changes be seen.

>
> I'm sorry, it seems like I copied the trigger definition from the
> wrong table (item_history is a test table I created yesterday while
> trying things which is exactly the same as group_history just a
> different name).
> The trigger on the group_history table is exactly the same though,
> except for the table name.
>
> Kind regards,
> Sebastian
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: query locks up when run concurrently

From
Adrian Klaver
Date:
On 11/23/2016 10:41 PM, azhwkd wrote:
> The group ID is part of the primary key of the group_history table. My
> understanding is that two INSERTs with different group IDs should not
> collide in this case, or am I wrong in thinking this?

The suspicion is that this:

insert into group_history ("group", id, sub_category, "date", aa, ab,
bb, ba, quantity, "hour") ....

is not the problem. That this(from the trigger function) is:

execute 'INSERT INTO public.' || quote_ident(_tablename) || ' VALUES ($1.*)
on conflict ("group", id, sub_category, "when", "hour") do

....


where _tablename is:

_tablename := 'group_history_'||_startdate;


It would be nice to see the schema definitions for group_history and at
least one of the group_history_startdate tables.

>
> The table definition for group_history is the following:
>
> CREATE TABLE public.group_history (
> group int4 NOT NULL,
> id int4 NOT NULL,
> sub_category int4 NOT NULL,
> "date" date NOT NULL,
> aa int8 NOT NULL,
> ab int8 NOT NULL,
> bb int8 NOT NULL,
> ba int8 NOT NULL,
> quantity int8 NOT NULL,
> "hour" int2 NOT NULL,
> CONSTRAINT group_history_pk PRIMARY KEY (group,id,sub_category,"date","hour")
> )
> WITH (
> OIDS=FALSE
> );
>
> Kind regards,
> Sebastian
>



--
Adrian Klaver
adrian.klaver@aklaver.com


Re: query locks up when run concurrently

From
Adrian Klaver
Date:
On 11/23/2016 10:41 PM, azhwkd wrote:
> The group ID is part of the primary key of the group_history table. My
> understanding is that two INSERTs with different group IDs should not
> collide in this case, or am I wrong in thinking this?
>
> The table definition for group_history is the following:
>
> CREATE TABLE public.group_history (
> group int4 NOT NULL,
> id int4 NOT NULL,
> sub_category int4 NOT NULL,
> "date" date NOT NULL,
> aa int8 NOT NULL,
> ab int8 NOT NULL,
> bb int8 NOT NULL,
> ba int8 NOT NULL,
> quantity int8 NOT NULL,
> "hour" int2 NOT NULL,
> CONSTRAINT group_history_pk PRIMARY KEY (group,id,sub_category,"date","hour")
> )
> WITH (
> OIDS=FALSE
> );

Me slaps head

Windy night last night + lack of sleep + insufficient caffeine = Not
seeing the table definition before hitting Send on previous post.

>
> Kind regards,
> Sebastian
>
> 2016-11-24 0:28 GMT+01:00 Adrian Klaver <adrian.klaver@aklaver.com>:
>> On 11/23/2016 01:52 PM, azhwkd wrote:
>>>
>>> Greetings!
>>>
>>> The parallel calls should not be working on the same row. Each query
>>> services a different group ID on it's own and there is no overlap.
>>
>>
>> Except the INSERT query in the trigger function is working on dates not
>> group ids.
>>
>>>
>>> Kind regards,
>>> Sebastian
>>>
>>>
>>> Tom Lane <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> schrieb am Mi.,
>>> 23. Nov. 2016 um 17:47 Uhr:
>>>
>>>     azhwkd@gmail.com <mailto:azhwkd@gmail.com> writes:
>>>     > I have a query which if run alone usually completes in about 300ms.
>>>     > When run in my application this query constantly locks up and bogs
>>>     > down all connections of the connection pool (In the application this
>>>     > query is run up to 10 times in parallel with different parameters).
>>>     > What's really weird is that I can re-run one of the hung queries
>>> from
>>>     > the command line while it's hung and it will complete as expected
>>>     > while the hung queries continue to use 100% CPU time.
>>>
>>>     Judging from the EXPLAIN timing, most of the work is in the trigger,
>>>     which leads me to wonder if the parallel calls are likely to be
>>> fighting
>>>     over inserting/updating the same row in the group_history partition
>>>     tables.  Or are you certain that they should be hitting different
>>> rows?
>>>
>>>                             regards, tom lane
>>>
>>
>>
>> --
>> Adrian Klaver
>> adrian.klaver@aklaver.com
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: query locks up when run concurrently

From
Adrian Klaver
Date:
On 11/23/2016 10:41 PM, azhwkd wrote:
> The group ID is part of the primary key of the group_history table. My
> understanding is that two INSERTs with different group IDs should not
> collide in this case, or am I wrong in thinking this?

After fresh coffee:

In your first post you had this:
"(In the application this
query is run up to 10 times in parallel with different parameters)"

where the parameters look to be a timestamp and a group id.

Is it possible that the combination of parameters over the 10 different
queries is not actually selecting rows in each result set that are
unique over (group,id,sub_category,"date","hour") between result sets?



>
> The table definition for group_history is the following:
>
> CREATE TABLE public.group_history (
> group int4 NOT NULL,
> id int4 NOT NULL,
> sub_category int4 NOT NULL,
> "date" date NOT NULL,
> aa int8 NOT NULL,
> ab int8 NOT NULL,
> bb int8 NOT NULL,
> ba int8 NOT NULL,
> quantity int8 NOT NULL,
> "hour" int2 NOT NULL,
> CONSTRAINT group_history_pk PRIMARY KEY (group,id,sub_category,"date","hour")
> )
> WITH (
> OIDS=FALSE
> );
>
> Kind regards,
> Sebastian
>



--
Adrian Klaver
adrian.klaver@aklaver.com


Re: query locks up when run concurrently

From
azhwkd
Date:
It should not be possible because a group does not return to the
update pool before the update hasn't finished.
I watched the queries in a postgres client and there was no overlap I could see.
I don't really know what to make from this behavior, sometimes when I
start the application a few updates go through and eventually it will
lock up completely and sometimes it locks up immediately - always with
heap_hot_search_buffer using ~20 of all CPU time on the system.

2016-11-24 19:14 GMT+01:00 Adrian Klaver <adrian.klaver@aklaver.com>:
> On 11/23/2016 10:41 PM, azhwkd wrote:
>>
>> The group ID is part of the primary key of the group_history table. My
>> understanding is that two INSERTs with different group IDs should not
>> collide in this case, or am I wrong in thinking this?
>
>
> After fresh coffee:
>
> In your first post you had this:
> "(In the application this
> query is run up to 10 times in parallel with different parameters)"
>
> where the parameters look to be a timestamp and a group id.
>
> Is it possible that the combination of parameters over the 10 different
> queries is not actually selecting rows in each result set that are unique
> over (group,id,sub_category,"date","hour") between result sets?
>
>
>
>
>>
>> The table definition for group_history is the following:
>>
>> CREATE TABLE public.group_history (
>> group int4 NOT NULL,
>> id int4 NOT NULL,
>> sub_category int4 NOT NULL,
>> "date" date NOT NULL,
>> aa int8 NOT NULL,
>> ab int8 NOT NULL,
>> bb int8 NOT NULL,
>> ba int8 NOT NULL,
>> quantity int8 NOT NULL,
>> "hour" int2 NOT NULL,
>> CONSTRAINT group_history_pk PRIMARY KEY
>> (group,id,sub_category,"date","hour")
>> )
>> WITH (
>> OIDS=FALSE
>> );
>>
>> Kind regards,
>> Sebastian
>>
>
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com


Re: query locks up when run concurrently

From
Adrian Klaver
Date:
On 11/24/2016 01:23 PM, azhwkd wrote:
> It should not be possible because a group does not return to the
> update pool before the update hasn't finished.

So what is this 'update pool' and what is driving/using it?

In other words how is the determination of the parameters done?

To be more specific, the implication is that a group id can be reused so
what determines that?

> I watched the queries in a postgres client and there was no overlap I could see.

Was this a visual inspection or did you dump the results of the various
query/parameter combinations into tables and do an SQL comparison?

> I don't really know what to make from this behavior, sometimes when I
> start the application a few updates go through and eventually it will
> lock up completely and sometimes it locks up immediately - always with

Is there a common thread with regard to the parameters in use when
things lock up?

> heap_hot_search_buffer using ~20 of all CPU time on the system.
>
> 2016-11-24 19:14 GMT+01:00 Adrian Klaver <adrian.klaver@aklaver.com>:
>> On 11/23/2016 10:41 PM, azhwkd wrote:
>>>
>>> The group ID is part of the primary key of the group_history table. My
>>> understanding is that two INSERTs with different group IDs should not
>>> collide in this case, or am I wrong in thinking this?
>>



--
Adrian Klaver
adrian.klaver@aklaver.com


Re: query locks up when run concurrently

From
azhwkd
Date:

Adrian Klaver <adrian.klaver@aklaver.com> schrieb am Do., 24. Nov. 2016 um 22:34 Uhr:
On 11/24/2016 01:23 PM, azhwkd wrote:
> It should not be possible because a group does not return to the
> update pool before the update hasn't finished.

So what is this 'update pool' and what is driving/using it?

In other words how is the determination of the parameters done?

To be more specific, the implication is that a group id can be reused so
what determines that?

The application is written in go. Every group ID has its own go routine and the routine is blocked until the SQL statement returns.
The update process starts with a check to an external API endpoint and if there is new data available the routine is downloading it, parsing it and inserting the data into 2 tables. Once that is done, the routine continues to execute the statement in question using the data it inserted before for the calculation. Only once this finishes will the routine start over again.
 

> I watched the queries in a postgres client and there was no overlap I could see.

Was this a visual inspection or did you dump the results of the various
query/parameter combinations into tables and do an SQL comparison?

I inspected it visually and also dumped all variables into a file directly from the application.
 

> I don't really know what to make from this behavior, sometimes when I
> start the application a few updates go through and eventually it will
> lock up completely and sometimes it locks up immediately - always with

Is there a common thread with regard to the parameters in use when
things lock up?

Do you mean if it always locks on the same parameters? If so then it does not, sadly
 

> heap_hot_search_buffer using ~20 of all CPU time on the system.
>
> 2016-11-24 19:14 GMT+01:00 Adrian Klaver <adrian.klaver@aklaver.com>:
>> On 11/23/2016 10:41 PM, azhwkd wrote:
>>>
>>> The group ID is part of the primary key of the group_history table. My
>>> understanding is that two INSERTs with different group IDs should not
>>> collide in this case, or am I wrong in thinking this?
>>



--
Adrian Klaver
adrian.klaver@aklaver.com

Re: query locks up when run concurrently

From
rob stone
Date:
On Thu, 2016-11-24 at 22:23 +0100, azhwkd wrote:
> It should not be possible because a group does not return to the
> update pool before the update hasn't finished.
> I watched the queries in a postgres client and there was no overlap I
> could see.
> I don't really know what to make from this behavior, sometimes when I
> start the application a few updates go through and eventually it will
> lock up completely and sometimes it locks up immediately - always
> with
> heap_hot_search_buffer using ~20 of all CPU time on the system.
>
>
Hello Sebastian,
You stated that the application is written using go (www.golang.org).
Are you able to run the application with the -race flag?

HTH,
Rob

From wikipedia:-

Lack of race condition safety

There are no restrictions on how goroutines access shared data, making
race conditions possible. Specifically, unless a program explicitly
synchronizes via channels or other means, writes from one goroutine
might be partly, entirely, or not at all visible to another, often with
no guarantees about ordering of writes. Furthermore, Go's internal data
structures like interface values, slice headers, hash tables, and
string headers are not immune to race conditions, so type and memory
safety can be violated in multithreaded programs that modify shared
instances of those types without synchronization.

Instead of language support, safe concurrent programming thus relies on
conventions; for example, Chisnall recommends an idiom called "aliases
xor mutable", meaning that passing a mutable value (or pointer) over a
channel signals a transfer of ownership over the value to its receiver.


Re: query locks up when run concurrently

From
Adrian Klaver
Date:
On 11/24/2016 02:14 PM, azhwkd wrote:
>
> Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> schrieb am Do., 24. Nov. 2016 um
> 22:34 Uhr:
>
>     On 11/24/2016 01:23 PM, azhwkd wrote:
>     > It should not be possible because a group does not return to the
>     > update pool before the update hasn't finished.
>
>     So what is this 'update pool' and what is driving/using it?
>
>     In other words how is the determination of the parameters done?
>
>     To be more specific, the implication is that a group id can be reused so
>     what determines that?
>
>
> The application is written in go. Every group ID has its own go routine
> and the routine is blocked until the SQL statement returns.
> The update process starts with a check to an external API endpoint and
> if there is new data available the routine is downloading it, parsing it
> and inserting the data into 2 tables. Once that is done, the routine
> continues to execute the statement in question using the data it
> inserted before for the calculation. Only once this finishes will the
> routine start over again.
>
>
>
>     > I watched the queries in a postgres client and there was no
>     overlap I could see.
>
>     Was this a visual inspection or did you dump the results of the various
>     query/parameter combinations into tables and do an SQL comparison?
>
>
> I inspected it visually and also dumped all variables into a file
> directly from the application.
>
>
>
>     > I don't really know what to make from this behavior, sometimes when I
>     > start the application a few updates go through and eventually it will
>     > lock up completely and sometimes it locks up immediately - always with
>
>     Is there a common thread with regard to the parameters in use when
>     things lock up?
>
>
> Do you mean if it always locks on the same parameters? If so then it
> does not, sadly
>

Yes, that would have been too easy. I'm out of ideas for the moment. Rob
Stones post looks promising though.


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: query locks up when run concurrently

From
azhwkd
Date:
Greetings!

I tried running the application using the race detector but no data races were detected.
To further narrow this down I removed the call to the function in question from the pool
and instead run it one by one from a queue. This seems to have helped and the query didn't lock up anymore. 
I took this one step further and left the locked up queries sitting there for a day and they eventually finished. Who would have thought a ~2 second query could blow up to ~20hrs.


rob stone <floriparob@gmail.com> schrieb am Fr., 25. Nov. 2016 um 02:34 Uhr:

On Thu, 2016-11-24 at 22:23 +0100, azhwkd wrote:
> It should not be possible because a group does not return to the
> update pool before the update hasn't finished.
> I watched the queries in a postgres client and there was no overlap I
> could see.
> I don't really know what to make from this behavior, sometimes when I
> start the application a few updates go through and eventually it will
> lock up completely and sometimes it locks up immediately - always
> with
> heap_hot_search_buffer using ~20 of all CPU time on the system.
>
>
Hello Sebastian,
You stated that the application is written using go (www.golang.org).
Are you able to run the application with the -race flag?

HTH,
Rob

From wikipedia:-

Lack of race condition safety

There are no restrictions on how goroutines access shared data, making
race conditions possible. Specifically, unless a program explicitly
synchronizes via channels or other means, writes from one goroutine
might be partly, entirely, or not at all visible to another, often with
no guarantees about ordering of writes. Furthermore, Go's internal data
structures like interface values, slice headers, hash tables, and
string headers are not immune to race conditions, so type and memory
safety can be violated in multithreaded programs that modify shared
instances of those types without synchronization.

Instead of language support, safe concurrent programming thus relies on
conventions; for example, Chisnall recommends an idiom called "aliases
xor mutable", meaning that passing a mutable value (or pointer) over a
channel signals a transfer of ownership over the value to its receiver.