Thread: query locks up when run concurrently
Greetings! I'm using postgres version 9.5.5 on a ubuntu 16.04.1 server installation which was installed through apt-get. I have a query which if run alone usually completes in about 300ms. When run in my application this query constantly locks up and bogs down all connections of the connection pool (In the application this query is run up to 10 times in parallel with different parameters). What's really weird is that I can re-run one of the hung queries from the command line while it's hung and it will complete as expected while the hung queries continue to use 100% CPU time. The query in question is this: insert into group_history ("group", id, sub_category, "date", aa, ab, bb, ba, quantity, "hour") (select a."group", a.id, b.sub_category, to_timestamp($2)::date as "date", max(a.aa / a.quantity) as aa, min(a.aa / a.quantity) as ab, max(a.bb / a.quantity) as bb, min(a.bb/ a.quantity) as ba, sum(a.quantity) as quantity, extract('hour' from to_timestamp($2)) as "hour" from tbla a join tblb b on a.id = b.id where a."group" = $1 and b."group" = $1 group by a."group", a.id, b.sub_category ); When I'm running a perf on the system it looks like this while running the query 10 times: Samples: 4M of event 'cpu-clock', Event count (approx.): 18972107951 Overhead Shared Object Symbol 17.95% postgres [.] heap_hot_search_buffer 5.64% postgres [.] heap_page_prune_opt 4.62% postgres [.] hash_search_with_hash_value 3.80% postgres [.] LWLockRelease 3.73% postgres [.] 0x00000000002f420d 2.50% postgres [.] _bt_checkkeys 2.48% postgres [.] hash_any 2.45% postgres [.] 0x00000000002f41e7 2.10% postgres [.] slot_getattr 1.80% postgres [.] ResourceOwnerForgetBuffer 1.58% postgres [.] LWLockAcquire 1.58% postgres [.] ReadBufferExtended 1.54% postgres [.] index_fetch_heap 1.47% postgres [.] MemoryContextReset 1.43% postgres [.] btgettuple 1.38% postgres [.] 0x00000000002d710c 1.36% postgres [.] 0x00000000002d70a5 1.35% postgres [.] ExecQual Explain (Analyze, Verbose) Output QUERY PLAN ---------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------- Insert on public.group_history (cost=10254.36..10315.16 rows=2432 width=62) (actual time=1833.967..1833.967 rows=0 loops=1) -> Subquery Scan on "*SELECT*" (cost=10254.36..10315.16 rows=2432 width=62) (actual time=353.880..376.490 rows=6139 loops=1) Output: "*SELECT*"."group", "*SELECT*".id, "*SELECT*".sub_category, "*SELECT*"."when", "*SELECT*".aa, "*SELECT*".ab, "*SELECT*".bb, "*SELECT*".ba, "*SELECT*".quantity, "*SELECT*"."hour" -> HashAggregate (cost=10254.36..10278.68 rows=2432 width=28) (actual time=353.871..367.144 rows=6139 loops=1) Output: a."group", a.id, b.sub_category, '2016-11-20'::date, max((a.aa / a.quantity)), min((a.aa / a.quantity)), max((a.bb / a.quantity)), min((a.bb / a.quantity)), sum(a.quantity), '21'::double precision Group Key: a."group", a.id, b.sub_category -> Hash Join (cost=5558.64..10181.40 rows=2432 width=28) (actual time=193.949..294.106 rows=30343 loops=1) Output: a."group", a.id, a.aa, a.quantity, a.bb, b.sub_category Hash Cond: (b.id = a.id) -> Bitmap Heap Scan on public.auctions_extra b (cost=685.19..4719.06 rows=30550 width=8) (actual time=56.678..111.038 rows=30343 loops=1) Output: b.sub_category, b.id Recheck Cond: (b."group" = 7) Heap Blocks: exact=289 -> Bitmap Index Scan on auction_extra_pk (cost=0.00..677.55 rows=30550 width=0) (actual time=55.966..55.966 rows=30343 loops=1) Index Cond: (b."group" = 7) -> Hash (cost=4280.62..4280.62 rows=30627 width=28) (actual time=137.160..137.160 rows=30343 loops=1) Output: a."group", a.id, a.aa, a.quantity, a.bb, a.id Buckets: 16384 Batches: 4 Memory Usage: 638kB -> Bitmap Heap Scan on public.tbla a (cost=689.78..4280.62 rows=30627 width=28) (actual time=58.530..117.064 rows=30343 loops=1) Output: a."group", a.id, a.aa, a.quantity, a.bb, a.id Recheck Cond: (a."group" = 7) Heap Blocks: exact=254 -> Bitmap Index Scan on tbla_pk (cost=0.00..682.12 rows=30627 width=0) (actual time=57.801..57.801 rows=30343 loops=1) Index Cond: (a."group" = 7) Planning time: 0.475 ms Trigger group_history_trigger: time=1442.561 calls=6139 Execution time: 1834.119 ms group_history_trigger: CREATE OR REPLACE FUNCTION public.group_history_partition_function() RETURNS trigger LANGUAGE plpgsql AS $function$ declare _new_date timestamptz; _tablename text; _startdate text; begin -- Takes the current inbound "when" value and determines when midnight is for the given date _new_date := date_trunc('day', new."when"); _startdate := to_char(_new_date, 'YYYY_MM_DD'); _tablename := 'group_history_'||_startdate; -- Insert the current record into the correct partition execute 'INSERT INTO public.' || quote_ident(_tablename) || ' VALUES ($1.*) on conflict ("group", id, sub_category, "when", "hour") do update set aa = excluded.aa, ab = excluded.ab, bb = excluded.bb, ba = excluded.ba, quantity = excluded.quantity' using new; return null; end $function$ Has anyone experienced this before or even has a solution for this? I'd be very grateful. Kind regards Sebastian
On 11/22/2016 11:41 PM, azhwkd@gmail.com wrote: > Greetings! > > I'm using postgres version 9.5.5 on a ubuntu 16.04.1 server > installation which was installed through apt-get. > > I have a query which if run alone usually completes in about 300ms. > When run in my application this query constantly locks up and bogs > down all connections of the connection pool (In the application this > query is run up to 10 times in parallel with different parameters). > What's really weird is that I can re-run one of the hung queries from > the command line while it's hung and it will complete as expected > while the hung queries continue to use 100% CPU time. Have you looked in the Postgres logs to see if there are any relevant messages? Best guess is that running in parallel is creating a lock problem where one of the queries is holding a lock that is keeping the others from completing. What relationships do the group_history and group_history_* tables have to other tables? What is the definition of the trigger that fires the trigger function below? > > The query in question is this: > > insert into group_history ("group", id, sub_category, "date", aa, ab, > bb, ba, quantity, "hour") > (select > a."group", > a.id, > b.sub_category, > to_timestamp($2)::date as "date", > max(a.aa / a.quantity) as aa, > min(a.aa / a.quantity) as ab, > max(a.bb / a.quantity) as bb, > min(a.bb/ a.quantity) as ba, > sum(a.quantity) as quantity, > extract('hour' from to_timestamp($2)) as "hour" > from tbla a > join tblb b on a.id = b.id > where a."group" = $1 and b."group" = $1 > group by a."group", a.id, b.sub_category > ); > > When I'm running a perf on the system it looks like this while running > the query 10 times: > > Samples: 4M of event 'cpu-clock', Event count (approx.): 18972107951 > Overhead Shared Object Symbol > 17.95% postgres [.] heap_hot_search_buffer > 5.64% postgres [.] heap_page_prune_opt > 4.62% postgres [.] hash_search_with_hash_value > 3.80% postgres [.] LWLockRelease > 3.73% postgres [.] 0x00000000002f420d > 2.50% postgres [.] _bt_checkkeys > 2.48% postgres [.] hash_any > 2.45% postgres [.] 0x00000000002f41e7 > 2.10% postgres [.] slot_getattr > 1.80% postgres [.] ResourceOwnerForgetBuffer > 1.58% postgres [.] LWLockAcquire > 1.58% postgres [.] ReadBufferExtended > 1.54% postgres [.] index_fetch_heap > 1.47% postgres [.] MemoryContextReset > 1.43% postgres [.] btgettuple > 1.38% postgres [.] 0x00000000002d710c > 1.36% postgres [.] 0x00000000002d70a5 > 1.35% postgres [.] ExecQual > > Explain (Analyze, Verbose) Output > > QUERY PLAN > ---------------------------------------------------------------------------- > ------------------------------------------------------------------------------------------------------------------------------------------------- > Insert on public.group_history (cost=10254.36..10315.16 rows=2432 > width=62) (actual time=1833.967..1833.967 rows=0 loops=1) > -> Subquery Scan on "*SELECT*" (cost=10254.36..10315.16 rows=2432 > width=62) (actual time=353.880..376.490 rows=6139 loops=1) > Output: "*SELECT*"."group", "*SELECT*".id, > "*SELECT*".sub_category, "*SELECT*"."when", "*SELECT*".aa, > "*SELECT*".ab, "*SELECT*".bb, "*SELECT*".ba, "*SELECT*".quantity, > "*SELECT*"."hour" > -> HashAggregate (cost=10254.36..10278.68 rows=2432 width=28) > (actual time=353.871..367.144 rows=6139 loops=1) > Output: a."group", a.id, b.sub_category, '2016-11-20'::date, > max((a.aa / a.quantity)), min((a.aa / a.quantity)), max((a.bb / > a.quantity)), min((a.bb / a.quantity)), sum(a.quantity), '21'::double > precision > Group Key: a."group", a.id, b.sub_category > -> Hash Join (cost=5558.64..10181.40 rows=2432 width=28) > (actual time=193.949..294.106 rows=30343 loops=1) > Output: a."group", a.id, a.aa, a.quantity, a.bb, b.sub_category > Hash Cond: (b.id = a.id) > -> Bitmap Heap Scan on public.auctions_extra b > (cost=685.19..4719.06 rows=30550 width=8) (actual time=56.678..111.038 > rows=30343 loops=1) > Output: b.sub_category, b.id > Recheck Cond: (b."group" = 7) > Heap Blocks: exact=289 > -> Bitmap Index Scan on auction_extra_pk > (cost=0.00..677.55 rows=30550 width=0) (actual time=55.966..55.966 > rows=30343 loops=1) > Index Cond: (b."group" = 7) > -> Hash (cost=4280.62..4280.62 rows=30627 width=28) > (actual time=137.160..137.160 rows=30343 loops=1) > Output: a."group", a.id, a.aa, a.quantity, a.bb, a.id > Buckets: 16384 Batches: 4 Memory Usage: 638kB > -> Bitmap Heap Scan on public.tbla a > (cost=689.78..4280.62 rows=30627 width=28) (actual > time=58.530..117.064 rows=30343 loops=1) > Output: a."group", a.id, a.aa, a.quantity, > a.bb, a.id > Recheck Cond: (a."group" = 7) > Heap Blocks: exact=254 > -> Bitmap Index Scan on tbla_pk > (cost=0.00..682.12 rows=30627 width=0) (actual time=57.801..57.801 > rows=30343 loops=1) > Index Cond: (a."group" = 7) > Planning time: 0.475 ms > Trigger group_history_trigger: time=1442.561 calls=6139 > Execution time: 1834.119 ms > > > group_history_trigger: > > CREATE OR REPLACE FUNCTION public.group_history_partition_function() > RETURNS trigger > LANGUAGE plpgsql > AS $function$ > declare > _new_date timestamptz; > _tablename text; > _startdate text; > begin > -- Takes the current inbound "when" value and determines when > midnight is for the given date > _new_date := date_trunc('day', new."when"); > _startdate := to_char(_new_date, 'YYYY_MM_DD'); > _tablename := 'group_history_'||_startdate; > > -- Insert the current record into the correct partition > execute 'INSERT INTO public.' || quote_ident(_tablename) || ' VALUES ($1.*) > on conflict ("group", id, sub_category, "when", "hour") do > update set aa = excluded.aa, > ab = excluded.ab, > bb = excluded.bb, > ba = excluded.ba, > quantity = excluded.quantity' using new; > return null; > end > $function$ > > Has anyone experienced this before or even has a solution for this? > I'd be very grateful. > > Kind regards > Sebastian > > -- Adrian Klaver adrian.klaver@aklaver.com
azhwkd@gmail.com writes: > I have a query which if run alone usually completes in about 300ms. > When run in my application this query constantly locks up and bogs > down all connections of the connection pool (In the application this > query is run up to 10 times in parallel with different parameters). > What's really weird is that I can re-run one of the hung queries from > the command line while it's hung and it will complete as expected > while the hung queries continue to use 100% CPU time. Judging from the EXPLAIN timing, most of the work is in the trigger, which leads me to wonder if the parallel calls are likely to be fighting over inserting/updating the same row in the group_history partition tables. Or are you certain that they should be hitting different rows? regards, tom lane
Greetings!
Yes I had a look at the logfiles but there is not a single logfile generated when I'm reproducing this issue.
Concerning locks I used there queries at https://wiki.postgresql.org/wiki/Lock_Monitoring and they came up empty.
The group_history table and its sub-tables do not have any foreign keys as they only contain calculated data and the source data is removed after some time.
The trigger which calls the partition function below is defined as follows:
create
trigger group_history_trigger before insert
on
item_history for each row execute procedure group_history_partition_function()
Kind regards,
Sebastian
Adrian Klaver <adrian.klaver@aklaver.com> schrieb am Mi., 23. Nov. 2016 um 16:25 Uhr:
On 11/22/2016 11:41 PM, azhwkd@gmail.com wrote:
> Greetings!
>
> I'm using postgres version 9.5.5 on a ubuntu 16.04.1 server
> installation which was installed through apt-get.
>
> I have a query which if run alone usually completes in about 300ms.
> When run in my application this query constantly locks up and bogs
> down all connections of the connection pool (In the application this
> query is run up to 10 times in parallel with different parameters).
> What's really weird is that I can re-run one of the hung queries from
> the command line while it's hung and it will complete as expected
> while the hung queries continue to use 100% CPU time.
Have you looked in the Postgres logs to see if there are any relevant
messages?
Best guess is that running in parallel is creating a lock problem where
one of the queries is holding a lock that is keeping the others from
completing.
What relationships do the group_history and group_history_* tables have
to other tables?
What is the definition of the trigger that fires the trigger function below?
>
> The query in question is this:
>
> insert into group_history ("group", id, sub_category, "date", aa, ab,
> bb, ba, quantity, "hour")
> (select
> a."group",
> a.id,
> b.sub_category,
> to_timestamp($2)::date as "date",
> max(a.aa / a.quantity) as aa,
> min(a.aa / a.quantity) as ab,
> max(a.bb / a.quantity) as bb,
> min(a.bb/ a.quantity) as ba,
> sum(a.quantity) as quantity,
> extract('hour' from to_timestamp($2)) as "hour"
> from tbla a
> join tblb b on a.id = b.id
> where a."group" = $1 and b."group" = $1
> group by a."group", a.id, b.sub_category
> );
>
> When I'm running a perf on the system it looks like this while running
> the query 10 times:
>
> Samples: 4M of event 'cpu-clock', Event count (approx.): 18972107951
> Overhead Shared Object Symbol
> 17.95% postgres [.] heap_hot_search_buffer
> 5.64% postgres [.] heap_page_prune_opt
> 4.62% postgres [.] hash_search_with_hash_value
> 3.80% postgres [.] LWLockRelease
> 3.73% postgres [.] 0x00000000002f420d
> 2.50% postgres [.] _bt_checkkeys
> 2.48% postgres [.] hash_any
> 2.45% postgres [.] 0x00000000002f41e7
> 2.10% postgres [.] slot_getattr
> 1.80% postgres [.] ResourceOwnerForgetBuffer
> 1.58% postgres [.] LWLockAcquire
> 1.58% postgres [.] ReadBufferExtended
> 1.54% postgres [.] index_fetch_heap
> 1.47% postgres [.] MemoryContextReset
> 1.43% postgres [.] btgettuple
> 1.38% postgres [.] 0x00000000002d710c
> 1.36% postgres [.] 0x00000000002d70a5
> 1.35% postgres [.] ExecQual
>
> Explain (Analyze, Verbose) Output
>
> QUERY PLAN
> ----------------------------------------------------------------------------
> -------------------------------------------------------------------------------------------------------------------------------------------------
> Insert on public.group_history (cost=10254.36..10315.16 rows=2432
> width=62) (actual time=1833.967..1833.967 rows=0 loops=1)
> -> Subquery Scan on "*SELECT*" (cost=10254.36..10315.16 rows=2432
> width=62) (actual time=353.880..376.490 rows=6139 loops=1)
> Output: "*SELECT*"."group", "*SELECT*".id,
> "*SELECT*".sub_category, "*SELECT*"."when", "*SELECT*".aa,
> "*SELECT*".ab, "*SELECT*".bb, "*SELECT*".ba, "*SELECT*".quantity,
> "*SELECT*"."hour"
> -> HashAggregate (cost=10254.36..10278.68 rows=2432 width=28)
> (actual time=353.871..367.144 rows=6139 loops=1)
> Output: a."group", a.id, b.sub_category, '2016-11-20'::date,
> max((a.aa / a.quantity)), min((a.aa / a.quantity)), max((a.bb /
> a.quantity)), min((a.bb / a.quantity)), sum(a.quantity), '21'::double
> precision
> Group Key: a."group", a.id, b.sub_category
> -> Hash Join (cost=5558.64..10181.40 rows=2432 width=28)
> (actual time=193.949..294.106 rows=30343 loops=1)
> Output: a."group", a.id, a.aa, a.quantity, a.bb, b.sub_category
> Hash Cond: (b.id = a.id)
> -> Bitmap Heap Scan on public.auctions_extra b
> (cost=685.19..4719.06 rows=30550 width=8) (actual time=56.678..111.038
> rows=30343 loops=1)
> Output: b.sub_category, b.id
> Recheck Cond: (b."group" = 7)
> Heap Blocks: exact=289
> -> Bitmap Index Scan on auction_extra_pk
> (cost=0.00..677.55 rows=30550 width=0) (actual time=55.966..55.966
> rows=30343 loops=1)
> Index Cond: (b."group" = 7)
> -> Hash (cost=4280.62..4280.62 rows=30627 width=28)
> (actual time=137.160..137.160 rows=30343 loops=1)
> Output: a."group", a.id, a.aa, a.quantity, a.bb, a.id
> Buckets: 16384 Batches: 4 Memory Usage: 638kB
> -> Bitmap Heap Scan on public.tbla a
> (cost=689.78..4280.62 rows=30627 width=28) (actual
> time=58.530..117.064 rows=30343 loops=1)
> Output: a."group", a.id, a.aa, a.quantity,
> a.bb, a.id
> Recheck Cond: (a."group" = 7)
> Heap Blocks: exact=254
> -> Bitmap Index Scan on tbla_pk
> (cost=0.00..682.12 rows=30627 width=0) (actual time=57.801..57.801
> rows=30343 loops=1)
> Index Cond: (a."group" = 7)
> Planning time: 0.475 ms
> Trigger group_history_trigger: time=1442.561 calls=6139
> Execution time: 1834.119 ms
>
>
> group_history_trigger:
>
> CREATE OR REPLACE FUNCTION public.group_history_partition_function()
> RETURNS trigger
> LANGUAGE plpgsql
> AS $function$
> declare
> _new_date timestamptz;
> _tablename text;
> _startdate text;
> begin
> -- Takes the current inbound "when" value and determines when
> midnight is for the given date
> _new_date := date_trunc('day', new."when");
> _startdate := to_char(_new_date, 'YYYY_MM_DD');
> _tablename := 'group_history_'||_startdate;
>
> -- Insert the current record into the correct partition
> execute 'INSERT INTO public.' || quote_ident(_tablename) || ' VALUES ($1.*)
> on conflict ("group", id, sub_category, "when", "hour") do
> update set aa = excluded.aa,
> ab = excluded.ab,
> bb = excluded.bb,
> ba = excluded.ba,
> quantity = excluded.quantity' using new;
> return null;
> end
> $function$
>
> Has anyone experienced this before or even has a solution for this?
> I'd be very grateful.
>
> Kind regards
> Sebastian
>
>
--
Adrian Klaver
adrian.klaver@aklaver.com
Greetings!
The parallel calls should not be working on the same row. Each query services a different group ID on it's own and there is no overlap.
Kind regards,
Sebastian
Tom Lane <tgl@sss.pgh.pa.us> schrieb am Mi., 23. Nov. 2016 um 17:47 Uhr:
azhwkd@gmail.com writes:
> I have a query which if run alone usually completes in about 300ms.
> When run in my application this query constantly locks up and bogs
> down all connections of the connection pool (In the application this
> query is run up to 10 times in parallel with different parameters).
> What's really weird is that I can re-run one of the hung queries from
> the command line while it's hung and it will complete as expected
> while the hung queries continue to use 100% CPU time.
Judging from the EXPLAIN timing, most of the work is in the trigger,
which leads me to wonder if the parallel calls are likely to be fighting
over inserting/updating the same row in the group_history partition
tables. Or are you certain that they should be hitting different rows?
regards, tom lane
On 11/23/2016 01:47 PM, azhwkd wrote: > Greetings! > > Yes I had a look at the logfiles but there is not a single logfile > generated when I'm reproducing this issue. We are talking about the Postgres logfile, correct? I have to believe a log file exists, but maybe not entries at that time. What are you logging settings? > Concerning locks I used there queries > at https://wiki.postgresql.org/wiki/Lock_Monitoring and they came up empty. > The group_history table and its sub-tables do not have any foreign keys > as they only contain calculated data and the source data is removed > after some time. > The trigger which calls the partition function below is defined as follows: > > create > trigger group_history_trigger before insert > on > item_history for each row execute procedure > group_history_partition_function() This trigger is on the table item_history not group history, is that correct? If so how does the item_history table fit into this? > > Kind regards, > Sebastian -- Adrian Klaver adrian.klaver@aklaver.com
On 11/23/2016 01:52 PM, azhwkd wrote: > Greetings! > > The parallel calls should not be working on the same row. Each query > services a different group ID on it's own and there is no overlap. Except the INSERT query in the trigger function is working on dates not group ids. > > Kind regards, > Sebastian > > > Tom Lane <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> schrieb am Mi., > 23. Nov. 2016 um 17:47 Uhr: > > azhwkd@gmail.com <mailto:azhwkd@gmail.com> writes: > > I have a query which if run alone usually completes in about 300ms. > > When run in my application this query constantly locks up and bogs > > down all connections of the connection pool (In the application this > > query is run up to 10 times in parallel with different parameters). > > What's really weird is that I can re-run one of the hung queries from > > the command line while it's hung and it will complete as expected > > while the hung queries continue to use 100% CPU time. > > Judging from the EXPLAIN timing, most of the work is in the trigger, > which leads me to wonder if the parallel calls are likely to be fighting > over inserting/updating the same row in the group_history partition > tables. Or are you certain that they should be hitting different rows? > > regards, tom lane > -- Adrian Klaver adrian.klaver@aklaver.com
I'm sorry. I worded this quite poorly. I meant to say that there were no log lines added to the postgres logfile at the time. I hope these are the settings you were refferring to (I did not change anything in the config files vs. the default) log_destination = 'stderr' logging_collector = off log_directory = 'pg_log' log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' log_file_mode = 0600 log_truncate_on_rotation = off log_rotation_age = 1d log_rotation_size = 10MB syslog_facility = 'LOCAL0' syslog_ident = 'postgres' event_source = 'PostgreSQL' client_min_messages = notice log_min_messages = warning log_min_error_statement = error log_min_duration_statement = -1 debug_print_parse = off debug_print_rewritten = off debug_print_plan = off debug_pretty_print = on log_checkpoints = off log_connections = off log_disconnections = off log_duration = off log_error_verbosity = default log_hostname = off log_line_prefix = '%t ' log_lock_waits = off log_statement = 'none' log_temp_files = -1 log_timezone = 'UTC' I'm sorry, it seems like I copied the trigger definition from the wrong table (item_history is a test table I created yesterday while trying things which is exactly the same as group_history just a different name). The trigger on the group_history table is exactly the same though, except for the table name. Kind regards, Sebastian 2016-11-24 0:26 GMT+01:00 Adrian Klaver <adrian.klaver@aklaver.com>: > On 11/23/2016 01:47 PM, azhwkd wrote: >> >> Greetings! >> >> Yes I had a look at the logfiles but there is not a single logfile >> generated when I'm reproducing this issue. > > > We are talking about the Postgres logfile, correct? > > I have to believe a log file exists, but maybe not entries at that time. > > What are you logging settings? > >> Concerning locks I used there queries >> at https://wiki.postgresql.org/wiki/Lock_Monitoring and they came up >> empty. >> The group_history table and its sub-tables do not have any foreign keys >> as they only contain calculated data and the source data is removed >> after some time. >> The trigger which calls the partition function below is defined as >> follows: >> >> create >> trigger group_history_trigger before insert >> on >> item_history for each row execute procedure >> group_history_partition_function() > > > This trigger is on the table item_history not group history, is that > correct? > > If so how does the item_history table fit into this? > >> >> Kind regards, >> Sebastian > > > > -- > Adrian Klaver > adrian.klaver@aklaver.com
The group ID is part of the primary key of the group_history table. My understanding is that two INSERTs with different group IDs should not collide in this case, or am I wrong in thinking this? The table definition for group_history is the following: CREATE TABLE public.group_history ( group int4 NOT NULL, id int4 NOT NULL, sub_category int4 NOT NULL, "date" date NOT NULL, aa int8 NOT NULL, ab int8 NOT NULL, bb int8 NOT NULL, ba int8 NOT NULL, quantity int8 NOT NULL, "hour" int2 NOT NULL, CONSTRAINT group_history_pk PRIMARY KEY (group,id,sub_category,"date","hour") ) WITH ( OIDS=FALSE ); Kind regards, Sebastian 2016-11-24 0:28 GMT+01:00 Adrian Klaver <adrian.klaver@aklaver.com>: > On 11/23/2016 01:52 PM, azhwkd wrote: >> >> Greetings! >> >> The parallel calls should not be working on the same row. Each query >> services a different group ID on it's own and there is no overlap. > > > Except the INSERT query in the trigger function is working on dates not > group ids. > >> >> Kind regards, >> Sebastian >> >> >> Tom Lane <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> schrieb am Mi., >> 23. Nov. 2016 um 17:47 Uhr: >> >> azhwkd@gmail.com <mailto:azhwkd@gmail.com> writes: >> > I have a query which if run alone usually completes in about 300ms. >> > When run in my application this query constantly locks up and bogs >> > down all connections of the connection pool (In the application this >> > query is run up to 10 times in parallel with different parameters). >> > What's really weird is that I can re-run one of the hung queries >> from >> > the command line while it's hung and it will complete as expected >> > while the hung queries continue to use 100% CPU time. >> >> Judging from the EXPLAIN timing, most of the work is in the trigger, >> which leads me to wonder if the parallel calls are likely to be >> fighting >> over inserting/updating the same row in the group_history partition >> tables. Or are you certain that they should be hitting different >> rows? >> >> regards, tom lane >> > > > -- > Adrian Klaver > adrian.klaver@aklaver.com
On 11/23/2016 10:26 PM, azhwkd wrote: > I'm sorry. I worded this quite poorly. I meant to say that there were > no log lines added to the postgres logfile at the time. > I hope these are the settings you were refferring to (I did not change > anything in the config files vs. the default) > > log_destination = 'stderr' > logging_collector = off > log_directory = 'pg_log' > log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' > log_file_mode = 0600 > log_truncate_on_rotation = off > log_rotation_age = 1d > log_rotation_size = 10MB > syslog_facility = 'LOCAL0' > syslog_ident = 'postgres' > event_source = 'PostgreSQL' > client_min_messages = notice > log_min_messages = warning > log_min_error_statement = error > log_min_duration_statement = -1 > debug_print_parse = off > debug_print_rewritten = off > debug_print_plan = off > debug_pretty_print = on > log_checkpoints = off > log_connections = off > log_disconnections = off > log_duration = off > log_error_verbosity = default > log_hostname = off > log_line_prefix = '%t ' > log_lock_waits = off > log_statement = 'none' > log_temp_files = -1 > log_timezone = 'UTC' While troubleshooting this I would enable log_connections and log_disconnections and change log_statement to 'mod'. This will help you see when and what is happening when you do the concurrent queries. FYI, you need reload the server to have the changes be seen. > > I'm sorry, it seems like I copied the trigger definition from the > wrong table (item_history is a test table I created yesterday while > trying things which is exactly the same as group_history just a > different name). > The trigger on the group_history table is exactly the same though, > except for the table name. > > Kind regards, > Sebastian > -- Adrian Klaver adrian.klaver@aklaver.com
On 11/23/2016 10:41 PM, azhwkd wrote: > The group ID is part of the primary key of the group_history table. My > understanding is that two INSERTs with different group IDs should not > collide in this case, or am I wrong in thinking this? The suspicion is that this: insert into group_history ("group", id, sub_category, "date", aa, ab, bb, ba, quantity, "hour") .... is not the problem. That this(from the trigger function) is: execute 'INSERT INTO public.' || quote_ident(_tablename) || ' VALUES ($1.*) on conflict ("group", id, sub_category, "when", "hour") do .... where _tablename is: _tablename := 'group_history_'||_startdate; It would be nice to see the schema definitions for group_history and at least one of the group_history_startdate tables. > > The table definition for group_history is the following: > > CREATE TABLE public.group_history ( > group int4 NOT NULL, > id int4 NOT NULL, > sub_category int4 NOT NULL, > "date" date NOT NULL, > aa int8 NOT NULL, > ab int8 NOT NULL, > bb int8 NOT NULL, > ba int8 NOT NULL, > quantity int8 NOT NULL, > "hour" int2 NOT NULL, > CONSTRAINT group_history_pk PRIMARY KEY (group,id,sub_category,"date","hour") > ) > WITH ( > OIDS=FALSE > ); > > Kind regards, > Sebastian > -- Adrian Klaver adrian.klaver@aklaver.com
On 11/23/2016 10:41 PM, azhwkd wrote: > The group ID is part of the primary key of the group_history table. My > understanding is that two INSERTs with different group IDs should not > collide in this case, or am I wrong in thinking this? > > The table definition for group_history is the following: > > CREATE TABLE public.group_history ( > group int4 NOT NULL, > id int4 NOT NULL, > sub_category int4 NOT NULL, > "date" date NOT NULL, > aa int8 NOT NULL, > ab int8 NOT NULL, > bb int8 NOT NULL, > ba int8 NOT NULL, > quantity int8 NOT NULL, > "hour" int2 NOT NULL, > CONSTRAINT group_history_pk PRIMARY KEY (group,id,sub_category,"date","hour") > ) > WITH ( > OIDS=FALSE > ); Me slaps head Windy night last night + lack of sleep + insufficient caffeine = Not seeing the table definition before hitting Send on previous post. > > Kind regards, > Sebastian > > 2016-11-24 0:28 GMT+01:00 Adrian Klaver <adrian.klaver@aklaver.com>: >> On 11/23/2016 01:52 PM, azhwkd wrote: >>> >>> Greetings! >>> >>> The parallel calls should not be working on the same row. Each query >>> services a different group ID on it's own and there is no overlap. >> >> >> Except the INSERT query in the trigger function is working on dates not >> group ids. >> >>> >>> Kind regards, >>> Sebastian >>> >>> >>> Tom Lane <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> schrieb am Mi., >>> 23. Nov. 2016 um 17:47 Uhr: >>> >>> azhwkd@gmail.com <mailto:azhwkd@gmail.com> writes: >>> > I have a query which if run alone usually completes in about 300ms. >>> > When run in my application this query constantly locks up and bogs >>> > down all connections of the connection pool (In the application this >>> > query is run up to 10 times in parallel with different parameters). >>> > What's really weird is that I can re-run one of the hung queries >>> from >>> > the command line while it's hung and it will complete as expected >>> > while the hung queries continue to use 100% CPU time. >>> >>> Judging from the EXPLAIN timing, most of the work is in the trigger, >>> which leads me to wonder if the parallel calls are likely to be >>> fighting >>> over inserting/updating the same row in the group_history partition >>> tables. Or are you certain that they should be hitting different >>> rows? >>> >>> regards, tom lane >>> >> >> >> -- >> Adrian Klaver >> adrian.klaver@aklaver.com > -- Adrian Klaver adrian.klaver@aklaver.com
On 11/23/2016 10:41 PM, azhwkd wrote: > The group ID is part of the primary key of the group_history table. My > understanding is that two INSERTs with different group IDs should not > collide in this case, or am I wrong in thinking this? After fresh coffee: In your first post you had this: "(In the application this query is run up to 10 times in parallel with different parameters)" where the parameters look to be a timestamp and a group id. Is it possible that the combination of parameters over the 10 different queries is not actually selecting rows in each result set that are unique over (group,id,sub_category,"date","hour") between result sets? > > The table definition for group_history is the following: > > CREATE TABLE public.group_history ( > group int4 NOT NULL, > id int4 NOT NULL, > sub_category int4 NOT NULL, > "date" date NOT NULL, > aa int8 NOT NULL, > ab int8 NOT NULL, > bb int8 NOT NULL, > ba int8 NOT NULL, > quantity int8 NOT NULL, > "hour" int2 NOT NULL, > CONSTRAINT group_history_pk PRIMARY KEY (group,id,sub_category,"date","hour") > ) > WITH ( > OIDS=FALSE > ); > > Kind regards, > Sebastian > -- Adrian Klaver adrian.klaver@aklaver.com
It should not be possible because a group does not return to the update pool before the update hasn't finished. I watched the queries in a postgres client and there was no overlap I could see. I don't really know what to make from this behavior, sometimes when I start the application a few updates go through and eventually it will lock up completely and sometimes it locks up immediately - always with heap_hot_search_buffer using ~20 of all CPU time on the system. 2016-11-24 19:14 GMT+01:00 Adrian Klaver <adrian.klaver@aklaver.com>: > On 11/23/2016 10:41 PM, azhwkd wrote: >> >> The group ID is part of the primary key of the group_history table. My >> understanding is that two INSERTs with different group IDs should not >> collide in this case, or am I wrong in thinking this? > > > After fresh coffee: > > In your first post you had this: > "(In the application this > query is run up to 10 times in parallel with different parameters)" > > where the parameters look to be a timestamp and a group id. > > Is it possible that the combination of parameters over the 10 different > queries is not actually selecting rows in each result set that are unique > over (group,id,sub_category,"date","hour") between result sets? > > > > >> >> The table definition for group_history is the following: >> >> CREATE TABLE public.group_history ( >> group int4 NOT NULL, >> id int4 NOT NULL, >> sub_category int4 NOT NULL, >> "date" date NOT NULL, >> aa int8 NOT NULL, >> ab int8 NOT NULL, >> bb int8 NOT NULL, >> ba int8 NOT NULL, >> quantity int8 NOT NULL, >> "hour" int2 NOT NULL, >> CONSTRAINT group_history_pk PRIMARY KEY >> (group,id,sub_category,"date","hour") >> ) >> WITH ( >> OIDS=FALSE >> ); >> >> Kind regards, >> Sebastian >> > > > > -- > Adrian Klaver > adrian.klaver@aklaver.com
On 11/24/2016 01:23 PM, azhwkd wrote: > It should not be possible because a group does not return to the > update pool before the update hasn't finished. So what is this 'update pool' and what is driving/using it? In other words how is the determination of the parameters done? To be more specific, the implication is that a group id can be reused so what determines that? > I watched the queries in a postgres client and there was no overlap I could see. Was this a visual inspection or did you dump the results of the various query/parameter combinations into tables and do an SQL comparison? > I don't really know what to make from this behavior, sometimes when I > start the application a few updates go through and eventually it will > lock up completely and sometimes it locks up immediately - always with Is there a common thread with regard to the parameters in use when things lock up? > heap_hot_search_buffer using ~20 of all CPU time on the system. > > 2016-11-24 19:14 GMT+01:00 Adrian Klaver <adrian.klaver@aklaver.com>: >> On 11/23/2016 10:41 PM, azhwkd wrote: >>> >>> The group ID is part of the primary key of the group_history table. My >>> understanding is that two INSERTs with different group IDs should not >>> collide in this case, or am I wrong in thinking this? >> -- Adrian Klaver adrian.klaver@aklaver.com
Adrian Klaver <adrian.klaver@aklaver.com> schrieb am Do., 24. Nov. 2016 um 22:34 Uhr:
On 11/24/2016 01:23 PM, azhwkd wrote:
> It should not be possible because a group does not return to the
> update pool before the update hasn't finished.
So what is this 'update pool' and what is driving/using it?
In other words how is the determination of the parameters done?
To be more specific, the implication is that a group id can be reused so
what determines that?
The application is written in go. Every group ID has its own go routine and the routine is blocked until the SQL statement returns.
The update process starts with a check to an external API endpoint and if there is new data available the routine is downloading it, parsing it and inserting the data into 2 tables. Once that is done, the routine continues to execute the statement in question using the data it inserted before for the calculation. Only once this finishes will the routine start over again.
> I watched the queries in a postgres client and there was no overlap I could see.
Was this a visual inspection or did you dump the results of the various
query/parameter combinations into tables and do an SQL comparison?
I inspected it visually and also dumped all variables into a file directly from the application.
> I don't really know what to make from this behavior, sometimes when I
> start the application a few updates go through and eventually it will
> lock up completely and sometimes it locks up immediately - always with
Is there a common thread with regard to the parameters in use when
things lock up?
Do you mean if it always locks on the same parameters? If so then it does not, sadly
> heap_hot_search_buffer using ~20 of all CPU time on the system.
>
> 2016-11-24 19:14 GMT+01:00 Adrian Klaver <adrian.klaver@aklaver.com>:
>> On 11/23/2016 10:41 PM, azhwkd wrote:
>>>
>>> The group ID is part of the primary key of the group_history table. My
>>> understanding is that two INSERTs with different group IDs should not
>>> collide in this case, or am I wrong in thinking this?
>>
--
Adrian Klaver
adrian.klaver@aklaver.com
On Thu, 2016-11-24 at 22:23 +0100, azhwkd wrote: > It should not be possible because a group does not return to the > update pool before the update hasn't finished. > I watched the queries in a postgres client and there was no overlap I > could see. > I don't really know what to make from this behavior, sometimes when I > start the application a few updates go through and eventually it will > lock up completely and sometimes it locks up immediately - always > with > heap_hot_search_buffer using ~20 of all CPU time on the system. > > Hello Sebastian, You stated that the application is written using go (www.golang.org). Are you able to run the application with the -race flag? HTH, Rob From wikipedia:- Lack of race condition safety There are no restrictions on how goroutines access shared data, making race conditions possible. Specifically, unless a program explicitly synchronizes via channels or other means, writes from one goroutine might be partly, entirely, or not at all visible to another, often with no guarantees about ordering of writes. Furthermore, Go's internal data structures like interface values, slice headers, hash tables, and string headers are not immune to race conditions, so type and memory safety can be violated in multithreaded programs that modify shared instances of those types without synchronization. Instead of language support, safe concurrent programming thus relies on conventions; for example, Chisnall recommends an idiom called "aliases xor mutable", meaning that passing a mutable value (or pointer) over a channel signals a transfer of ownership over the value to its receiver.
On 11/24/2016 02:14 PM, azhwkd wrote: > > Adrian Klaver <adrian.klaver@aklaver.com > <mailto:adrian.klaver@aklaver.com>> schrieb am Do., 24. Nov. 2016 um > 22:34 Uhr: > > On 11/24/2016 01:23 PM, azhwkd wrote: > > It should not be possible because a group does not return to the > > update pool before the update hasn't finished. > > So what is this 'update pool' and what is driving/using it? > > In other words how is the determination of the parameters done? > > To be more specific, the implication is that a group id can be reused so > what determines that? > > > The application is written in go. Every group ID has its own go routine > and the routine is blocked until the SQL statement returns. > The update process starts with a check to an external API endpoint and > if there is new data available the routine is downloading it, parsing it > and inserting the data into 2 tables. Once that is done, the routine > continues to execute the statement in question using the data it > inserted before for the calculation. Only once this finishes will the > routine start over again. > > > > > I watched the queries in a postgres client and there was no > overlap I could see. > > Was this a visual inspection or did you dump the results of the various > query/parameter combinations into tables and do an SQL comparison? > > > I inspected it visually and also dumped all variables into a file > directly from the application. > > > > > I don't really know what to make from this behavior, sometimes when I > > start the application a few updates go through and eventually it will > > lock up completely and sometimes it locks up immediately - always with > > Is there a common thread with regard to the parameters in use when > things lock up? > > > Do you mean if it always locks on the same parameters? If so then it > does not, sadly > Yes, that would have been too easy. I'm out of ideas for the moment. Rob Stones post looks promising though. -- Adrian Klaver adrian.klaver@aklaver.com
Greetings!
I tried running the application using the race detector but no data races were detected.
To further narrow this down I removed the call to the function in question from the pool
and instead run it one by one from a queue. This seems to have helped and the query didn't lock up anymore.
I took this one step further and left the locked up queries sitting there for a day and they eventually finished. Who would have thought a ~2 second query could blow up to ~20hrs.
rob stone <floriparob@gmail.com> schrieb am Fr., 25. Nov. 2016 um 02:34 Uhr:
On Thu, 2016-11-24 at 22:23 +0100, azhwkd wrote:
> It should not be possible because a group does not return to the
> update pool before the update hasn't finished.
> I watched the queries in a postgres client and there was no overlap I
> could see.
> I don't really know what to make from this behavior, sometimes when I
> start the application a few updates go through and eventually it will
> lock up completely and sometimes it locks up immediately - always
> with
> heap_hot_search_buffer using ~20 of all CPU time on the system.
>
>
Hello Sebastian,
You stated that the application is written using go (www.golang.org).
Are you able to run the application with the -race flag?
HTH,
Rob
From wikipedia:-
Lack of race condition safety
There are no restrictions on how goroutines access shared data, making
race conditions possible. Specifically, unless a program explicitly
synchronizes via channels or other means, writes from one goroutine
might be partly, entirely, or not at all visible to another, often with
no guarantees about ordering of writes. Furthermore, Go's internal data
structures like interface values, slice headers, hash tables, and
string headers are not immune to race conditions, so type and memory
safety can be violated in multithreaded programs that modify shared
instances of those types without synchronization.
Instead of language support, safe concurrent programming thus relies on
conventions; for example, Chisnall recommends an idiom called "aliases
xor mutable", meaning that passing a mutable value (or pointer) over a
channel signals a transfer of ownership over the value to its receiver.