Thread: zabbix on postgresql - very slow delete of events
Hi
This will be a rather lengthy post, just to give the full (I hope) picture. We're using Zabbix for monitoring and I'm having problems
understanding why the deletion of rows in the events table is so slow.
Zabbix: 4.2 (never mind the name of the db - it is 4.2)
new values per second: ~400
hosts: ~600
items: ~45000
OS: CentOS Linux release 7.6.1810 (Core)
Postgresql was installed from the yum repo on postgresql.org
zabbix_34=> select version();
version
---------------------------------------------------------------------------------------------------------
PostgreSQL 10.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36), 64-bit
(1 row)
The database is analyzed + vacuumed nightly. The server runs Zabbix and the database, has 16 GB memory, 4 vCPUs (modern hardware).
Some parameters:
shared_buffers = 3GB
work_mem = 10MB (I also tested with work_mem = 128MB - no difference)
effective_cache_size = 6 GB
effective_io_concurrency = 40
checkpoint_timeout = 5 min (default)
max_wal_size = 1 GB (default)
checkpoint_completion_target = 0.8
pg_wal is already on a separate device.
events table: ~25 million rows / 2.9 GB
event_recovery table: ~12 million rows / 550 MB
alerts table: ~600000 rows / 530 MB
Generally the database is quite snappy and shows no indication of problems. But now I've seen that housekeeping of events is
very slow - a single (normally hourly) run can take more than one day to finish, so events keep stacking up in the table. A typical slow
delete statement, from the postgres log:
postgresql-10-20190717-031404.log:2019-07-17 03:37:43 CEST [80965]: [4-1] user=zabbix,db=zabbix_34,app=[unknown],client=[local]: LOG: duration: 27298798.930 ms statement: delete from events where (eventid between 5580621 and 5580681 or eventid between 5580689 and 5580762 or eventid between 5580769 and 5580844 or eventid between 5580851 and 5580867 or eventid between 5580869 and 5580926 or eventid between 5580933 and 5580949 or eventid between 5580963 and 5581024
--- 8< --- a lot of similar eventids snipped away -----
or eventid between 5586799 and 5586839 or eventid in (5581385,5581389,5581561,5581563,5581564,5581580,5 581582,5581584,5581585,5581635))
I've analyzed the deletion of a single row in events. First, some table information:
zabbix_34=> \d events
Table "zabbix.events"
Column | Type | Collation | Nullable | Default
--------------+-------------------------+-----------+----------+-----------------------
eventid | numeric | | not null |
source | bigint | | not null | '0'::bigint
object | bigint | | not null | '0'::bigint
objectid | numeric | | not null | '0'::numeric
clock | bigint | | not null | '0'::bigint
value | bigint | | not null | '0'::bigint
acknowledged | bigint | | not null | '0'::bigint
ns | bigint | | not null | '0'::bigint
name | character varying(2048) | | not null | ''::character varying
severity | integer | | not null | 0
Indexes:
"idx_29337_primary" PRIMARY KEY, btree (eventid)
"events_1" btree (source, object, objectid, clock)
"events_2" btree (source, object, clock)
"events_clk_3" btree (clock)
Referenced by:
TABLE "acknowledges" CONSTRAINT "c_acknowledges_2" FOREIGN KEY (eventid) REFERENCES events(eventid) ON UPDATE RESTRICT ON DELETE CASCADE
TABLE "alerts" CONSTRAINT "c_alerts_2" FOREIGN KEY (eventid) REFERENCES events(eventid) ON UPDATE RESTRICT ON DELETE CASCADE
TABLE "alerts" CONSTRAINT "c_alerts_5" FOREIGN KEY (p_eventid) REFERENCES events(eventid) ON DELETE CASCADE
TABLE "event_recovery" CONSTRAINT "c_event_recovery_1" FOREIGN KEY (eventid) REFERENCES events(eventid) ON DELETE CASCADE
TABLE "event_recovery" CONSTRAINT "c_event_recovery_2" FOREIGN KEY (r_eventid) REFERENCES events(eventid) ON DELETE CASCADE
TABLE "event_recovery" CONSTRAINT "c_event_recovery_3" FOREIGN KEY (c_eventid) REFERENCES events(eventid) ON DELETE CASCADE
TABLE "event_suppress" CONSTRAINT "c_event_suppress_1" FOREIGN KEY (eventid) REFERENCES events(eventid) ON DELETE CASCADE
TABLE "event_tag" CONSTRAINT "c_event_tag_1" FOREIGN KEY (eventid) REFERENCES events(eventid) ON DELETE CASCADE
TABLE "problem" CONSTRAINT "c_problem_1" FOREIGN KEY (eventid) REFERENCES events(eventid) ON DELETE CASCADE
TABLE "problem" CONSTRAINT "c_problem_2" FOREIGN KEY (r_eventid) REFERENCES events(eventid) ON DELETE CASCADE
zabbix_34=> \d event_recovery
Table "zabbix.event_recovery"
Column | Type | Collation | Nullable | Default
---------------+--------+-----------+----------+---------
eventid | bigint | | not null |
r_eventid | bigint | | not null |
c_eventid | bigint | | |
correlationid | bigint | | |
userid | bigint | | |
Indexes:
"event_recovery_pkey" PRIMARY KEY, btree (eventid)
"event_recovery_1" btree (r_eventid)
"event_recovery_2" btree (c_eventid)
Foreign-key constraints:
"c_event_recovery_1" FOREIGN KEY (eventid) REFERENCES events(eventid) ON DELETE CASCADE
"c_event_recovery_2" FOREIGN KEY (r_eventid) REFERENCES events(eventid) ON DELETE CASCADE
"c_event_recovery_3" FOREIGN KEY (c_eventid) REFERENCES events(eventid) ON DELETE CASCADE
zabbix_34=> \d alerts
Table "zabbix.alerts"
Column | Type | Collation | Nullable | Default
---------------+-------------------------+-----------+----------+-----------------------
alertid | numeric | | not null |
actionid | numeric | | not null |
eventid | numeric | | not null |
userid | numeric | | |
clock | bigint | | not null | '0'::bigint
mediatypeid | numeric | | |
sendto | character varying(1024) | | not null | ''::character varying
subject | character varying(255) | | not null | ''::character varying
message | text | | not null | ''::text
status | bigint | | not null | '0'::bigint
retries | bigint | | not null | '0'::bigint
error | character varying(2048) | | not null | ''::character varying
esc_step | bigint | | not null | '0'::bigint
alerttype | bigint | | not null | '0'::bigint
p_eventid | bigint | | |
acknowledgeid | bigint | | |
Indexes:
"idx_29120_primary" PRIMARY KEY, btree (alertid)
"alerts_1" btree (actionid)
"alerts_2" btree (clock)
"alerts_3" btree (eventid)
"alerts_4" btree (status)
"alerts_5" btree (mediatypeid)
"alerts_6" btree (userid)
"alerts_7" btree (p_eventid)
Foreign-key constraints:
"c_alerts_1" FOREIGN KEY (actionid) REFERENCES actions(actionid) ON UPDATE RESTRICT ON DELETE CASCADE
"c_alerts_2" FOREIGN KEY (eventid) REFERENCES events(eventid) ON UPDATE RESTRICT ON DELETE CASCADE
"c_alerts_3" FOREIGN KEY (userid) REFERENCES users(userid) ON UPDATE RESTRICT ON DELETE CASCADE
"c_alerts_4" FOREIGN KEY (mediatypeid) REFERENCES media_type(mediatypeid) ON UPDATE RESTRICT ON DELETE CASCADE
"c_alerts_5" FOREIGN KEY (p_eventid) REFERENCES events(eventid) ON DELETE CASCADE
"c_alerts_6" FOREIGN KEY (acknowledgeid) REFERENCES acknowledges(acknowledgeid) ON DELETE CASCADE
Let's look at what's in the tables for event 7123123:
zabbix_34=> select * from events where eventid=7123123;
eventid | source | object | objectid | clock | value | acknowledged | ns | name | severity
---------+--------+--------+----------+------------+-------+--------------+---------+--------------------------------------+----------
7123123 | 3 | 0 | 27562 | 1525264196 | 1 | 0 | 1980875 | Cannot calculate trigger expression. | 0
(1 row)
zabbix_34=> select * from event_recovery where eventid=7123123;
eventid | r_eventid | c_eventid | correlationid | userid
---------+-----------+-----------+---------------+--------
7123123 | 7124371 | | |
(1 row)
zabbix_34=> select * from alerts where eventid=7123123;
alertid | actionid | eventid | userid | clock | mediatypeid | sendto | subject | message | status | retries | error | esc_step | aler
ttype | p_eventid | acknowledgeid
---------+----------+---------+--------+-------+-------------+--------+---------+---------+--------+---------+-------+----------+-----
------+-----------+---------------
(0 rows)
All these queries execute well below 1 ms, using indexes.
Let's delete one row. See explain results here: https://explain.depesz.com/s/aycf . 5 seconds to delete a single row, wow!
This shows that it is the foreign key constraints on event_recovery and alerts that take a lot of time.
But why? I far as I can see, the delete is fully CPU bound during execution.
Deleting the corresponding row directly from event_recovery or alerts executes in less than 0.1 ms.
Any ideas?
I've observed that alerts and event_recovery tables both have more than one foreign key that references events, if that matters.
Regards
Kristian Ejvind
| ||||||||
| ||||||||
| ||||||||
Attachment
On Tue, Jul 23, 2019 at 08:07:55AM +0000, Kristian Ejvind wrote: > Hi > > This will be a rather lengthy post, just to give the full (I hope) picture. We're using Zabbix for monitoring and I'm havingproblems > understanding why the deletion of rows in the events table is so slow. > > Zabbix: 4.2 (never mind the name of the db - it is 4.2) > new values per second: ~400 > hosts: ~600 > items: ~45000 > Hi Kristian, Time series databases like Zabbix work poorly with the Housekeeper service. We had many similar sorts of problems as our Zabbix usage grew. Once we partitioned the big tables, turned off the Housekeeper, and cleaned up by dropping partitions instead everything worked much, much, much better. When we started using partitioning, we used the old inheiritance style. Now you can use the native partitioning. Regards, Ken
Thanks Kenneth. In fact we've already partitioned the largest history* and trends* tables and that has been running fine for a year. Performance was vastly improved. But since you can't have a unique index on a partitioned table in postgres 10, we haven't worked on that. Regards Kristian ?On 2019-07-23, 14:58, "Kenneth Marshall" <ktm@rice.edu> wrote: Hi Kristian, Time series databases like Zabbix work poorly with the Housekeeper service. We had many similar sorts of problems as our Zabbix usage grew. Once we partitioned the big tables, turned off the Housekeeper, and cleaned up by dropping partitions instead everything worked much, much, much better. When we started using partitioning, we used the old inheiritance style. Now you can use the native partitioning. Regards, Ken Resurs Bank AB Kristian Ejvind Linux System Administrator IT Operations | Technical Operations Ekslingan 8 Box 222 09, SE-25467 Helsingborg Direkt Tfn: Mobil: +46 728571483 Vxl: +46 42 382000 Fax: E-post: Kristian.Ejvind@resurs.se Webb: http://www.resursbank.se
On Tue, Jul 23, 2019 at 01:41:53PM +0000, Kristian Ejvind wrote: > Thanks Kenneth. In fact we've already partitioned the largest history* and trends* tables > and that has been running fine for a year. Performance was vastly improved. But since you > can't have a unique index on a partitioned table in postgres 10, we haven't worked on that. > > Regards > Kristian Hi Kristian, Why are you not partitioning the events and alerts tables as well? That would eliminate this problem and you already have the infrastructure in place to support the management since you are using it for the history and trends tables. Regards, Ken
Hi. Well, the events table has both a primary key and foreign keys referencing it, which is not possible on a partitioned table in postgresql 10. How did you work around this issue? On the other hand, if we can get the deletion of rows from the events table run at normal speed, I can't imagine we would have a problem with it in a long time. After all, although our Zabbix installation definitely is larger than "small", it's still far from "large". I think I would need assistance with debugging why postgresql behaves like it does. Is there a defect with deleting data from a table that has multiple foreign keys referencing it from a certain table? Is there a problem with the query optimizer that chooses the wrong plan when working on the foreign key constraints? How do I inspect how the db works on the deletion of rows from the referencing tables? Regards Kristian ?On 2019-07-23, 16:33, "Kenneth Marshall" <ktm@rice.edu> wrote: On Tue, Jul 23, 2019 at 01:41:53PM +0000, Kristian Ejvind wrote: > Thanks Kenneth. In fact we've already partitioned the largest history* and trends* tables > and that has been running fine for a year. Performance was vastly improved. But since you > can't have a unique index on a partitioned table in postgres 10, we haven't worked on that. > > Regards > Kristian Hi Kristian, Why are you not partitioning the events and alerts tables as well? That would eliminate this problem and you already have the infrastructure in place to support the management since you are using it for the history and trends tables. Regards, Ken Resurs Bank AB Kristian Ejvind Linux System Administrator IT Operations | Technical Operations Ekslingan 8 Box 222 09, SE-25467 Helsingborg Direkt Tfn: Mobil: +46 728571483 Vxl: +46 42 382000 Fax: E-post: Kristian.Ejvind@resurs.se Webb: http://www.resursbank.se
Hi.
Well, the events table has both a primary key and foreign keys referencing it, which is not possible
on a partitioned table in postgresql 10. How did you work around this issue?
On the other hand, if we can get the deletion of rows from the events table run at normal speed, I
can't imagine we would have a problem with it in a long time. After all, although our Zabbix installation
definitely is larger than "small", it's still far from "large".
I think I would need assistance with debugging why postgresql behaves like it does.
Is there a defect with deleting data from a table that has multiple foreign keys referencing it from a certain table?
Is there a problem with the query optimizer that chooses the wrong plan when working on the foreign key constraints?
How do I inspect how the db works on the deletion of rows from the referencing tables?
Regards
Kristian
?On 2019-07-23, 16:33, "Kenneth Marshall" <ktm@rice.edu> wrote:
On Tue, Jul 23, 2019 at 01:41:53PM +0000, Kristian Ejvind wrote:
> Thanks Kenneth. In fact we've already partitioned the largest history* and trends* tables
> and that has been running fine for a year. Performance was vastly improved. But since you
> can't have a unique index on a partitioned table in postgres 10, we haven't worked on that.
>
> Regards
> Kristian
Hi Kristian,
Why are you not partitioning the events and alerts tables as well? That
would eliminate this problem and you already have the infrastructure in
place to support the management since you are using it for the history
and trends tables.
Regards,
Ken
Resurs Bank AB
Kristian Ejvind
Linux System Administrator
IT Operations | Technical Operations
Ekslingan 8
Box 222 09, SE-25467 Helsingborg
Direkt Tfn:
Mobil: +46 728571483
Vxl: +46 42 382000
Fax:
E-post: Kristian.Ejvind@resurs.se
Webb: http://www.resursbank.se
Senior Postgresql DBA
https://dataegret.com/
Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"
All these queries execute well below 1 ms, using indexes.
Let's delete one row. See explain results here: https://explain.depesz.com/s/aycf . 5 seconds to delete a single row, wow!
This shows that it is the foreign key constraints on event_recovery and alerts that take a lot of time.
But why? I far as I can see, the delete is fully CPU bound during execution.
Deleting the corresponding row directly from event_recovery or alerts executes in less than 0.1 ms.
Any ideas?
I've observed that alerts and event_recovery tables both have more than one foreign key that references events, if that matters.
Senior Postgresql DBA
https://dataegret.com/
Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"
Attachment
Hi Maxim
Thanks for your advice, and let me start with your second email, which I'll copy here:
=====
Hi Kristian,
After comparing structure of zabbix tables with same in my zabbix installation I found one very weird difference.
Why type of events.eventid had been changed from default bigint to numeric?
I suspect that the difference between events.eventid (numeric) type and event_recovery.*_eventid (bigint) types might lead to inability of use index during foreign key checks.
Anyway it will be clearly visible on the pg_stat_xact_user_tables results (I now expect to see 3 sequential scan on event_recovery and may be on some other tables as well).
Kind Regards,
Maxim
=====
Well spotted! On closer examination it seems that data types are wrong in several places. I suspect that this comes
from the time when our Zabbix ran on a MySQL database, which was converted over to PostgreSQL a few years
ago. I agree this discrepancy is suspicious and I will continue to examine it.
Regarding your ideas in the email below, I can say that 1) is not valid, disk latency is in the range of a few ms.
This is the output from your recommended query, which seems to verify your suspicions.
zabbix_34=# begin; delete from zabbix.events where eventid = 7123123; select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order by seq_scan+idx_scan desc; rollback;
Time: 0.113 ms
Time: 4798.189 ms (00:04.798)
relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd
--------+------------+----------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------
41940 | zabbix | event_recovery | 3 | 35495224 | 0 | 0 | 0 | 0 | 1 | 0
41675 | zabbix | alerts | 1 | 544966 | 1 | 0 | 0 | 0 | 0 | 0
42573 | zabbix | problem | 2 | 13896 | 0 | 0 | 0 | 0 | 0 | 0
41943 | zabbix | event_tag | 1 | 22004 | 0 | 0 | 0 | 0 | 0 | 0
41649 | zabbix | acknowledges | 1 | 47 | 0 | 0 | 0 | 0 | 0 | 0
41951 | zabbix | events | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0
260215 | zabbix | event_suppress | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0
(7 rows)
Time: 2.857 ms
Time: 0.162 ms
Regards
Kristian
| ||||||||
| ||||||||
| ||||||||
From: Maxim Boguk <maxim.boguk@gmail.com>
Date: Wednesday, 24 July 2019 at 15:55
To: Kristian Ejvind <Kristian.Ejvind@resurs.se>
Cc: "pgsql-performance@lists.postgresql.org" <pgsql-performance@lists.postgresql.org>
Subject: Re: zabbix on postgresql - very slow delete of events
Hi Kristian,
If you look for explain analyze results for delete,
you will see that 99% of time query spent on the foreign key triggers checks.
In the same time the database have indexes on foreign key side in place.
I recommend try this:
\timing on
BEGIN;
delete from zabbix.events where eventid = [some testing id];
select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order by seq_scan+idx_scan desc;
ABORT;
And provide result of the last query and how long delete runs.
It might help us understand whats going on.
Currently I have 3 ideas:
1)very very slow and overloaded IO subsystem
2)a lot of stuff being delete by ON DELETE CASCADE
3)some locking prevent foreign key checks run fast
On Wed, Jul 24, 2019 at 11:12 AM Kristian Ejvind <Kristian.Ejvind@resurs.se> wrote:
Hi.
Well, the events table has both a primary key and foreign keys referencing it, which is not possible
on a partitioned table in postgresql 10. How did you work around this issue?
On the other hand, if we can get the deletion of rows from the events table run at normal speed, I
can't imagine we would have a problem with it in a long time. After all, although our Zabbix installation
definitely is larger than "small", it's still far from "large".
I think I would need assistance with debugging why postgresql behaves like it does.
Is there a defect with deleting data from a table that has multiple foreign keys referencing it from a certain table?
Is there a problem with the query optimizer that chooses the wrong plan when working on the foreign key constraints?
How do I inspect how the db works on the deletion of rows from the referencing tables?
Regards
Kristian
?On 2019-07-23, 16:33, "Kenneth Marshall" <ktm@rice.edu> wrote:
On Tue, Jul 23, 2019 at 01:41:53PM +0000, Kristian Ejvind wrote:
> Thanks Kenneth. In fact we've already partitioned the largest history* and trends* tables
> and that has been running fine for a year. Performance was vastly improved. But since you
> can't have a unique index on a partitioned table in postgres 10, we haven't worked on that.
>
> Regards
> Kristian
Hi Kristian,
Why are you not partitioning the events and alerts tables as well? That
would eliminate this problem and you already have the infrastructure in
place to support the management since you are using it for the history
and trends tables.
Regards,
Ken
Resurs Bank AB
Kristian Ejvind
Linux System Administrator
IT Operations | Technical Operations
Ekslingan 8
Box 222 09, SE-25467 Helsingborg
Direkt Tfn:
Mobil: +46 728571483
Vxl: +46 42 382000
Fax:
E-post: Kristian.Ejvind@resurs.se
Webb: http://www.resursbank.se
--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/
Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
"??????, ?? ??? ?????????? ??? ?? ??????, ?? ?????? ??? ??-???????? ?????? ????? ? ??? ????? ??? ????"
Attachment
Hi Maxim
Thanks for your advice, and let me start with your second email, which I'll copy here:
=====
Hi Kristian,
After comparing structure of zabbix tables with same in my zabbix installation I found one very weird difference.
Why type of events.eventid had been changed from default bigint to numeric?
I suspect that the difference between events.eventid (numeric) type and event_recovery.*_eventid (bigint) types might lead to inability of use index during foreign key checks.
Anyway it will be clearly visible on the pg_stat_xact_user_tables results (I now expect to see 3 sequential scan on event_recovery and may be on some other tables as well).
Kind Regards,
Maxim
=====
Well spotted! On closer examination it seems that data types are wrong in several places. I suspect that this comes
from the time when our Zabbix ran on a MySQL database, which was converted over to PostgreSQL a few years
ago. I agree this discrepancy is suspicious and I will continue to examine it.
Regarding your ideas in the email below, I can say that 1) is not valid, disk latency is in the range of a few ms.
This is the output from your recommended query, which seems to verify your suspicions.
zabbix_34=# begin; delete from zabbix.events where eventid = 7123123; select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order by seq_scan+idx_scan desc; rollback;
Time: 0.113 ms
Time: 4798.189 ms (00:04.798)
relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd
--------+------------+----------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------
41940 | zabbix | event_recovery | 3 | 35495224 | 0 | 0 | 0 | 0 | 1 | 0
41675 | zabbix | alerts | 1 | 544966 | 1 | 0 | 0 | 0 | 0 | 0
42573 | zabbix | problem | 2 | 13896 | 0 | 0 | 0 | 0 | 0 | 0
41943 | zabbix | event_tag | 1 | 22004 | 0 | 0 | 0 | 0 | 0 | 0
41649 | zabbix | acknowledges | 1 | 47 | 0 | 0 | 0 | 0 | 0 | 0
41951 | zabbix | events | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0
260215 | zabbix | event_suppress | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0
Senior Postgresql DBA
https://dataegret.com/
Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"
Hi.
Just a short message, confirming that after we've altered the tables to have matching
types, deletes now take 1 ms, instead of 5 sec. Indexes are being used now.
Thanks for assistance.
Regards
Kristian
ps. would be nice with some warnings or indications in analyze output when this happens.
| ||||||||
| ||||||||
| ||||||||
From: Maxim Boguk <maxim.boguk@gmail.com>
Date: Wednesday, 24 July 2019 at 19:17
To: Kristian Ejvind <Kristian.Ejvind@resurs.se>
Cc: "pgsql-performance@lists.postgresql.org" <pgsql-performance@lists.postgresql.org>
Subject: Re: zabbix on postgresql - very slow delete of events
On Wed, Jul 24, 2019 at 6:12 PM Kristian Ejvind <Kristian.Ejvind@resurs.se> wrote:
Hi Maxim
Thanks for your advice, and let me start with your second email, which I'll copy here:
=====
Hi Kristian,
After comparing structure of zabbix tables with same in my zabbix installation I found one very weird difference.
Why type of events.eventid had been changed from default bigint to numeric?
I suspect that the difference between events.eventid (numeric) type and event_recovery.*_eventid (bigint) types might lead to inability of use index during foreign key checks.
Anyway it will be clearly visible on the pg_stat_xact_user_tables results (I now expect to see 3 sequential scan on event_recovery and may be on some other tables as well).
Kind Regards,
Maxim
=====
Well spotted! On closer examination it seems that data types are wrong in several places. I suspect that this comes
from the time when our Zabbix ran on a MySQL database, which was converted over to PostgreSQL a few years
ago. I agree this discrepancy is suspicious and I will continue to examine it.
Regarding your ideas in the email below, I can say that 1) is not valid, disk latency is in the range of a few ms.
This is the output from your recommended query, which seems to verify your suspicions.
zabbix_34=# begin; delete from zabbix.events where eventid = 7123123; select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order by seq_scan+idx_scan desc; rollback;
Time: 0.113 ms
Time: 4798.189 ms (00:04.798)
relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd
--------+------------+----------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------
41940 | zabbix | event_recovery | 3 | 35495224 | 0 | 0 | 0 | 0 | 1 | 0
41675 | zabbix | alerts | 1 | 544966 | 1 | 0 | 0 | 0 | 0 | 0
42573 | zabbix | problem | 2 | 13896 | 0 | 0 | 0 | 0 | 0 | 0
41943 | zabbix | event_tag | 1 | 22004 | 0 | 0 | 0 | 0 | 0 | 0
41649 | zabbix | acknowledges | 1 | 47 | 0 | 0 | 0 | 0 | 0 | 0
41951 | zabbix | events | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0
260215 | zabbix | event_suppress | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0
Hi Kristian,
This result definitely proves that indexes not used during foreign key checks (see that non-zero seq_scan counters for linked tables).
Only possible reason (IMHO) that wrong usage numeric in place of bigint.
I recommend change types of events.eventid (and any other similar fields) to bigint.
It should resolve your performance issues with deletes on events table (as additional bonus - bigint a lot faster and compact type than numeric).
--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/
Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
"??????, ?? ??? ?????????? ??? ?? ??????, ?? ?????? ??? ??-???????? ?????? ????? ? ??? ????? ??? ????"