Thread: Suddenly all queries moved to seq scan
Hi,
We are using PostgresQL 10 in our production database. We have around 890 req /s request on peak time.
We have 1 primary and 4 slave databases as well in the same postgres cluster.
2 days back we applied some patches in the primary server and restarted. We didn't do anything on the secondary server.
Next day, After 18 hours all our queries from secondary servers started taking too much time. queries were working in 2 sec started taking 80 seconds. Almost all queries behaved the same way.
After half an hour of outage we restarted all db servers and system back to normal.
Still we are not able to understand the root case. We couldn't find any error log or fatal errors. During the incident, in one of the read server disks was full. We couldn't see any replication lag or query cancellation due to replication.
please help
Regards
Sreejith
Solutions for Care Anywhere
dWise HealthCare IT Solutions Pvt. Ltd. | www.lifetrenz.com
Disclaimer: The information and attachments contained in this email are intended for exclusive use of the addressee(s) and may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and destroy all copies of this message and any attachments. The views expressed in this email are, unless otherwise stated, those of the author and not those of dWise HealthCare IT Solutions or its management.
On 11/20/24 12:50, Sreejith P wrote:
Still you say seq scan, how do you know that? You run explain (buffers, analyze, settings) ? If this is indeed the case, then you need to check your indexes. Check when they were last used/scanned and if they are usable. Did the disk error had any of your PostgreSQL tablespaces / tables/indexes ?Hi,We are using PostgresQL 10 in our production database. We have around 890 req /s request on peak time.We have 1 primary and 4 slave databases as well in the same postgres cluster.2 days back we applied some patches in the primary server and restarted. We didn't do anything on the secondary server.Next day, After 18 hours all our queries from secondary servers started taking too much time. queries were working in 2 sec started taking 80 seconds. Almost all queries behaved the same way.After half an hour of outage we restarted all db servers and system back to normal.Still we are not able to understand the root case. We couldn't find any error log or fatal errors. During the incident, in one of the read server disks was full. We couldn't see any replication lag or query cancellation due to replication.please help
RegardsSreejith
Solutions for Care Anywhere
dWise HealthCare IT Solutions Pvt. Ltd. | www.lifetrenz.com
Disclaimer: The information and attachments contained in this email are intended for exclusive use of the addressee(s) and may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and destroy all copies of this message and any attachments. The views expressed in this email are, unless otherwise stated, those of the author and not those of dWise HealthCare IT Solutions or its management.
> On 20 Nov 2024, at 11:50, Sreejith P <sreejith@lifetrenz.com> wrote: > We are using PostgresQL 10 in our production database. We have around 890 req /s request on peak time. PostgreSQL 10 is well out of support and does not receive bugfixes or security fixes, you should plan a migration to a supported version sooner rather than later. > 2 days back we applied some patches in the primary server and restarted. We didn't do anything on the secondary server. Patches to the operating system, postgres, another application? > Next day, After 18 hours all our queries from secondary servers started taking too much time. queries were working in2 sec started taking 80 seconds. Almost all queries behaved the same way. > > After half an hour of outage we restarted all db servers and system back to normal. > > Still we are not able to understand the root case. We couldn't find any error log or fatal errors. During the incident,in one of the read server disks was full. We couldn't see any replication lag or query cancellation due to replication. You say that all queries started doing sequential scans, is that an assumption from queries being slow or did you capture plans for the queries which be compared against "normal" production plans? -- Daniel Gustafsson
Make sure to run analyze on the entire database, possibly using vacuumdb would be faster.
Also, check for invalid indexes.
Efrain J. Berdecia
On Wednesday, November 20, 2024 at 08:02:36 AM EST, Daniel Gustafsson <daniel@yesql.se> wrote:
> On 20 Nov 2024, at 11:50, Sreejith P <sreejith@lifetrenz.com> wrote:
> We are using PostgresQL 10 in our production database. We have around 890 req /s request on peak time.
PostgreSQL 10 is well out of support and does not receive bugfixes or security
fixes, you should plan a migration to a supported version sooner rather than
later.
> 2 days back we applied some patches in the primary server and restarted. We didn't do anything on the secondary server.
Patches to the operating system, postgres, another application?
> Next day, After 18 hours all our queries from secondary servers started taking too much time. queries were working in 2 sec started taking 80 seconds. Almost all queries behaved the same way.
>
> After half an hour of outage we restarted all db servers and system back to normal.
>
> Still we are not able to understand the root case. We couldn't find any error log or fatal errors. During the incident, in one of the read server disks was full. We couldn't see any replication lag or query cancellation due to replication.
You say that all queries started doing sequential scans, is that an assumption
from queries being slow or did you capture plans for the queries which be
compared against "normal" production plans?
--
Daniel Gustafsson
> We are using PostgresQL 10 in our production database. We have around 890 req /s request on peak time.
PostgreSQL 10 is well out of support and does not receive bugfixes or security
fixes, you should plan a migration to a supported version sooner rather than
later.
> 2 days back we applied some patches in the primary server and restarted. We didn't do anything on the secondary server.
Patches to the operating system, postgres, another application?
> Next day, After 18 hours all our queries from secondary servers started taking too much time. queries were working in 2 sec started taking 80 seconds. Almost all queries behaved the same way.
>
> After half an hour of outage we restarted all db servers and system back to normal.
>
> Still we are not able to understand the root case. We couldn't find any error log or fatal errors. During the incident, in one of the read server disks was full. We couldn't see any replication lag or query cancellation due to replication.
You say that all queries started doing sequential scans, is that an assumption
from queries being slow or did you capture plans for the queries which be
compared against "normal" production plans?
--
Daniel Gustafsson
> On 20 Nov 2024, at 6:32 PM, Daniel Gustafsson <daniel@yesql.se> wrote: > >> On 20 Nov 2024, at 11:50, Sreejith P <sreejith@lifetrenz.com> wrote: > >> We are using PostgresQL 10 in our production database. We have around 890 req /s request on peak time. > > PostgreSQL 10 is well out of support and does not receive bugfixes or security > fixes, you should plan a migration to a supported version sooner rather than > later. > >> 2 days back we applied some patches in the primary server and restarted. We didn't do anything on the secondary server. > > Patches to the operating system, postgres, another application? PostgreSQL Common 10.23-6 > >> Next day, After 18 hours all our queries from secondary servers started taking too much time. queries were working in2 sec started taking 80 seconds. Almost all queries behaved the same way. >> >> After half an hour of outage we restarted all db servers and system back to normal. >> >> Still we are not able to understand the root case. We couldn't find any error log or fatal errors. During the incident,in one of the read server disks was full. We couldn't see any replication lag or query cancellation due to replication. > > You say that all queries started doing sequential scans, is that an assumption > from queries being slow or did you capture plans for the queries which be > compared against "normal" production plans?. Queries were taking 20 ms started taking 60 seconds. So have done SQL analyse to understand about query plan. There we foundthat query planner taking seq scan instead in index scan. I would like to add one ore point. A delete query were running in DB from 2 days for deleting around 80 million records. > > -- > Daniel Gustafsson > -- *Solutions for Care Anywhere* *dWise HealthCare IT Solutions Pvt. Ltd.* | www.lifetrenz.com <http://www.lifetrenz.com> *Disclaimer*: The information and attachments contained in this email are intended for exclusive use of the addressee(s) and may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and destroy all copies of this message and any attachments. The views expressed in this email are, unless otherwise stated, those of the author and not those of dWise HealthCare IT Solutions or its management.
On Wed, Nov 20, 2024 at 9:09 AM Sreejith P <sreejith@lifetrenz.com> wrote:
Queries were taking 20 ms started taking 60 seconds. So have done SQL analyse to understand about query plan. There we found that query planner taking seq scan instead in index scan.
I would like to add one ore point. A delete query were running in DB from 2 days for deleting around 80 million records.
This can cause this specific problem where the number of dead tuples and lack of autovacuum running can cause the statistics to favor a sequential scan over an index scan. Taking into account the length of time the delete took it would hold a number of datapages and tuples in a lock state, which can lead to blocking queries and prevent autovacuum/analyze.
It is best to do bulk deletes in batches and have a rest period between batches to allow autovacuum and analyze to keep up.. Doing deletes in batches reduces the number of resources being consumed.
It is best to do bulk deletes in batches and have a rest period between batches to allow autovacuum and analyze to keep up.. Doing deletes in batches reduces the number of resources being consumed.