Hi Team,
Greetings!
We are facing an intermittent performance issue with a PL/pgSQL function and would appreciate your guidance.
**Environment:**
* PostgreSQL version: 16.9
* OS: Ubuntu 22
* Table size: ~80 million rows (~45 GB)
* CPU/Memory/IO normal
**Problem Description:**
We have multiple PL/pgSQL functions. One of the functions is executed around 20 times per day and on peak days more than 400 times. Under normal conditions, it completes in 2–4 minutes depending on input parameters.
However, intermittently (especially during peak load), the same function execution takes more than 35 minutes. We typically cancel the execution at that point.
**Observation:**
We noticed that making a very minor change in the function (for example, adding/removing a `LOWER()` condition in the WHERE clause causes the function to execute again in the normal 2–4 minutes.
Example:
Original condition:
RAISE NOTICE 'Step6 : %', clock_timestamp()::timestamp without time zone;
DROP TABLE IF EXISTS "TempOriginalDocument";
CREATE TEMP TABLE "TempOriginalDocument" AS
SELECT
pdra."Id" "PurchaseDocumentRecoId"
FROM
"Temp2BUnReconciledIds" tid
INNER JOIN report."DocumentD" r_pd ON tid."Id" = r_pd."Id"
INNER JOIN report."DocumentD" pdra
ON LOWER(pdra."OriginalDocumentNumber") = LOWER(r_pd."DocumentNumber")
.
.
WHERE
LOWER(pdra."OriginalDocumentNumber") IS NOT NULL
AND pdra."OriginalDocumentDate" IS NOT NULL;
replace with......
RAISE NOTICE 'Step6 : %', clock_timestamp()::timestamp without time zone;
DROP TABLE IF EXISTS "TempOriginalDocument";
CREATE TEMP TABLE "TempOriginalDocument" AS
SELECT
pdra."Id" "PurchaseDocumentRecoId"
FROM
"Temp2BUnReconciledIds" tid
INNER JOIN report."DocumentD" r_pd ON tid."Id" = r_pd."Id"
INNER JOIN report."DocumentD" pdra
ON LOWER(pdra."OriginalDocumentNumber") = LOWER(r_pd."DocumentNumber")
.
.
WHERE
pdra."OriginalDocumentNumber" IS NOT NULL
AND pdra."OriginalDocumentDate" IS NOT NULL;
Or vice versa.
After such a change, performance returns to normal.
FYI-- these problematic functions will be executing in a sequnce one at time
**Additional Findings:**
* We added `RAISE NOTICE` statements and observed that the delay does not occur at a fixed step—it can happen at different parts of the function at step 6,8, 62 etc.
* Running `VACUUM ANALYZE` on the main table did not resolve the issue.
* Seeing the postgresql logs I have observed that catlogue tables are being vacuumed too frequently
is it query plan caching or parameter-sensitive planning ?
Any guidance on how to diagnose or resolve this issue would be greatly appreciated.
Thanks in advance!
Regards,
OM