Re: tds_fdw : Severe performance degradation from postgresql 10.23 to 16.4 - Mailing list pgsql-performance

From Achilleas Mantzios
Subject Re: tds_fdw : Severe performance degradation from postgresql 10.23 to 16.4
Date
Msg-id 50da382b-1d47-4332-9279-8af0996f8088@cloud.gatewaynet.com
Whole thread Raw
In response to Re: tds_fdw : Severe performance degradation from postgresql 10.23 to 16.4  (Tomas Vondra <tomas@vondra.me>)
Responses Re: tds_fdw : Severe performance degradation from postgresql 10.23 to 16.4
List pgsql-performance


Στις 12/11/24 15:17, ο/η Tomas Vondra έγραψε:
On 11/12/24 11:37, Achilleas Mantzios - cloud wrote:
...

We run perf on both systems for 90 seconds to make sure that it included
at least two runs of the slow system, and we attach both outputs. We run
perf as :

sudo perf record -g -p <PID> -- sleep 90

it strikes me that we dont find similarities between the two, but we are
new to perf.

Not sure, but it seems the slow profile has a lot of writes. Not sure
why. Do both instances have the same work_mem value / available memory,
storatge?
In fact the plain vanilla installation, with all the defaults, runs this fast, please read further.

Maybe EXPLAIN VERBOSE would show if the remote query is the same ...

Maybe try setting log_temp_files=0 before running the query. Writing the
foreign scan result set into a temp file could be happening on one of
the machines only.

No temp files, the remote query shown by EXPLAIN (... VERBOSE) and ms sql it self is :

Remote query: SELECT NULL FROM [db_ro_non_delosnav].[ACDOC]

Aggregate  (cost=168359742.33..168359742.34 rows=1 width=8) (actual time=15100.248..15100.250 rows=1 loops=1)
  Output: count(*)
  ->  Foreign Scan on mssql_bdynacom."ACDOC"  (cost=200.00..168355533.86 rows=1683386 width=100) (actual time=3.783..14967.029 rows=1683386 loops=1)
        Output: "ID", "DOC_TYPE", "PARAST", "TRNS_DATE", "VIA_MNG", "VIA_MNG_ID", "FL_UPD", "NOTES", "REM_DATE", "REM_CLS", "OPN_CLS", "ACSITE_ID", "USERS_ID", "LAST_UPD_DA
TE", "LAST_UPD_USERS_ID", "OTH_TYPE", "OTH_TYPE_DESC", "OTH_APPL", "FL_PRN", "CREATE_DATE", "LOG_NOTE", "PER_ID", "ACDEPT_ID", "FL_DLT", "CMP_ID", "INTPER_ID", "FL_INT_TRNCL
S", "FL_ACTV", "OLD_ACDEPT_ID", "OLD_USERS_ID", "OLD_TRNS_DATE", "OLD_FL_UPD_DATE", "FL_FNLZ", "ERROR_NOTES", "FL_ATTCH", "OPDOC_ID", "FL_ONHOLD", "ONHOLD_NOTES", "FL_RVS",
"RVS_DOC_ID", "FL_REV", "REV_NO", "REV_DOC_ID", "REV_NOTES", "REV_DATE", "REV_USER_ID", "FL_PROT", "INTGR_DOC_ID", "INTGR_DOC_TP", "P_DOC_ID", "FL_OPN", "FL_MDL", "USERUPDAC
C_ID", "DATE_UPDACC", "USERIN_ID", "USERUPD_ID", "DATE_IN", "DATE_UPD", "DOC_SUBTYPE", "DOC_REF", "FRTRN_TYPE", "FL_ACC", "ATCH_DIR", "ATCH_REM", "KEYID", "OTH_APPL1", "RGST
R_ID", "ACDOC_LIST_NO", "EXEC_DATE", "EXEC_USER_ID", "EXEC_NOTES", "CNL_EXEC_DATE", "CNL_EXEC_USER_ID", "FL_EXEC", "CNL_EXEC_NOTES", "LOG_NOTE1", "USERREUPDACC_ID", "DATE_RE
UPDACC", "P_ACDOC_ID", "FL_AUTHORITY_APRV", "FL_APPROVE_TYPE", "FL_NOTINCLUDE_INAUTH", "FL_NO_OTHERCURRENCY", "FL_WFSTATUS", "FL_APRV", "USRACTV_LOG_ID", "USERIN_DNAME", "US
ERUPD_DNAME", "UROLEIN_ID", "UROLEIN_NAME", "UDEPTIN_ID", "UDEPTIN_NAME", "FL_LOAD_5805", "UROLEUPD_ID", "UROLEUPD_NAME", "CNTRY_ID", fl_togetback, "FULLPATH", "FL_UPD_OLD",
"FL_REOINTG", "FL_OINTGR_RESEND"
        Remote query: SELECT NULL FROM [db_ro_non_delosnav].[ACDOC]
Query Identifier: 6812542821581303630
Planning Time: 14029.724 ms
JIT:
  Functions: 2
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 0.168 ms, Inlining 2.087 ms, Optimization 3.301 ms, Emission 3.564 ms, Total 9.121 ms
Execution Time: 15102.803 ms
(12 rows)

I gather this is how it computes the plan.

I think we have narrowed down the problem, and this is extremely strange :

It is not a matter of pgsql version 10 or 16, it is a matter of the postgres user, the user that owns the data dir(s) and the user of the postgres process. We  reproduced both the problem and the solution with all combinations of versions.

To sum it up :

Our sysadm created the system debian Debian GNU/Linux 12 (bookworm) with postgres as user 1000. Now at some point we realized that whenever we run a pgsql cluster with another user (I found that after spending two good days testing), the above query runs in about 1 second. With user postgres 1000 in 30 seconds. As you saw the perf output are completely different.

On the test VM, we removed the postgres user, recreated with uid=1003, chown -R all the mount points + table spaces, started postgres with the new postgres user , and bingo . The chown -R on the 5TB is instant... just saying ...

We are puzzled what can be causing this. Tomorrow we dig into GDB , *trace and the like.

If this rings any bells we would be more than grateful to know.

Also, the worse, is the suspicion that maybe our whole infra performance is affected. We hope it is only free-tds and tds_fdw .



regards

pgsql-performance by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: tds_fdw : Severe performance degradation from postgresql 10.23 to 16.4
Next
From: Tom Lane
Date:
Subject: Re: tds_fdw : Severe performance degradation from postgresql 10.23 to 16.4