Re: BUG #19382: Server crash at __nss_database_lookup - Mailing list pgsql-bugs

From surya poondla
Subject Re: BUG #19382: Server crash at __nss_database_lookup
Date
Msg-id CAOVWO5rVBKsjG4YwO_PJQu2OBGp8qUdF1jineYY6Lm3zc6-KWQ@mail.gmail.com
Whole thread Raw
In response to BUG #19382: Server crash at __nss_database_lookup  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #19382: Server crash at __nss_database_lookup
List pgsql-bugs

Hi Yuxiao, Kirill,

Thank you for the test cases.


I can reproduce this issue on PostgreSQL 17.6. I debugged it with lldb and found the root cause.

When a composite type is altered mid-transaction while a PL/pgSQL record variable holds data of that type, the server crashes because it interprets old data using the new type definition without performing type conversion.


The server crashes with this stack trace:

* thread #1, queue = 'main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x117e00000)

    frame #0: 0x0000000183c95320 libsystem_platform.dylib`_platform_memmove + 96

libsystem_platform.dylib`_platform_memmove:

->  0x183c95320 <+96>:  ldnp   q0, q1, [x1]

    0x183c95324 <+100>: add    x1, x1, #0x20

    0x183c95328 <+104>: subs   x2, x2, #0x20

    0x183c9532c <+108>: b.hi   0x183c95318    ; <+88>

Target 0: (postgres) stopped.

(lldb) bt

* thread #1, queue = 'main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x117e00000)

  * frame #0: 0x0000000183c95320 libsystem_platform.dylib`_platform_memmove + 96

    frame #1: 0x00000001030ef368 postgres`text_to_cstring(t=0x0000000117017b1c) at varlena.c:225:2

    frame #2: 0x00000001030f0e58 postgres`textout(fcinfo=0x000000016d5f1b98) at varlena.c:594:2

    frame #3: 0x000000010314ed14 postgres`FunctionCall1Coll(flinfo=0x0000000121808cd8, collation=0, arg1=4680940316) at fmgr.c:1139:11

    frame #4: 0x0000000103150880 postgres`OutputFunctionCall(flinfo=0x0000000121808cd8, val=4680940316) at fmgr.c:1685:25

    frame #5: 0x0000000103075c8c postgres`record_out(fcinfo=0x000000016d5f1d58) at rowtypes.c:435:11

    frame #6: 0x000000010314ed14 postgres`FunctionCall1Coll(flinfo=0x0000000121808a28, collation=0, arg1=4940960546) at fmgr.c:1139:11

    frame #7: 0x0000000103150880 postgres`OutputFunctionCall(flinfo=0x0000000121808a28, val=4940960546) at fmgr.c:1685:25

    frame #8: 0x000000010282fa30 postgres`printtup(slot=0x00000001218087a8, self=0x00000001170102d8) at printtup.c:360:16

    frame #9: 0x0000000102b8fdac postgres`ExecutePlan(queryDesc=0x0000000137010300, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x00000001170102d8) at execMain.c:1679:9

    frame #10: 0x0000000102b8fb98 postgres`standard_ExecutorRun(queryDesc=0x0000000137010300, direction=ForwardScanDirection, count=0, execute_once=false) at execMain.c:360:3

    frame #11: 0x0000000102b8f988 postgres`ExecutorRun(queryDesc=0x0000000137010300, direction=ForwardScanDirection, count=0, execute_once=false) at execMain.c:306:3

    frame #12: 0x0000000102ee2bd4 postgres`PortalRunSelect(portal=0x000000012782c500, forward=true, count=0, dest=0x00000001170102d8) at pquery.c:922:4

    frame #13: 0x0000000102ee2568 postgres`PortalRun(portal=0x000000012782c500, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x00000001170102d8, altdest=0x00000001170102d8, qc=0x000000016d5f21b8) at pquery.c:766:18

    frame #14: 0x0000000102edce9c postgres`exec_simple_query(query_string="SELECT bar();") at postgres.c:1278:10

    frame #15: 0x0000000102edbf6c postgres`PostgresMain(dbname="postgres", username="surya") at postgres.c:4767:7

    frame #16: 0x0000000102ed3594 postgres`BackendMain(startup_data="", startup_data_len=4) at backend_startup.c:106:2

    frame #17: 0x0000000102daf8f8 postgres`postmaster_child_launch(child_type=B_BACKEND, startup_data="", startup_data_len=4, client_sock=0x000000016d5f25b8) at launch_backend.c:277:3

    frame #18: 0x0000000102db7708 postgres`BackendStartup(client_sock=0x000000016d5f25b8) at postmaster.c:3624:8

    frame #19: 0x0000000102db4438 postgres`ServerLoop at postmaster.c:1678:6

    frame #20: 0x0000000102db3324 postgres`PostmasterMain(argc=3, argv=0x000060000321d420) at postmaster.c:1376:11

    frame #21: 0x0000000102c369c0 postgres`main(argc=3, argv=0x000060000321d420) at main.c:199:3

    frame #22: 0x00000001838bab98 dyld`start + 6076

(lldb)



The crash happens because textout() is called on integer data, and it interprets 1073741824 (2^30) as a memory pointer.


I set breakpoints at two critical points to trace the issue:


Breakpoint 1: ExpandedRecordGetDatum (when PL/pgSQL returns the record)

At this point, the record still has complete version information:


(lldb) p erh->er_tupdesc_id

(uint64) 2                      // Record was created with version 2


(lldb) p assign_record_type_identifier(erh->er_typeid, erh->er_typmod)

(uint64) 4                      // Current type is now version 4


(lldb) p erh->er_tupdesc->attrs[1].atttypid

(Oid) 23                        // Field b was INT4 when record was created


(lldb) p ((TypeCacheEntry*)lookup_type_cache(erh->er_typeid, 0x00100))->tupDesc->attrs[1].atttypid

(Oid) 25                        // Field b is now TEXT in current definition


Version mismatch detected (2 != 4). The record has integer data but the type definition changed to TEXT.


Breakpoint 2: record_out (when converting record to text for output)

After ExpandedRecordGetDatum flattens the record to HeapTupleHeader, the version information is lost:


(lldb) p tupType

(Oid) 32770                  //Only type OID preserved


(lldb) p tupTypmod

(int32) -1                      //Only typmod preserved

(lldb) p tupdesc->attrs[1].atttypid

(Oid) 25                        // Uses current definition: TEXT


When ExpandedRecordHeader is flattened to HeapTupleHeader, HeapTupleHeader only stores type OID and typmod but not the version identifier.


This returns the current type definition (version 4, field b = TEXT), but the actual data is still from version 2 (field b = INT, value = 1073741824).


The crash happens at rowtypes.c, when record_out() calls textout() on field b. Since textout() expects a text pointer but receives an integer, it tries to dereference 0x40000000 (1073741824 (2^30)), causing a segfault that leads to the crash.



I believe the fix should be in pl_exec.c before the record is returned. At the point where we still have access to erh->er_tupdesc_id, and we can compare erh->er_tupdesc_id with current tupDesc_identifier, if they differ, the type was altered. For each field with changed type, apply conversion using exec_cast_value().

If conversion fails or no cast exists, raise a proper error, if not return the converted record with updated version

This prevents crashes by either converting the data (INT to TEXT which should work) or raising a clean error message instead of a segfault.


I am working on a patch for this.

Kindly let me know your thoughts.

pgsql-bugs by date:

Previous
From: David Rowley
Date:
Subject: Re: BUG #19385: Normal SELECT generates an ineffecifient query plan compare to the prepared SELECT.
Next
From: Amit Langote
Date:
Subject: Re: BUG #19099: Conditional DELETE from partitioned table with non-updatable partition raises internal error