Postgres v16.4 crashes on segfault when memory >= 16gb - Mailing list pgsql-bugs

From Weslley Braga
Subject Postgres v16.4 crashes on segfault when memory >= 16gb
Date
Msg-id CAKeJUPB1E16fGBrXTO3JKuPv3VW0wywCDHd3Ece2h2uHwtJSkA@mail.gmail.com
Whole thread Raw
Responses Re: Postgres v16.4 crashes on segfault when memory >= 16gb
List pgsql-bugs

Hi Postgres team,

I've been observing for the past month. I'm running PostgreSQL in Kubernetes through the CloudNativePG project. When I bump the pods resource configuration to anything above 16gb I notice that complex queries such as the one shared below cause the runtime to fail with a segfault see logs for details. The same query with a configuration < 16gb memory works ok.

Segfault log:

server process (PID 197) was terminated by signal 11: Segmentation fault","detail":"Failed │
│  process was running: -- Define the variables\nWITH vars AS (\n    SELECT\n        'f6dd1d │
│ 37-d0c6-4295-b0e6-6467437ef464'::uuid AS brand_id,\n        'I-39438' AS invoice_number,\n │
│         'SO5626760' AS sales_order_number\n),\nrelated_ids AS (\n    SELECT DISTINCT\n     │
│     f.id AS fulfillment_id,\n        fp.id AS fulfillment_package_id,\n        fpli.id AS  │
│ fulfillment_package_line_item_id,\n        f.sales_order_number,\n        f.brand_id,\n    │
│      li.number AS invoice_number\n    FROM\n        cdm.fulfillment AS f\n    JOIN\n       │
│   cdm.logistics_invoice AS li ON f.logistics_invoice_id = li.id\n    JOIN\n        cdm.ful │
│ fillment_package AS fp ON f.id = fp.fulfillment_id\n    JOIN\n        cdm.fulfillment_pack │
│ age_line_item AS fpli ON fp.id = fpli.fulfillment_package_id\n    CROSS JOIN\n        vars │
│ \n    WHERE\n        f.sales_order_number = vars.sales_order_number\n        AND f.brand_i │
│ d = vars.brand_id\n        AND li.number = vars.invoice_number\n),\nreconciled_charges AS  │
│ (\n    SELECT\n        ri.sales_order_number,\n        ri.brand_id,\n        ri.invoice_n" │
│ ,"backend_type":"postmaster","query_id":"0"}}


Query:

WITH vars AS ( SELECT 'f6dd1d37-d0c6-4295-b0e6-6467437ef464'::uuid AS brand_id, 'I-39438' AS invoice_number, 'SO5626760' AS sales_order_number ), related_ids AS ( SELECT DISTINCT f.id AS fulfillment_id, fp.id AS fulfillment_package_id, fpli.id AS fulfillment_package_line_item_id, f.sales_order_number, f.brand_id, li.number AS invoice_number FROM cdm.fulfillment AS f JOIN cdm.logistics_invoice AS li ON f.logistics_invoice_id = li.id JOIN cdm.fulfillment_package AS fp ON f.id = fp.fulfillment_id JOIN cdm.fulfillment_package_line_item AS fpli ON fp.id = fpli.fulfillment_package_id CROSS JOIN vars WHERE f.sales_order_number = vars.sales_order_number AND f.brand_id = vars.brand_id AND li.number = vars.invoice_number ), reconciled_charges AS ( SELECT ri.sales_order_number, ri.brand_id, ri.invoice_number, rsc.difference_root_cause, rsc.difference_amount, COALESCE(sct_real.code, sct_synth.code) AS service_charge_type_code, CASE WHEN rsc.real_service_charge_id IS NOT NULL THEN 'REAL' WHEN rsc.synthetic_service_charge_id IS NOT NULL THEN 'SYNTHETIC' END AS charge_type FROM cdm.reconciled_service_charge AS rsc LEFT JOIN cdm.service_charge AS sc ON rsc.real_service_charge_id = sc.id LEFT JOIN cdm.service_charge_type AS sct_real ON sc.service_charge_type_id = sct_real.id LEFT JOIN cdm.synthetic_service_charge AS ssc ON rsc.synthetic_service_charge_id = ssc.id LEFT JOIN cdm.service_charge_type AS sct_synth ON ssc.charge_type_id = sct_synth.id INNER JOIN related_ids AS ri ON (sc.applied_to_type = 'Fulfillment' AND sc.applied_to_id = ri.fulfillment_id) OR (sc.applied_to_type = 'FulfillmentPackage' AND sc.applied_to_id = ri.fulfillment_package_id) OR (sc.applied_to_type = 'FulfillmentPackageLineItem' AND sc.applied_to_id = ri.fulfillment_package_line_item_id) OR (ssc.applied_to_type = 'SYNTHETIC_FULFILLMENT' AND ssc.applied_to_id = ri.fulfillment_id) OR (ssc.applied_to_type = 'SYNTHETIC_FULFILLMENT_PACKAGE' AND ssc.applied_to_id = ri.fulfillment_package_id) OR (ssc.applied_to_type = 'SYNTHETIC_FULFILLMENT_PACKAGE_LINE_ITEM' AND ssc.applied_to_id = ri.fulfillment_package_line_item_id) WHERE rsc.brand_id = (SELECT brand_id FROM vars) AND rsc.difference_status <> 'MATCHING' ) SELECT sales_order_number, brand_id, invoice_number, difference_root_cause, SUM(difference_amount) AS total_difference_amount, SUM(CASE WHEN difference_amount < 0 THEN difference_amount ELSE 0 END) AS total_unfavorable_amount, SUM(CASE WHEN difference_amount > 0 THEN difference_amount ELSE 0 END) AS total_favorable_amount, COUNT(CASE WHEN difference_amount < 0 THEN 1 END) AS unfavorable_count, COUNT(CASE WHEN difference_amount > 0 THEN 1 END) AS favorable_count, COUNT(*) AS total_error_count, ARRAY_AGG(DISTINCT service_charge_type_code) AS services, ARRAY_AGG(DISTINCT charge_type) AS charge_types FROM reconciled_charges GROUP BY sales_order_number, brand_id, invoice_number, difference_root_cause ORDER BY total_error_count DESC;


Here is my yaml manifest for the cnpg cluster:

apiVersion: postgresql.cnpg.io/v1 kind: Cluster metadata: annotations: config.linkerd.io/skip-inbound-ports: "5432" panfactum.com/admin-role: admin-implentio-pg-7649 panfactum.com/db: "true" panfactum.com/db-type: PostgreSQL panfactum.com/reader-role: reader-implentio-pg-7649 panfactum.com/service: pg-7649-pooler-rw.implentio panfactum.com/service-port: "5432" panfactum.com/superuser-role: superuser-implentio-pg-7649 panfactum.com/vault-mount: db/implentio-pg-7649 creationTimestamp: "2024-07-01T15:44:34Z" generation: 70 labels: id: pg-pg-7649-0cbb8555d636b38a panfactum.com/environment: development panfactum.com/local: "false" panfactum.com/module: kube_pg_cluster panfactum.com/prevent-lifetime-eviction: "true" panfactum.com/region: us-west-2 panfactum.com/root-module: implentio_db panfactum.com/scheduler: "true" panfactum.com/stack-commit: d89564842ca3d4275e82c69ff8f01c640b566d0d panfactum.com/stack-version: edge.24-08-12 panfactum.com/workload: pg-pg-7649 name: pg-7649 namespace: implentio resourceVersion: "69072821" uid: ddb60f66-4b90-419f-8f4d-c3e42d716f55 spec: affinity: enablePodAntiAffinity: true podAntiAffinityType: required tolerations: - effect: NoSchedule key: spot operator: Equal value: "true" - effect: NoSchedule key: burstable operator: Equal value: "true" - effect: NoSchedule key: arm64 operator: Equal value: "true" topologyKey: node.kubernetes.io/instance-type backup: barmanObjectStore: data: compression: bzip2 jobs: 8 destinationPath: s3://implentio-pg-7649-backup-6b62299e849216e0/ s3Credentials: inheritFromIAMRole: true wal: compression: bzip2 maxParallel: 8 retentionPolicy: 7d target: prefer-standby bootstrap: initdb: database: app encoding: UTF8 localeCType: C localeCollate: C owner: app postInitApplicationSQL: - REVOKE ALL ON SCHEMA public FROM PUBLIC; - CREATE ROLE reader NOINHERIT; - GRANT pg_read_all_data TO reader; - GRANT USAGE ON SCHEMA public TO reader; - GRANT USAGE ON SCHEMA cdm TO reader; - GRANT USAGE ON SCHEMA client TO reader; - CREATE ROLE writer NOINHERIT; - GRANT pg_write_all_data, pg_read_all_data TO writer; - GRANT ALL PRIVILEGES ON SCHEMA public TO writer; - GRANT ALL PRIVILEGES ON SCHEMA cdm TO writer; - GRANT ALL PRIVILEGES ON SCHEMA client TO writer; - GRANT CONNECT ON DATABASE app TO cnpg_pooler_pgbouncer; - GRANT ALL PRIVILEGES ON SCHEMA public TO cnpg_pooler_pgbouncer; - GRANT ALL PRIVILEGES ON SCHEMA cdm TO cnpg_pooler_pgbouncer; - GRANT ALL PRIVILEGES ON SCHEMA client TO cnpg_pooler_pgbouncer; - CREATE OR REPLACE FUNCTION user_search(uname TEXT) RETURNS TABLE (usename name, passwd text) LANGUAGE sql SECURITY DEFINER AS 'SELECT usename, passwd FROM pg_shadow WHERE usename=$1;' - REVOKE ALL ON FUNCTION user_search(text) FROM public; - GRANT EXECUTE ON FUNCTION user_search(text) TO cnpg_pooler_pgbouncer; postInitSQL: - REVOKE ALL ON SCHEMA public FROM PUBLIC; - CREATE ROLE cnpg_pooler_pgbouncer WITH LOGIN; - GRANT ALL PRIVILEGES ON SCHEMA public TO cnpg_pooler_pgbouncer; - GRANT ALL PRIVILEGES ON SCHEMA cdm TO cnpg_pooler_pgbouncer; - GRANT ALL PRIVILEGES ON SCHEMA client TO cnpg_pooler_pgbouncer; - GRANT CONNECT ON DATABASE postgres TO cnpg_pooler_pgbouncer; - CREATE OR REPLACE FUNCTION user_search(uname TEXT) RETURNS TABLE (usename name, passwd text) LANGUAGE sql SECURITY DEFINER AS 'SELECT usename, passwd FROM pg_shadow WHERE usename=$1;' - REVOKE ALL ON FUNCTION user_search(text) FROM public; - GRANT EXECUTE ON FUNCTION user_search(text) TO cnpg_pooler_pgbouncer; certificates: clientCASecret: pg-client-certs-331c replicationTLSSecret: pg-client-certs-331c serverCASecret: pg-server-certs-b04c serverTLSSecret: pg-server-certs-b04c enablePDB: false enableSuperuserAccess: true failoverDelay: 5 imageName: 730335560480.dkr.ecr.us-west-2.amazonaws.com/github/cloudnative-pg/postgresql:16.4 inheritedMetadata: annotations: config.linkerd.io/skip-inbound-ports: "5432" linkerd.io/inject: enabled resize.topolvm.io/increase: 100% resize.topolvm.io/storage_limit: 200Gi resize.topolvm.io/threshold: 35% labels: id: pg-pg-7649-0cbb8555d636b38a panfactum.com/environment: development panfactum.com/local: "false" panfactum.com/module: kube_pg_cluster panfactum.com/prevent-lifetime-eviction: "true" panfactum.com/region: us-west-2 panfactum.com/root-module: implentio_db panfactum.com/scheduler: "true" panfactum.com/stack-commit: d89564842ca3d4275e82c69ff8f01c640b566d0d panfactum.com/stack-version: edge.24-08-12 panfactum.com/workload: pg-pg-7649 pg-cluster: implentio-pg-7649 instances: 3 logLevel: info maxSyncReplicas: 0 minSyncReplicas: 0 monitoring: customQueriesConfigMap: - key: queries name: cnpg-default-monitoring disableDefaultQueries: false enablePodMonitor: true postgresGID: 26 postgresUID: 26 postgresql: parameters: archive_mode: "on" archive_timeout: 5min dynamic_shared_memory_type: posix effective_cache_size: 6400MB log_destination: csvlog log_directory: /controller/log log_filename: postgres log_rotation_age: "0" log_rotation_size: "0" log_truncate_on_rotation: "false" logging_collector: "on" maintenance_work_mem: 1600MB max_connections: "150" max_parallel_workers: "32" max_replication_slots: "32" max_worker_processes: "32" shared_buffers: 4000MB shared_memory_type: mmap shared_preload_libraries: "" ssl_max_protocol_version: TLSv1.3 ssl_min_protocol_version: TLSv1.3 wal_keep_size: 1024MB wal_level: logical wal_log_hints: "on" wal_receiver_timeout: 5s wal_sender_timeout: 5s work_mem: 27MB syncReplicaElectionConstraint: enabled: false primaryUpdateMethod: switchover primaryUpdateStrategy: unsupervised priorityClassName: database replicationSlots: highAvailability: enabled: true slotPrefix: _cnpg_ synchronizeReplicas: enabled: true updateInterval: 30 resources: limits: memory: 20800Mi requests: cpu: "6" memory: 16000Mi schedulerName: panfactum serviceAccountTemplate: metadata: annotations: eks.amazonaws.com/role-arn: arn:aws:iam::730335560480:role/pg-7649-20240701154426349300000002 smartShutdownTimeout: 5 startDelay: 600 stopDelay: 70 storage: pvcTemplate: resources: requests: storage: 20Gi storageClassName: ebs-standard resizeInUseVolumes: true superuserSecret: name: pg-7649-superuser-01bb512a09da19f2a01a4f1e79a5bb44dff309cf02c7afbb7b84aca8298765d4 switchoverDelay: 70 topologySpreadConstraints: - labelSelector: matchLabels: id: pg-pg-7649-0cbb8555d636b38a maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule status: availableArchitectures: - goArch: amd64 hash: 94527128605ac5100415106fe26c480531d094b3f36626e562a8135f342b89e4 - goArch: arm64 hash: 9b7b08592e917ed3b20bb3ae404ea4c0c958bdee73e5411c452d6c464d77f0b4 certificates: clientCASecret: pg-client-certs-331c expirations: pg-client-certs-331c: 2024-09-02 15:44:29 +0000 UTC pg-server-certs-b04c: 2024-09-02 15:44:29 +0000 UTC replicationTLSSecret: pg-client-certs-331c serverAltDNSNames: - pg-7649-rw - pg-7649-rw.implentio - pg-7649-rw.implentio.svc - pg-7649-r - pg-7649-r.implentio - pg-7649-r.implentio.svc - pg-7649-ro - pg-7649-ro.implentio - pg-7649-ro.implentio.svc serverCASecret: pg-server-certs-b04c serverTLSSecret: pg-server-certs-b04c cloudNativePGCommitHash: 336ddf53 cloudNativePGOperatorHash: 9b7b08592e917ed3b20bb3ae404ea4c0c958bdee73e5411c452d6c464d77f0b4 conditions: - lastTransitionTime: "2024-08-21T17:28:36Z" message: Cluster is Ready reason: ClusterIsReady status: "True" type: Ready - lastTransitionTime: "2024-08-21T17:28:03Z" message: Continuous archiving is working reason: ContinuousArchivingSuccess status: "True" type: ContinuousArchiving - lastTransitionTime: "2024-08-18T02:06:07Z" message: Backup was successful reason: LastBackupSucceeded status: "True" type: LastBackupSucceeded configMapResourceVersion: metrics: cnpg-default-monitoring: "6071197" currentPrimary: pg-7649-2 currentPrimaryTimestamp: "2024-08-21T17:28:01.989805Z" firstRecoverabilityPoint: "2024-08-11T00:10:35Z" firstRecoverabilityPointByMethod: barmanObjectStore: "2024-08-11T00:10:35Z" healthyPVC: - pg-7649-1 - pg-7649-2 - pg-7649-5 image: 730335560480.dkr.ecr.us-west-2.amazonaws.com/github/cloudnative-pg/postgresql:16.4 instanceNames: - pg-7649-1 - pg-7649-2 - pg-7649-5 instances: 3 instancesReportedState: pg-7649-1: isPrimary: false timeLineID: 192 pg-7649-2: isPrimary: true timeLineID: 192 pg-7649-5: isPrimary: false timeLineID: 191 instancesStatus: healthy: - pg-7649-1 - pg-7649-5 replicating: - pg-7649-2 lastFailedBackup: "2024-07-28T00:09:04Z" lastSuccessfulBackup: "2024-08-18T02:06:06Z" lastSuccessfulBackupByMethod: barmanObjectStore: "2024-08-18T02:06:06Z" latestGeneratedNode: 5 managedRolesStatus: {} phase: Cluster in healthy state poolerIntegrations: pgBouncerIntegration: {} pvcCount: 3 readService: pg-7649-r readyInstances: 2 secretsResourceVersion: applicationSecretVersion: "61181252" clientCaSecretVersion: "66383294" replicationSecretVersion: "66383294" serverCaSecretVersion: "66383293" serverSecretVersion: "66383293" superuserSecretVersion: "61180981" switchReplicaClusterStatus: {} targetPrimary: pg-7649-2 targetPrimaryTimestamp: "2024-08-21T17:27:58.902780Z" timelineID: 192 topology: instances: pg-7649-1: {} pg-7649-2: {} pg-7649-5: {} nodesUsed: 3 successfullyExtracted: true writeService: pg-7649-rw
I'm glad to share more information to further help identify the issue. I've also extracted core dumps from the error.

pgsql-bugs by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: BUG #18507: See C include file "ntstatus.h" for a description of the hexadecimal value.
Next
From: Tom Lane
Date:
Subject: Re: Postgres v16.4 crashes on segfault when memory >= 16gb