Thread: Postgres v16.4 crashes on segfault when memory >= 16gb
Hi Postgres team,
I've been observing for the past month. I'm running PostgreSQL in Kubernetes through the CloudNativePG project. When I bump the pods resource configuration to anything above 16gb I notice that complex queries such as the one shared below cause the runtime to fail with a segfault see logs for details. The same query with a configuration < 16gb memory works ok.
Segfault log:
server process (PID 197) was terminated by signal 11: Segmentation fault","detail":"Failed │
│ process was running: -- Define the variables\nWITH vars AS (\n SELECT\n 'f6dd1d │
│ 37-d0c6-4295-b0e6-6467437ef464'::uuid AS brand_id,\n 'I-39438' AS invoice_number,\n │
│ 'SO5626760' AS sales_order_number\n),\nrelated_ids AS (\n SELECT DISTINCT\n │
│ f.id AS fulfillment_id,\n fp.id AS fulfillment_package_id,\n fpli.id AS │
│ fulfillment_package_line_item_id,\n f.sales_order_number,\n f.brand_id,\n │
│ li.number AS invoice_number\n FROM\n cdm.fulfillment AS f\n JOIN\n │
│ cdm.logistics_invoice AS li ON f.logistics_invoice_id = li.id\n JOIN\n cdm.ful │
│ fillment_package AS fp ON f.id = fp.fulfillment_id\n JOIN\n cdm.fulfillment_pack │
│ age_line_item AS fpli ON fp.id = fpli.fulfillment_package_id\n CROSS JOIN\n vars │
│ \n WHERE\n f.sales_order_number = vars.sales_order_number\n AND f.brand_i │
│ d = vars.brand_id\n AND li.number = vars.invoice_number\n),\nreconciled_charges AS │
│ (\n SELECT\n ri.sales_order_number,\n ri.brand_id,\n ri.invoice_n" │
│ ,"backend_type":"postmaster","query_id":"0"}}
Query:
WITH vars AS ( SELECT 'f6dd1d37-d0c6-4295-b0e6-6467437ef464'::uuid AS brand_id, 'I-39438' AS invoice_number, 'SO5626760' AS sales_order_number ), related_ids AS ( SELECT DISTINCT f.id AS fulfillment_id, fp.id AS fulfillment_package_id, fpli.id AS fulfillment_package_line_item_id, f.sales_order_number, f.brand_id, li.number AS invoice_number FROM cdm.fulfillment AS f JOIN cdm.logistics_invoice AS li ON f.logistics_invoice_id = li.id JOIN cdm.fulfillment_package AS fp ON f.id = fp.fulfillment_id JOIN cdm.fulfillment_package_line_item AS fpli ON fp.id = fpli.fulfillment_package_id CROSS JOIN vars WHERE f.sales_order_number = vars.sales_order_number AND f.brand_id = vars.brand_id AND li.number = vars.invoice_number ), reconciled_charges AS ( SELECT ri.sales_order_number, ri.brand_id, ri.invoice_number, rsc.difference_root_cause, rsc.difference_amount, COALESCE(sct_real.code, sct_synth.code) AS service_charge_type_code, CASE WHEN rsc.real_service_charge_id IS NOT NULL THEN 'REAL' WHEN rsc.synthetic_service_charge_id IS NOT NULL THEN 'SYNTHETIC' END AS charge_type FROM cdm.reconciled_service_charge AS rsc LEFT JOIN cdm.service_charge AS sc ON rsc.real_service_charge_id = sc.id LEFT JOIN cdm.service_charge_type AS sct_real ON sc.service_charge_type_id = sct_real.id LEFT JOIN cdm.synthetic_service_charge AS ssc ON rsc.synthetic_service_charge_id = ssc.id LEFT JOIN cdm.service_charge_type AS sct_synth ON ssc.charge_type_id = sct_synth.id INNER JOIN related_ids AS ri ON (sc.applied_to_type = 'Fulfillment' AND sc.applied_to_id = ri.fulfillment_id) OR (sc.applied_to_type = 'FulfillmentPackage' AND sc.applied_to_id = ri.fulfillment_package_id) OR (sc.applied_to_type = 'FulfillmentPackageLineItem' AND sc.applied_to_id = ri.fulfillment_package_line_item_id) OR (ssc.applied_to_type = 'SYNTHETIC_FULFILLMENT' AND ssc.applied_to_id = ri.fulfillment_id) OR (ssc.applied_to_type = 'SYNTHETIC_FULFILLMENT_PACKAGE' AND ssc.applied_to_id = ri.fulfillment_package_id) OR (ssc.applied_to_type = 'SYNTHETIC_FULFILLMENT_PACKAGE_LINE_ITEM' AND ssc.applied_to_id = ri.fulfillment_package_line_item_id) WHERE rsc.brand_id = (SELECT brand_id FROM vars) AND rsc.difference_status <> 'MATCHING' ) SELECT sales_order_number, brand_id, invoice_number, difference_root_cause, SUM(difference_amount) AS total_difference_amount, SUM(CASE WHEN difference_amount < 0 THEN difference_amount ELSE 0 END) AS total_unfavorable_amount, SUM(CASE WHEN difference_amount > 0 THEN difference_amount ELSE 0 END) AS total_favorable_amount, COUNT(CASE WHEN difference_amount < 0 THEN 1 END) AS unfavorable_count, COUNT(CASE WHEN difference_amount > 0 THEN 1 END) AS favorable_count, COUNT(*) AS total_error_count, ARRAY_AGG(DISTINCT service_charge_type_code) AS services, ARRAY_AGG(DISTINCT charge_type) AS charge_types FROM reconciled_charges GROUP BY sales_order_number, brand_id, invoice_number, difference_root_cause ORDER BY total_error_count DESC;
I'm glad to share more information to further help identify the issue. I've also extracted core dumps from the error.Here is my yaml manifest for the cnpg cluster:
apiVersion: postgresql.cnpg.io/v1 kind: Cluster metadata: annotations: config.linkerd.io/skip-inbound-ports: "5432" panfactum.com/admin-role: admin-implentio-pg-7649 panfactum.com/db: "true" panfactum.com/db-type: PostgreSQL panfactum.com/reader-role: reader-implentio-pg-7649 panfactum.com/service: pg-7649-pooler-rw.implentio panfactum.com/service-port: "5432" panfactum.com/superuser-role: superuser-implentio-pg-7649 panfactum.com/vault-mount: db/implentio-pg-7649 creationTimestamp: "2024-07-01T15:44:34Z" generation: 70 labels: id: pg-pg-7649-0cbb8555d636b38a panfactum.com/environment: development panfactum.com/local: "false" panfactum.com/module: kube_pg_cluster panfactum.com/prevent-lifetime-eviction: "true" panfactum.com/region: us-west-2 panfactum.com/root-module: implentio_db panfactum.com/scheduler: "true" panfactum.com/stack-commit: d89564842ca3d4275e82c69ff8f01c640b566d0d panfactum.com/stack-version: edge.24-08-12 panfactum.com/workload: pg-pg-7649 name: pg-7649 namespace: implentio resourceVersion: "69072821" uid: ddb60f66-4b90-419f-8f4d-c3e42d716f55 spec: affinity: enablePodAntiAffinity: true podAntiAffinityType: required tolerations: - effect: NoSchedule key: spot operator: Equal value: "true" - effect: NoSchedule key: burstable operator: Equal value: "true" - effect: NoSchedule key: arm64 operator: Equal value: "true" topologyKey: node.kubernetes.io/instance-type backup: barmanObjectStore: data: compression: bzip2 jobs: 8 destinationPath: s3://implentio-pg-7649-backup-6b62299e849216e0/ s3Credentials: inheritFromIAMRole: true wal: compression: bzip2 maxParallel: 8 retentionPolicy: 7d target: prefer-standby bootstrap: initdb: database: app encoding: UTF8 localeCType: C localeCollate: C owner: app postInitApplicationSQL: - REVOKE ALL ON SCHEMA public FROM PUBLIC; - CREATE ROLE reader NOINHERIT; - GRANT pg_read_all_data TO reader; - GRANT USAGE ON SCHEMA public TO reader; - GRANT USAGE ON SCHEMA cdm TO reader; - GRANT USAGE ON SCHEMA client TO reader; - CREATE ROLE writer NOINHERIT; - GRANT pg_write_all_data, pg_read_all_data TO writer; - GRANT ALL PRIVILEGES ON SCHEMA public TO writer; - GRANT ALL PRIVILEGES ON SCHEMA cdm TO writer; - GRANT ALL PRIVILEGES ON SCHEMA client TO writer; - GRANT CONNECT ON DATABASE app TO cnpg_pooler_pgbouncer; - GRANT ALL PRIVILEGES ON SCHEMA public TO cnpg_pooler_pgbouncer; - GRANT ALL PRIVILEGES ON SCHEMA cdm TO cnpg_pooler_pgbouncer; - GRANT ALL PRIVILEGES ON SCHEMA client TO cnpg_pooler_pgbouncer; - CREATE OR REPLACE FUNCTION user_search(uname TEXT) RETURNS TABLE (usename name, passwd text) LANGUAGE sql SECURITY DEFINER AS 'SELECT usename, passwd FROM pg_shadow WHERE usename=$1;' - REVOKE ALL ON FUNCTION user_search(text) FROM public; - GRANT EXECUTE ON FUNCTION user_search(text) TO cnpg_pooler_pgbouncer; postInitSQL: - REVOKE ALL ON SCHEMA public FROM PUBLIC; - CREATE ROLE cnpg_pooler_pgbouncer WITH LOGIN; - GRANT ALL PRIVILEGES ON SCHEMA public TO cnpg_pooler_pgbouncer; - GRANT ALL PRIVILEGES ON SCHEMA cdm TO cnpg_pooler_pgbouncer; - GRANT ALL PRIVILEGES ON SCHEMA client TO cnpg_pooler_pgbouncer; - GRANT CONNECT ON DATABASE postgres TO cnpg_pooler_pgbouncer; - CREATE OR REPLACE FUNCTION user_search(uname TEXT) RETURNS TABLE (usename name, passwd text) LANGUAGE sql SECURITY DEFINER AS 'SELECT usename, passwd FROM pg_shadow WHERE usename=$1;' - REVOKE ALL ON FUNCTION user_search(text) FROM public; - GRANT EXECUTE ON FUNCTION user_search(text) TO cnpg_pooler_pgbouncer; certificates: clientCASecret: pg-client-certs-331c replicationTLSSecret: pg-client-certs-331c serverCASecret: pg-server-certs-b04c serverTLSSecret: pg-server-certs-b04c enablePDB: false enableSuperuserAccess: true failoverDelay: 5 imageName: 730335560480.dkr.ecr.us-west-2.amazonaws.com/github/cloudnative-pg/postgresql:16.4 inheritedMetadata: annotations: config.linkerd.io/skip-inbound-ports: "5432" linkerd.io/inject: enabled resize.topolvm.io/increase: 100% resize.topolvm.io/storage_limit: 200Gi resize.topolvm.io/threshold: 35% labels: id: pg-pg-7649-0cbb8555d636b38a panfactum.com/environment: development panfactum.com/local: "false" panfactum.com/module: kube_pg_cluster panfactum.com/prevent-lifetime-eviction: "true" panfactum.com/region: us-west-2 panfactum.com/root-module: implentio_db panfactum.com/scheduler: "true" panfactum.com/stack-commit: d89564842ca3d4275e82c69ff8f01c640b566d0d panfactum.com/stack-version: edge.24-08-12 panfactum.com/workload: pg-pg-7649 pg-cluster: implentio-pg-7649 instances: 3 logLevel: info maxSyncReplicas: 0 minSyncReplicas: 0 monitoring: customQueriesConfigMap: - key: queries name: cnpg-default-monitoring disableDefaultQueries: false enablePodMonitor: true postgresGID: 26 postgresUID: 26 postgresql: parameters: archive_mode: "on" archive_timeout: 5min dynamic_shared_memory_type: posix effective_cache_size: 6400MB log_destination: csvlog log_directory: /controller/log log_filename: postgres log_rotation_age: "0" log_rotation_size: "0" log_truncate_on_rotation: "false" logging_collector: "on" maintenance_work_mem: 1600MB max_connections: "150" max_parallel_workers: "32" max_replication_slots: "32" max_worker_processes: "32" shared_buffers: 4000MB shared_memory_type: mmap shared_preload_libraries: "" ssl_max_protocol_version: TLSv1.3 ssl_min_protocol_version: TLSv1.3 wal_keep_size: 1024MB wal_level: logical wal_log_hints: "on" wal_receiver_timeout: 5s wal_sender_timeout: 5s work_mem: 27MB syncReplicaElectionConstraint: enabled: false primaryUpdateMethod: switchover primaryUpdateStrategy: unsupervised priorityClassName: database replicationSlots: highAvailability: enabled: true slotPrefix: _cnpg_ synchronizeReplicas: enabled: true updateInterval: 30 resources: limits: memory: 20800Mi requests: cpu: "6" memory: 16000Mi schedulerName: panfactum serviceAccountTemplate: metadata: annotations: eks.amazonaws.com/role-arn: arn:aws:iam::730335560480:role/pg-7649-20240701154426349300000002 smartShutdownTimeout: 5 startDelay: 600 stopDelay: 70 storage: pvcTemplate: resources: requests: storage: 20Gi storageClassName: ebs-standard resizeInUseVolumes: true superuserSecret: name: pg-7649-superuser-01bb512a09da19f2a01a4f1e79a5bb44dff309cf02c7afbb7b84aca8298765d4 switchoverDelay: 70 topologySpreadConstraints: - labelSelector: matchLabels: id: pg-pg-7649-0cbb8555d636b38a maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule status: availableArchitectures: - goArch: amd64 hash: 94527128605ac5100415106fe26c480531d094b3f36626e562a8135f342b89e4 - goArch: arm64 hash: 9b7b08592e917ed3b20bb3ae404ea4c0c958bdee73e5411c452d6c464d77f0b4 certificates: clientCASecret: pg-client-certs-331c expirations: pg-client-certs-331c: 2024-09-02 15:44:29 +0000 UTC pg-server-certs-b04c: 2024-09-02 15:44:29 +0000 UTC replicationTLSSecret: pg-client-certs-331c serverAltDNSNames: - pg-7649-rw - pg-7649-rw.implentio - pg-7649-rw.implentio.svc - pg-7649-r - pg-7649-r.implentio - pg-7649-r.implentio.svc - pg-7649-ro - pg-7649-ro.implentio - pg-7649-ro.implentio.svc serverCASecret: pg-server-certs-b04c serverTLSSecret: pg-server-certs-b04c cloudNativePGCommitHash: 336ddf53 cloudNativePGOperatorHash: 9b7b08592e917ed3b20bb3ae404ea4c0c958bdee73e5411c452d6c464d77f0b4 conditions: - lastTransitionTime: "2024-08-21T17:28:36Z" message: Cluster is Ready reason: ClusterIsReady status: "True" type: Ready - lastTransitionTime: "2024-08-21T17:28:03Z" message: Continuous archiving is working reason: ContinuousArchivingSuccess status: "True" type: ContinuousArchiving - lastTransitionTime: "2024-08-18T02:06:07Z" message: Backup was successful reason: LastBackupSucceeded status: "True" type: LastBackupSucceeded configMapResourceVersion: metrics: cnpg-default-monitoring: "6071197" currentPrimary: pg-7649-2 currentPrimaryTimestamp: "2024-08-21T17:28:01.989805Z" firstRecoverabilityPoint: "2024-08-11T00:10:35Z" firstRecoverabilityPointByMethod: barmanObjectStore: "2024-08-11T00:10:35Z" healthyPVC: - pg-7649-1 - pg-7649-2 - pg-7649-5 image: 730335560480.dkr.ecr.us-west-2.amazonaws.com/github/cloudnative-pg/postgresql:16.4 instanceNames: - pg-7649-1 - pg-7649-2 - pg-7649-5 instances: 3 instancesReportedState: pg-7649-1: isPrimary: false timeLineID: 192 pg-7649-2: isPrimary: true timeLineID: 192 pg-7649-5: isPrimary: false timeLineID: 191 instancesStatus: healthy: - pg-7649-1 - pg-7649-5 replicating: - pg-7649-2 lastFailedBackup: "2024-07-28T00:09:04Z" lastSuccessfulBackup: "2024-08-18T02:06:06Z" lastSuccessfulBackupByMethod: barmanObjectStore: "2024-08-18T02:06:06Z" latestGeneratedNode: 5 managedRolesStatus: {} phase: Cluster in healthy state poolerIntegrations: pgBouncerIntegration: {} pvcCount: 3 readService: pg-7649-r readyInstances: 2 secretsResourceVersion: applicationSecretVersion: "61181252" clientCaSecretVersion: "66383294" replicationSecretVersion: "66383294" serverCaSecretVersion: "66383293" serverSecretVersion: "66383293" superuserSecretVersion: "61180981" switchReplicaClusterStatus: {} targetPrimary: pg-7649-2 targetPrimaryTimestamp: "2024-08-21T17:27:58.902780Z" timelineID: 192 topology: instances: pg-7649-1: {} pg-7649-2: {} pg-7649-5: {} nodesUsed: 3 successfullyExtracted: true writeService: pg-7649-rw
Weslley Braga <wbraga@implentio.com> writes: > I've been observing for the past month. I'm running PostgreSQL in > Kubernetes through the CloudNativePG project. When I bump the pods resource > configuration to anything above 16gb I notice that complex queries such as > the one shared below cause the runtime to fail with a segfault see logs for > details. The same query with a configuration < 16gb memory works ok. Hmm, does turning jit = off affect it? > I'm glad to share more information to further help identify the issue. I've > also extracted core dumps from the error. The core dumps will be useless to anyone on any other platform or with an even slightly different build of Postgres. If you could extract stack traces from them, that might be helpful: https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend regards, tom lane
Thank you for the reply. I will try with jit turned off and report back but also try to collect stack traces from the pods.
On Wed, Aug 21, 2024 at 4:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Weslley Braga <wbraga@implentio.com> writes:
> I've been observing for the past month. I'm running PostgreSQL in
> Kubernetes through the CloudNativePG project. When I bump the pods resource
> configuration to anything above 16gb I notice that complex queries such as
> the one shared below cause the runtime to fail with a segfault see logs for
> details. The same query with a configuration < 16gb memory works ok.
Hmm, does turning jit = off affect it?
> I'm glad to share more information to further help identify the issue. I've
> also extracted core dumps from the error.
The core dumps will be useless to anyone on any other platform or with
an even slightly different build of Postgres. If you could extract
stack traces from them, that might be helpful:
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend
regards, tom lane