Thread: Postgres v16.4 crashes on segfault when memory >= 16gb

Postgres v16.4 crashes on segfault when memory >= 16gb

From

Weslley Braga

Date:

21 August 2024, 22:53:00

Hi Postgres team,

I've been observing for the past month. I'm running PostgreSQL in Kubernetes through the CloudNativePG project. When I bump the pods resource configuration to anything above 16gb I notice that complex queries such as the one shared below cause the runtime to fail with a segfault see logs for details. The same query with a configuration < 16gb memory works ok.

Segfault log:

server process (PID 197) was terminated by signal 11: Segmentation fault","detail":"Failed │
│  process was running: -- Define the variables\nWITH vars AS (\n    SELECT\n        'f6dd1d │
│ 37-d0c6-4295-b0e6-6467437ef464'::uuid AS brand_id,\n        'I-39438' AS invoice_number,\n │
│         'SO5626760' AS sales_order_number\n),\nrelated_ids AS (\n    SELECT DISTINCT\n     │
│     f.id AS fulfillment_id,\n        fp.id AS fulfillment_package_id,\n        fpli.id AS  │
│ fulfillment_package_line_item_id,\n        f.sales_order_number,\n        f.brand_id,\n    │
│      li.number AS invoice_number\n    FROM\n        cdm.fulfillment AS f\n    JOIN\n       │
│   cdm.logistics_invoice AS li ON f.logistics_invoice_id = li.id\n    JOIN\n        cdm.ful │
│ fillment_package AS fp ON f.id = fp.fulfillment_id\n    JOIN\n        cdm.fulfillment_pack │
│ age_line_item AS fpli ON fp.id = fpli.fulfillment_package_id\n    CROSS JOIN\n        vars │
│ \n    WHERE\n        f.sales_order_number = vars.sales_order_number\n        AND f.brand_i │
│ d = vars.brand_id\n        AND li.number = vars.invoice_number\n),\nreconciled_charges AS  │
│ (\n    SELECT\n        ri.sales_order_number,\n        ri.brand_id,\n        ri.invoice_n" │
│ ,"backend_type":"postmaster","query_id":"0"}}

Query:

WITH vars AS (    SELECT        'f6dd1d37-d0c6-4295-b0e6-6467437ef464'::uuid AS brand_id,        'I-39438' AS invoice_number,        'SO5626760' AS sales_order_number
),
related_ids AS (    SELECT DISTINCT        f.id AS fulfillment_id,        fp.id AS fulfillment_package_id,        fpli.id AS fulfillment_package_line_item_id,        f.sales_order_number,        f.brand_id,        li.number AS invoice_number    FROM        cdm.fulfillment AS f    JOIN        cdm.logistics_invoice AS li ON f.logistics_invoice_id = li.id    JOIN        cdm.fulfillment_package AS fp ON f.id = fp.fulfillment_id    JOIN        cdm.fulfillment_package_line_item AS fpli ON fp.id = fpli.fulfillment_package_id    CROSS JOIN        vars    WHERE        f.sales_order_number = vars.sales_order_number        AND f.brand_id = vars.brand_id        AND li.number = vars.invoice_number
),
reconciled_charges AS (    SELECT        ri.sales_order_number,        ri.brand_id,        ri.invoice_number,        rsc.difference_root_cause,        rsc.difference_amount,        COALESCE(sct_real.code, sct_synth.code) AS service_charge_type_code,        CASE            WHEN rsc.real_service_charge_id IS NOT NULL THEN 'REAL'            WHEN rsc.synthetic_service_charge_id IS NOT NULL THEN 'SYNTHETIC'        END AS charge_type    FROM        cdm.reconciled_service_charge AS rsc    LEFT JOIN        cdm.service_charge AS sc ON rsc.real_service_charge_id = sc.id    LEFT JOIN        cdm.service_charge_type AS sct_real ON sc.service_charge_type_id = sct_real.id    LEFT JOIN        cdm.synthetic_service_charge AS ssc ON rsc.synthetic_service_charge_id = ssc.id    LEFT JOIN        cdm.service_charge_type AS sct_synth ON ssc.charge_type_id = sct_synth.id    INNER JOIN        related_ids AS ri ON        (sc.applied_to_type = 'Fulfillment' AND sc.applied_to_id = ri.fulfillment_id) OR        (sc.applied_to_type = 'FulfillmentPackage' AND sc.applied_to_id = ri.fulfillment_package_id) OR        (sc.applied_to_type = 'FulfillmentPackageLineItem' AND sc.applied_to_id = ri.fulfillment_package_line_item_id) OR        (ssc.applied_to_type = 'SYNTHETIC_FULFILLMENT' AND ssc.applied_to_id = ri.fulfillment_id) OR        (ssc.applied_to_type = 'SYNTHETIC_FULFILLMENT_PACKAGE' AND ssc.applied_to_id = ri.fulfillment_package_id) OR        (ssc.applied_to_type = 'SYNTHETIC_FULFILLMENT_PACKAGE_LINE_ITEM' AND ssc.applied_to_id = ri.fulfillment_package_line_item_id)    WHERE        rsc.brand_id = (SELECT brand_id FROM vars) AND        rsc.difference_status <> 'MATCHING'
)
SELECT    sales_order_number,    brand_id,    invoice_number,    difference_root_cause,    SUM(difference_amount) AS total_difference_amount,    SUM(CASE WHEN difference_amount < 0 THEN difference_amount ELSE 0 END) AS total_unfavorable_amount,    SUM(CASE WHEN difference_amount > 0 THEN difference_amount ELSE 0 END) AS total_favorable_amount,    COUNT(CASE WHEN difference_amount < 0 THEN 1 END) AS unfavorable_count,    COUNT(CASE WHEN difference_amount > 0 THEN 1 END) AS favorable_count,    COUNT(*) AS total_error_count,    ARRAY_AGG(DISTINCT service_charge_type_code) AS services,    ARRAY_AGG(DISTINCT charge_type) AS charge_types
FROM    reconciled_charges
GROUP BY    sales_order_number, brand_id, invoice_number, difference_root_cause
ORDER BY    total_error_count DESC;

Here is my yaml manifest for the cnpg cluster:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:  annotations:    config.linkerd.io/skip-inbound-ports: "5432"    panfactum.com/admin-role: admin-implentio-pg-7649    panfactum.com/db: "true"    panfactum.com/db-type: PostgreSQL    panfactum.com/reader-role: reader-implentio-pg-7649    panfactum.com/service: pg-7649-pooler-rw.implentio    panfactum.com/service-port: "5432"    panfactum.com/superuser-role: superuser-implentio-pg-7649    panfactum.com/vault-mount: db/implentio-pg-7649  creationTimestamp: "2024-07-01T15:44:34Z"  generation: 70  labels:    id: pg-pg-7649-0cbb8555d636b38a    panfactum.com/environment: development    panfactum.com/local: "false"    panfactum.com/module: kube_pg_cluster    panfactum.com/prevent-lifetime-eviction: "true"    panfactum.com/region: us-west-2    panfactum.com/root-module: implentio_db    panfactum.com/scheduler: "true"    panfactum.com/stack-commit: d89564842ca3d4275e82c69ff8f01c640b566d0d    panfactum.com/stack-version: edge.24-08-12    panfactum.com/workload: pg-pg-7649  name: pg-7649  namespace: implentio  resourceVersion: "69072821"  uid: ddb60f66-4b90-419f-8f4d-c3e42d716f55
spec:  affinity:    enablePodAntiAffinity: true    podAntiAffinityType: required    tolerations:    - effect: NoSchedule      key: spot      operator: Equal      value: "true"    - effect: NoSchedule      key: burstable      operator: Equal      value: "true"    - effect: NoSchedule      key: arm64      operator: Equal      value: "true"    topologyKey: node.kubernetes.io/instance-type  backup:    barmanObjectStore:      data:        compression: bzip2        jobs: 8      destinationPath: s3://implentio-pg-7649-backup-6b62299e849216e0/      s3Credentials:        inheritFromIAMRole: true      wal:        compression: bzip2        maxParallel: 8    retentionPolicy: 7d    target: prefer-standby  bootstrap:    initdb:      database: app      encoding: UTF8      localeCType: C      localeCollate: C      owner: app      postInitApplicationSQL:      - REVOKE ALL ON SCHEMA public FROM PUBLIC;      - CREATE ROLE reader NOINHERIT;      - GRANT pg_read_all_data TO reader;      - GRANT USAGE ON SCHEMA public TO reader;      - GRANT USAGE ON SCHEMA cdm TO reader;      - GRANT USAGE ON SCHEMA client TO reader;      - CREATE ROLE writer NOINHERIT;      - GRANT pg_write_all_data, pg_read_all_data TO writer;      - GRANT ALL PRIVILEGES ON SCHEMA public TO writer;      - GRANT ALL PRIVILEGES ON SCHEMA cdm TO writer;      - GRANT ALL PRIVILEGES ON SCHEMA client TO writer;      - GRANT CONNECT ON DATABASE app TO cnpg_pooler_pgbouncer;      - GRANT ALL PRIVILEGES ON SCHEMA public TO cnpg_pooler_pgbouncer;      - GRANT ALL PRIVILEGES ON SCHEMA cdm TO cnpg_pooler_pgbouncer;      - GRANT ALL PRIVILEGES ON SCHEMA client TO cnpg_pooler_pgbouncer;      - CREATE OR REPLACE FUNCTION user_search(uname TEXT) RETURNS TABLE (usename        name, passwd text) LANGUAGE sql SECURITY DEFINER AS 'SELECT usename, passwd        FROM pg_shadow WHERE usename=$1;'      - REVOKE ALL ON FUNCTION user_search(text) FROM public;      - GRANT EXECUTE ON FUNCTION user_search(text) TO cnpg_pooler_pgbouncer;      postInitSQL:      - REVOKE ALL ON SCHEMA public FROM PUBLIC;      - CREATE ROLE cnpg_pooler_pgbouncer WITH LOGIN;      - GRANT ALL PRIVILEGES ON SCHEMA public TO cnpg_pooler_pgbouncer;      - GRANT ALL PRIVILEGES ON SCHEMA cdm TO cnpg_pooler_pgbouncer;      - GRANT ALL PRIVILEGES ON SCHEMA client TO cnpg_pooler_pgbouncer;      - GRANT CONNECT ON DATABASE postgres TO cnpg_pooler_pgbouncer;      - CREATE OR REPLACE FUNCTION user_search(uname TEXT) RETURNS TABLE (usename        name, passwd text) LANGUAGE sql SECURITY DEFINER AS 'SELECT usename, passwd        FROM pg_shadow WHERE usename=$1;'      - REVOKE ALL ON FUNCTION user_search(text) FROM public;      - GRANT EXECUTE ON FUNCTION user_search(text) TO cnpg_pooler_pgbouncer;  certificates:    clientCASecret: pg-client-certs-331c    replicationTLSSecret: pg-client-certs-331c    serverCASecret: pg-server-certs-b04c    serverTLSSecret: pg-server-certs-b04c  enablePDB: false  enableSuperuserAccess: true  failoverDelay: 5  imageName: 730335560480.dkr.ecr.us-west-2.amazonaws.com/github/cloudnative-pg/postgresql:16.4  inheritedMetadata:    annotations:      config.linkerd.io/skip-inbound-ports: "5432"      linkerd.io/inject: enabled      resize.topolvm.io/increase: 100%      resize.topolvm.io/storage_limit: 200Gi      resize.topolvm.io/threshold: 35%    labels:      id: pg-pg-7649-0cbb8555d636b38a      panfactum.com/environment: development      panfactum.com/local: "false"      panfactum.com/module: kube_pg_cluster      panfactum.com/prevent-lifetime-eviction: "true"      panfactum.com/region: us-west-2      panfactum.com/root-module: implentio_db      panfactum.com/scheduler: "true"      panfactum.com/stack-commit: d89564842ca3d4275e82c69ff8f01c640b566d0d      panfactum.com/stack-version: edge.24-08-12      panfactum.com/workload: pg-pg-7649      pg-cluster: implentio-pg-7649  instances: 3  logLevel: info  maxSyncReplicas: 0  minSyncReplicas: 0  monitoring:    customQueriesConfigMap:    - key: queries      name: cnpg-default-monitoring    disableDefaultQueries: false    enablePodMonitor: true  postgresGID: 26  postgresUID: 26  postgresql:    parameters:      archive_mode: "on"      archive_timeout: 5min      dynamic_shared_memory_type: posix      effective_cache_size: 6400MB      log_destination: csvlog      log_directory: /controller/log      log_filename: postgres      log_rotation_age: "0"      log_rotation_size: "0"      log_truncate_on_rotation: "false"      logging_collector: "on"      maintenance_work_mem: 1600MB      max_connections: "150"      max_parallel_workers: "32"      max_replication_slots: "32"      max_worker_processes: "32"      shared_buffers: 4000MB      shared_memory_type: mmap      shared_preload_libraries: ""      ssl_max_protocol_version: TLSv1.3      ssl_min_protocol_version: TLSv1.3      wal_keep_size: 1024MB      wal_level: logical      wal_log_hints: "on"      wal_receiver_timeout: 5s      wal_sender_timeout: 5s      work_mem: 27MB    syncReplicaElectionConstraint:      enabled: false  primaryUpdateMethod: switchover  primaryUpdateStrategy: unsupervised  priorityClassName: database  replicationSlots:    highAvailability:      enabled: true      slotPrefix: _cnpg_    synchronizeReplicas:      enabled: true    updateInterval: 30  resources:    limits:      memory: 20800Mi    requests:      cpu: "6"      memory: 16000Mi  schedulerName: panfactum  serviceAccountTemplate:    metadata:      annotations:        eks.amazonaws.com/role-arn: arn:aws:iam::730335560480:role/pg-7649-20240701154426349300000002  smartShutdownTimeout: 5  startDelay: 600  stopDelay: 70  storage:    pvcTemplate:      resources:        requests:          storage: 20Gi      storageClassName: ebs-standard    resizeInUseVolumes: true  superuserSecret:    name: pg-7649-superuser-01bb512a09da19f2a01a4f1e79a5bb44dff309cf02c7afbb7b84aca8298765d4  switchoverDelay: 70  topologySpreadConstraints:  - labelSelector:      matchLabels:        id: pg-pg-7649-0cbb8555d636b38a    maxSkew: 1    topologyKey: topology.kubernetes.io/zone    whenUnsatisfiable: DoNotSchedule
status:  availableArchitectures:  - goArch: amd64    hash: 94527128605ac5100415106fe26c480531d094b3f36626e562a8135f342b89e4  - goArch: arm64    hash: 9b7b08592e917ed3b20bb3ae404ea4c0c958bdee73e5411c452d6c464d77f0b4  certificates:    clientCASecret: pg-client-certs-331c    expirations:      pg-client-certs-331c: 2024-09-02 15:44:29 +0000 UTC      pg-server-certs-b04c: 2024-09-02 15:44:29 +0000 UTC    replicationTLSSecret: pg-client-certs-331c    serverAltDNSNames:    - pg-7649-rw    - pg-7649-rw.implentio    - pg-7649-rw.implentio.svc    - pg-7649-r    - pg-7649-r.implentio    - pg-7649-r.implentio.svc    - pg-7649-ro    - pg-7649-ro.implentio    - pg-7649-ro.implentio.svc    serverCASecret: pg-server-certs-b04c    serverTLSSecret: pg-server-certs-b04c  cloudNativePGCommitHash: 336ddf53  cloudNativePGOperatorHash: 9b7b08592e917ed3b20bb3ae404ea4c0c958bdee73e5411c452d6c464d77f0b4  conditions:  - lastTransitionTime: "2024-08-21T17:28:36Z"    message: Cluster is Ready    reason: ClusterIsReady    status: "True"    type: Ready  - lastTransitionTime: "2024-08-21T17:28:03Z"    message: Continuous archiving is working    reason: ContinuousArchivingSuccess    status: "True"    type: ContinuousArchiving  - lastTransitionTime: "2024-08-18T02:06:07Z"    message: Backup was successful    reason: LastBackupSucceeded    status: "True"    type: LastBackupSucceeded  configMapResourceVersion:    metrics:      cnpg-default-monitoring: "6071197"  currentPrimary: pg-7649-2  currentPrimaryTimestamp: "2024-08-21T17:28:01.989805Z"  firstRecoverabilityPoint: "2024-08-11T00:10:35Z"  firstRecoverabilityPointByMethod:    barmanObjectStore: "2024-08-11T00:10:35Z"  healthyPVC:  - pg-7649-1  - pg-7649-2  - pg-7649-5  image: 730335560480.dkr.ecr.us-west-2.amazonaws.com/github/cloudnative-pg/postgresql:16.4  instanceNames:  - pg-7649-1  - pg-7649-2  - pg-7649-5  instances: 3  instancesReportedState:    pg-7649-1:      isPrimary: false      timeLineID: 192    pg-7649-2:      isPrimary: true      timeLineID: 192    pg-7649-5:      isPrimary: false      timeLineID: 191  instancesStatus:    healthy:    - pg-7649-1    - pg-7649-5    replicating:    - pg-7649-2  lastFailedBackup: "2024-07-28T00:09:04Z"  lastSuccessfulBackup: "2024-08-18T02:06:06Z"  lastSuccessfulBackupByMethod:    barmanObjectStore: "2024-08-18T02:06:06Z"  latestGeneratedNode: 5  managedRolesStatus: {}  phase: Cluster in healthy state  poolerIntegrations:    pgBouncerIntegration: {}  pvcCount: 3  readService: pg-7649-r  readyInstances: 2  secretsResourceVersion:    applicationSecretVersion: "61181252"    clientCaSecretVersion: "66383294"    replicationSecretVersion: "66383294"    serverCaSecretVersion: "66383293"    serverSecretVersion: "66383293"    superuserSecretVersion: "61180981"  switchReplicaClusterStatus: {}  targetPrimary: pg-7649-2  targetPrimaryTimestamp: "2024-08-21T17:27:58.902780Z"  timelineID: 192  topology:    instances:      pg-7649-1: {}      pg-7649-2: {}      pg-7649-5: {}    nodesUsed: 3    successfullyExtracted: true  writeService: pg-7649-rw

I'm glad to share more information to further help identify the issue. I've also extracted core dumps from the error.

core_dumps.tar

Re: Postgres v16.4 crashes on segfault when memory >= 16gb

From

Tom Lane

Date:

22 August 2024, 00:31:37

Weslley Braga <wbraga@implentio.com> writes:
> I've been observing for the past month. I'm running PostgreSQL in
> Kubernetes through the CloudNativePG project. When I bump the pods resource
> configuration to anything above 16gb I notice that complex queries such as
> the one shared below cause the runtime to fail with a segfault see logs for
> details. The same query with a configuration < 16gb memory works ok.

Hmm, does turning jit = off affect it?

> I'm glad to share more information to further help identify the issue. I've
> also extracted core dumps from the error.

The core dumps will be useless to anyone on any other platform or with
an even slightly different build of Postgres.  If you could extract
stack traces from them, that might be helpful:

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

            regards, tom lane

Re: Postgres v16.4 crashes on segfault when memory >= 16gb

From

Weslley Braga

Date:

27 August 2024, 20:20:03

Thank you for the reply. I will try with jit turned off and report back but also try to collect stack traces from the pods.

On Wed, Aug 21, 2024 at 4:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Weslley Braga <wbraga@implentio.com> writes:
> I've been observing for the past month. I'm running PostgreSQL in
> Kubernetes through the CloudNativePG project. When I bump the pods resource
> configuration to anything above 16gb I notice that complex queries such as
> the one shared below cause the runtime to fail with a segfault see logs for
> details. The same query with a configuration < 16gb memory works ok.

Hmm, does turning jit = off affect it?

> I'm glad to share more information to further help identify the issue. I've
> also extracted core dumps from the error.

The core dumps will be useless to anyone on any other platform or with
an even slightly different build of Postgres. If you could extract
stack traces from them, that might be helpful:

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

regards, tom lane