Re: pgsql: Introduce pg_shmem_allocations_numa view - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pgsql: Introduce pg_shmem_allocations_numa view
Date
Msg-id 6c9f9f7e-947b-4fc3-bdb6-b0696d7492e5@vondra.me
Whole thread Raw
In response to Re: pgsql: Introduce pg_shmem_allocations_numa view  (Christoph Berg <myon@debian.org>)
List pgsql-hackers

On 6/23/25 21:57, Christoph Berg wrote:
> Re: Andres Freund
>> How confident are we that this isn't actually because we passed a bogus
>> address to the kernel or such? With this patch, are *any* pages recognized as
>> valid on the machines that triggered the error?
> 
> See upthread - the first 35 pages were ok, then a lot of -14.
> 
>> I wonder if we ought to report the failures as a separate "numa node"
>> (e.g. NULL as node id) instead ...
> 
> Did that now, using N+1 (== 1 here) for errors in this Debian i386
> environment (chroot on an amd64 host):
> 
> select * from pg_shmem_allocations_numa \crosstabview 
> 
>                       name                      │    0     │    1
> ────────────────────────────────────────────────┼──────────┼──────────
>  multixact_offset                               │    69632 │    65536
>  subtransaction                                 │   139264 │   131072
>  notify                                         │   139264 │        0
>  Shared Memory Stats                            │   188416 │   131072
>  serializable                                   │   188416 │    86016
>  PROCLOCK hash                                  │     4096 │        0
>  FinishedSerializableTransactions               │     4096 │        0
>  XLOG Ctl                                       │  2117632 │  2097152
>  Shared MultiXact State                         │     4096 │        0
>  Proc Header                                    │     4096 │        0
>  Archiver Data                                  │     4096 │        0
> .... more 0s in the last column ...
>  AioHandleData                                  │  1429504 │        0
>  Buffer Blocks                                  │ 67117056 │ 67108864
>  Buffer IO Condition Variables                  │   266240 │        0
>  Proc Array                                     │     4096 │        0
> .... more 0s
> (73 rows)
> 
> 
> There is something fishy with pg_buffercache. If I restart PG, I'm
> getting "Bad address" (errno 14), this time as return value of
> move_pages().
> 
> postgres =# select * from pg_buffercache_numa;
> DEBUG:  00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:383
> 2025-06-23 19:41:41.315 UTC [1331894] ERROR:  failed NUMA pages inquiry: Bad address
> 2025-06-23 19:41:41.315 UTC [1331894] STATEMENT:  select * from pg_buffercache_numa;
> ERROR:  XX000: failed NUMA pages inquiry: Bad address
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:394
> 
> Repeated calls are fine.
> 

Huh. So it's only the first call that does this?

Can you maybe print the addresses passed to pg_numa_query_pages? I
wonder if there's some bug in how we fill that array. Not sure why would
it happen only on 32-bit systems, though.

I'll create a 32-bit VM so that I can try reproducing this.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Next
From: Melanie Plageman
Date:
Subject: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)