Re: PoC: VALGRIND_MAKE_MEM_NOACCESS for dynamic shared memory - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: PoC: VALGRIND_MAKE_MEM_NOACCESS for dynamic shared memory |
| Date | |
| Msg-id | aeae820d-9a58-462e-b3aa-94526a031389@vondra.me Whole thread |
| In response to | Re: PoC: VALGRIND_MAKE_MEM_NOACCESS for dynamic shared memory (Andres Freund <andres@anarazel.de>) |
| List | pgsql-hackers |
On 5/4/26 15:56, Andres Freund wrote: > Hi, > > On 2026-05-04 10:18:31 +0200, Tomas Vondra wrote: >> A couple of days ago we got a report regarding an incorrect shmem size >> calculation in btree [1], leading to a buffer overflow / memory >> corruption. Which became a much less common issue in our code, thanks to >> valgrind and similar tools. But it took me a while to realize valgrind >> won't catch this because we only instrument private memory (palloc et >> al), while shmem is left alone. >> >> I was wondering if it's feasible to improve this. Attached is a trivial >> patch that adjusts shm_toc.c to add a couple NOACCESS bytes after each >> entry in the segment. It seems to do the trick - with this we get a >> reasonable report (for the reproducer provided in the bug report, before >> it got fixed by 748d871b7c) from valgrind, with invalid accesses. See >> the attached .log for an example. It's much better than the confusing >> crashes due to corrupted state. >> >> There's an issue, though. It seems the valgrind memory markings are not >> shared between processes. The leader sets the shm_toc up, marks the >> ranges as NOACCESS, and then checks it while accessing the memory. But >> the parallel workers don't seem to see this, and so will produce no >> reports. I'm assuming this is the case, because all the reports come >> from the leader, never from the workers. Maybe there's a different >> explanation (e.g. maybe it's just the leader touching the memory?). > > I assume the issue is just that the workers don't have the NOACCESS markers? I > think you'd need to do them in every process using the shm_toc. Either by > doing it in shm_toc_attach() or in shm_toc(). > Right, but it'd also need to know how long the entry is. Which AFAIK it does not. We don't want to redo the calculation in every worker, but we might add a "len" field to the toc entry. > > >> An alternative would be to do mprotect(), but unfortunately it requires >> page-aligned ranges, which makes it somewhat useless for small overflows >> of a couple bytes (like here). It should work cross-process, I think, > > It doesn't work across processes. Every process has their own mprotect "view". > Ah, OK. So I guessed wrong, and it has the same issue as NOACCESS. > >> FWIW the PoC patch adds a 32-byte chunk, not just a single byte. This is >> intentional, because if the state is an array, it's quite possible the >> invalid access steps over a fair number of bytes. I'm actually thinking >> it should be even larger. >> >> This modifies just the dynamic shmem, but maybe we should do this even >> for the "regular" shmem allocated at start. Issues in that would likely >> cause crashes pretty quick (unlike the btree issue, which requires a >> somewhat special reproducer), but a nice valgrind report helps. > > I can tell you from experience that no, it's not necessarily quickly caught. > So yes, I think we should definitely do that. > Understood. > > >> /* >> * Allocate shared memory from a segment managed by a table of contents. >> * >> @@ -119,9 +127,19 @@ shm_toc_allocate(shm_toc *toc, Size nbytes) >> } >> vtoc->toc_allocated_bytes += nbytes; >> >> +#ifdef USE_VALGRIND >> + vtoc->toc_allocated_bytes += NUM_NOACCESS_BYTES; >> +#endif >> + >> SpinLockRelease(&toc->toc_mutex); >> >> - return ((char *) toc) + (total_bytes - allocated_bytes - nbytes); >> +#ifdef USE_VALGRIND >> + /* make the bytes at the end no-access */ >> + VALGRIND_MAKE_MEM_NOACCESS(((char *) toc) + (total_bytes - allocated_bytes - NUM_NOACCESS_BYTES), >> + NUM_NOACCESS_BYTES); >> +#endif >> + >> + return ((char *) toc) + (total_bytes - allocated_bytes - nbytes - NUM_NOACCESS_BYTES); >> } > > The size is already rounded up by that point: > > /* > * Make sure request is well-aligned. XXX: MAXALIGN is not enough, > * because atomic ops might need a wider alignment. We don't have a > * proper definition for the minimum to make atomic ops safe, but > * BUFFERALIGN ought to be enough. > */ > nbytes = BUFFERALIGN(nbytes); > > Which means that you'll often have a up to 32byte pad at the end of the > (ALIGNOF_BUFFER=32) allocation already. I don't care about the waste, but the > ALIGNOF_BUFFER padding will often prevent detecting smaller out-of-bounds > accesses. > Good point. I forgot about this alignment rounding. I don't care about the waste either (it'd be a bit silly in a valgrind build anyway), but not catching smaller mistakes would not be great. regards -- Tomas Vondra
pgsql-hackers by date: