Thread: any suggestions to detect memory corruption
I can get the following log randomly and I am not which commit caused it. I spend one day but failed at last.
2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set index info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18
2019-05-08 21:37:46.692 CST [60110] WARNING: idx: 2 problem in alloc set index info: bad single-chunk 0x2a33a78 in block 0x2a33a18, chsize: 1408, chunkLimit: 1024, chunkHeaderSize: 24, block_used: 768 request size: 2481
2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set index info: found inconsistent memory block 0x2a33a18
it looks like the memory which is managed by "index info" memory context is written by some other wrong codes.
I didn't change any AllocSetXXX related code and I think I just use it wrong in some way.
Thanks
Alex <zhihui.fan1213@gmail.com> writes: > I can get the following log randomly and I am not which commit caused it. > 2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set index > info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18 I've had success in finding memory stomp causes fairly quickly by setting a hardware watchpoint in gdb on the affected location. Then you just let it run to see when the value changes, and check whether that's a "legit" or "not legit" modification point. The hard part of that, of course, is to know in advance where the affected location is. You may be able to make things sufficiently repeatable by doing the problem query in a fresh session each time. regards, tom lane
On Wed, May 8, 2019 at 10:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alex <zhihui.fan1213@gmail.com> writes: > > I can get the following log randomly and I am not which commit caused it. > > > 2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set index > > info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18 > > I've had success in finding memory stomp causes fairly quickly by setting > a hardware watchpoint in gdb on the affected location. Then you just let > it run to see when the value changes, and check whether that's a "legit" > or "not legit" modification point. > > The hard part of that, of course, is to know in advance where the affected > location is. You may be able to make things sufficiently repeatable by > doing the problem query in a fresh session each time. valgrind might also be a possibility, although that has a lot of overhead. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thanks you Tom and Robert! I tried valgrind, and looks it help me fix the issue.
Someone add some code during backend init which used palloc. but at that time, the CurrentMemoryContext is PostmasterContext. at the end of backend initialization, the PostmasterContext is deleted, then the error happens. the reason why it happens randomly is before the palloc, there are some other if clause which may skip the palloc.
I still can't explain why PostmasterContext may have impact "index info" MemoryContext sometime, but now I just can't reproduce it (before the fix, it may happen in 30% cases).
On Thu, May 9, 2019 at 1:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 8, 2019 at 10:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alex <zhihui.fan1213@gmail.com> writes:
> > I can get the following log randomly and I am not which commit caused it.
>
> > 2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set index
> > info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18
>
> I've had success in finding memory stomp causes fairly quickly by setting
> a hardware watchpoint in gdb on the affected location. Then you just let
> it run to see when the value changes, and check whether that's a "legit"
> or "not legit" modification point.
>
> The hard part of that, of course, is to know in advance where the affected
> location is. You may be able to make things sufficiently repeatable by
> doing the problem query in a fresh session each time.
valgrind might also be a possibility, although that has a lot of overhead.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Alex <zhihui.fan1213@gmail.com> writes: > Someone add some code during backend init which used palloc. but at that > time, the CurrentMemoryContext is PostmasterContext. at the end of > backend initialization, the PostmasterContext is deleted, then the error > happens. the reason why it happens randomly is before the palloc, there > are some other if clause which may skip the palloc. > I still can't explain why PostmasterContext may have impact "index info" > MemoryContext sometime, but now I just can't reproduce it (before the > fix, it may happen in 30% cases). Well, once the context is deleted, that memory is available for reuse. Everything will seem fine until it *is* reused, and then boom! The error would have been a lot more obvious if you'd enabled MEMORY_CONTEXT_CHECKING, which would overwrite freed data with garbage. That is normally turned on in --enable-cassert builds. Anybody who's been hacking Postgres for more than a week does backend code development in --enable-cassert mode as a matter of course; it turns on a *lot* of helpful cross-checks. regards, tom lane
On Thu, May 9, 2019 at 9:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alex <zhihui.fan1213@gmail.com> writes:
> Someone add some code during backend init which used palloc. but at that
> time, the CurrentMemoryContext is PostmasterContext. at the end of
> backend initialization, the PostmasterContext is deleted, then the error
> happens. the reason why it happens randomly is before the palloc, there
> are some other if clause which may skip the palloc.
> I still can't explain why PostmasterContext may have impact "index info"
> MemoryContext sometime, but now I just can't reproduce it (before the
> fix, it may happen in 30% cases).
Well, once the context is deleted, that memory is available for reuse.
Everything will seem fine until it *is* reused, and then boom!
The error would have been a lot more obvious if you'd enabled
MEMORY_CONTEXT_CHECKING, which would overwrite freed data with garbage.
Thanks! I didn't know this before and " once the context is deleted, that memory is available for reuse.
Everything will seem fine until it *is* reused". I have enabled enable-cassert now. That is normally turned on in --enable-cassert builds. Anybody who's been
hacking Postgres for more than a week does backend code development in
--enable-cassert mode as a matter of course; it turns on a *lot* of
helpful cross-checks.
regards, tom lane