Thread: strange behaviour on pooled alloc
Hi, I'm continuing with the pooled palloc() stuff and am stuck with a very strange thing. I've reverted my changes to palloc() and am doing all the memory block pool handling now in aset.c. The benefit from this will be that I later can easily make palloc() etc. macros. The new version of the AllocSet...() functions does not use ordered set. it manages the block pools itself. Has the same 10% speedup and I expect some more from the macro version of palloc(). It aligns small allocations to power of 2 for better reusability of free'd chunks which are held in 8 different free lists per alloc set depending on their size. It lost the ability of AllocSetDump() - who need's that? First I found some bad places where memory is used after it has been free'd. One was in the portal manager with a portal memory context struct! I'm pretty sure that I found all because I tested by memset() 'ing all memory on AllocSetFree() and AllocSetReset() with different values. The strange behaviour now is that depending on the blocksize and the limit for block/single alloction I use for the pools, the portals_p2 regression test fails or not. The failure is that the cursor foo24 does not return data if the pools blocksize is greater/equal 16K and the smallchunk limit is 2K. It returns the correct row if one of them is less. More irritating is that it only fails if run inside 'make runtest'. If I put multiple portals_p2 tests into the tests list, all fail the same. But if the test is run manually with the same psql switches, it succeeds. All this behaviour is identical on two Linux 2.1.88 installations. One has gcc-2.8.1 and glibc-2.0.13, the other gcc-2.7.2.1 and libc.5. I have absolutely no clue what's going on here. Anyone an idea how to track this down? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
> Hi, > > I'm continuing with the pooled palloc() stuff and am stuck > with a very strange thing. I've reverted my changes to > palloc() and am doing all the memory block pool handling now > in aset.c. > > The benefit from this will be that I later can easily make > palloc() etc. macros. Sounds good. > The new version of the AllocSet...() functions does not use > ordered set. it manages the block pools itself. Has the same > 10% speedup and I expect some more from the macro version of > palloc(). It aligns small allocations to power of 2 for > better reusability of free'd chunks which are held in 8 > different free lists per alloc set depending on their size. > It lost the ability of AllocSetDump() - who need's that? No one. > > First I found some bad places where memory is used after it > has been free'd. One was in the portal manager with a portal > memory context struct! I'm pretty sure that I found all > because I tested by memset() 'ing all memory on > AllocSetFree() and AllocSetReset() with different values. Good. > The strange behaviour now is that depending on the blocksize > and the limit for block/single alloction I use for the pools, > the portals_p2 regression test fails or not. The failure is > that the cursor foo24 does not return data if the pools > blocksize is greater/equal 16K and the smallchunk limit is > 2K. It returns the correct row if one of them is less. More > irritating is that it only fails if run inside 'make > runtest'. If I put multiple portals_p2 tests into the tests > list, all fail the same. But if the test is run manually with > the same psql switches, it succeeds. > > All this behaviour is identical on two Linux 2.1.88 > installations. One has gcc-2.8.1 and glibc-2.0.13, the other > gcc-2.7.2.1 and libc.5. > > I have absolutely no clue what's going on here. Anyone an > idea how to track this down? My recommendation is to apply the fix and let others debug it. Someone will find the cause. Just give them a reproducable test case. In many cases, more eyes or another OS shows the error much clearer. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > The strange behaviour now is that depending on the blocksize > > and the limit for block/single alloction I use for the pools, > > the portals_p2 regression test fails or not. > > [...] > > I have absolutely no clue what's going on here. Anyone an > > idea how to track this down? > > My recommendation is to apply the fix and let others debug it. Someone > will find the cause. Just give them a reproducable test case. In many > cases, more eyes or another OS shows the error much clearer. New version of AllocSet...() functions is committed. palloc() is a macro now. The memory eating problem of COPY FROM, INSERT ... SELECT and UPDATES on a table that has constraints is fixed (new file nodes/freefuncs.c). The settings in aset.c aren't optimal for now, because the settings in place force the portals_p2 test to fail (at least here). Some informations for those who want to take a look at it follow. Reproducing the bug: The bug can be reproduced after the regression test has been run by running only portals_p2.sql. To cause the error, the postmaster must be started with -B64 (default) and at least one environment variable (e.g. PGDATESTYLE), that causes psql to send a SET on connection must be set. If -B is greater than 64, AllocSetAlloc() put's the allocation for the buffer reference counts in the execution state EState into it's own malloc() area, not into a smallchunk block. The problem disappears. If the ALLOC_BLOCK_SIZE (in aset.c) is changed to 8192, the problem also disappears. If none of the mentioned environment variables is set, the BEGIN from the regression test is the first command sent to the backend and the problem disappears too. But adding a simple BEGIN; END; to the top of the test forces it to appear again, so it isn't in the variable setting code. Guessings: The symptom is that in the case of many portals on a big table rows that are there don't show up. Each cursor declaration results in it's own ExecutorStart(), where the buffer reference count is saved into the newly created execution state and reset to zero. Later on ExecutorEnd() these states are restored. These disappearing rows might have to do with unpinned buffers that are expected to be pinned. Since it depends on whether the allocation for the saved reference counts is taken from a block or allocated separately, I think some counts get corrupted from somewhere else. It also depends on the blocksize, one more point that it might be from somewhere else because the refcount areas must live in the same block with some other allocation together. I'll keep on debugging, but would be very appreciated if someone could help. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #