Thread: strange behaviour on pooled alloc

strange behaviour on pooled alloc

From
jwieck@debis.com (Jan Wieck)
Date:
Hi,

    I'm  continuing  with  the pooled palloc() stuff and am stuck
    with a very  strange  thing.  I've  reverted  my  changes  to
    palloc()  and am doing all the memory block pool handling now
    in aset.c.

    The benefit from this will be that I later  can  easily  make
    palloc() etc. macros.

    The  new  version of the AllocSet...() functions does not use
    ordered set.  it manages the block pools itself. Has the same
    10%  speedup and I expect some more from the macro version of
    palloc(). It aligns small  allocations  to  power  of  2  for
    better  reusability  of  free'd  chunks  which  are held in 8
    different free lists per alloc set depending on  their  size.
    It lost the ability of AllocSetDump() - who need's that?

    First  I  found some bad places where memory is used after it
    has been free'd. One was in the portal manager with a  portal
    memory  context  struct!  I'm  pretty  sure  that I found all
    because  I  tested   by   memset()   'ing   all   memory   on
    AllocSetFree() and AllocSetReset() with different values.

    The  strange behaviour now is that depending on the blocksize
    and the limit for block/single alloction I use for the pools,
    the  portals_p2  regression test fails or not. The failure is
    that the cursor foo24 does  not  return  data  if  the  pools
    blocksize  is  greater/equal  16K and the smallchunk limit is
    2K. It returns the correct row if one of them is  less.  More
    irritating  is  that  it  only  fails  if  run  inside  'make
    runtest'. If I put multiple portals_p2 tests into  the  tests
    list, all fail the same. But if the test is run manually with
    the same psql switches, it succeeds.

    All  this  behaviour  is  identical  on  two   Linux   2.1.88
    installations.  One has gcc-2.8.1 and glibc-2.0.13, the other
    gcc-2.7.2.1 and libc.5.

    I have absolutely no clue what's going  on  here.  Anyone  an
    idea how to track this down?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #

Re: [HACKERS] strange behaviour on pooled alloc

From
Bruce Momjian
Date:
> Hi,
> 
>     I'm  continuing  with  the pooled palloc() stuff and am stuck
>     with a very  strange  thing.  I've  reverted  my  changes  to
>     palloc()  and am doing all the memory block pool handling now
>     in aset.c.
> 
>     The benefit from this will be that I later  can  easily  make
>     palloc() etc. macros.

Sounds good.

>     The  new  version of the AllocSet...() functions does not use
>     ordered set.  it manages the block pools itself. Has the same
>     10%  speedup and I expect some more from the macro version of
>     palloc(). It aligns small  allocations  to  power  of  2  for
>     better  reusability  of  free'd  chunks  which  are held in 8
>     different free lists per alloc set depending on  their  size.
>     It lost the ability of AllocSetDump() - who need's that?

No one.

> 
>     First  I  found some bad places where memory is used after it
>     has been free'd. One was in the portal manager with a  portal
>     memory  context  struct!  I'm  pretty  sure  that I found all
>     because  I  tested   by   memset()   'ing   all   memory   on
>     AllocSetFree() and AllocSetReset() with different values.


Good.

>     The  strange behaviour now is that depending on the blocksize
>     and the limit for block/single alloction I use for the pools,
>     the  portals_p2  regression test fails or not. The failure is
>     that the cursor foo24 does  not  return  data  if  the  pools
>     blocksize  is  greater/equal  16K and the smallchunk limit is
>     2K. It returns the correct row if one of them is  less.  More
>     irritating  is  that  it  only  fails  if  run  inside  'make
>     runtest'. If I put multiple portals_p2 tests into  the  tests
>     list, all fail the same. But if the test is run manually with
>     the same psql switches, it succeeds.
> 
>     All  this  behaviour  is  identical  on  two   Linux   2.1.88
>     installations.  One has gcc-2.8.1 and glibc-2.0.13, the other
>     gcc-2.7.2.1 and libc.5.
> 
>     I have absolutely no clue what's going  on  here.  Anyone  an
>     idea how to track this down?

My recommendation is to apply the fix and let others debug it.  Someone
will find the cause.  Just give them a reproducable test case.  In many
cases, more eyes or another OS shows the error much clearer.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] strange behaviour on pooled alloc

From
jwieck@debis.com (Jan Wieck)
Date:
Bruce Momjian wrote:

> >     The  strange behaviour now is that depending on the blocksize
> >     and the limit for block/single alloction I use for the pools,
> >     the  portals_p2  regression test fails or not.
> >     [...]
> >     I have absolutely no clue what's going  on  here.  Anyone  an
> >     idea how to track this down?
>
> My recommendation is to apply the fix and let others debug it.  Someone
> will find the cause.  Just give them a reproducable test case.  In many
> cases, more eyes or another OS shows the error much clearer.

    New version of AllocSet...() functions is committed. palloc()
    is a macro now. The  memory  eating  problem  of  COPY  FROM,
    INSERT ... SELECT and UPDATES on a table that has constraints
    is fixed (new file nodes/freefuncs.c).

    The settings in aset.c aren't optimal for  now,  because  the
    settings in place force the portals_p2 test to fail (at least
    here). Some informations for those who want to take a look at
    it follow.

    Reproducing the bug:

        The  bug  can be reproduced after the regression test has
        been run by running only portals_p2.sql.

        To cause the error, the postmaster must be  started  with
        -B64  (default)  and  at  least  one environment variable
        (e.g. PGDATESTYLE), that causes psql to  send  a  SET  on
        connection must be set.

        If  -B  is  greater  than  64,  AllocSetAlloc() put's the
        allocation  for  the  buffer  reference  counts  in   the
        execution  state  EState into it's own malloc() area, not
        into a smallchunk block. The problem disappears.

        If the ALLOC_BLOCK_SIZE (in aset.c) is changed  to  8192,
        the problem also disappears.

        If  none  of  the mentioned environment variables is set,
        the BEGIN from the regression test is the  first  command
        sent  to  the backend and the problem disappears too. But
        adding a simple BEGIN; END; to the top of the test forces
        it  to  appear again, so it isn't in the variable setting
        code.

    Guessings:

        The symptom is that in the case of many portals on a  big
        table  rows  that  are  there  don't show up. Each cursor
        declaration results in it's  own  ExecutorStart(),  where
        the  buffer  reference  count  is  saved  into  the newly
        created execution state  and  reset  to  zero.  Later  on
        ExecutorEnd() these states are restored.

        These  disappearing  rows  might have to do with unpinned
        buffers that are expected to be pinned.

        Since it depends on whether the allocation for the  saved
        reference  counts  is  taken  from  a  block or allocated
        separately,  I  think  some  counts  get  corrupted  from
        somewhere else.

        It  also depends on the blocksize, one more point that it
        might be from somewhere else because the  refcount  areas
        must  live  in  the same block with some other allocation
        together.

    I'll keep on debugging, but  would  be  very  appreciated  if
    someone could help.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #