Re: [HACKERS] strange behaviour on pooled alloc - Mailing list pgsql-hackers

From jwieck@debis.com (Jan Wieck)
Subject Re: [HACKERS] strange behaviour on pooled alloc
Date
Msg-id m109BWi-000EBPC@orion.SAPserv.Hamburg.dsh.de
Whole thread Raw
In response to Re: [HACKERS] strange behaviour on pooled alloc  (Bruce Momjian <maillist@candle.pha.pa.us>)
List pgsql-hackers
Bruce Momjian wrote:

> >     The  strange behaviour now is that depending on the blocksize
> >     and the limit for block/single alloction I use for the pools,
> >     the  portals_p2  regression test fails or not.
> >     [...]
> >     I have absolutely no clue what's going  on  here.  Anyone  an
> >     idea how to track this down?
>
> My recommendation is to apply the fix and let others debug it.  Someone
> will find the cause.  Just give them a reproducable test case.  In many
> cases, more eyes or another OS shows the error much clearer.

    New version of AllocSet...() functions is committed. palloc()
    is a macro now. The  memory  eating  problem  of  COPY  FROM,
    INSERT ... SELECT and UPDATES on a table that has constraints
    is fixed (new file nodes/freefuncs.c).

    The settings in aset.c aren't optimal for  now,  because  the
    settings in place force the portals_p2 test to fail (at least
    here). Some informations for those who want to take a look at
    it follow.

    Reproducing the bug:

        The  bug  can be reproduced after the regression test has
        been run by running only portals_p2.sql.

        To cause the error, the postmaster must be  started  with
        -B64  (default)  and  at  least  one environment variable
        (e.g. PGDATESTYLE), that causes psql to  send  a  SET  on
        connection must be set.

        If  -B  is  greater  than  64,  AllocSetAlloc() put's the
        allocation  for  the  buffer  reference  counts  in   the
        execution  state  EState into it's own malloc() area, not
        into a smallchunk block. The problem disappears.

        If the ALLOC_BLOCK_SIZE (in aset.c) is changed  to  8192,
        the problem also disappears.

        If  none  of  the mentioned environment variables is set,
        the BEGIN from the regression test is the  first  command
        sent  to  the backend and the problem disappears too. But
        adding a simple BEGIN; END; to the top of the test forces
        it  to  appear again, so it isn't in the variable setting
        code.

    Guessings:

        The symptom is that in the case of many portals on a  big
        table  rows  that  are  there  don't show up. Each cursor
        declaration results in it's  own  ExecutorStart(),  where
        the  buffer  reference  count  is  saved  into  the newly
        created execution state  and  reset  to  zero.  Later  on
        ExecutorEnd() these states are restored.

        These  disappearing  rows  might have to do with unpinned
        buffers that are expected to be pinned.

        Since it depends on whether the allocation for the  saved
        reference  counts  is  taken  from  a  block or allocated
        separately,  I  think  some  counts  get  corrupted  from
        somewhere else.

        It  also depends on the blocksize, one more point that it
        might be from somewhere else because the  refcount  areas
        must  live  in  the same block with some other allocation
        together.

    I'll keep on debugging, but  would  be  very  appreciated  if
    someone could help.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] Optimizer speed and GEQO (was: nested loops in joins)
Next
From: Tom Lane
Date:
Subject: Re: [SQL] Functional Indexes