Re: [HACKERS] Re: indexing words slow - Mailing list pgsql-hackers

From dg@illustra.com (David Gould)
Subject Re: [HACKERS] Re: indexing words slow
Date
Msg-id 9803140648.AA05842@hawk.illustra.com
Whole thread Raw
In response to Re: indexing words slow  (Bruce Momjian <maillist@candle.pha.pa.us>)
List pgsql-hackers
Bruce Momjian writes:
> Just started running CLUSTER, and it appears to be taking forever.  Does
> not seem to be using the disk, but a lot of CPU.  It appears to be
> caught in an endless loop.
...
> Attaching to program `/u/pg/bin/postgres', process 29755
> 0x94999 in WaitIO (buf=0x4c0ccc8, spinlock=0) at bufmgr.c:1107
> 1107            S_LOCK(&(buf->io_in_progress_lock));
>
> ---------------------------------------------------------------------------
>
> Stepping to the next statement shows it is just hung here.
> The backtrace shows:
>
> ---------------------------------------------------------------------------
>
> #0  0x9499b in WaitIO (buf=0x4c0ccc8, spinlock=0) at bufmgr.c:1107
> #1  0x94109 in BufferAlloc (reln=0x19df90, blockNum=11324,
>     foundPtr=0xefbfb903 "\001\030����=\t", bufferLockHeld=0) at bufmgr.c:400
> #2  0x93e91 in ReadBufferWithBufferLock (reln=0x19df90, blockNum=11324,
>     bufferLockHeld=0) at bufmgr.c:255
> #3  0x93dee in ReadBuffer (reln=0x19df90, blockNum=11324) at bufmgr.c:174
> #4  0xb28f in heap_fetch (relation=0x19df90, seeself=0 '\000', tid=0x1b2256,
>     b=0xefbfb974) at heapam.c:1050
> #5  0x303d5 in rebuildheap (OIDNewHeap=11466400, OIDOldHeap=6424406,
>     OIDOldIndex=11466368) at cluster.c:357
> #6  0x30111 in cluster (oldrelname=0x112790 "artist_fti",
>     oldindexname=0x10cdd0 "artist_fti_idx") at cluster.c:160
> #7  0xa29c1 in ProcessUtility (parsetree=0x1127b0, dest=Remote)
>     at utility.c:626

This is all just speculation from memory as I do not have the code in front
of me, but here is what I think might be happening:

heap_fetch -> ReadBuffer -> ReadBufferWithBufferLock
-- trying to fetch the next row, needed to read a new page

 -> BufferAlloc
    -- to read a new page, we need a buffer, so grab one
    -- or possibly we found the buffer we were looking for

    -> WaitIO
       -- but the target buffer was already being read or written
          (by another process???)

       -> S_LOCK(&(buf->io_in_progress_lock))
          -- so spin on the buffer in_progress lock waiting for the I/O to
             complete and clear the lock

             Apparently, the I/O completion is never seen so the lock is
             never cleared so this will wait forever.

Since the system is stuck at this point, somebody dropped the ball on
clearing the io_in_progress_lock spinlock. I can't tell without looking
at the code (and probably not even then) if this is I/O initiated by the
stuck process and just forgotten about, or if some other process was involved.

Were there any other active backends while this was running?

If so, did any of them exit abnormally (ie without cleaning up)?

Is the stuck process holding any other spinlocks (like the bufmgr spinlock
for instance)? (There is an array of held spinlocks in the Proc structure
that gets set by SpinAcquire to record spinlocks owned by the current
process).

If so, is it possible that another process is somehow deadlocked
with this one?

  eg: A has bufmgr spinlock, wants io_in_progress_lock spinlock
      B has io_in_progress_lock spinlock, wants bufmgr spinlock

I hope this is useful...

-dg

ps: what is CLUSTER anyhow? I have a guess, but...

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 - I realize now that irony has no place in business communications.

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] casting & type comments
Next
From: Edmund Mergl
Date:
Subject: Re: [HACKERS] postgre install/perl interf