Thread: Need some help on code

Need some help on code

From
Maarten Boekhold
Date:
Hi,

I was trying to change to cluster command to do the its writes clustered
by a 100 tuples, thus hoping to improve performance. However, the code
I've written crashes. This has certainly to do with some internal states
of pgsql that aren't preserved in a HeapTuple.

Could somebody with knowledge have a brief glimpse on my code and perhaps
tell me how to do it properly?

Maarten

_____________________________________________________________________________
| TU Delft, The Netherlands, Faculty of Information Technology and Systems  |
|                   Department of Electrical Engineering                    |
|           Computer Architecture and Digital Technique section             |
|                          M.Boekhold@et.tudelft.nl                         |
-----------------------------------------------------------------------------

static void
rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex)
{
    Relation    LocalNewHeap,
                LocalOldHeap,
                LocalOldIndex;
    IndexScanDesc ScanDesc;
    RetrieveIndexResult ScanResult;
    ItemPointer HeapTid;
    HeapTuple    LocalHeapTuple;
    Buffer        LocalBuffer[100];
    Oid            OIDNewHeapInsert;
    Dllist      *ScanResList;
    Dlelem      *ListEl;
    int         count, loop;

    /*
     * Open the relations I need. Scan through the OldHeap on the OldIndex
     * and insert each tuple into the NewHeap.
     */
    LocalNewHeap = (Relation) heap_open(OIDNewHeap);
    LocalOldHeap = (Relation) heap_open(OIDOldHeap);
    LocalOldIndex = (Relation) index_open(OIDOldIndex);
    ScanResList = DLNewList();

    ScanDesc = index_beginscan(LocalOldIndex, false, 0, (ScanKey) NULL);

    loop = 1;
    while (loop) {
        count = 0;
        while ((count < 100) &&
               ((ScanResult =
                index_getnext(ScanDesc,
                    ForwardScanDirection)) != NULL))
        {

            HeapTid = &ScanResult->heap_iptr;
            pfree(ScanResult);
            LocalHeapTuple = heap_fetch(LocalOldHeap, false,
                    HeapTid, &LocalBuffer[count]);
            ListEl = DLNewElem(LocalHeapTuple);
            DLAddTail(ScanResList, ListEl);
            count++;
        }

        if (count < 100) loop = 0;

        count = 0;
        while ((ListEl = DLRemHead(ScanResList)) != NULL) {
            LocalHeapTuple = (HeapTuple)ListEl->dle_val;
            DLFreeElem(ListEl);
            OIDNewHeapInsert =
                heap_insert(LocalNewHeap, LocalHeapTuple);
            ReleaseBuffer(LocalBuffer[count]);
            count++;
        }
    }

    index_endscan(ScanDesc);

    index_close(LocalOldIndex);
    heap_close(LocalOldHeap);
    heap_close(LocalNewHeap);
    DLFreeList(ScanResList);
}

Re: [HACKERS] Need some help on code

From
Bruce Momjian
Date:
>
> Hi,
>
> I was trying to change to cluster command to do the its writes clustered
> by a 100 tuples, thus hoping to improve performance. However, the code
> I've written crashes. This has certainly to do with some internal states
> of pgsql that aren't preserved in a HeapTuple.
>
> Could somebody with knowledge have a brief glimpse on my code and perhaps
> tell me how to do it properly?

I did not look at the code, but I can pretty much tell you that bunching
the write will not help performance.  We already do that pretty well
with the cache.

THe problem with the cluster is the normal problem of using an index to
seek into a data table, where the data is not clustered on the index.
Every entry in the index requires a different page, and each has to be
read in from disk.

Often the fastest way is to discard the index, and just read the table,
sorting each in pieces, and merging them in.  That is what psort does,
which is our sort code.  That is why I recommend the SELECT INTO
solution if you have enough disk space.

Once it is clustered, subsequent clusters should be very fast, because
only the out-of-order entries cause random disk seeks.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Need some help on code

From
Maarten Boekhold
Date:
On Sun, 7 Jun 1998, Bruce Momjian wrote:

> >
> > Hi,
> >
> > I was trying to change to cluster command to do the its writes clustered
> > by a 100 tuples, thus hoping to improve performance. However, the code
> > I've written crashes. This has certainly to do with some internal states
> > of pgsql that aren't preserved in a HeapTuple.
> >
> > Could somebody with knowledge have a brief glimpse on my code and perhaps
> > tell me how to do it properly?
>
> I did not look at the code, but I can pretty much tell you that bunching
> the write will not help performance.  We already do that pretty well
> with the cache.
>
> THe problem with the cluster is the normal problem of using an index to
> seek into a data table, where the data is not clustered on the index.
> Every entry in the index requires a different page, and each has to be
> read in from disk.

My thinking was that the reading from the table is very scattered, but
that the writing to the new table could be done 'sequentially'. Therefore
I thought it was interesting to see if it would help to cluster the writes.

> Often the fastest way is to discard the index, and just read the table,
> sorting each in pieces, and merging them in.  That is what psort does,
> which is our sort code.  That is why I recommend the SELECT INTO
> solution if you have enough disk space.

A 'select into ... order by ...' you mean?

Maarten

_____________________________________________________________________________
| TU Delft, The Netherlands, Faculty of Information Technology and Systems  |
|                   Department of Electrical Engineering                    |
|           Computer Architecture and Digital Technique section             |
|                          M.Boekhold@et.tudelft.nl                         |
-----------------------------------------------------------------------------


Re: [HACKERS] Need some help on code

From
Bruce Momjian
Date:
> My thinking was that the reading from the table is very scattered, but
> that the writing to the new table could be done 'sequentially'. Therefore
> I thought it was interesting to see if it would help to cluster the writes.
>
> > Often the fastest way is to discard the index, and just read the table,
> > sorting each in pieces, and merging them in.  That is what psort does,
> > which is our sort code.  That is why I recommend the SELECT INTO
> > solution if you have enough disk space.
>
> A 'select into ... order by ...' you mean?

Yes.  See CLUSTER manual page:

       Another way is to use SELECT  ...  INTO  TABLE  temp  FROM
       ...ORDER  BY ...  This uses the PostgreSQL sorting code in
       ORDER BY to match  the  index,  and  is  much  faster  for
       unordered  data.   You  then drop the old table, use ALTER

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Need some help on code

From
dg@illustra.com (David Gould)
Date:
Maarten wrote:

> I was trying to change to cluster command to do the its writes clustered
> by a 100 tuples, thus hoping to improve performance. However, the code
> I've written crashes. This has certainly to do with some internal states
> of pgsql that aren't preserved in a HeapTuple.
>
> Could somebody with knowledge have a brief glimpse on my code and perhaps
> tell me how to do it properly?
...
> static void
> rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex)
> {
>     Relation    LocalNewHeap,
>                 LocalOldHeap,
>                 LocalOldIndex;
>     IndexScanDesc ScanDesc;
>     RetrieveIndexResult ScanResult;
>     ItemPointer HeapTid;
>     HeapTuple    LocalHeapTuple;
>     Buffer        LocalBuffer[100];
>     Oid            OIDNewHeapInsert;
>     Dllist      *ScanResList;
>     Dlelem      *ListEl;
>     int         count, loop;
>
>     /*
>      * Open the relations I need. Scan through the OldHeap on the OldIndex
>      * and insert each tuple into the NewHeap.
>      */
>     LocalNewHeap = (Relation) heap_open(OIDNewHeap);
>     LocalOldHeap = (Relation) heap_open(OIDOldHeap);
>     LocalOldIndex = (Relation) index_open(OIDOldIndex);
>     ScanResList = DLNewList();
>
>     ScanDesc = index_beginscan(LocalOldIndex, false, 0, (ScanKey) NULL);
>
>     loop = 1;
>     while (loop) {
>         count = 0;
>         while ((count < 100) &&
>                ((ScanResult =
>                 index_getnext(ScanDesc,
>                     ForwardScanDirection)) != NULL))
>         {
>
>             HeapTid = &ScanResult->heap_iptr;
>             pfree(ScanResult);
                        ^^^^^^^^^^^^^^^^^^
Hmmm, at this point, HeapTid is a pointer to what?

>             LocalHeapTuple = heap_fetch(LocalOldHeap, false,
>                     HeapTid, &LocalBuffer[count]);

Given more than one tuple on a page, then there may exist some
 LocalBuffer[i] == LocalBuffer[j] where i and j are distinct values of count.

>             ListEl = DLNewElem(LocalHeapTuple);
>             DLAddTail(ScanResList, ListEl);
>             count++;
>         }
>
>         if (count < 100) loop = 0;
>
>         count = 0;
>         while ((ListEl = DLRemHead(ScanResList)) != NULL) {
>             LocalHeapTuple = (HeapTuple)ListEl->dle_val;
>             DLFreeElem(ListEl);
>             OIDNewHeapInsert =
>                 heap_insert(LocalNewHeap, LocalHeapTuple);
>             ReleaseBuffer(LocalBuffer[count]);

So here we ReleaseBuffer(LocalBuffer[count]) which if there are more than
one LocalBuffer[] that are in fact the same buffer will release the buffer
multiple times.

>             count++;
>         }
>     }
>
>     index_endscan(ScanDesc);
>
>     index_close(LocalOldIndex);
>     heap_close(LocalOldHeap);
>     heap_close(LocalNewHeap);
>     DLFreeList(ScanResList);
> }


Hope this helps.
-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Don't worry about people stealing your ideas.  If your ideas are any
 good, you'll have to ram them down people's throats." -- Howard Aiken

Re: [HACKERS] Need some help on code

From
Maarten Boekhold
Date:
> >             HeapTid = &ScanResult->heap_iptr;
> >             pfree(ScanResult);
>                         ^^^^^^^^^^^^^^^^^^
> Hmmm, at this point, HeapTid is a pointer to what?

I have no idea :) I think I missed the '&' in front of ScanResult, and
thus figured that HeapTid would still be valid after the pfree().

> >                 heap_insert(LocalNewHeap, LocalHeapTuple);
> >             ReleaseBuffer(LocalBuffer[count]);
>
> So here we ReleaseBuffer(LocalBuffer[count]) which if there are more than
> one LocalBuffer[] that are in fact the same buffer will release the buffer
> multiple times.

Well, I don't even know what 'LocalBuffer' is. It would be nice if there
would be some documentation describing the access methods with all
structs and types etc. that they use.

Thanx for trying to explain things,

Maarten

_____________________________________________________________________________
| TU Delft, The Netherlands, Faculty of Information Technology and Systems  |
|                   Department of Electrical Engineering                    |
|           Computer Architecture and Digital Technique section             |
|                          M.Boekhold@et.tudelft.nl                         |
-----------------------------------------------------------------------------