Thread: Loading optimization

Loading optimization

From
Gary Wesley
Date:
Is there any advantage to having data sorted before populating it into a
table?
(In 6.5)

Gary Wesley


Re: Loading optimization

From
"Brett W. McCoy"
Date:
On Wed, 3 Jan 2001, Gary Wesley wrote:

> Is there any advantage to having data sorted before populating it into a
> table?

None at all, AFAIK.  In fact, if you are doing a bulk copy, you should
build your indexes after the data is loaded -- you'll get ebtter
performance during the load.

-- Brett
                                     http://www.chapelperilous.net/~bmccoy/
---------------------------------------------------------------------------
This sentence no verb.


Re: Loading optimization

From
Ian Harding
Date:
Gary Wesley wrote:

> Is there any advantage to having data sorted before populating it into a
> table?
> (In 6.5)
>
> Gary Wesley

Yes.  You have effectively loaded it with a clustered index.  If you
cluster an index on the sort column after loading it, the sort will remain
in effect and will speed queries/joins that use that column.

Ian


Re: Re: Loading optimization

From
Martijn van Oosterhout
Date:
Ian Harding wrote:
>
> Gary Wesley wrote:
>
> > Is there any advantage to having data sorted before populating it into a
> > table?
> > (In 6.5)
> >
> > Gary Wesley
>
> Yes.  You have effectively loaded it with a clustered index.  If you
> cluster an index on the sort column after loading it, the sort will remain
> in effect and will speed queries/joins that use that column.

But does postgres actually use the fact that the data is clustered? I
keep thinking that here I could cluster all our data such that a
sequential
search is almost always a bad idea but I have no idea how to relate that
fact to postgres...
--
Martijn van Oosterhout <kleptog@cupid.suninternet.com>
http://cupid.suninternet.com/~kleptog/

Re: Re: Loading optimization

From
Bruce Momjian
Date:
> Ian Harding wrote:
> >
> > Gary Wesley wrote:
> >
> > > Is there any advantage to having data sorted before populating it into a
> > > table?
> > > (In 6.5)
> > >
> > > Gary Wesley
> >
> > Yes.  You have effectively loaded it with a clustered index.  If you
> > cluster an index on the sort column after loading it, the sort will remain
> > in effect and will speed queries/joins that use that column.
>
> But does postgres actually use the fact that the data is clustered? I
> keep thinking that here I could cluster all our data such that a
> sequential
> search is almost always a bad idea but I have no idea how to relate that
> fact to postgres...

Well, clustering certainly speeds up index access to multiple heap
values because duplicate values are all on the same heap page.  One
thing that is missing is that there is no preference for index scans for
clustered indexes.

Because the clustering is not permanent, but becomes unclustered as data
is added/modified, there is no easy way to know if the clustering is
still valid.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Re: Loading optimization

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@cupid.suninternet.com> writes:
> But does postgres actually use the fact that the data is clustered?

The planner has no idea that the table is clustered, and will estimate
indexscan costs on the assumption that the data is randomly ordered in
the table.  So you're likely to get a seqscan plan for queries where
indexscan would actually be faster.  This is something we need to fix,
but the main problem is accounting for the fact that the clustered order
will degrade over time as data is added/updated.  See past discussions
in pghackers.

The CLUSTER implementation is so shoddy at the moment that I'm hesitant
to encourage people to use it anyway :-(.  We've got to rewrite it so
that it doesn't drop other indexes, lose constraints, break foreign
key and inheritance relationships, etc etc.

            regards, tom lane

Re: Re: Loading optimization

From
kleptog@cupid.suninternet.com (Martijn van Oosterhout)
Date:
On Tue, Jan 09, 2001 at 10:51:35AM -0500, Bruce Momjian wrote:
> Well, clustering certainly speeds up index access to multiple heap
> values because duplicate values are all on the same heap page.  One
> thing that is missing is that there is no preference for index scans for
> clustered indexes.

Maybe that would be the simple way, just a flag. Alternatively, have VACUUM
ANALYZE estimate the "cohesiveness" of the data...

> Because the clustering is not permanent, but becomes unclustered as data
> is added/modified, there is no easy way to know if the clustering is
> still valid.

Well, in our case the table has over 1,000,000 rows and refer to items that
would appear on a bill. Since a bill is never changed after the fact, the
clustering is always in effect.

This table is a WORM table, once data is added, it is never updated. The
question is, is this typical of very large tables? If that is the case then
generally clustering would tend to stay rather than degrade.

Also, in our case, clustering by a single index is not really sufficient.
Within a single bill we would like to cluster the items by service. I was
thinking or writing a program that would do a pg_dump, order by columns as
requested and then dump it back in. I would've done it except that
Postgresql won't use the fact that it's sorted.

I for one am hoping for progress in this area. Unnessesary sequential scans
are painful when someone is waiting on the phone...

Martijn

Re: Re: Loading optimization

From
Tom Lane
Date:
Ian Harding <iharding@pakrat.com> writes:
> Tom Lane wrote:
>> The CLUSTER implementation is so shoddy at the moment that I'm hesitant
>> to encourage people to use it anyway :-(.  We've got to rewrite it so
>> that it doesn't drop other indexes, lose constraints, break foreign
>> key and inheritance relationships, etc etc.

> Are the problems with CLUSTER isolated to the creation of the clustering,
> or the maintenance of it?

I guess you could consider it a bug that the clustered order is not
preserved by subsequent inserts/updates, but I don't.  Otherwise the
problem is just with creation.  That effectively does something like

    SELECT * INTO temp_NNN FROM your_table ORDER BY index_var;

and then drops your_table and renames temp_NNN into place.  So all
that's copied are the column types; you lose all other auxiliary info
about the table.

Now that I look at the code, it'd be very easy to preserve constraints
(a small change in the code would allow copying them to the new table)
so maybe we should do that.  But the other issues like inheritance
relationships can't be fixed without a fundamentally different
implementation method, one that preserves the identity (OID) of the
table.  You can find past discussions about how to do this in the
pghackers archives; it seems within reach given the changes made for
7.1, so perhaps someone will get to it in 7.2 or so.

            regards, tom lane

Re: Re: Loading optimization

From
Bruce Momjian
Date:
> Also, in our case, clustering by a single index is not really sufficient.
> Within a single bill we would like to cluster the items by service. I was
> thinking or writing a program that would do a pg_dump, order by columns as
> requested and then dump it back in. I would've done it except that
> Postgresql won't use the fact that it's sorted.

You can create an index on all the columns, do a CLUSTER, then drop the
index and create the one you need.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Re: Loading optimization

From
Bruce Momjian
Date:
Added to TODO (part of this is reorganization of cluster items):

* CLUSTER
        * cluster all tables at once
        * prent lose of constraints, indexes, permissions, inheritance
        * Automatically keep clustering on a table
        * Keep statistics about clustering, perhaps during VACUUM ANALYZE
          [optimizer]


> Ian Harding <iharding@pakrat.com> writes:
> > Tom Lane wrote:
> >> The CLUSTER implementation is so shoddy at the moment that I'm hesitant
> >> to encourage people to use it anyway :-(.  We've got to rewrite it so
> >> that it doesn't drop other indexes, lose constraints, break foreign
> >> key and inheritance relationships, etc etc.
>
> > Are the problems with CLUSTER isolated to the creation of the clustering,
> > or the maintenance of it?
>
> I guess you could consider it a bug that the clustered order is not
> preserved by subsequent inserts/updates, but I don't.  Otherwise the
> problem is just with creation.  That effectively does something like
>
>     SELECT * INTO temp_NNN FROM your_table ORDER BY index_var;
>
> and then drops your_table and renames temp_NNN into place.  So all
> that's copied are the column types; you lose all other auxiliary info
> about the table.
>
> Now that I look at the code, it'd be very easy to preserve constraints
> (a small change in the code would allow copying them to the new table)
> so maybe we should do that.  But the other issues like inheritance
> relationships can't be fixed without a fundamentally different
> implementation method, one that preserves the identity (OID) of the
> table.  You can find past discussions about how to do this in the
> pghackers archives; it seems within reach given the changes made for
> 7.1, so perhaps someone will get to it in 7.2 or so.
>
>             regards, tom lane
>


--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Re: Loading optimization

From
Alfred Perlstein
Date:
* Bruce Momjian <pgman@candle.pha.pa.us> [010111 13:12] wrote:
> Added to TODO (part of this is reorganization of cluster items):
>
> * CLUSTER
>         * cluster all tables at once
>         * prent lose of constraints, indexes, permissions, inheritance
>         * Automatically keep clustering on a table
>         * Keep statistics about clustering, perhaps during VACUUM ANALYZE
>           [optimizer]

Are there any plans (or maybe it's already done?) for a fsck-like
utility for postgresql that will automagically drop all indecies,
scan the table for bad data and then recreate the index, prefereably
with only an exclusive lock on the database being manipulated(*)?

(*) to avoid having to start up the postmaster with weird options
and/or stop activity on other databases under the postmaster's
control.

thanks,
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

Re: Re: Loading optimization

From
Bruce Momjian
Date:
We have talked about this, but we have not seen enought corruption cases
to warrant it.


> * Bruce Momjian <pgman@candle.pha.pa.us> [010111 13:12] wrote:
> > Added to TODO (part of this is reorganization of cluster items):
> >
> > * CLUSTER
> >         * cluster all tables at once
> >         * prent lose of constraints, indexes, permissions, inheritance
> >         * Automatically keep clustering on a table
> >         * Keep statistics about clustering, perhaps during VACUUM ANALYZE
> >           [optimizer]
>
> Are there any plans (or maybe it's already done?) for a fsck-like
> utility for postgresql that will automagically drop all indecies,
> scan the table for bad data and then recreate the index, prefereably
> with only an exclusive lock on the database being manipulated(*)?
>
> (*) to avoid having to start up the postmaster with weird options
> and/or stop activity on other databases under the postmaster's
> control.
>
> thanks,
> --
> -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
> "I have the heart of a child; I keep it in a jar on my desk."
>


--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

RE: Re: Loading optimization

From
"Mikheev, Vadim"
Date:
> We have talked about this, but we have not seen enought
> corruption cases to warrant it.

Huh?! Just pg_ctl -m immediate stop in <= 7.0.X with high
insert activity and ... pray.

It's changed in 7.1 by WAL - btree doesn't lose tuples in split
ops anymore and in after crash restart you'll never see btree
tuples pointing to unexisted heap tuples (because of index tuples
inserted/logged after heap ones).

Nevertheless, WAL doesn't guarantee logical consistency of index
in the case of aborted split ops (there may be pages unreferenced
from parent level) - selects will work but inserts may result in
"my bits moved off..." error. I'll try to fix this in 7.1 if I'll
get some time (it doesn't look too hard to do).

Also, WAL doesn't prevent zero blocks in files after crash - I didn't
want to fsync log on each new block allocation, - but this shouldn't
be a problem (except of lost disk space), afair scans are smart to
handle it, Tom? I'm going to fix this with new smgr.

Vadim

Re: Re: Loading optimization

From
Alfred Perlstein
Date:
* Mikheev, Vadim <vmikheev@SECTORBASE.COM> [010111 14:14] wrote:
> > We have talked about this, but we have not seen enought
> > corruption cases to warrant it.
>
> Huh?! Just pg_ctl -m immediate stop in <= 7.0.X with high
> insert activity and ... pray.

no kidding. :)

>
> It's changed in 7.1 by WAL - btree doesn't lose tuples in split
> ops anymore and in after crash restart you'll never see btree
> tuples pointing to unexisted heap tuples (because of index tuples
> inserted/logged after heap ones).

The main thing is that I consider myself to be pretty ok with
postgresql right now, but it's still a major amount of manual
labor to _absolutely_ make sure no data has been lost and make
sure that no more corruption exists that could get the server
to die or go off into some inifinite loop.  We really haven't
had these problems for a long time but it's be really comforting
to have something that pretty much made sure it wouldn't.

--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

RE: Re: Loading optimization

From
"Mikheev, Vadim"
Date:
> > It's changed in 7.1 by WAL - btree doesn't lose tuples in split
> > ops anymore and in after crash restart you'll never see btree
> > tuples pointing to unexisted heap tuples (because of index tuples
> > inserted/logged after heap ones).
>
> The main thing is that I consider myself to be pretty ok with
> postgresql right now, but it's still a major amount of manual
> labor to _absolutely_ make sure no data has been lost and make
> sure that no more corruption exists that could get the server
> to die or go off into some inifinite loop.  We really haven't
> had these problems for a long time but it's be really comforting
> to have something that pretty much made sure it wouldn't.

Just remembered from my prev observations in index area that
it's still possible to get infinite loop if root page was
splitted but new root was not created...

Vadim

Re: Re: Loading optimization

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> Also, WAL doesn't prevent zero blocks in files after crash - I didn't
> want to fsync log on each new block allocation, - but this shouldn't
> be a problem (except of lost disk space), afair scans are smart to
> handle it, Tom?

This is OK for table files, unless someone's broken the code that will
auto-initialize a zero page when it comes across one.

I had a note to myself saying that zeroed pages in indexes may not be
OK, but I'm not sure if that's really a risk or not.  btree, at least,
never does any linear scans of an index file; it can only visit pages
that are linked to by parent or sibling links.  If we are careful that
we init a btree page before we modify its sibling(s) and parent, we
should be OK.

            regards, tom lane

RE: Re: Loading optimization

From
"Mikheev, Vadim"
Date:
> This is OK for table files, unless someone's broken the code that will
> auto-initialize a zero page when it comes across one.

Hmmm, I don't see anything like auto-initialization in code -:(
Where did you put these changes?

> I had a note to myself saying that zeroed pages in indexes may not be
> OK, but I'm not sure if that's really a risk or not.  btree, at least,
> never does any linear scans of an index file; it can only visit pages
> that are linked to by parent or sibling links.  If we are careful that
> we init a btree page before we modify its sibling(s) and parent, we
> should be OK.

Should be covered by WALoging of newroot & split ops which only add
new pages.

Vadim

Re: Re: Loading optimization

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> This is OK for table files, unless someone's broken the code that will
>> auto-initialize a zero page when it comes across one.

> Hmmm, I don't see anything like auto-initialization in code -:(
> Where did you put these changes?

I didn't put 'em in, it looked like your work to me: see vacuum.c,
lines 618-622 in current sources.

Awhile back I did fix PageGetFreeSpace and some related macros to
deliver sane results when looking at an all-zero page header, so that
scans and inserts would ignore the page until vacuum fixes it.

Perhaps WAL redo needs to be prepared to do PageInit as well?

Actually, I'd expect the CRC check to catch an all-zeroes page (if
it fails to complain, then you misimplemented the CRC), so that would
be the place to deal with it now.


>> If we are careful that
>> we init a btree page before we modify its sibling(s) and parent, we
>> should be OK.

> Should be covered by WALoging of newroot & split ops which only add
> new pages.

OK, sounds good.

            regards, tom lane

RE: Re: Loading optimization

From
"Mikheev, Vadim"
Date:
> >> This is OK for table files, unless someone's broken the
> >> code that will auto-initialize a zero page when it comes across one.
>
> > Hmmm, I don't see anything like auto-initialization in code -:(
> > Where did you put these changes?
>
> I didn't put 'em in, it looked like your work to me: see vacuum.c,
> lines 618-622 in current sources.

Oh, this code was there from 6.0 days.

> Awhile back I did fix PageGetFreeSpace and some related macros to
> deliver sane results when looking at an all-zero page header, so that
> scans and inserts would ignore the page until vacuum fixes it.

I see now - PageGetMaxOffsetNumber... Ok.

> Perhaps WAL redo needs to be prepared to do PageInit as well?

It calls PageIsNew and uses flag in record to know when a page could
be uninitialized.

> Actually, I'd expect the CRC check to catch an all-zeroes page (if
> it fails to complain, then you misimplemented the CRC), so that would
> be the place to deal with it now.

I've used standard CRC32 implementation you pointed me to -:)
But CRC is used in WAL records only.

Vadim

Re: Re: Loading optimization

From
Ian Harding
Date:
Tom Lane wrote:

> Martijn van Oosterhout <kleptog@cupid.suninternet.com> writes:
> > But does postgres actually use the fact that the data is clustered?
>
> The planner has no idea that the table is clustered, and will estimate
> indexscan costs on the assumption that the data is randomly ordered in
> the table.  So you're likely to get a seqscan plan for queries where
> indexscan would actually be faster.  This is something we need to fix,
> but the main problem is accounting for the fact that the clustered order
> will degrade over time as data is added/updated.  See past discussions
> in pghackers.
>
> The CLUSTER implementation is so shoddy at the moment that I'm hesitant
> to encourage people to use it anyway :-(.  We've got to rewrite it so
> that it doesn't drop other indexes, lose constraints, break foreign
> key and inheritance relationships, etc etc.
>
>                         regards, tom lane

Are the problems with CLUSTER isolated to the creation of the clustering,
or the maintenance of it?  If I cluster an index before I create any
relationships, constraints, or other indexes, (or load any data for that
matter) am I going to be OK?  BTW, Microsoft recommends creating clustered
indexes first, because creating one will cause all other existing indexes
to be dropped and recreated.  That bit makes sense, since rebuilding all
your indexes might take some time, and they have to be recreated since the
data has moved, right?

Ian