Thread: Importing Large Amounts of Data

Importing Large Amounts of Data

From

Curt Sampson

Date:

15 April 2002, 02:48:54

Appended is a message that I sent to the pgsql-general list, and
for which I received no useful reply. (Well, anyway, no reply that
has helped me to speed up my imports.) If you've read it already,
feel free to ignore it, but if you haven't, I'd appreciate any
advice on how to make this work as fast as possible.

There are some things that could be done to help optimize situations
like this, though I don't think any can be done in the short term.
Here are some of my thoughts, which may or may not be useful:

1. Postgres appears to have a fairly high row overhead (40 bytes
or so according to the FAQ), which grieves me slightly, as that's
actually larger than the size of the data in my tuples. It would
seem that in my case some of the items in that header (the OID and
the NULL bitfield) are not used; would it be possible to avoid
allocating these in this relations that don't use them?

Also, is there a good description of the on-disk tuple format
somewhere? include/access/htup.h seems to require a fair bit of
knowledge about how other parts of the system work to be understood.

2. Does it make sense to be able to do some operations without
logging? For the COPY, if the system crashes and I lose some or
all all the tuples I'd imported so far, I don't care that much; I
can just restart the COPY at an appropriate point. As mentioned
below, that would save half a gig of disk writes when importing 5M
tuples.

3. How about having a way to take a table off-line to work on it,
and bring it back on-line again when done? This would get rid of
the logging overhead, locking overhead, and that sort of stuff,
and in theory might be able to get you something approaching
disk-speed data imports.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 

---------- Forwarded message ----------
Date: Thu, 11 Apr 2002 17:28:13 +0900 (JST)
From: Curt Sampson <cjs@cynic.net>
To: pgsql-general@postgresql.org
Subject: [GENERAL] Importing Large Amounts of Data

I've been asked by a client to do some testing of Postgres for what
appears to be OLAP on a fairly large data set (about half a billion
tuples). I'm probably going to want to try partitioning this in various
ways, but the application, not Postgres, will deal with that.

I'm using PostgreSQL 7.2.1, and the schema I'm testing with is as follows:
CREATE TABLE bigone (    rec_no        INT            PRIMARY KEY,    day            DATE        NOT NULL,    user_id
    CHAR(5)        NOT NULL,    value        VARCHAR(20) NOT NULL    ) WITHOUT OIDS;DROP INDEX bigone_pkey;
 
[COPY is done here....]
CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no);CREATE INDEX bigone_day ON bigone (day);CREATE INDEX bigone_user_id
ONbigone (user_id);
 

Unfortunately, the first problem I've run into is that importing is
rather slow. With all indexes (including the bigone_pkey) dropped,
importing five million tuples into the above table, starting from empty,
takes about 921 seconds. The second 5M tuples takes about 1009 seconds.
If I switch to using the -F option, the first 5M takes 714 seconds and the
second 5M takes 742 seconds. At the end, I have about 742 MB of data under
the data/base directory. (This is using a fresh database cluster.)

For comparison, the MySQL does each import in about 221 and 304 seconds,
and the data in the end take up about 427 MB.

Part of the problem here may be that Postgres appears to be logging the
COPY operation; I get from 27-33 "recycled transaction log file" messages
for every 5M tuple COPY that I do. If there were a way to do an unlogged
copy, that might save close to half a gig of writes to the disk.

The other part of the problem may just be the size of the data;
why does Postgres take up 75% more space (about 78 bytes per tuple,
vs. 45 bytes per tuple) for this table?

As well, index builds seem to take about 20% longer (using -F), and they
seem to be about 10% larger as well.

Does anybody have any suggestions as to how I can improve performance
here, and reduce disk space requirements? If not, I'll probably have
to suggest to the client that he move to MySQL for this particular
application, unless he needs any of the features that Postgres provides
and MySQL doesn't.

Re: Importing Large Amounts of Data

From

"Christopher Kings-Lynne"

Date:

15 April 2002, 03:05:36

> 1. Postgres appears to have a fairly high row overhead (40 bytes
> or so according to the FAQ), which grieves me slightly, as that's
> actually larger than the size of the data in my tuples. It would
> seem that in my case some of the items in that header (the OID and
> the NULL bitfield) are not used; would it be possible to avoid
> allocating these in this relations that don't use them?

CREATE TABLE WITHOUT OIDS ...

> As well, index builds seem to take about 20% longer (using -F), and they
> seem to be about 10% larger as well.
> > Does anybody have any suggestions as to how I can improve performance
> here, and reduce disk space requirements? If not, I'll probably have
> to suggest to the client that he move to MySQL for this particular
> application, unless he needs any of the features that Postgres provides
> and MySQL doesn't.

This conclusion seems to me to be remarkably shortsighted.  Does the initial
data load into the database occur just once or quite often?  If just once,
then the initial loading time doesn't matter.

It's a bit hard to say "just turn off all the things that ensure your data
integrity so it runs a bit faster", if you actually need data integrity.

Anyway, from what I understand an OLTP application is all about selects and
memoising certain aggregate results.  Since Postgres has far more advanced
indexing and trigger support than MySQL, surely you need to take this kind
of difference into account???  The fact that you can load stuff quicker in
MySQL and it takes up less disk space seems totally irrelevant.

Just wait until your MySQL server crashes and your client finds that half
his data is corrupted...

Chris

Re: Importing Large Amounts of Data

From

Curt Sampson

Date:

15 April 2002, 03:41:24

On Mon, 15 Apr 2002, Christopher Kings-Lynne wrote:

> > ...the OID and
> > the NULL bitfield) are not used; would it be possible to avoid
> > allocating these in this relations that don't use them?
>
> CREATE TABLE WITHOUT OIDS ...

As you can see from the schema I gave later in my message, that's
exactly what I did. But does this actually avoid allocating the
space in the on-disk tuples? What part of the code deals with this?
It looks to me like the four bytes for the OID are still allocated
in the tuple, but not used.

> This conclusion seems to me to be remarkably shortsighted.  Does the initial
> data load into the database occur just once or quite often?

Well, I'm going to be doing the initial load (.5 billion tuples) quite
a few times, in order to test some different partitioning arrangements.
So I'll save quite a lot of time initially if I get a faster import.

But from the looks of it, the production system will be doing daily
imports of fresh data ranging in size from a copule of million rows
to a couple of tens of millions of rows.

> It's a bit hard to say "just turn off all the things that ensure your data
> integrity so it runs a bit faster", if you actually need data integrity.

I'm not looking for "runs a bit faster;" five percent either way
makes little difference to me. I'm looking for a five-fold performance
increase.

> Anyway, from what I understand an OLTP application is all about selects and
> memoising certain aggregate results.

I guess that was a typo, and you meant OLAP?

Anyway, from the looks of it, this is going to be fairly simple
stuff.  (Unfortunately, I don't have details of the real application
the client has in mind, though I sure wish I did.) What I'm trying
to indicate when I say "OLAP" is that it's basically selecting
across broad swaths of a large data set, and doing little or nothing
in the way of updates. (Except for the daily batches of data, of
course.)

> The fact that you can load stuff quicker in
> MySQL and it takes up less disk space seems totally irrelevant.

Yeah, everybody's telling me this. Let me try once again here:
1. Every day, I must import millions, possibly tens ofmillions, of rows of data. Thus, speed of import is indeedfairly
importantto me.

2. It looks, at least at this point, as if the applicationwill be doing only fairly simple selects out of the
currenthalf-billionrows of data and whatever gets added in thefuture. Thus, I don't think that using MySQL would be
aproblem.(If I did, I wouldn't be proposing it.)

I don't want to start a flamewar here, because personally I don't
even like MySQL and would prefer always to use PostgreSQL. But it
makes it a lot harder to do so when people keep insisting that
import speed is not important. Rather than say that, why don't we
just admit that PosgreSQL is a fairly crap performer in this regard
at the moment (at least the way I'm doing it), and work out ways
to fix this?

> Just wait until your MySQL server crashes and your client finds that half
> his data is corrupted...

If there are no updates, why would anything be corrupted? At any
rate, I can always restore from backup, since little or nothing
would be lost.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Importing Large Amounts of Data

From

"Christopher Kings-Lynne"

Date:

15 April 2002, 03:53:24

> As you can see from the schema I gave later in my message, that's
> exactly what I did. But does this actually avoid allocating the
> space in the on-disk tuples? What part of the code deals with this?
> It looks to me like the four bytes for the OID are still allocated
> in the tuple, but not used.

OK, well I guess in that case they are - I'm no expert on the file format.

> But from the looks of it, the production system will be doing daily
> imports of fresh data ranging in size from a copule of million rows
> to a couple of tens of millions of rows.

Well that definitely makes a difference then...

> > It's a bit hard to say "just turn off all the things that
> ensure your data
> > integrity so it runs a bit faster", if you actually need data integrity.
>
> I'm not looking for "runs a bit faster;" five percent either way
> makes little difference to me. I'm looking for a five-fold performance
> increase.

> Anyway, from the looks of it, this is going to be fairly simple
> stuff.  (Unfortunately, I don't have details of the real application
> the client has in mind, though I sure wish I did.) What I'm trying
> to indicate when I say "OLAP" is that it's basically selecting
> across broad swaths of a large data set, and doing little or nothing
> in the way of updates. (Except for the daily batches of data, of
> course.)

OK, well now it depends on what kind of selects you're doing.  Do you
regularly select over a certain subset of the data, in which case using
partial indices might give you significant speedup.  Do you select functions
of columns?  If so, then you'll need functional indices.  MySQL doesn't have
either of these.  However, if you're always doing full table scans, then
MySQL will probably do these faster.

Now, here's another scenario.  Suppose you're often querying aggregate data
over particular subsets of the data.  Now instead of requerying all the
time, you can set up triggers to maintain your aggregates for you on the
fly.  This will give O(1) performance on select compared to O(n).  MySQL's
new query cache might help you with this, however.

> I don't want to start a flamewar here, because personally I don't
> even like MySQL and would prefer always to use PostgreSQL. But it
> makes it a lot harder to do so when people keep insisting that
> import speed is not important. Rather than say that, why don't we
> just admit that PosgreSQL is a fairly crap performer in this regard
> at the moment (at least the way I'm doing it), and work out ways
> to fix this?

It depends on your definition.  You have to accept a certain overhead if
you're to have data integrity and MVCC.  If you can't handle that overhead,
then you can't have data integrity and vice versa.

BTW, instead of:

CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no);

do:

ALTER TABLE bigone ADD PRIMARY KEY(rec_no);

And remember to run "VACUUM ANALYZE bigone;" or just "ANALYZE bigone;" after
the COPY and before trying to use the table.  I'm not sure if it's better to
analyze before or after the indexes are added, but it's definitely better to
vaccum before the indexes are added.

Chris

Re: Importing Large Amounts of Data

From

Curt Sampson

Date:

15 April 2002, 04:19:30

On Mon, 15 Apr 2002, Christopher Kings-Lynne wrote:

> OK, well now it depends on what kind of selects you're doing.  Do you
> regularly select over a certain subset of the data, in which case using
> partial indices might give you significant speedup.

I believe from the information I've been given that we will indeed
be regularly selecting over certain subsets, based on day. (One of
the test queries I've been asked to use selects based on user_id
and a date range.) But I was intending to partition the tables
based on date range (to keep the index rebuild time from getting
completely out of hand), so that will handily take care of that
requirement anyway.

> Do you select functions of columns?

No.

> It depends on your definition.  You have to accept a certain overhead if
> you're to have data integrity and MVCC.  If you can't handle that overhead,
> then you can't have data integrity and vice versa.

Well, a few points:
 a) I am not convinced that data integrity should cost a five-fold decrease in performance,
 b) In fact, at times I don't need that data integrity. I'm prefectly happy to risk the loss of a table during import,
ifit lets me do the import more quickly, especially if I'm taking the database off line to do the import anyway. MS SQL
serverin fact allows me to specify relaxed integrity (with attendant risks) when doing a BULK IMPORT; it would be cool
ifPostgres allowed that to.

> BTW, instead of:
>
> CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no);
>
> do:
>
> ALTER TABLE bigone ADD PRIMARY KEY(rec_no);
>
> And remember to run "VACUUM ANALYZE bigone;" or just "ANALYZE bigone;" after
> the COPY and before trying to use the table.  I'm not sure if it's better to
> analyze before or after the indexes are added, but it's definitely better to
> vaccum before the indexes are added.

Thanks. This is the kind of useful information I'm looking for. I
was doing a vacuum after, rather than before, generating the indices.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Importing Large Amounts of Data

From

"Christopher Kings-Lynne"

Date:

15 April 2002, 04:24:07

>      b) In fact, at times I don't need that data integrity. I'm
> prefectly
>      happy to risk the loss of a table during import, if it
> lets me do the
>      import more quickly, especially if I'm taking the database off line
>      to do the import anyway. MS SQL server in fact allows me to specify
>      relaxed integrity (with attendant risks) when doing a BULK
> IMPORT; it
>      would be cool if Postgres allowed that to.

Well I guess a TODO item would be to allow COPY to use relaxed constraints.
Don't know how this would go over with the core developers tho.

> Thanks. This is the kind of useful information I'm looking for. I
> was doing a vacuum after, rather than before, generating the indices.

That's because the indexes themselves are cleaned out with vacuum, as well
as the tables.

Chris

Re: Importing Large Amounts of Data

From

Denis Perchine

Date:

15 April 2002, 04:53:57

On Monday 15 April 2002 03:53, Christopher Kings-Lynne wrote:
> BTW, instead of:
>
> CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no);
>
> do:
>
> ALTER TABLE bigone ADD PRIMARY KEY(rec_no);

I am sorry, could you please elaborate more on the difference?

--
Denis

Re: Importing Large Amounts of Data

From

"Christopher Kings-Lynne"

Date:

15 April 2002, 05:15:04

> On Monday 15 April 2002 03:53, Christopher Kings-Lynne wrote:
> > BTW, instead of:
> >
> > CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no);
> >
> > do:
> >
> > ALTER TABLE bigone ADD PRIMARY KEY(rec_no);
>
> I am sorry, could you please elaborate more on the difference?

They have the same _effect_, it's just that the first sytnax does not mark
the index as the _primary_ index on the relation.

Chris

Re: Importing Large Amounts of Data

From

Denis Perchine

Date:

15 April 2002, 05:21:04

On Monday 15 April 2002 05:15, Christopher Kings-Lynne wrote:
> > On Monday 15 April 2002 03:53, Christopher Kings-Lynne wrote:
> > > BTW, instead of:
> > >
> > > CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no);
> > >
> > > do:
> > >
> > > ALTER TABLE bigone ADD PRIMARY KEY(rec_no);
> >
> > I am sorry, could you please elaborate more on the difference?
>
> They have the same _effect_, it's just that the first sytnax does not mark
> the index as the _primary_ index on the relation.

Yes, I know. I mean how does this affect performance? How this can change
planner decision? Does it have any effect except cosmetical one?

--
Denis

Re: Importing Large Amounts of Data

From

"Christopher Kings-Lynne"

Date:

15 April 2002, 07:06:33

> Yes, I know. I mean how does this affect performance? How this can change
> planner decision? Does it have any effect except cosmetical one?

Only cosmetic.  In the example he gave, he wanted a primary key, so I showed
him how to make one properly.

Chris

Re: Importing Large Amounts of Data

From

Tom Lane

Date:

15 April 2002, 10:43:38

Curt Sampson <cjs@cynic.net> writes:
> On Mon, 15 Apr 2002, Christopher Kings-Lynne wrote:
>> CREATE TABLE WITHOUT OIDS ...

> As you can see from the schema I gave later in my message, that's
> exactly what I did. But does this actually avoid allocating the
> space in the on-disk tuples? What part of the code deals with this?
> It looks to me like the four bytes for the OID are still allocated
> in the tuple, but not used.

Curt is correct: WITHOUT OIDS does not save any storage.  Having two
different formats for the on-disk tuple header seemed more pain than
the feature was worth.  Also, because of alignment considerations it
would save no storage on machines where MAXALIGN is 8.  (Possibly my
thinking is colored somewhat by the fact that that's so on all my
favorite platforms ;-).)

However, as for the NULL values bitmap: that's already compacted out
when not used, and always has been AFAIK.

>> It's a bit hard to say "just turn off all the things that ensure your data
>> integrity so it runs a bit faster", if you actually need data integrity.

> I'm not looking for "runs a bit faster;" five percent either way
> makes little difference to me. I'm looking for a five-fold performance
> increase.

You are not going to get it from this; where in the world did you get
the notion that data integrity costs that much?  When the WAL stuff
was added in 7.1, we certainly did not see any five-fold slowdown.
If anything, testing seemed to indicate that WAL sped things up.
A lot would depend on your particular scenario of course.

Have you tried all the usual speedup hacks?  Turn off fsync, if you
really think you do not care about crash integrity; use COPY FROM STDIN
to bulk-load data, not retail INSERTs; possibly drop and recreate
indexes rather than updating them piecemeal; etc.  You should also
consider not declaring foreign keys, as the runtime checks for reference
validity are pretty expensive.
        regards, tom lane

Re: Importing Large Amounts of Data

From

Tom Lane

Date:

15 April 2002, 11:44:03

"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
>> Yes, I know. I mean how does this affect performance? How this can change
>> planner decision? Does it have any effect except cosmetical one?

> Only cosmetic.  In the example he gave, he wanted a primary key, so I showed
> him how to make one properly.

The ALTER form will complain if any of the columns are not marked NOT
NULL, so the difference isn't completely cosmetic.
        regards, tom lane

Re: Importing Large Amounts of Data

From

Curt Sampson

Date:

15 April 2002, 21:32:20

On Mon, 15 Apr 2002, Tom Lane wrote:

> > I'm not looking for "runs a bit faster;" five percent either way
> > makes little difference to me. I'm looking for a five-fold performance
> > increase.
>
> You are not going to get it from this; where in the world did you get
> the notion that data integrity costs that much?

Um...the fact that MySQL imports the same data five times as fast? :-)

Note that this is *only* related to bulk-importing huge amounts of
data. Postgres seems a little bit slower than MySQL at building
the indicies afterwards, but this would be expected since (probably
due to higher tuple overhead) the size of the data once in postgres
is about 75% larger than in MySQL: 742 MB vs 420 MB. I've not done
any serious testing of query speed, but the bit of toying I've done
with it shows no major difference.

> Have you tried all the usual speedup hacks?  Turn off fsync, if you
> really think you do not care about crash integrity; use COPY FROM STDIN
> to bulk-load data, not retail INSERTs; possibly drop and recreate
> indexes rather than updating them piecemeal; etc.  You should also
> consider not declaring foreign keys, as the runtime checks for reference
> validity are pretty expensive.

Yes, I did all of the above. (This was all mentioned in my initial
message, except for turning off foreign key constraints--but the
table has no foreign keys.)

What I'm thinking would be really cool would be to have an "offline"
way of creating tables using a stand-alone program that would write
the files at, one hopes, near disk speed. Maybe it could work by
creating the tables in a detached tablespace, and then you'd attach
the tablespace when you're done. It might even be extended to be
able to do foreign key checks, create indicies, and so on. (Foreign
key checks would be useful; I'm not sure that creating indicies
would be any faster than just doing it after the tablespace is
attached.)

This would be particularly useful for fast restores of backups.
Downtime while doing a restore is always a huge pain for large
databases.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Importing Large Amounts of Data

From

Bruce Momjian

Date:

15 April 2002, 21:44:48

Curt Sampson wrote:
> On Mon, 15 Apr 2002, Tom Lane wrote:
> 
> > > I'm not looking for "runs a bit faster;" five percent either way
> > > makes little difference to me. I'm looking for a five-fold performance
> > > increase.
> >
> > You are not going to get it from this; where in the world did you get
> > the notion that data integrity costs that much?
> 
> Um...the fact that MySQL imports the same data five times as fast? :-)
> 
> Note that this is *only* related to bulk-importing huge amounts of
> data. Postgres seems a little bit slower than MySQL at building
> the indicies afterwards, but this would be expected since (probably
> due to higher tuple overhead) the size of the data once in postgres
> is about 75% larger than in MySQL: 742 MB vs 420 MB. I've not done
> any serious testing of query speed, but the bit of toying I've done
> with it shows no major difference.

Can you check your load and see if there is a PRIMARY key on the table
at the time it is being loaded.  In the old days, we created indexes
only after the data was loaded, but when we added PRIMARY key, pg_dump
was creating the table with PRIMARY key then loading it, meaning the
table was being loaded while it had an existing index.  I know we fixed
this recently but I am not sure if it was in 7.2 or not. 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Importing Large Amounts of Data

From

Neil Conway

Date:

15 April 2002, 21:49:45

On Mon, 15 Apr 2002 21:44:26 -0400 (EDT)
"Bruce Momjian" <pgman@candle.pha.pa.us> wrote:
> In the old days, we created indexes
> only after the data was loaded, but when we added PRIMARY key, pg_dump
> was creating the table with PRIMARY key then loading it, meaning the
> table was being loaded while it had an existing index.  I know we fixed
> this recently but I am not sure if it was in 7.2 or not. 

It's not in 7.2 -- but it's fixed in CVS.

Cheers,

Neil

-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: Importing Large Amounts of Data

From

Curt Sampson

Date:

15 April 2002, 21:53:52

On Mon, 15 Apr 2002, Bruce Momjian wrote:

> Can you check your load and see if there is a PRIMARY key on the table
> at the time it is being loaded.

There is not. I create the table with a PRIMARY KEY declaration,
but I drop that index before doing the import, and do an ALTER
TABLE to re-add the primary key afterwards.

At one point I tried doing a load with all indices enabled, but
after about eight or nine hours I gave up. (Typically the load
takes about 30 minutes. This is using about 2% of the sample data.)

> In the old days, we created indexes
> only after the data was loaded, but when we added PRIMARY key, pg_dump
> was creating the table with PRIMARY key then loading it, meaning the
> table was being loaded while it had an existing index.  I know we fixed
> this recently but I am not sure if it was in 7.2 or not.

Ah, I saw that fix. But I'm doing the load by hand, not using
pg_restore.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Importing Large Amounts of Data

From

Gavin Sherry

Date:

15 April 2002, 23:56:59

On Tue, 16 Apr 2002, Curt Sampson wrote:

[snip]

> What I'm thinking would be really cool would be to have an "offline"
> way of creating tables using a stand-alone program that would write
> the files at, one hopes, near disk speed. 

Personally, I think there is some merit in this. Postgres can be used
for large scale data mining, an application which does not need
(usually) multi-versioning and concurrency but which can benefit from
postgres's implementation of SQL, as well as backend extensibility. 

I don't see any straight forward way of modifying the code to allow a fast
path directly to relationals on-disk. However, it should be possible to
bypass locking, RI, MVCC etc with the use of a bootstrap-like
tool.

Such a tool would only be able to be used when the database was
offline. It would read data from files pasted to it in some format,
perhaps that generated by COPY.

Given the very low parsing and 'planning' overhead, the real cost would be
WAL (the bootstrapper could fail and render the database unusable) and the
subsequent updating of on-disk relations.

Comments?

Gavin

Re: Importing Large Amounts of Data

From

Curt Sampson

Date:

16 April 2002, 01:05:32

On Tue, 16 Apr 2002, Gavin Sherry wrote:

> I don't see any straight forward way of modifying the code to allow a fast
> path directly to relationals on-disk. However, it should be possible to
> bypass locking, RI, MVCC etc with the use of a bootstrap-like tool.

That was my thought. I'm not asking for disk-speed writes through the
database server itself; I can make do with taking the server off-line,
doing the imports, and bringing it back again, as you suggest.

> Given the very low parsing and 'planning' overhead, the real cost would be
> WAL (the bootstrapper could fail and render the database unusable) and the
> subsequent updating of on-disk relations.

MS SQL Server, when doing a BULK INSERT or BCP, can do it as a fully or
"minimally" logged operation. When minimally logged, there's no ability
to roll-forward or recover inserted data, just the ability to go back
to the state at the beginning of the operation. This technique can work
even though an on-line database. A bit more information is available at
http://msdn.microsoft.com/library/en-us/adminsql/ad_impt_bcp_9esz.asp

(You may want to browse this with lynx; the javascript in it is going to
force your screen into a configuration with frames.) You can follow some
of the links in that page for further information.

Another option, for off-line databases, might just be not to log at all.
If you take a backup first, it may be faster to restore the backup and
start again than to try to roll back the operation, or roll it foward
to partial completion and then figure out where to restart your import.
This seems especially likely if you can restore only the files relating
to the table that was actually damanaged.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Importing Large Amounts of Data

From

Gavin Sherry

Date:

16 April 2002, 01:15:24

On Tue, 16 Apr 2002, Curt Sampson wrote:

> > Given the very low parsing and 'planning' overhead, the real cost would be
> > WAL (the bootstrapper could fail and render the database unusable) and the
> > subsequent updating of on-disk relations.
> 
> MS SQL Server, when doing a BULK INSERT or BCP, can do it as a fully or
> "minimally" logged operation. When minimally logged, there's no ability
> to roll-forward or recover inserted data, just the ability to go back
> to the state at the beginning of the operation. This technique can work
> even though an on-line database. A bit more information is available at

The other reason I say that this bootstrap tool would still use WAL is
that bypassing WAL would require writing a fairly significant amount of
code (unless the pre-WAL heap_insert() code could be used, with relevant
modification).

On the other hand, I would imagine it to be very difficult to implement
an 'interactive' roll back facility with the kind of tool I am
describing.

Gavin