Importing Large Amounts of Data - Mailing list pgsql-hackers

From Curt Sampson
Subject Importing Large Amounts of Data
Date
Msg-id Pine.NEB.4.43.0204151445170.439-100000@angelic.cynic.net
Whole thread Raw
Responses Re: Importing Large Amounts of Data  ("Christopher Kings-Lynne" <chriskl@familyhealth.com.au>)
List pgsql-hackers
Appended is a message that I sent to the pgsql-general list, and
for which I received no useful reply. (Well, anyway, no reply that
has helped me to speed up my imports.) If you've read it already,
feel free to ignore it, but if you haven't, I'd appreciate any
advice on how to make this work as fast as possible.

There are some things that could be done to help optimize situations
like this, though I don't think any can be done in the short term.
Here are some of my thoughts, which may or may not be useful:

1. Postgres appears to have a fairly high row overhead (40 bytes
or so according to the FAQ), which grieves me slightly, as that's
actually larger than the size of the data in my tuples. It would
seem that in my case some of the items in that header (the OID and
the NULL bitfield) are not used; would it be possible to avoid
allocating these in this relations that don't use them?

Also, is there a good description of the on-disk tuple format
somewhere? include/access/htup.h seems to require a fair bit of
knowledge about how other parts of the system work to be understood.

2. Does it make sense to be able to do some operations without
logging? For the COPY, if the system crashes and I lose some or
all all the tuples I'd imported so far, I don't care that much; I
can just restart the COPY at an appropriate point. As mentioned
below, that would save half a gig of disk writes when importing 5M
tuples.

3. How about having a way to take a table off-line to work on it,
and bring it back on-line again when done? This would get rid of
the logging overhead, locking overhead, and that sort of stuff,
and in theory might be able to get you something approaching
disk-speed data imports.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 

---------- Forwarded message ----------
Date: Thu, 11 Apr 2002 17:28:13 +0900 (JST)
From: Curt Sampson <cjs@cynic.net>
To: pgsql-general@postgresql.org
Subject: [GENERAL] Importing Large Amounts of Data

I've been asked by a client to do some testing of Postgres for what
appears to be OLAP on a fairly large data set (about half a billion
tuples). I'm probably going to want to try partitioning this in various
ways, but the application, not Postgres, will deal with that.

I'm using PostgreSQL 7.2.1, and the schema I'm testing with is as follows:
CREATE TABLE bigone (    rec_no        INT            PRIMARY KEY,    day            DATE        NOT NULL,    user_id
    CHAR(5)        NOT NULL,    value        VARCHAR(20) NOT NULL    ) WITHOUT OIDS;DROP INDEX bigone_pkey;
 
[COPY is done here....]
CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no);CREATE INDEX bigone_day ON bigone (day);CREATE INDEX bigone_user_id
ONbigone (user_id);
 

Unfortunately, the first problem I've run into is that importing is
rather slow. With all indexes (including the bigone_pkey) dropped,
importing five million tuples into the above table, starting from empty,
takes about 921 seconds. The second 5M tuples takes about 1009 seconds.
If I switch to using the -F option, the first 5M takes 714 seconds and the
second 5M takes 742 seconds. At the end, I have about 742 MB of data under
the data/base directory. (This is using a fresh database cluster.)

For comparison, the MySQL does each import in about 221 and 304 seconds,
and the data in the end take up about 427 MB.

Part of the problem here may be that Postgres appears to be logging the
COPY operation; I get from 27-33 "recycled transaction log file" messages
for every 5M tuple COPY that I do. If there were a way to do an unlogged
copy, that might save close to half a gig of writes to the disk.

The other part of the problem may just be the size of the data;
why does Postgres take up 75% more space (about 78 bytes per tuple,
vs. 45 bytes per tuple) for this table?

As well, index builds seem to take about 20% longer (using -F), and they
seem to be about 10% larger as well.

Does anybody have any suggestions as to how I can improve performance
here, and reduce disk space requirements? If not, I'll probably have
to suggest to the client that he move to MySQL for this particular
application, unless he needs any of the features that Postgres provides
and MySQL doesn't.



pgsql-hackers by date:

Previous
From: Daniel Kalchev
Date:
Subject: more on large oids
Next
From: "Christopher Kings-Lynne"
Date:
Subject: Re: Importing Large Amounts of Data