Re: Importing Large Amounts of Data - Mailing list pgsql-hackers

From Curt Sampson
Subject Re: Importing Large Amounts of Data
Date
Msg-id Pine.NEB.4.43.0204151616020.439-100000@angelic.cynic.net
Whole thread Raw
In response to Re: Importing Large Amounts of Data  ("Christopher Kings-Lynne" <chriskl@familyhealth.com.au>)
Responses Re: Importing Large Amounts of Data  ("Christopher Kings-Lynne" <chriskl@familyhealth.com.au>)
Re: Importing Large Amounts of Data  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Mon, 15 Apr 2002, Christopher Kings-Lynne wrote:

> > ...the OID and
> > the NULL bitfield) are not used; would it be possible to avoid
> > allocating these in this relations that don't use them?
>
> CREATE TABLE WITHOUT OIDS ...

As you can see from the schema I gave later in my message, that's
exactly what I did. But does this actually avoid allocating the
space in the on-disk tuples? What part of the code deals with this?
It looks to me like the four bytes for the OID are still allocated
in the tuple, but not used.

> This conclusion seems to me to be remarkably shortsighted.  Does the initial
> data load into the database occur just once or quite often?

Well, I'm going to be doing the initial load (.5 billion tuples) quite
a few times, in order to test some different partitioning arrangements.
So I'll save quite a lot of time initially if I get a faster import.

But from the looks of it, the production system will be doing daily
imports of fresh data ranging in size from a copule of million rows
to a couple of tens of millions of rows.

> It's a bit hard to say "just turn off all the things that ensure your data
> integrity so it runs a bit faster", if you actually need data integrity.

I'm not looking for "runs a bit faster;" five percent either way
makes little difference to me. I'm looking for a five-fold performance
increase.

> Anyway, from what I understand an OLTP application is all about selects and
> memoising certain aggregate results.

I guess that was a typo, and you meant OLAP?

Anyway, from the looks of it, this is going to be fairly simple
stuff.  (Unfortunately, I don't have details of the real application
the client has in mind, though I sure wish I did.) What I'm trying
to indicate when I say "OLAP" is that it's basically selecting
across broad swaths of a large data set, and doing little or nothing
in the way of updates. (Except for the daily batches of data, of
course.)

> The fact that you can load stuff quicker in
> MySQL and it takes up less disk space seems totally irrelevant.

Yeah, everybody's telling me this. Let me try once again here:
1. Every day, I must import millions, possibly tens ofmillions, of rows of data. Thus, speed of import is indeedfairly
importantto me.
 
2. It looks, at least at this point, as if the applicationwill be doing only fairly simple selects out of the
currenthalf-billionrows of data and whatever gets added in thefuture. Thus, I don't think that using MySQL would be
aproblem.(If I did, I wouldn't be proposing it.)
 

I don't want to start a flamewar here, because personally I don't
even like MySQL and would prefer always to use PostgreSQL. But it
makes it a lot harder to do so when people keep insisting that
import speed is not important. Rather than say that, why don't we
just admit that PosgreSQL is a fairly crap performer in this regard
at the moment (at least the way I'm doing it), and work out ways
to fix this?

> Just wait until your MySQL server crashes and your client finds that half
> his data is corrupted...

If there are no updates, why would anything be corrupted? At any
rate, I can always restore from backup, since little or nothing
would be lost.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



pgsql-hackers by date:

Previous
From: "Christopher Kings-Lynne"
Date:
Subject: Re: Importing Large Amounts of Data
Next
From: Tatsuo Ishii
Date:
Subject: Re: [PATCHES] unknownin/out patch (was PQescapeBytea is