Re: [HACKERS] Problems with >2GB tables on Linux 2.0 - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] Problems with >2GB tables on Linux 2.0
Date
Msg-id 17722.918518134@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] Problems with >2GB tables on Linux 2.0  (Thomas Reinke <reinke@e-softinc.com>)
Responses Re: [HACKERS] Problems with >2GB tables on Linux 2.0  (Peter T Mount <peter@retep.org.uk>)
List pgsql-hackers
Peter T Mount wrote:
>> How about dropping the suffix, so you would have:
>> .../data/2/tablename
>> Doing that doesn't mean having to increase the filename buffer size, just
>> the format and arg order (from %s.%d to %d/%s).

I thought of that also, but concluded it was a bad idea, because it
means you cannot symlink several of the /n subdirectories to the same
place.  It also seems just plain risky/errorprone to have different
files named the same thing...

>> I'd think we could add a test when the new segment is created for the
>> symlink/directory. If it doesn't exist, then create it.

Absolutely, the system would need to auto-create a /n subdirectory if
one didn't already exist.

Thomas Reinke <reinke@e-softinc.com> writes:
> ... I'm not entirely sure that this is an effective
> solution to data distribution.

Well, I'm certain we could do better if we wanted to put some direct
effort into that issue, but we can get a usable scheme this way with
practically no effort except writing a little how-to documentation.

Assume you have N big tables where you know what N is.  (You probably
have a lot of little tables as well, which we assume can be ignored for
the purposes of space allocation.)  If you configure the max file size
as M megabytes, the toplevel data directory will have M * N megabytes
of stuff (plus little files).  If all the big tables are about the same
size, say K * M meg apiece, then you wind up with K-1 subdirectories
each also containing M * N meg, which you can readily scatter across
different filesystems by setting up the subdirectories as symlinks.
In practice the later subdirectories are probably less full because
the big tables aren't all equally big, but you can put more of them
on a single filesystem to make up for that.

If N varies considerably over time then this scheme doesn't work so
well, but I don't see any scheme that would cope with a very variable
database without physically moving files around every so often.

When we get to the point where people are routinely complaining what
a pain in the neck it is to manage big databases this way, it'll be
time enough to improve the design and write some scripts to help
rearrange files on the fly.  Right now, I would just like to see a
scheme that doesn't require the dbadmin to symlink each individual
table file in order to split a big database.  (It could probably be
argued that even doing that much is ahead of the demand, but since
it's so cheap to provide this little bit of functionality we might
as well do it.)

> I'd suggest making the max file size 1 Gig default, configurable
> someplace, and solving the data distribution as a separate effort.

We might actually be saying the same thing, if by that remark you
mean that we can come back later and write "real" data distribution
management tools.  I'm just pointing out that given a configurable
max file size we can have a primitive facility almost for free.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: samekeys
Next
From: "Oliver Elphick"
Date:
Subject: HAVING bug in 6.4.2