Re: Re: [COMMITTERS] Can not create more than 32766 databases in ufs file system. - Mailing list pgsql-hackers

From Mark Mielke
Subject Re: Re: [COMMITTERS] Can not create more than 32766 databases in ufs file system.
Date
Msg-id 4AAC2AC5.4010502@mark.mielke.cc
Whole thread Raw
In response to Re: Re: [COMMITTERS] Can not create more than 32766 databases in ufs file system.  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
On 09/12/2009 04:17 PM, Stephen Frost wrote:
> * Mark Mielke (mark@mark.mielke.cc) wrote:
>    
>> There is no technical requirement for PostgreSQL to separate data in
>> databases or tables on subdirectory or file boundaries. Nothing wrong
>> with having one or more large files that contain everything.
>>      
> Uhh, except where you run into system limitations on file size (eg- a 2G
> max file size..).  You'll note PG creates files up to 1G and then splits
> them into separate files.  It's not done just because it's fun.
>    

This becomes a bit of a side thread - but note that I carefully didn't 
say "exactly one file". I said "one or more large files that contain 
everything". That is, if we have 3 databases each of size 50 Mbytes, 
there is no technical reason why this cannot be stored within a single 
150 Mbyte data file. Sure, if it goes beyond 2G, we can break it into a 
set of files, and treat each file as a 2G "block" in a virtual larger 
storage pool. VMWare has this for storing virtual drives.

If we assume that 32k *databases* is reasonable for a single instance, 
for 32k databases to *require* 32k immediate sub-directories is the real 
problem. This can be solved either by: 1) Adding additional depth to the 
directory height to work around this limit (what the OP and you are 
proposing), or 2) Storing multiple databases within the same files or 
sub-directories. If you really must have this amount of scalability, I 
am suggesting that you consider all of the resources required to access 
32k worth of sub-directories in the file systems, specifically including 
file descriptors, inodes, the backing bitmaps or extent mappings that 
allocate from the file system free space, the rather inefficient 
directory layouts of many file systems (many file systems still do 
LINEAR searches for filenames, making file lookups linearly slower as 
the directory becomes larger), and the kernel memory caches that track 
all of these little details. The POSIX guarantees required are certainly 
more heavy weight than the requirements that PostgreSQL has, and I am 
certain it is possible to create a targetted solution to this problem 
that is simpler and faster. For only a few databases and a few files, 
the effort isn't worth it. But, if supporting 32k+ *databases*, or even 
32k+ tables and indexes is a major requirement, and a major design 
target, then PostgreSQL should do this stuff itself.

Modern file systems don't have the 2G problem. ext2/ext3 for 4Kbyte 
blocks (standard) supports up to 2Tbytes. This also matches the 
practical limit on addressing a single physical disk, at least on the 
platforms I am familiar with. The requirement to stay under 2G for a 
single file is a bit out dated.


>> I guess I'm not seeing how using 32k tables is a sensible model.
>>      
> For one thing, there's partitioning.  For another, there's a large user
> base.  32K tables is, to be honest, not all that many, especially for
> some of these databases which reach into the multi-TB range..
>    

Talking philosophically - the need to use table-based partitioning to 
achieve acceptable performance or storage requirements is somewhat of a 
hacky work around. It's effectively moving the database query logic back 
into the application space, where the application must know which tables 
contain which data. The inherited tables and automatic constraint-based 
query planning helps out, but it's still an elaborate hack. It's 
exposing data that the application should not need to care about, and 
then making it possible to hide some of it again. Table partitioning 
should be far more automatic. I don't want to break my theoretical table 
containing every call made on my network into per-hour tables, each with 
a constraint for the time range it includes data for. I want to create a 
table, with a timestamp column, fill it with billions of records, 
provide a few hints, and the database engine should be smart enough to 
partition the table such that my queries "just work".

Back to reality - maybe things have not reached this level of maturity 
yet, and people with practical requirements today, have found that they 
need to use very complex manual partitioning schemes that chew up 
thousands of tables.


>> So yes,
>> things can be done to reduce the cost - but it seems like something is
>> wrong if this is truly a requirement.
>>      
> I have no idea what you've been working with, but I hardly think it
> makes sense for PG to consider over 32k tables as not worth supporting.
>    

I don't advocate any limits. However, I also don't advocate designing 
PostgreSQL specifically for the case of 32k tables. If you want to use 
32k tables, then you better have a file system that supports 32k+ files 
in a single directory, and a kernel that is able to work efficiently 
when postgres has thousands or more file descriptors open and in use at 
the same time. The system *supports* 32k tables, but if you look at the 
design, you'll see that it is not optimal for 32k tables. Even with the 
changes to reduce the use of flat files and such, it's still not a 
design that makes 32k tables optimal. If you want to create 32k tables - 
I don't want to stop you.

But, I'm not going to agree that this is optimal or that PostgreSQL 
should be put in excessive effort to make it optimal. :-)

Specifically with regard to my comments about creating a truly scalable 
system that would support 32k databases - I don't think it's worth it. I 
know it is possible, but I don't want to see effort put into it.

>> There are alternative models of
>> storage that would not require 32k tables, that likely perform better.
>>      
> Eh?  You would advocate combining tables for no reason other than you
> think it's bad to have alot?
>    

"likely perform better" is not "no reason".

>> Do you agree with me that having 32k open file descriptors (or worse,
>> open on demand file descriptors that need to be re-opened many times) is
>> a problem?
>>      
> Nope.
>    

If you don't think it represents a scaleable problem, then I think it 
puts your ability to judge the situation in question. :-)


>> Looking at PostgreSQL today - I don't think it's designed to scale to
>> this. Looking at SQL today, I think I would find it difficult to justify
>> creating a solution that requires this capability.
>>      
> Actually, I find that PG handles it pretty well.  And we used to be an
> Oracle shop.
>    

Everything is relative. Throw enough RAM and CPU at the problem, and 
things can appear acceptable and scaleable. Hit a limit, and ... oops

You do remember we're in a thread talking about how the # of databases 
is limited on some systems, right? :-)

"Pretty well" means "good enough for you". It means nothing more or less.

Cheers,
mark

-- 
Mark Mielke<mark@mielke.cc>



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: plperl returning setof foo[]
Next
From: Andrew Dunstan
Date:
Subject: Re: Re: [COMMITTERS] Can not create more than 32766 databases in ufs file system.