Thread: Large rows number, and large objects

Large rows number, and large objects

From
Jose Ildefonso Camargo Tolosa
Date:
Greetings,

I have been thinking a lot about pgsql performance when it is dealing
with tables with lots of rows on one table (several millions, maybe
thousands of millions).  Say, the Large Object use case:

one table has large objects (have a pointer to one object).
The large object table stores the large object in 2000 bytes chunks
(iirc), so, if we have something like 1TB of data stored in large
objects, the large objects table would have something like 550M rows,
if we get to 8TB, we will have 4400M rows (or so).

I have read at several places that huge tables should be partitioned,
to improve performance.... now, my first question comes: does the
large objects system automatically partitions itself? if no: will
Large Objects system performance degrade as we add more data? (I guess
it would).

Now... I can't fully understand this: why does the performance
actually goes lower? I mean, when we do partitioning, we take a
certain parameter to "divide" the data,and then use the same parameter
to issue the request against the correct table... shouldn't the DB
actually do something similar with the indexes? I mean, I have always
thought about the indexes, well, exactly like that: approximation
search, I know I'm looking for, say, a date that is less than
2010-03-02, and the system should just position itself on the index
around that date, and scan from that point backward... as far as my
understanding goes, the partitioning only adds like this "auxiliary"
index, making the system, for example, go to a certain table if the
query goes toward one particular year (assuming we partitioned by
year), what if the DB actually implemented something like an Index for
the Index (so that the first search on huge tables scan on an smaller
index that points to a position on the larger index, thus avoiding the
scan of the large index initially).

Well.... I'm writing all of this half-sleep now, so... I'll re-read it
tomorrow... in the meantime, just ignore anything that doesn't make a
lot of sense :) .

Thanks!

Ildefonso Camargo

Re: Large rows number, and large objects

From
Samuel Gendler
Date:


On Sat, Jun 18, 2011 at 9:06 PM, Jose Ildefonso Camargo Tolosa <ildefonso.camargo@gmail.com> wrote:
Greetings,

I have been thinking a lot about pgsql performance when it is dealing
with tables with lots of rows on one table (several millions, maybe
thousands of millions).  Say, the Large Object use case:

one table has large objects (have a pointer to one object).
The large object table stores the large object in 2000 bytes chunks
(iirc), so, if we have something like 1TB of data stored in large
objects, the large objects table would have something like 550M rows,
if we get to 8TB, we will have 4400M rows (or so).

I have read at several places that huge tables should be partitioned,
to improve performance.... now, my first question comes: does the
large objects system automatically partitions itself? if no: will
Large Objects system performance degrade as we add more data? (I guess
it would).

Now... I can't fully understand this: why does the performance
actually goes lower? I mean, when we do partitioning, we take a
certain parameter to "divide" the data,and then use the same parameter
to issue the request against the correct table... shouldn't the DB
actually do something similar with the indexes? I mean, I have always
thought about the indexes, well, exactly like that: approximation
search, I know I'm looking for, say, a date that is less than
2010-03-02, and the system should just position itself on the index
around that date, and scan from that point backward... as far as my
understanding goes, the partitioning only adds like this "auxiliary"
index, making the system, for example, go to a certain table if the
query goes toward one particular year (assuming we partitioned by
year), what if the DB actually implemented something like an Index for
the Index (so that the first search on huge tables scan on an smaller
index that points to a position on the larger index, thus avoiding the
scan of the large index initially).

Well.... I'm writing all of this half-sleep now, so... I'll re-read it
tomorrow... in the meantime, just ignore anything that doesn't make a
lot of sense :) .

Partitioning helps in a number of ways.  First, if running a query which must scan an entire table, if the table is very large, that scan will be expensive.  Partitioning can allow the query planner to do a sequential scan over just some of the data and skip the rest (or process it via some other manner, such as an index lookup).  Also, the larger an index is, the more expensive the index is to maintain.  Inserts and lookups will both take longer.  Partitioning will give you n indexes, each with m/n entries (assuming fairly even distribution of data among partitions), so any given index will be smaller, which means inserts into a partition will potentially be much faster.  Since large tables often also have very high insert rates, this can be a big win.  You can also gain better resource utilization by moving less frequently used partitions onto slower media (via a tablespace), freeing up space on your fastest i/o devices for the most important data.  A lot of partitioning tends to happen by time, and the most frequently run queries are often on the partitions containing the most recent data, so it often can be very beneficial to keep only the most recent partitions on fastest storage.  Then there is caching.  Indexes and tables are cached by page.  Without clustering a table on a particular index, the contents of a single page may be quite arbitrary.  Even with clustering, depending upon the usage patterns of the table in question, it is entirely possible that any given page may only have some fairly small percentage of highly relevant data if the table is very large.  By partitioning, you can (potentially) ensure that any given page in cache will have a higher density of highly relevant entries, so you'll get better efficiency out of the caching layers.  And with smaller indexes, it is less likely that loading an index into shared buffers will push some other useful chunk of data out of the cache.

As for the large object tables, I'm not sure about the internals.  Assuming that each table gets its own table for large objects, partitioning the main table will have the effect of partitioning the large object table, too - keeping index maintenance more reasonable and ensuring that lookups are as fast as possible.  There's probably a debate to be had on the benefit of storing very large numbers of large objects in the db, too (as opposed to keeping references to them in the db and actually accessing them via some other mechanism.  Both product requirements and performance are significant factors in that discussion).

As for your suggestion that the db maintain an index on an index, how would the database do so in an intelligent manner? It would have to maintain such indexes on every index and guess as to which values to use as boundaries for each bucket. Partitioning solves the same problem, but allows you to direct the database such that it only does extra work where the dba, who is much more knowledgable about the structure of the data and how it will be used than the database itself, tells it to. And the dba gets to tell the db what buckets to use when partitioning the database - via the check constraints on the partitions.  Without that, the db would have to guess as to appropriate bucket sizes and the distribution of values within them.

I'm sure there are reasons beyond even those I've listed here.  I'm not one of the postgresql devs, so my understanding of how it benefits from partitioning is shallow, at best.  If the usage pattern of your very large table is such that every query tends to use all of the table, then I'm not sure partitioning really offers much gain.  The benefits of partitioning are, at least in part, predicated on only a subset of the data being useful to any one query, and the benefits get that much stronger if some portion of the data is rarely used by any query.

Re: Large rows number, and large objects

From
Craig James
Date:
On 6/19/11 4:37 AM, Samuel Gendler wrote:
On Sat, Jun 18, 2011 at 9:06 PM, Jose Ildefonso Camargo Tolosa <ildefonso.camargo@gmail.com> wrote:
Greetings,

I have been thinking a lot about pgsql performance when it is dealing
with tables with lots of rows on one table (several millions, maybe
thousands of millions).  Say, the Large Object use case:

one table has large objects (have a pointer to one object).
The large object table stores the large object in 2000 bytes chunks
(iirc), so, if we have something like 1TB of data stored in large
objects, the large objects table would have something like 550M rows,
if we get to 8TB, we will have 4400M rows (or so).

I have read at several places that huge tables should be partitioned,
to improve performance.... now, my first question comes: does the
large objects system automatically partitions itself? if no: will
Large Objects system performance degrade as we add more data? (I guess
it would).
You should consider "partitioning" your data in a different way: Separate the relational/searchable data from the bulk data that is merely being stored.

Relational databases are just that: relational.  The thing they do well is to store relationships between various objects, and they are very good at finding objects using relational queries and logical operators.

But when it comes to storing bulk data, a relational database is no better than a file system.

In our system, each "object" is represented by a big text object of a few kilobytes.  Searching that text file is essential useless -- the only reason it's there is for visualization and to pass on to other applications.  So it's separated out into its own table, which only has the text record and a primary key.

We then use other tables to hold extracted fields and computed data about the primary object, and the relationships between the objects.  That means we've effectively "partitioned" our data into searchable relational data and non-searchable bulk data.  The result is that we have around 50 GB of bulk data that's never searched, and about 1GB of relational, searchable data in a half-dozen other tables.

With this approach, there's no need for table partitioning, and full table scans are quite reasonable.

Craig

Re: Large rows number, and large objects

From
Jose Ildefonso Camargo Tolosa
Date:
Hi!

Thanks (you both, Samuel and Craig) for your answers!

On Sun, Jun 19, 2011 at 11:19 AM, Craig James
<craig_james@emolecules.com> wrote:
> On 6/19/11 4:37 AM, Samuel Gendler wrote:
>
> On Sat, Jun 18, 2011 at 9:06 PM, Jose Ildefonso Camargo Tolosa
> <ildefonso.camargo@gmail.com> wrote:
>>
>> Greetings,
>>
>> I have been thinking a lot about pgsql performance when it is dealing
>> with tables with lots of rows on one table (several millions, maybe
>> thousands of millions).  Say, the Large Object use case:
>>
>> one table has large objects (have a pointer to one object).
>> The large object table stores the large object in 2000 bytes chunks
>> (iirc), so, if we have something like 1TB of data stored in large
>> objects, the large objects table would have something like 550M rows,
>> if we get to 8TB, we will have 4400M rows (or so).
>>
>> I have read at several places that huge tables should be partitioned,
>> to improve performance.... now, my first question comes: does the
>> large objects system automatically partitions itself? if no: will
>> Large Objects system performance degrade as we add more data? (I guess
>> it would).
>
> You should consider "partitioning" your data in a different way: Separate
> the relational/searchable data from the bulk data that is merely being
> stored.
>
> Relational databases are just that: relational.  The thing they do well is
> to store relationships between various objects, and they are very good at
> finding objects using relational queries and logical operators.
>
> But when it comes to storing bulk data, a relational database is no better
> than a file system.
>
> In our system, each "object" is represented by a big text object of a few
> kilobytes.  Searching that text file is essential useless -- the only reason
> it's there is for visualization and to pass on to other applications.  So
> it's separated out into its own table, which only has the text record and a
> primary key.

Well, my original schema does exactly that (I mimic the LO schema):

files (searchable): id, name, size, hash, mime_type, number_chunks
files_chunks : id, file_id, hash, chunk_number, data (bytea)

So, my bulk data is on files_chunks table, but due that data is
restricted (by me) to 2000 bytes, the total number of rows on the
files_chunks table can get *huge*.

So, system would search the files table, and then, search the
files_chunks table (to get each of the chunks, and, maybe, send them
out to the web client).

So, with a prospect of ~4500M rows for that table, I really thought it
could be a good idea to partition files_chunks table.  Due that I'm
thinking on relatively small files (<100MB), table partitioning should
do great here, because I could manage to make all of the chunks for a
table  to be contained on the same table.  Now, even if the system
were to get larger files (>5GB), this approach should still work.

The original question was about Large Objects, and partitioning...
see, according to documentation:
http://www.postgresql.org/docs/9.0/static/lo-intro.html

"All large objects are placed in a single system table called pg_largeobject."

So, the question is, if I were to store 8TB worth of data into large
objects system, it would actually make the pg_largeobject table slow,
unless it was automatically partitioned.

Thanks for taking the time to discuss this matter with me!

Sincerely,

Ildefonso Camargo

Re: Large rows number, and large objects

From
Robert Haas
Date:
On Sun, Jun 19, 2011 at 10:19 PM, Jose Ildefonso Camargo Tolosa
<ildefonso.camargo@gmail.com> wrote:
> So, the question is, if I were to store 8TB worth of data into large
> objects system, it would actually make the pg_largeobject table slow,
> unless it was automatically partitioned.

I think it's a bit of an oversimplification to say that large,
unpartitioned tables are automatically going to be slow.  Suppose you
had 100 tables that were each 80GB instead of one table that is 8TB.
The index lookups would be a bit faster on the smaller tables, but it
would take you some non-zero amount of time to figure out which index
to read in the first place.  It's not clear that you are really
gaining all that much.

Many of the advantages of partitioning have to do with maintenance
tasks.  For example, if you gather data on a daily basis, it's faster
to drop the partition that contains Thursday's data than it is to do a
DELETE that finds the rows and deletes them one at a time.  And VACUUM
can be a problem on very large tables as well, because only one VACUUM
can run on a table at any given time.  If the frequency with which the
table needs to be vacuumed is less than the time it takes for VACUUM
to complete, then you've got a problem.

But I think that if we want to optimize pg_largeobject, we'd probably
gain a lot more by switching to a different storage format than we
could ever gain by partitioning the table.  For example, we might
decide that any object larger than 16MB should be stored in its own
file.  Even somewhat smaller objects would likely benefit from being
stored in larger chunks - say, a bunch of 64kB chunks, with any
overage stored in the 2kB chunks we use now.  While this might be an
interesting project, it's probably not going to be anyone's top
priority, because it would be a lot of work for the amount of benefit
you'd get.  There's an easy workaround: store the files in the
filesystem, and a path to those files in the database.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Large rows number, and large objects

From
Jose Ildefonso Camargo Tolosa
Date:


On Tue, Jul 19, 2011 at 3:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Jun 19, 2011 at 10:19 PM, Jose Ildefonso Camargo Tolosa
> So, the question is, if I were to store 8TB worth of data into large
> objects system, it would actually make the pg_largeobject table slow,
> unless it was automatically partitioned.

I think it's a bit of an oversimplification to say that large,
unpartitioned tables are automatically going to be slow.  Suppose you
had 100 tables that were each 80GB instead of one table that is 8TB.
The index lookups would be a bit faster on the smaller tables, but it
would take you some non-zero amount of time to figure out which index
to read in the first place.  It's not clear that you are really
gaining all that much.

Certainly.... but it is still very blurry to me on *when* it is better to partition than not.
 

Many of the advantages of partitioning have to do with maintenance
tasks.  For example, if you gather data on a daily basis, it's faster
to drop the partition that contains Thursday's data than it is to do a
DELETE that finds the rows and deletes them one at a time.  And VACUUM
can be a problem on very large tables as well, because only one VACUUM
can run on a table at any given time.  If the frequency with which the
table needs to be vacuumed is less than the time it takes for VACUUM
to complete, then you've got a problem.

And.... pg_largeobject table doesn't get vacuumed? I mean, isn't that table just as any other table?
 

But I think that if we want to optimize pg_largeobject, we'd probably
gain a lot more by switching to a different storage format than we
could ever gain by partitioning the table.  For example, we might
decide that any object larger than 16MB should be stored in its own
file.  Even somewhat smaller objects would likely benefit from being
stored in larger chunks - say, a bunch of 64kB chunks, with any
overage stored in the 2kB chunks we use now.  While this might be an
interesting project, it's probably not going to be anyone's top
priority, because it would be a lot of work for the amount of benefit
you'd get.  There's an easy workaround: store the files in the
filesystem, and a path to those files in the database.

Ok, one reason for storing a file *in* the DB is to be able to do PITR of a wrongly deleted files (or overwritten, and that kind of stuff), on the filesystem level you would need a versioning filesystem (and I don't, yet, know any that is stable in the Linux world).

Also, you can use streaming replication and at the same time you stream your data, your files are also streamed to a secondary server (yes, on the FS-level you could use drbd or similar).

Ildefonso.

Re: Large rows number, and large objects

From
Robert Haas
Date:
On Wed, Jul 20, 2011 at 11:57 AM, Jose Ildefonso Camargo Tolosa
<ildefonso.camargo@gmail.com> wrote:
> On Tue, Jul 19, 2011 at 3:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Jun 19, 2011 at 10:19 PM, Jose Ildefonso Camargo Tolosa
>> <ildefonso.camargo@gmail.com> wrote:
>> > So, the question is, if I were to store 8TB worth of data into large
>> > objects system, it would actually make the pg_largeobject table slow,
>> > unless it was automatically partitioned.
>>
>> I think it's a bit of an oversimplification to say that large,
>> unpartitioned tables are automatically going to be slow.  Suppose you
>> had 100 tables that were each 80GB instead of one table that is 8TB.
>> The index lookups would be a bit faster on the smaller tables, but it
>> would take you some non-zero amount of time to figure out which index
>> to read in the first place.  It's not clear that you are really
>> gaining all that much.
>
> Certainly.... but it is still very blurry to me on *when* it is better to
> partition than not.

I think that figuring that out is as much an art as it is a science.
It's better to partition when most of your queries are going to touch
only a single partition; when you are likely to want to remove
partitions in their entirety; when VACUUM starts to have trouble
keeping up... but the reality is that in some cases you probably have
to try it both ways and see which one works better.

>> Many of the advantages of partitioning have to do with maintenance
>> tasks.  For example, if you gather data on a daily basis, it's faster
>> to drop the partition that contains Thursday's data than it is to do a
>> DELETE that finds the rows and deletes them one at a time.  And VACUUM
>> can be a problem on very large tables as well, because only one VACUUM
>> can run on a table at any given time.  If the frequency with which the
>> table needs to be vacuumed is less than the time it takes for VACUUM
>> to complete, then you've got a problem.
>
> And.... pg_largeobject table doesn't get vacuumed? I mean, isn't that table
> just as any other table?

Yes, it is.  So, I agree: putting 8TB of data in there is probably
going to hurt.

>> But I think that if we want to optimize pg_largeobject, we'd probably
>> gain a lot more by switching to a different storage format than we
>> could ever gain by partitioning the table.  For example, we might
>> decide that any object larger than 16MB should be stored in its own
>> file.  Even somewhat smaller objects would likely benefit from being
>> stored in larger chunks - say, a bunch of 64kB chunks, with any
>> overage stored in the 2kB chunks we use now.  While this might be an
>> interesting project, it's probably not going to be anyone's top
>> priority, because it would be a lot of work for the amount of benefit
>> you'd get.  There's an easy workaround: store the files in the
>> filesystem, and a path to those files in the database.
>
> Ok, one reason for storing a file *in* the DB is to be able to do PITR of a
> wrongly deleted files (or overwritten, and that kind of stuff), on the
> filesystem level you would need a versioning filesystem (and I don't, yet,
> know any that is stable in the Linux world).
>
> Also, you can use streaming replication and at the same time you stream your
> data, your files are also streamed to a secondary server (yes, on the
> FS-level you could use drbd or similar).

Well, those are good arguments for putting the functionality in the
database and making it all play nicely with write-ahead logging.  But
nobody's felt motivated to write the code yet, so...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Large rows number, and large objects

From
Andrzej Nakonieczny
Date:
W dniu 20.07.2011 17:57, Jose Ildefonso Camargo Tolosa pisze:

[...]

>     Many of the advantages of partitioning have to do with maintenance
>     tasks.  For example, if you gather data on a daily basis, it's faster
>     to drop the partition that contains Thursday's data than it is to do a
>     DELETE that finds the rows and deletes them one at a time.  And VACUUM
>     can be a problem on very large tables as well, because only one VACUUM
>     can run on a table at any given time.  If the frequency with which the
>     table needs to be vacuumed is less than the time it takes for VACUUM
>     to complete, then you've got a problem.
>
>
> And.... pg_largeobject table doesn't get vacuumed? I mean, isn't that
> table just as any other table?

Vacuum is a real problem on big pg_largeobject table. I have 1.6 TB
database mostly with large objects and vacuuming that table on fast SAN
takes about 4 hours:

         now          |        start        |   time   |  datname   |
        current_query
---------------------+---------------------+----------+------------+----------------------------------------------
  2011-07-20 20:12:03 | 2011-07-20 16:21:20 | 03:50:43 | bigdb      |
autovacuum: VACUUM pg_catalog.pg_largeobject
(1 row)


LO generates a lot of dead tuples when object are adding:

      relname      | n_dead_tup
------------------+------------
  pg_largeobject   |     246731

Adding LO is very fast when table is vacuumed. But when there is a lot
of dead tuples adding LO is very slow (50-100 times slower) and eats
100% of CPU.

It looks that better way is writing object directly as a bytea on
paritioned tables althought it's a bit slower than LO interface on a
vacuumed table.


Regards,
Andrzej

Re: Large rows number, and large objects

From
Jose Ildefonso Camargo Tolosa
Date:
On Wed, Jul 20, 2011 at 3:03 PM, Andrzej Nakonieczny
<dzemik-pgsql-performance@e-list.pingwin.eu.org> wrote:
> W dniu 20.07.2011 17:57, Jose Ildefonso Camargo Tolosa pisze:
>
> [...]
>
>>    Many of the advantages of partitioning have to do with maintenance
>>    tasks.  For example, if you gather data on a daily basis, it's faster
>>    to drop the partition that contains Thursday's data than it is to do a
>>    DELETE that finds the rows and deletes them one at a time.  And VACUUM
>>    can be a problem on very large tables as well, because only one VACUUM
>>    can run on a table at any given time.  If the frequency with which the
>>    table needs to be vacuumed is less than the time it takes for VACUUM
>>    to complete, then you've got a problem.
>>
>>
>> And.... pg_largeobject table doesn't get vacuumed? I mean, isn't that
>> table just as any other table?
>
> Vacuum is a real problem on big pg_largeobject table. I have 1.6 TB database
> mostly with large objects and vacuuming that table on fast SAN takes about 4
> hours:
>
>        now          |        start        |   time   |  datname   |
>  current_query
> ---------------------+---------------------+----------+------------+----------------------------------------------
>  2011-07-20 20:12:03 | 2011-07-20 16:21:20 | 03:50:43 | bigdb      |
> autovacuum: VACUUM pg_catalog.pg_largeobject
> (1 row)
>
>
> LO generates a lot of dead tuples when object are adding:
>
>     relname      | n_dead_tup
> ------------------+------------
>  pg_largeobject   |     246731
>
> Adding LO is very fast when table is vacuumed. But when there is a lot of
> dead tuples adding LO is very slow (50-100 times slower) and eats 100% of
> CPU.
>
> It looks that better way is writing object directly as a bytea on paritioned
> tables althought it's a bit slower than LO interface on a vacuumed table.

Well... yes... I thought about that, but now then, what happen when
you need to fetch the file from the DB? will that be fetched
completely at once?  I'm thinking about large files here, say
(hypothetically speaking) you have 1GB files stored.... if the system
will fetch the whole 1GB at once, it would take 1GB RAM (or not?), and
that's what I wanted to avoid by dividing the file in 2kB chunks
(bytea chunks, actually).... I don't quite remember where I got the
2kB size from... but I decided I wanted to avoid using TOAST too.

>
>
> Regards,
> Andrzej
>