Thread: blobs

blobs

From
Steve Holdoway
Date:
I'm got the enviable task of redesigning an MySQL based project from scratch. We need a proper rdbms for this project,
andI want to use PG 8.2. 

The table I'm concerned with at the moment have (currently) 5 million rows, with a churn of about 300,000 rows a week.
Thetable has about a million hits a day, which makes it the main potential bottleneck in this database. 

We need to store some large ( 0 -> 100kB ) data with each row. Would you recommend adding it as columns in this table,
giventhat blobs will be stored in the pg_largeobject table anyway, or would you recommend a daughter table for this? 

Any other suggestions on how to avoid performance problems with this table ( hardware is dual Xeon, 4GB mem, 2 hardware
raidchannels for storage + 1 for logs, all running debian 32 bit ). 

Cheers,

Steve

Re: blobs

From
"Phillip Smith"
Date:
I don't know about your table question, but may I ask why you're running
32-bit when you have dual Xeon processors?

I have dual Xeon's in my DWH, and I used to run 32-bit which I upgraded to
64-bit over Christmas. We run a nightly import to that database which used
to take around 5 hours which now completes in less than 1 hour.

Many of our large queries also run much faster - the only thing I did was
reload the box with RedHat ES 4 64-bit instead of 32-bit.

My 2.2 cents (Aust. GST inclusive!)

Cheers,
-p

-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Steve Holdoway
Sent: Thursday, 1 February 2007 08:46
To: pgsql-admin@postgresql.org
Subject: [ADMIN] blobs

I'm got the enviable task of redesigning an MySQL based project from
scratch. We need a proper rdbms for this project, and I want to use PG 8.2.

The table I'm concerned with at the moment have (currently) 5 million rows,
with a churn of about 300,000 rows a week. The table has about a million
hits a day, which makes it the main potential bottleneck in this database.

We need to store some large ( 0 -> 100kB ) data with each row. Would you
recommend adding it as columns in this table, given that blobs will be
stored in the pg_largeobject table anyway, or would you recommend a daughter
table for this?

Any other suggestions on how to avoid performance problems with this table (
hardware is dual Xeon, 4GB mem, 2 hardware raid channels for storage + 1 for
logs, all running debian 32 bit ).

Cheers,

Steve

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match


*******************Confidentiality and Privilege Notice*******************

The material contained in this message is privileged and confidential to
the addressee.  If you are not the addressee indicated in this message or
responsible for delivery of the message to such person, you may not copy
or deliver this message to anyone, and you should destroy it and kindly
notify the sender by reply email.

Information in this message that does not relate to the official business
of Weatherbeeta must be treated as neither given nor endorsed by Weatherbeeta.
Weatherbeeta, its employees, contractors or associates shall not be liable
for direct, indirect or consequential loss arising from transmission of this
message or any attachments

Re: blobs

From
Steve Holdoway
Date:
Availability of hardware monitoring software, and my personally being sick of things falling over. I have to run Mysql
4.0on this server at the moment, and wasn't prepared to take the risk (: 

Maybe by the time we implement, 64 bit will be reliable enough.

Steve

On Thu, 1 Feb 2007 09:25:08 +1100
"Phillip Smith" <phillip.smith@weatherbeeta.com.au> wrote:

> I don't know about your table question, but may I ask why you're running
> 32-bit when you have dual Xeon processors?
>
> I have dual Xeon's in my DWH, and I used to run 32-bit which I upgraded to
> 64-bit over Christmas. We run a nightly import to that database which used
> to take around 5 hours which now completes in less than 1 hour.
>
> Many of our large queries also run much faster - the only thing I did was
> reload the box with RedHat ES 4 64-bit instead of 32-bit.
>
> My 2.2 cents (Aust. GST inclusive!)
>
> Cheers,
> -p
>
> -----Original Message-----
> From: pgsql-admin-owner@postgresql.org
> [mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Steve Holdoway
> Sent: Thursday, 1 February 2007 08:46
> To: pgsql-admin@postgresql.org
> Subject: [ADMIN] blobs
>
> I'm got the enviable task of redesigning an MySQL based project from
> scratch. We need a proper rdbms for this project, and I want to use PG 8.2.
>
> The table I'm concerned with at the moment have (currently) 5 million rows,
> with a churn of about 300,000 rows a week. The table has about a million
> hits a day, which makes it the main potential bottleneck in this database.
>
> We need to store some large ( 0 -> 100kB ) data with each row. Would you
> recommend adding it as columns in this table, given that blobs will be
> stored in the pg_largeobject table anyway, or would you recommend a daughter
> table for this?
>
> Any other suggestions on how to avoid performance problems with this table (
> hardware is dual Xeon, 4GB mem, 2 hardware raid channels for storage + 1 for
> logs, all running debian 32 bit ).
>
> Cheers,
>
> Steve
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>
>
> *******************Confidentiality and Privilege Notice*******************
>
> The material contained in this message is privileged and confidential to
> the addressee.  If you are not the addressee indicated in this message or
> responsible for delivery of the message to such person, you may not copy
> or deliver this message to anyone, and you should destroy it and kindly
> notify the sender by reply email.
>
> Information in this message that does not relate to the official business
> of Weatherbeeta must be treated as neither given nor endorsed by Weatherbeeta.
> Weatherbeeta, its employees, contractors or associates shall not be liable
> for direct, indirect or consequential loss arising from transmission of this
> message or any attachments
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

Re: blobs

From
"Chad Wagner"
Date:
On 1/31/07, Steve Holdoway <steve.holdoway@firetrust.com> wrote:
The table I'm concerned with at the moment have (currently) 5 million rows, with a churn of about 300,000 rows a week. The table has about a million hits a day, which makes it the main potential bottleneck in this database.

Why would it be a "bottleneck"?  If you are updating or deleting 300K a week, definitely make sure you take a look at autovacuum and turn it ON.


We need to store some large ( 0 -> 100kB ) data with each row. Would you recommend adding it as columns in this table, given that blobs will be stored in the pg_largeobject table anyway, or would you recommend a daughter table for this?

Depends on how you are querying the table.  This is really a database modeling question, and leads into many many more questions.  I would say if your frequently range scanning the table (selecting several rows) and in those cases you rarely need the "blob", then I would fork it off into a child table.  If the "blob" is rarely accessed, and only accessed directly, then definitely a child table in my book.

The reason is if your frequently fetching rows from this table and rarely using the blob all you are doing is consuming memory that could be better used for other things, and spinning I/O when it is not necessary.
 

Any other suggestions on how to avoid performance problems with this table ( hardware is dual Xeon, 4GB mem, 2 hardware raid channels for storage + 1 for logs, all running debian 32 bit ).

It really has to do with how you design your schema.


--
Chad
http://www.postgresqlforums.com/

Re: blobs

From
Steve Holdoway
Date:
On Wed, 31 Jan 2007 18:45:15 -0500
"Chad Wagner" <chad.wagner@gmail.com> wrote:

> On 1/31/07, Steve Holdoway <steve.holdoway@firetrust.com> wrote:
> >
> > The table I'm concerned with at the moment have (currently) 5 million
> > rows, with a churn of about 300,000 rows a week. The table has about a
> > million hits a day, which makes it the main potential bottleneck in this
> > database.
>
>
> Why would it be a "bottleneck"?  If you are updating or deleting 300K a
> week, definitely make sure you take a look at autovacuum and turn it ON.
It is a potential bottleneck as I said. I am replacing between 5 and 10 percent of the data in the table every week.
Thereare, in addition, a daily total of about a million hits ( read/update/delete ), as I said.  

This is the busiest table in the database. Therefore, it's the most likely candidate to cause performance problems.
That'swhat I call a bottleneck. 
>
>
> We need to store some large ( 0 -> 100kB ) data with each row. Would you
> > recommend adding it as columns in this table, given that blobs will be
> > stored in the pg_largeobject table anyway, or would you recommend a daughter
> > table for this?
>
>
> Depends on how you are querying the table.  This is really a database
> modeling question, and leads into many many more questions.  I would say if
> your frequently range scanning the table (selecting several rows) and in
> those cases you rarely need the "blob", then I would fork it off into a
> child table.  If the "blob" is rarely accessed, and only accessed directly,
> then definitely a child table in my book.
>
> The reason is if your frequently fetching rows from this table and rarely
> using the blob all you are doing is consuming memory that could be better
> used for other things, and spinning I/O when it is not necessary.
Sorry, you're completley wrong. If you consider that the only way of getting info is select *, then this is true.
Personally,I think that anyone who does this in code is plain lazy and should find a job more suited to them (: 

I am asking for input from those who have been in this situation before, and have experience in the tradeoff of running
aseparate table for the big stuff as against the extra ( 8 byte? ) column that would be added to the master table. Why
amI asking this? Because Postgres has an unique way of handling this kind of data, unique from even the last time I
usedpostgres in anger - 7.4. It's different from every other rdbms ( and MySQL ), and I have no practical experience of
itin the wild. 
>
>
> Any other suggestions on how to avoid performance problems with this table (
> > hardware is dual Xeon, 4GB mem, 2 hardware raid channels for storage + 1 for
> > logs, all running debian 32 bit ).
> >
>
> It really has to do with how you design your schema.
No. It really has to do with the gathering of relevant information so that I can design my schema properly.
Academically,3NF may be the solution to every design, but in the real world, there are many other things to take in to
account.I've monitored the data flows, sized the hardware to handle the IO, and can either spend a week or two
benchmarkingdifferent solutions from cold, or I can take the advice of those with relevant experience ( who I expect to
findon this list ) to point me in the right direction first. 
>
>
> --
> Chad
> http://www.postgresqlforums.com/
>
Steve.

PS. Please fix your headers so replies to your mails appear on the list directly.

Re: blobs

From
"Chad Wagner"
Date:
On 2/1/07, Steve Holdoway <steve.holdoway@firetrust.com> wrote:
> Why would it be a "bottleneck"?  If you are updating or deleting 300K a
> week, definitely make sure you take a look at autovacuum and turn it ON.
It is a potential bottleneck as I said. I am replacing between 5 and 10 percent of the data in the table every week. There are, in addition, a daily total of about a million hits ( read/update/delete ), as I said.

This is the busiest table in the database. Therefore, it's the most likely candidate to cause performance problems. That's what I call a bottleneck.

The reason I am asking why you think it is a bottleneck is because many MySQL users see 'reading' a table as a bottleneck, and this is usually because they are using MyISAM tables.  Under MyISAM tables readers block readers, writers block readers, readers block writers.  This is NOT the case with PostgreSQL, readers never block other readers, writers never block readers, and readers never block writers.  PostgreSQL does this using multi-versioning, similar to how Oracle works.


> Depends on how you are querying the table.  This is really a database
> modeling question, and leads into many many more questions.  I would say if
> your frequently range scanning the table (selecting several rows) and in
> those cases you rarely need the "blob", then I would fork it off into a
> child table.  If the "blob" is rarely accessed, and only accessed directly,
> then definitely a child table in my book.
>
> The reason is if your frequently fetching rows from this table and rarely
> using the blob all you are doing is consuming memory that could be better
> used for other things, and spinning I/O when it is not necessary.
Sorry, you're completley wrong. If you consider that the only way of getting info is select *, then this is true. Personally, I think that anyone who does this in code is plain lazy and should find a job more suited to them (:

Okay, I would love to see your test case.  My statement is based on the suggestion that you were embedding your data in the same row as a bytea column.  Here is a test case that I ran:

NOTE: Fill factor is set to 10% to exaggerate a wide row, this also proves the point that EMPTY space in a table affects performance.

drop table foo;
create table foo (x integer not null primary key, y text not null) with (fillfactor = 10);
insert into foo (x, y) select generate_series(1,100000), repeat('1234567890', 10240);
checkpoint;
analyze foo;

select relpages,reltuples,pg_size_pretty(pg_relation_size(relname)) from pg_class where relname = 'foo';
 relpages | reltuples | pg_size_pretty
----------+-----------+----------------
   100000 |    100000 | 781 MB
(1 row)


-- fetch just the integer column
explain analyze select x from foo;
                                                   QUERY PLAN                                                   
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..101000.00 rows=100000 width=4) (actual time=10.389..19288.848 rows=100000 loops=1)
 Total runtime: 19700.804 ms
(2 rows)

-- fetch just the text column
explain analyze select y from foo;
                                                   QUERY PLAN                                                   
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..101000.00 rows=100000 width=32) (actual time=9.234..19863.485 rows=100000 loops=1)
 Total runtime: 20290.618 ms
(2 rows)

-- fetch both the integer and text column
explain analyze select * from foo;
                                 
------------------------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..101000.00 rows=100000 width=36) (actual time=40.478..20470.648 rows=100000 loops=1)
 Total runtime: 20904.611 ms
(2 rows)


-- Runtime is the same for all three cases against the "wide" table.  Disproves the theory that selecting or not selecting a "column" has something to do with performance.  I would think logically it does when you are talking about sorting and merging data sets, not when we are talking about physical reads.
-- Now let's seperate the "large" text column from the rest of the data.

drop table bar;
create table bar (x integer not null primary key);
insert into bar (x) select generate_series(1,100000);
checkpoint;
analyze bar;

select relpages,reltuples,pg_size_pretty(pg_relation_size(relname)) from pg_class where relname = 'bar';
 relpages | reltuples | pg_size_pretty
----------+-----------+----------------
      441 |    100000 | 3528 kB
(1 row)


-- fetch from the "skinny" table
explain analyze select * from bar;
                                                 QUERY PLAN                                                 
-------------------------------------------------------------------------------------------------------------
 Seq Scan on bar  (cost=0.00..1441.00 rows=100000 width=4) (actual time=19.552..416.575 rows=100000 loops=1)
 Total runtime: 790.901 ms
(2 rows)

-- Notice how must faster it is, seems to suggest that a table with a narrow column with is faster.

I think this case proves the point that a "wide" blob if not accessed frequently shouldn't be included in the core table.  Naturally this makes no sense if every single time you are going to fetch the blob and use it with every single fetch against the table.

I am asking for input from those who have been in this situation before, and have experience in the tradeoff of running a separate table for the big stuff as against the extra ( 8 byte? ) column that would be added to the master table. Why am I asking this? Because Postgres has an unique way of handling this kind of data, unique from even the last time I used postgres in anger - 7.4. It's different from every other rdbms ( and MySQL ), and I have no practical experience of it in the wild.

I think you are referring to large objects, which are completely different then your original statement.  Your original statement seemed to ask about including binary data in the table versus not in the table.  My understanding of large objects is they create a file on the filesystem (probably an oid like a table does) and you store the oid of the "file" in a table.  Basically you have bytea (byte arrays) which are stored inline with the rest of the row and large objects where store a reference to another object.


> It really has to do with how you design your schema.
No. It really has to do with the gathering of relevant information so that I can design my schema properly. Academically, 3NF may be the solution to every design, but in the real world, there are many other things to take in to account. I've monitored the data flows, sized the hardware to handle the IO, and can either spend a week or two benchmarking different solutions from cold, or I can take the advice of those with relevant experience ( who I expect to find on this list ) to point me in the right direction first.

That is my point here, the schema should be designed properly.  A poor schema design and yield awful I/O performance, which is essentially my point in my previous message.

Anyways, if you are typically this sarcastic then I would suggest you learn how to either better phrase your responses or not respond at all.  Frankly, no one has to help you, and making statements as you have above are rude to everyone on this list.

I have plenty of experience building large distributed OLTP systems and multi terabyte data warehouses, so you can either take the advice or leave it.

Re: blobs

From
Steve Holdoway
Date:
On Thu, 1 Feb 2007 23:04:57 -0500
"Chad Wagner" <chad.wagner@gmail.com> wrote:

> I have plenty of experience building large distributed OLTP systems and
> multi terabyte data warehouses, so you can either take the advice or leave
> it.
>
Me too, since Oracle 6, Informix and Ingres in the late 80's until today. What I don't have experience of, and what you
omittedfrom this informative post are expeciences of postgres 8.x's handling of blobs - the question I asked - rather
thanthe effect of using text columns which is a different question altogether. 

Steve