Re: improvement suggestions for performance design - Mailing list pgsql-performance

From Heikki Linnakangas
Subject Re: improvement suggestions for performance design
Date
Msg-id 468D3AF7.4070505@enterprisedb.com
Whole thread Raw
In response to Re: improvement suggestions for performance design  (tfinneid@ifi.uio.no)
Responses Re: improvement suggestions for performance design
List pgsql-performance
tfinneid@ifi.uio.no wrote:
>> I would strongly suggest that you use a proper relational schema,
>> instead of storing everything in two tables. I don't know your
>> application, but a schema like that is called an Entity-Attribute-Value
>> (though your entity seems to be just posx and posy) and it should raise
>> a big red flag in the mind of any database designer. In particular,
>> constructing queries against an EAV schema is a major pain in the ass.
>> This has been discussed before on postgresql lists as well, you might
>> want to search and read the previous discussions.
>
> I get your point, but the thing is the attributes have no particular
> relation to each other, other than belonging to same attribute groups.
> There are no specific rules that states that certain attributes are always
> used together, such as with an address record. It depends on what
> attributes the operator wants to study. This is why I don't find any
> reason to group the attributes into separate tables and columns.

ISTM that a properly normalized schema would look something like this:

create table position (
   posX int not null,
   posY int not null,
   primary key (posX, posY)
);

create table colour (
   posX int not null,
   posY int not null,
   colour varchar(50) not null,
   primary key (posX, posY),
   foreign key (posX, posY) references position (posX, posY)
);

create table population (
   posX int not null,
   posY int not null,
   population int notn u,
   primary key (posX, posY),
   foreign key (posX, posY) references position (posX, posY)
);

where colour and population are examples of attributes you want to
store. If you have 100 different attributes, you'll have 100 tables like
that. That may sound like a lot, but it's not.

This allows you to use proper data types for the attributes, as well as
constraints and all the other goodies a good relational data model gives you

It also allows you to build proper indexes on the attributes. For
example, if you store populations as text, you're going to have a hard
time building an index that allows you to query for positions with a
population between 100-2000 efficiently.

These are all imaginary examples, but what I'm trying to point out here
is that a proper relational schema allows you to manage and query your
data much more easily and with more flexibility, allows for future
extensions.

A normalized schema will also take less space, which means less I/O and
more performance, because there's no need to store metadata like the
data_type, attr_type on every row. For performance reasons, you might
actually want to not store the position-table at all in the above schema.

An alternative design would be to have a single table, with one column
per attribute:

create table position (
   posX int not null,
   posY int not null,
   colour varchar(50),
   population int,
   ...
   primary key (posX, posY)
)

This is more space-efficient, especially if you have a lot of attributes
on same coordinates. You can easily add and drop columns as needed,
using ALTER TABLE.

> I am still looking into the design of the tables, but I need to get at
> proper test harness running before I can start ruling things out. And a
> part of that, is for example, efficient ways of transferring the insert
> data from the client to the db, instead of just single command inserts.
> This is where bulk transfer by arrays probably would be preferable.

Before you start fiddling with functions, I'd suggest that you try
batching the inserts with the JDBC PreparedStatement batch facility.

Splitting the inserts into multiple threads in your application sounds
messy. The inserts would have to be in separate transactions, for
example. Presumably your CORBA ORB will spawn multiple threads for you
when there's a lot requests coming in, so the overall throughput should
be the same with a single thread per request.

BTW, I concur with Y Sidhu that with data volumes as high as you have,
partitioning is a good idea. It's a lot easier to manage 20 10 GB table
partitions, than one 200 GB table. For example, VACUUM, CLUSTER, CREATE
INDEX can be done partition per partition, instead of as a single huge
operatio that runs for hours. Though if you choose to have just one
table per attribute type, each table might be conveniently small by
nature, so that no partitioning is required.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

pgsql-performance by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Query is taking 5 HOURS to Complete on 8.1 version
Next
From: Thomas Finneid
Date:
Subject: Re: improvement suggestions for performance design