Re: Matching unique primary keys - Mailing list pgsql-general

From Richard Huxton
Subject Re: Matching unique primary keys
Date
Msg-id 200210231126.39242.dev@archonet.com
Whole thread Raw
In response to Matching unique primary keys  (Kevin Old <kold@carolina.rr.com>)
Responses Re: Matching unique primary keys  (Kevin Old <kold@carolina.rr.com>)
List pgsql-general
On Tuesday 22 Oct 2002 9:12 pm, Kevin Old wrote:
> Hello all,
>
> I have a table of data that is very large (502 columns) and whenever I
> query it, I only query on up to 4 columns and on average only 20 fields
> are returned.  I've come up with a way to increase the speed of the
> queries.  I could could put the 20 fields that are searched and returned
> into a separate table and then link them by a unique id.  Sounds
> easy......but I'm stuck on something.
>
> I am dealing with about 1 million records per day.

Just to clarify - that's a table that's expanding by 1 million rows a day?

> One option is to put a sequence on the tables, but dealing with so many
> records I'd have to use a BIGINT and with a ceiling of
> 9223372036854775807 it seems to me that numbers this large might slow
> down the query process.  I realize it would take quite a while to get to
> this point, but would like other's opinions.

Well an int4 would give you enough unique id's for about 4000 days, which may
or may not be enough for you. One thing I'd say is that querying a table with
4 billion rows with 500 columns is going to require a hefty machine.

Using a bigint (int8) means practically unlimited expansion, but handling int8
is slower than int4 on a 32-bit system. Details will depend on your hardware,
and on 64-bit systems I'd expect no difference (but test - I haven't).

Having said that, I can't believe the impact is relevant when compared with
the overhead of 500 columns.

> Another option is that I have 3 fields that when combine, make each
> record unique.  Is there some way I can combine these dynamically and
> then use "views" of them to reference my records and display them from
> the larger table.

So - those 3 fields are a candidate key, and in the absence of any other info
are your primary key. A first design would be to repeat these 3 fields in
each table (if you split the 500 column one) and then if performance sucks
add a sequence to the "core" table and use that as a key instead.

I'd think it's unlikely that your table should (in design terms) be over 500
columns wide. If lots of these values are defaults or nulls then that's a
sign that things need to be split up.

If the 20 columns you mention being queried against are logically similar then
that should form your foo_core table. If splitting the columns into logical
groups can be done, that's where I'd start. If nothing else, a logical split
is less likely to need changes in the future - an important consideration
once you have a few hundred million records.

From a performance point of view, I'd think it would be helpful to have your 4
query columns in one table and your 20 results columns in one table (possibly
the same table). However, if they aren't logically related I'd be tempted to
still split things logically and set up a separate foo_query table and
maintain it using triggers.

> I realize this is very confusing, please get back with me if I should
> elaborate on something.

Difficult for anyone on the list to give specific advice without knowing your
hardware, performance requirements and schema details. If my post hasn't
helped, try another message with more details.

Bruce Momjian has written some notes on the internals of PG - how it accesses
disk blocks and interacts with the OS disk cache. They're either in the
developers section of the website or on techdocs.postgresql.org - can't
remember which sorry. Might give you something to think about while you're
looking at this problem.

--
  Richard Huxton

pgsql-general by date:

Previous
From: Leif Jensen
Date:
Subject: Re: Linking 2 or more databases.
Next
From: Paulo Henrique Baptista de Oliveira
Date:
Subject: Re: Online backup