Re: Compare rows - Mailing list pgsql-performance

From Josh Berkus
Subject Re: Compare rows
Date
Msg-id 200310081010.26576.josh@agliodbs.com
Whole thread Raw
In response to Re: Compare rows  (Greg Spiegelberg <gspiegelberg@cranel.com>)
Responses Re: Compare rows
List pgsql-performance
Greg,

> The data represents metrics at a point in time on a system for
> network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
> speed, and whatever else can be gathered.
>
> We arrived at this one 642 column table after testing the whole
> process from data gathering, methods of temporarily storing then
> loading to the database.  Initially, 37+ tables were in use but
> the one big-un has saved us over 3.4 minutes.

Hmmm ... if few of those columns are NULL, then you are probably right ...
this is probably the most normalized design.   If, however, many of columns
are NULL the majority of the time, then the design you should be using is a
vertial child table, of the form  ( value_type  | value ).

Such a vertical child table would also make your comparison between instances
*much* easier, as it could be executed via a simple 4-table-outer-join and 3
where clauses.  So even if you don't have a lot of NULLs, you probably want
to consider this.

> The reason for my initial question was this.  We save changes only.
> In other words, if system S has row T1 for day D1 and if on day D2
> we have another row T1 (excluding our time column) we don't want
> to save it.

If re-designing the table per the above is not a possibility, then I'd suggest
that you locate 3-5 columns that:
1) are not NULL for any row;
2) combined, serve to identify a tiny subset of rows, i.e. 3% or less of the
table.

Then put a multi-column index on those columns, and do your comparison.
Hopefully the planner should pick up on the availablity of the index and scan
only the rows retrieved by the index.   However, there is the distinct
possibility that the presence of 637 WHERE criteria will confuse the planner,
causing it to resort to a full table seq scan; in that case, you will want to
use a subselect to force the issue.

Or, as Joe Conway suggested, you could figure out some kind of value hash that
uniquely identifies your rows.

--
Josh Berkus
Aglio Database Solutions
San Francisco

pgsql-performance by date:

Previous
From: Jeff
Date:
Subject: Re: Presentation
Next
From: Jason Hihn
Date:
Subject: Re: Compare rows