Re: Compare rows - Mailing list pgsql-performance

From Jason Hihn
Subject Re: Compare rows
Date
Msg-id NGBBLHANMLKMHPDGJGAPIEBECOAA.jhihn@paytimepayroll.com
Whole thread Raw
In response to Re: Compare rows  (Greg Spiegelberg <gspiegelberg@cranel.com>)
Responses Re: Compare rows
List pgsql-performance

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Greg
> Spiegelberg
> Sent: Wednesday, October 08, 2003 3:11 PM
> To: PgSQL Performance ML
> Subject: Re: [PERFORM] Compare rows
>
>
> Josh Berkus wrote:
> > Greg,
> >
> >
> >>The data represents metrics at a point in time on a system for
> >>network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
> >>speed, and whatever else can be gathered.
> >>
> >>We arrived at this one 642 column table after testing the whole
> >>process from data gathering, methods of temporarily storing then
> >>loading to the database.  Initially, 37+ tables were in use but
> >>the one big-un has saved us over 3.4 minutes.
> >
> >
> > Hmmm ... if few of those columns are NULL, then you are
> probably right ...
> > this is probably the most normalized design.   If, however,
> many of columns
> > are NULL the majority of the time, then the design you should
> be using is a
> > vertial child table, of the form  ( value_type  | value ).
> >
> > Such a vertical child table would also make your comparison
> between instances
> > *much* easier, as it could be executed via a simple
> 4-table-outer-join and 3
> > where clauses.  So even if you don't have a lot of NULLs, you
> probably want
> > to consider this.
>
> You lost me on that one.  What's a "vertical child table"?

Parent table Fkey | Option | Value
------------------+--------+-------
                  | OS     | Solaris
                  | DISK1  | 30g
                   ^^^^^^^^   ^^^-- values
                      fields are values in a column rather than 'fields'


> Statistically, about 6% of the rows use more than 200 of the columns,
> 27% of the rows use 80-199 or more columns, 45% of the rows use 40-79
> columns and the remaining 22% of the rows use 39 or less of the columns.
> That is a lot of NULLS.  Never gave that much thought.
>
> To ensure query efficiency, hide the NULLs and simulate the multiple
> tables I have a boatload of indexes, ensure that every query makees use
> of an index, and have created 37 views.  It's worked pretty well so
> far
>
>
> >>The reason for my initial question was this.  We save changes only.
> >>In other words, if system S has row T1 for day D1 and if on day D2
> >>we have another row T1 (excluding our time column) we don't want
> >>to save it.
> >
> >
> > If re-designing the table per the above is not a possibility,
> then I'd suggest
> > that you locate 3-5 columns that:
> > 1) are not NULL for any row;
> > 2) combined, serve to identify a tiny subset of rows, i.e. 3%
> or less of the
> > table.
>
> There are always, always, always 7 columns that contain data.
>
>
> > Then put a multi-column index on those columns, and do your
> comparison.
> > Hopefully the planner should pick up on the availablity of the
> index and scan
> > only the rows retrieved by the index.   However, there is the distinct
> > possibility that the presence of 637 WHERE criteria will
> confuse the planner,
> > causing it to resort to a full table seq scan; in that case,
> you will want to
> > use a subselect to force the issue.
>
> That's what I'm trying to avoid is a big WHERE (c1,c2,...,c637) <>
> (d1,d2,...,d637) clause.  Ugly.
>
>
> > Or, as Joe Conway suggested, you could figure out some kind of
> value hash that
> > uniquely identifies your rows.
>
> I've given that some though and though appealing I don't think I'd care
> to spend the CPU cycles to do it.  Best way I can figure to accomplish
> it would be to generate an MD5 on each row without the timestamp and
> store it in another column, create an index on the MD5 column, generate
> MD5 on each line I want to insert.  Makes for a simple WHERE...
>
> Okay.  I'll give it a whirl.  What's one more column, right?
>
> Greg
>
> --
> Greg Spiegelberg
>   Sr. Product Development Engineer
>   Cranel, Incorporated.
>   Phone: 614.318.4314
>   Fax:   614.431.8388
>   Email: gspiegelberg@Cranel.com
> Cranel. Technology. Integrity. Focus.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

pgsql-performance by date:

Previous
From: pgsql-performance@sysd.com
Date:
Subject: Re: Sun performance - Major discovery!
Next
From: Jean-Luc Lachance
Date:
Subject: Re: Compare rows