Re: Compare rows - Mailing list pgsql-performance

From Thomas Swan
Subject Re: Compare rows
Date
Msg-id 3F86A5AA.6060705@idigx.com
Whole thread Raw
In response to Re: Compare rows  (Jason Hihn <jhihn@paytimepayroll.com>)
List pgsql-performance
I took this approach with a former company in designing an dynamic
e-commerce system.   This kept the addition of new products from
requiring an alteration of the schema.   With an ORB manager and cache
control the performance was not significantly, but the automatic
extensibility and the ease of maintainabilty was greatly enhanced.

Thomas


Jason Hihn wrote:

>
>
>>-----Original Message-----
>>From: pgsql-performance-owner@postgresql.org
>>[mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Greg
>>Spiegelberg
>>Sent: Wednesday, October 08, 2003 3:11 PM
>>To: PgSQL Performance ML
>>Subject: Re: [PERFORM] Compare rows
>>
>>
>>Josh Berkus wrote:
>>
>>
>>>Greg,
>>>
>>>
>>>
>>>
>>>>The data represents metrics at a point in time on a system for
>>>>network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
>>>>speed, and whatever else can be gathered.
>>>>
>>>>We arrived at this one 642 column table after testing the whole
>>>>process from data gathering, methods of temporarily storing then
>>>>loading to the database.  Initially, 37+ tables were in use but
>>>>the one big-un has saved us over 3.4 minutes.
>>>>
>>>>
>>>Hmmm ... if few of those columns are NULL, then you are
>>>
>>>
>>probably right ...
>>
>>
>>>this is probably the most normalized design.   If, however,
>>>
>>>
>>many of columns
>>
>>
>>>are NULL the majority of the time, then the design you should
>>>
>>>
>>be using is a
>>
>>
>>>vertial child table, of the form  ( value_type  | value ).
>>>
>>>Such a vertical child table would also make your comparison
>>>
>>>
>>between instances
>>
>>
>>>*much* easier, as it could be executed via a simple
>>>
>>>
>>4-table-outer-join and 3
>>
>>
>>>where clauses.  So even if you don't have a lot of NULLs, you
>>>
>>>
>>probably want
>>
>>
>>>to consider this.
>>>
>>>
>>You lost me on that one.  What's a "vertical child table"?
>>
>>
>
>Parent table Fkey | Option | Value
>------------------+--------+-------
>                  | OS     | Solaris
>                  | DISK1  | 30g
>                   ^^^^^^^^   ^^^-- values
>                      fields are values in a column rather than 'fields'
>
>
>
>
>>Statistically, about 6% of the rows use more than 200 of the columns,
>>27% of the rows use 80-199 or more columns, 45% of the rows use 40-79
>>columns and the remaining 22% of the rows use 39 or less of the columns.
>>That is a lot of NULLS.  Never gave that much thought.
>>
>>To ensure query efficiency, hide the NULLs and simulate the multiple
>>tables I have a boatload of indexes, ensure that every query makees use
>>of an index, and have created 37 views.  It's worked pretty well so
>>far
>>
>>
>>
>>
>>>>The reason for my initial question was this.  We save changes only.
>>>>In other words, if system S has row T1 for day D1 and if on day D2
>>>>we have another row T1 (excluding our time column) we don't want
>>>>to save it.
>>>>
>>>>
>>>If re-designing the table per the above is not a possibility,
>>>
>>>
>>then I'd suggest
>>
>>
>>>that you locate 3-5 columns that:
>>>1) are not NULL for any row;
>>>2) combined, serve to identify a tiny subset of rows, i.e. 3%
>>>
>>>
>>or less of the
>>
>>
>>>table.
>>>
>>>
>>There are always, always, always 7 columns that contain data.
>>
>>
>>
>>
>>>Then put a multi-column index on those columns, and do your
>>>
>>>
>>comparison.
>>
>>
>>>Hopefully the planner should pick up on the availablity of the
>>>
>>>
>>index and scan
>>
>>
>>>only the rows retrieved by the index.   However, there is the distinct
>>>possibility that the presence of 637 WHERE criteria will
>>>
>>>
>>confuse the planner,
>>
>>
>>>causing it to resort to a full table seq scan; in that case,
>>>
>>>
>>you will want to
>>
>>
>>>use a subselect to force the issue.
>>>
>>>
>>That's what I'm trying to avoid is a big WHERE (c1,c2,...,c637) <>
>>(d1,d2,...,d637) clause.  Ugly.
>>
>>
>>
>>
>>>Or, as Joe Conway suggested, you could figure out some kind of
>>>
>>>
>>value hash that
>>
>>
>>>uniquely identifies your rows.
>>>
>>>
>>I've given that some though and though appealing I don't think I'd care
>>to spend the CPU cycles to do it.  Best way I can figure to accomplish
>>it would be to generate an MD5 on each row without the timestamp and
>>store it in another column, create an index on the MD5 column, generate
>>MD5 on each line I want to insert.  Makes for a simple WHERE...
>>
>>Okay.  I'll give it a whirl.  What's one more column, right?
>>
>>Greg
>>
>>--
>>Greg Spiegelberg
>>  Sr. Product Development Engineer
>>  Cranel, Incorporated.
>>  Phone: 614.318.4314
>>  Fax:   614.431.8388
>>  Email: gspiegelberg@Cranel.com
>>Cranel. Technology. Integrity. Focus.
>>
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 2: you can get off all lists at once with the unregister command
>>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>>
>>
>>
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>



pgsql-performance by date:

Previous
From: Jeff
Date:
Subject: Re: backup/restore - another area.
Next
From: "Nick Barr"
Date:
Subject: go for a script! / ex: PostgreSQL vs. MySQL