Thread: Questions about indexes?
Hello postgres hackers, Been a while since I have participated on this list ... but I have a new itch to scratch.... Although the table schema is immaterial, I will provide it so we have a common framework for this discussion: host_id integer (not null)timestamp datetime (not null)category text (not null) [<= 5 chars]anomaly text (not null) [<= 1024 chars] This table is used to store archived data, so each row in the table must be unique. Currently I am using a primary key across each column to enforce this uniqueness. This table currently has ~86 million rows and is 16+ GB in size. This primary key index is also 16+ GB in size, because it appears all the data is duplicated in the index. (I have only done some preliminary looking at the database file with strings, etc ... so this assumption is purly based on these observations). I am not sure why all the data is duplicated in the index ... but i bet it has to do with performance since it would save a lookup in the main table. Is there any benchmarks or papers related to this topic I should locate and read? I am curious about this because it seems the only advantaged gained is searching the index for the specified values.... Once the entry is found, the full entry needs to be pulled from the main table anyhow since the index does not contain all the data. Also with the increased size, it seems additional pressure would be put on the shared memory caches (no idea how this really works, just guessing! :)) Since my only requirement is that the rows be unique, I have developed a custom MD5 function in C, and created an index on the MD5 hash of the concatanation of all the fields. This has reduced the disk space usage considerably, as show below against my test database ~6 million rows at 1+ GB. All this data is based off the test database running 7.3.2: Type Size-------------------------------------------Database Table 1188642816All columns pkey 1510252544MD5columns pkey 370999296 Just using MD5 hash data instead of all the columns is a considerable diskspace win going from 1.5 GB to 370 MB. Has anyone else solved this problem? Has anyone else looked into something like this and mind sharing so I do not have to re-invent the wheel? :) Also (assuming there is no papers / benchmarks proving data in index is a good idea), how difficult would it be to impliment an index type that extracts the data from the main table? Thanks for reading. I will be happy to field any question that I can, or read any papers, research, etc that relates to this topic. - Ryan P.S. the production database is running 7.2.4 if that makes a difference. -- Ryan Bradetich <rbradetich@uswest.net>
Ryan Bradetich <rbradetich@uswest.net> writes: > Although the table schema is immaterial, I will provide it so we have a > common framework for this discussion: > host_id integer (not null) > timestamp datetime (not null) > category text (not null) [<= 5 chars] > anomaly text (not null) [<= 1024 chars] > This table is used to store archived data, so each row in the table must > be unique. Currently I am using a primary key across each column to > enforce this uniqueness. It's not real clear to me why you bother enforcing a constraint that the complete row be unique. Wouldn't a useful constraint be that the first three columns be unique? Even if that's not correct, what's wrong with tolerating a few duplicates? You can't tell me it's to save on storage ;-) > I am not sure why all the data is duplicated in the index ... but i bet > it has to do with performance since it would save a lookup in the main > table. An index that can't prevent looking into the main table wouldn't be worth anything AFAICS ... regards, tom lane
On Sun, 2003-02-16 at 23:34, Tom Lane wrote: > Ryan Bradetich <rbradetich@uswest.net> writes: > > Although the table schema is immaterial, I will provide it so we have a > > common framework for this discussion: > > > host_id integer (not null) > > timestamp datetime (not null) > > category text (not null) [<= 5 chars] > > anomaly text (not null) [<= 1024 chars] > > > This table is used to store archived data, so each row in the table must > > be unique. Currently I am using a primary key across each column to > > enforce this uniqueness. > > It's not real clear to me why you bother enforcing a constraint that the > complete row be unique. Wouldn't a useful constraint be that the first > three columns be unique? Even if that's not correct, what's wrong with > tolerating a few duplicates? You can't tell me it's to save on storage > ;-) The table holds system policy compliance data. The catagory is basically the policy, and the anomaly is the detailed text explaining why the system is out of compliance. So the anomaly data is important (and often the reason why the key is unique). The reason we are archiving the data is to generate reports and graphs showing policy compliance over time. Duplicated rows will artifically inflate the numbers in the reports and graphs. The other option we had was to perform a DISTINCT select at report / graph time, we chose no to go this route bacause of the sort added to the query. (Also it just seemed tidier to only store good data :)) The disk storage is a minor concern :), but I was actually looking at it as a possible performance enhancement. I am curious how it affects the shared buffer cache, and also there should be less average pages to read since the index size was smaller. Does this make sense? Or am I out in left field again? :) > > I am not sure why all the data is duplicated in the index ... but i bet > > it has to do with performance since it would save a lookup in the main > > table. > > An index that can't prevent looking into the main table wouldn't be > worth anything AFAICS ... Ok, scratch that idea then :) I will continue looking at other ideas like the MD5 data hashing etc. Thanks for your input Tom! - Ryan regards, tom lane -- Ryan Bradetich <rbradetich@uswest.net>
Ryan Bradetich <rbradetich@uswest.net> writes: > On Sun, 2003-02-16 at 23:34, Tom Lane wrote: >> It's not real clear to me why you bother enforcing a constraint that the >> complete row be unique. Wouldn't a useful constraint be that the first >> three columns be unique? > The table holds system policy compliance data. The catagory is > basically the policy, and the anomaly is the detailed text explaining > why the system is out of compliance. So the anomaly data is important > (and often the reason why the key is unique). Well, sure the anomaly is important: it's the payload, the reason why you bother to have the table in the first place. But that doesn't mean it's part of the key. Generally the key would be the info you use to look up a particular anomaly text. In this example, it's not clear to me why you'd need/want two different anomaly texts entered for the same host_id and the same category at the same instant of time. ISTM there's something inadequate about your category column if you need that. regards, tom lane
On Mon, 2003-02-17 at 00:15, Tom Lane wrote: > Ryan Bradetich <rbradetich@uswest.net> writes: > > On Sun, 2003-02-16 at 23:34, Tom Lane wrote: > >> It's not real clear to me why you bother enforcing a constraint that the > >> complete row be unique. Wouldn't a useful constraint be that the first > >> three columns be unique? > > > The table holds system policy compliance data. The catagory is > > basically the policy, and the anomaly is the detailed text explaining > > why the system is out of compliance. So the anomaly data is important > > (and often the reason why the key is unique). > > Well, sure the anomaly is important: it's the payload, the reason why > you bother to have the table in the first place. But that doesn't mean > it's part of the key. Generally the key would be the info you use to > look up a particular anomaly text. In this example, it's not clear to > me why you'd need/want two different anomaly texts entered for the same > host_id and the same category at the same instant of time. ISTM there's > something inadequate about your category column if you need that. Ok, I understand what you are asking now :) Let me make up a contrived example to show how the table is used. host_id 1 = hosta.somewhere.comhost_id 2 = hostb.somewhere.com The catagories are coded so (made up examples):cat p101 = /etc/passwd checkcat f101 = filesystem check. the table would look like: 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell. 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell. 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has expired password. 2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner. etc... So I do not need the anomaly to be part of the index, I only need it to I agree with you, that I would not normally add the anomally to the index, except for the unique row requirement. Thinking about it now, maybe I should guarentee unique rows via a check constraint... Thanks for making me think about this in a different way! - Ryan > regards, tom lane -- Ryan Bradetich <rbradetich@uswest.net>
>>>Ryan Bradetich said:> the table would look like:> 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.>1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.> 1 | Mon Feb 17 00:34:24 MST 2003 | p101 |user y has expired password.> 2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner.> etc...> > So I donot need the anomaly to be part of the index, I only need it to > > I agree with you, that I would not normally add theanomally to the> index, except for the unique row requirement. Thinking about it now,> maybe I should guarentee uniquerows via a check constraint...> > Thanks for making me think about this in a different way! (sorry this is a bit long) Ryan, I use somewhat similarly structured data (archived records of various events) and when the database was setup (back when this baby was called postgres95), I too used indexes on all possible fields. My database consists of an 'operations' table, which holds for the last x days period (example) and several tables with archived records (per month, or per-year - see later, The operations table can have frequent updates, which add new data. Data is never modified but often lookups are made. The archived tables are generated once and forever from the operations table (possibly merging in the future, but I haven't yet made my mind on this) - then access is read-only, although sufficiently frequent. What I found for the many years of operating this database on different PostgreSQL versions and hardware is that indexes have considerable cost. :) So does the need to not miss anything from the operations table (that is, collect data from many places and have have it all it there). I ended up with few only indexes on the operations table, because the processes that fill it up do minimal lookups to see if data is already in the table, if not do inserts. Then at regular intervals, the table is cleaned up - that is, a process to remove the duplicate is run. This unfortunately costs OIDs, but I found no other reasonable way to do the fast inserts. Perhaps the best way is to create the table without OIDs (but wouldn't this still waste OIDs?) use COPY and then clean afterwards? The archived tables are generated, then cleaned up. Then, as Tom suggested indexes are put on the archived tables, only for the fields that are used in queries. Once the table is created, there is no way duplicated data will exist, as it will not be inserted into. Therefore no need for UNIQUE index enforcement. If you need to have one large 'history' table, then perhaps you will just have to do (slow :) selects for each record before each insert, or just insert the data and then run the cleanup process. Daniel
> I ended up with few only indexes on the operations table, because the > processes that fill it up do minimal lookups to see if data is already in the > table, if not do inserts. Then at regular intervals, the table is cleaned up - > that is, a process to remove the duplicate is run. This unfortunately costs > OIDs, but I found no other reasonable way to do the fast inserts. Perhaps the > best way is to create the table without OIDs (but wouldn't this still waste > OIDs?) use COPY and then clean afterwards? No, WITHOUT OIDS is implemented specifically to not waste OIDs. Chris
Ryan Bradetich <rbradetich@uswest.net> writes: > the table would look like: > 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell. > 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell. Ah, I see your point now. (Thinks: what about separating the "anomaly" column into an "identifier" and a "complaint" column: 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | x | user has an invalid shell. 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | y | user has an invalid shell. 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | y | user has expired password. 2 | Mon Feb 17 00:34:24 MST 2003 | f101 | /foo | file has improper owner. No, that doesn't quite work either, unless you are willing to make the categories more specific. At which point the category and the anomaly text become equivalent. Actually I'm wondering why you bother with the category at all; isn't it implied by the anomaly text?) > I agree with you, that I would not normally add the anomally to the > index, except for the unique row requirement. Thinking about it now, > maybe I should guarentee unique rows via a check constraint... A check constraint won't be efficient either, at least not without a supporting index. Possibly you could index just the host and timestamp columns, which would not be unique but it would cut the number of rows the constraint would need to examine to something manageable. But I'm still thinking that enforcing uniqueness is a waste of time. What exactly is so harmful about it if 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell. appears twice? How likely is that anyway (especially if you don't truncate the timestamp precision)? regards, tom lane
On Mon, 16 Feb 2003, Ryan Bradetich wrote: > I am not sure why all the data is duplicated in the index ... Well, you have to have the full key in the index, or how would you know, when you look at a particular index item, if it actually matches what you're searching for? MS SQL server does have an interesting option that would help you a lot in this case: clustered indexes. A table may have a single clustered index, and each leaf node of the index stores not just the key but actually the entire row. Thus, in a case like yours, you'd store the row only once, not twice. Without thinking too hard about it (my usual mode of operation on this list :-)) this could probably be implemented in postgresql. But I don't think it would be entirely trivial, and your case is unusual enough that I very much doubt whether it would be worth implementing to fix that alone. It would also offer the advantage that any lookup using the clustered index would save fetching the heap page after that as well, but it's hard to say if the savings would be worth the work. > Since my only requirement is that the rows be unique, I have developed a > custom MD5 function in C, and created an index on the MD5 hash of the > concatanation of all the fields. Well, that won't guarantee uniqueness, since it's perfectly possible to have two different rows hash to the same value. (If that weren't possible, your hash would have to contain as much information as the row itself, and your space savings wouldn't be nearly so dramatic.) cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
Curt Sampson wrote: > On Mon, 16 Feb 2003, Ryan Bradetich wrote: > > Since my only requirement is that the rows be unique, I have developed a > > custom MD5 function in C, and created an index on the MD5 hash of the > > concatanation of all the fields. > > Well, that won't guarantee uniqueness, since it's perfectly possible > to have two different rows hash to the same value. (If that weren't > possible, your hash would have to contain as much information as the row > itself, and your space savings wouldn't be nearly so dramatic.) That's true, but even if he has 4 billion rows it drops the probability of a duplicate down to something like one in 4 billion, so it's probably a safe enough bet. His application doesn't require absolute uniqueness, fortunately, so md5 works well enough in this case. Otherwise md5 wouldn't be a terribly good hash... -- Kevin Brown kevin@sysexperts.com
Ryan, I am crossing this discussion to the PGSQL-PERFORMANCE list, which is the proper place for it. Anyone interested, please follow us there! >>>Ryan Bradetich said: > the table would look like: > 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell. > 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell. > 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has expired password. > 2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner. > etc... > > So I do not need the anomaly to be part of the index, I only need it to > > I agree with you, that I would not normally add the anomally to the > index, except for the unique row requirement. Thinking about it now, > maybe I should guarentee unique rows via a check constraint... > > Thanks for making me think about this in a different way! First off, I'm not clear on why a duplicate anominaly would be necessarily invalid, given the above. Not useful, certainly, but legitimate real data. I realize that you have performance considerations to work with. However, I must point out that your database design is not *at all* normalized ... and that normalizing it might solve some of your indexing problems. A normalized structure would look something like: TABLE operations id serial not null primary key, host_id int not null, timeoccurred timestamp not null default now(), category varchar(5) not null, constraint operations_unq unique (host_id, timeoccurred, category) TABLE anominalies id serial not null primary key, operation_id int not null references operations(id) on delete cascade, anominaly text This structure would have two advantages for you: 1) the operations table would be *much* smaller than what you have now, as you would not be repeating rows for each anominaly. 2) In neither table would you be including the anominaly text in an index ... thus reducing index size tremendously. As a warning, though: you may find that for insert speed the referential integrity key is better maintained at the middleware layer. We've had some complaints about slow FK enforcement in Postgres. -- Josh Berkus Aglio Database Solutions San Francisco