Thread: quick question abt pg_dump and restore
Hi,
When restoring the pg_dumped data thro psql does the rows of the table are restored in the same order? ie for example if
Table A has rows r1,r2,r3,r4,r5 in this order, then if I pg_dump and restore it to another database, will it have the rows in the same order r1,r2,r3,r4,r5? Does this apply to big tables also?
Thanks
josh
When restoring the pg_dumped data thro psql does the rows of the table are restored in the same order? ie for example if
Table A has rows r1,r2,r3,r4,r5 in this order, then if I pg_dump and restore it to another database, will it have the rows in the same order r1,r2,r3,r4,r5? Does this apply to big tables also?
Thanks
josh
On 09/01/2008 14:02, Josh Harrison wrote: > When restoring the pg_dumped data thro psql does the rows of the table > are restored in the same order? ie for example if > Table A has rows r1,r2,r3,r4,r5 in this order, then if I pg_dump and > restore it to another database, will it have the rows in the same order > r1,r2,r3,r4,r5? Does this apply to big tables also? If you use the text dump format, you can see the order in which the rows are restored. Ray. --------------------------------------------------------------- Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland rod@iol.ie ---------------------------------------------------------------
am Wed, dem 09.01.2008, um 9:02:23 -0500 mailte Josh Harrison folgendes: > Hi, > When restoring the pg_dumped data thro psql does the rows of the table are > restored in the same order? ie for example if > Table A has rows r1,r2,r3,r4,r5 in this order, then if I pg_dump and restore it > to another database, will it have the rows in the same order r1,r2,r3,r4,r5? No. If you need an order than you need an ORDER BY in the SELECT-Statement. Andreas -- Andreas Kretschmer Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net
am Wed, dem 09.01.2008, um 14:07:13 +0000 mailte Raymond O'Donnell folgendes: > On 09/01/2008 14:02, Josh Harrison wrote: > > >When restoring the pg_dumped data thro psql does the rows of the table > >are restored in the same order? ie for example if > >Table A has rows r1,r2,r3,r4,r5 in this order, then if I pg_dump and > >restore it to another database, will it have the rows in the same order > >r1,r2,r3,r4,r5? Does this apply to big tables also? > > If you use the text dump format, you can see the order in which the rows > are restored. Right, but within the table the rows hasn't an order. You can see this when you select the ctid-column: test=# create table order_test (i int); CREATE TABLE test=*# insert into order_test values (1); INSERT 0 1 test=*# insert into order_test values (2); INSERT 0 1 test=*# insert into order_test values (3); INSERT 0 1 test=*# select ctid, i from order_test ; ctid | i -------+--- (0,1) | 1 (0,2) | 2 (0,3) | 3 (3 rows) test=*# update order_test set i=20 where i=2; UPDATE 1 test=*# update order_test set i=2 where i=20; UPDATE 1 test=*# select ctid, i from order_test ; ctid | i -------+--- (0,1) | 1 (0,3) | 3 (0,5) | 2 (3 rows) Now a pg_dump: ALTER TABLE public.order_test OWNER TO webmaster; -- -- Data for Name: order_test; Type: TABLE DATA; Schema: public; Owner: webmaster -- COPY order_test (i) FROM stdin; 1 3 2 \. Now the question: what is the correct order? Andreas -- Andreas Kretschmer Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net
On Jan 9, 2008 9:12 AM, A. Kretschmer <andreas.kretschmer@schollglas.com> wrote:
am Wed, dem 09.01.2008, um 9:02:23 -0500 mailte Josh Harrison folgendes:> Hi,No. If you need an order than you need an ORDER BY in the
> When restoring the pg_dumped data thro psql does the rows of the table are
> restored in the same order? ie for example if
> Table A has rows r1,r2,r3,r4,r5 in this order, then if I pg_dump and restore it
> to another database, will it have the rows in the same order r1,r2,r3,r4,r5?
SELECT-Statement.
Fine. I can use order by when I want to order it in terms of some columns. But What if I want to maintain the same order as in the database1? ie., I want my rows of TableABC in Database2 to be the same order as the rows in TableABC in Database 1 ???
Thanks
josh
Josh Harrison escribió: > Fine. I can use order by when I want to order it in terms of some columns. > But What if I want to maintain the same order as in the database1? ie., I > want my rows of TableABC in Database2 to be the same order as the rows in > TableABC in Database 1 ??? You can't. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Jan 9, 2008 9:35 AM, A. Kretschmer <andreas.kretschmer@schollglas.com> wrote:
All my requirement is that the dumped table in database2 should be in the same order as the original table(at the time of dump) in database1 . am Wed, dem 09.01.2008, um 14:07:13 +0000 mailte Raymond O'Donnell folgendes:> On 09/01/2008 14:02, Josh Harrison wrote:Right, but within the table the rows hasn't an order. You can see this
>
> >When restoring the pg_dumped data thro psql does the rows of the table
> >are restored in the same order? ie for example if
> >Table A has rows r1,r2,r3,r4,r5 in this order, then if I pg_dump and
> >restore it to another database, will it have the rows in the same order
> >r1,r2,r3,r4,r5? Does this apply to big tables also?
>
> If you use the text dump format, you can see the order in which the rows
> are restored.
when you select the ctid-column:
test=# create table order_test (i int);
CREATE TABLE
test=*# insert into order_test values (1);
INSERT 0 1
test=*# insert into order_test values (2);
INSERT 0 1
test=*# insert into order_test values (3);
INSERT 0 1
test=*# select ctid, i from order_test ;
ctid | i
-------+---
(0,1) | 1
(0,2) | 2
(0,3) | 3
(3 rows)
test=*# update order_test set i=20 where i=2;
UPDATE 1
test=*# update order_test set i=2 where i=20;
UPDATE 1
test=*# select ctid, i from order_test ;
ctid | i
-------+---
(0,1) | 1
(0,3) | 3
(0,5) | 2
(3 rows)
Now a pg_dump:
ALTER TABLE public.order_test OWNER TO webmaster;
--
-- Data for Name: order_test; Type: TABLE DATA; Schema: public; Owner:
webmaster
--
COPY order_test (i) FROM stdin;
1
3
2
\.
Now the question: what is the correct order?
Thanks
josh
am Wed, dem 09.01.2008, um 9:45:11 -0500 mailte Josh Harrison folgendes: > What if I want to maintain the same order as in the database1? ie., I want my > rows of TableABC in Database2 to be the same order as the rows in TableABC in > Database 1 ??? For what reason? Again: there is no order within the database. Andreas -- Andreas Kretschmer Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net
On Jan 9, 2008 9:59 AM, A. Kretschmer <andreas.kretschmer@schollglas.com> wrote:
Thanks...am Wed, dem 09.01.2008, um 9:45:11 -0500 mailte Josh Harrison folgendes:> What if I want to maintain the same order as in the database1? ie., I want myFor what reason?
> rows of TableABC in Database2 to be the same order as the rows in TableABC in
> Database 1 ???
Again: there is no order within the database.
Another quick question...When you issue a query like this
select * from dummy limit 10
What 10 rows are fetched? like first 10 or last 10 or the first 10 from first block or.... ?
And this query always returns the same 10 rows (when there are no updates/deletes)
josh
Josh Harrison escribió: > Another quick question...When you issue a query like this > select * from dummy limit 10 > What 10 rows are fetched? like first 10 or last 10 or the first 10 from > first block or.... ? Any 10. (First 10 in the physical table _if_ a seqscan is used). > And this query always returns the same 10 rows (when there are no > updates/deletes) Yes, assuming there is no VACUUM or CLUSTER either. And no inserts either if there is free space on the first pages. You really shouldn't trust it. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Josh Harrison escribi�: >> Fine. I can use order by when I want to order it in terms of some columns. >> But What if I want to maintain the same order as in the database1? ie., I >> want my rows of TableABC in Database2 to be the same order as the rows in >> TableABC in Database 1 ??? > You can't. According to the SQL standard, a table is an *unordered* collection of rows, and the results of any query are produced in an unspecified order (unless you use ORDER BY). The ambiguity about row ordering is intentional and is exploited by most DBMSes including Postgres to improve implementation efficiency. If you assume there is such a thing as a specific ordering within a table, you'll live to regret it eventually. regards, tom lane
On Jan 9, 2008 10:27 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Josh Harrison escribió:Any 10. (First 10 in the physical table _if_ a seqscan is used).
> Another quick question...When you issue a query like this
> select * from dummy limit 10
> What 10 rows are fetched? like first 10 or last 10 or the first 10 from
> first block or.... ?
Okay. Here is another scenario where Im confused.
I have a a table with around 30,000,000 recs. This is not a production system but a test system. So in the test system generally we upload the rows in some order say rows corresponding to a particular patient or something like that. But in the production system, it generally doesn't happen like that. The rows of 1 particular patient can be shuffled anywhere (ie., inserted in any order). We r trying to duplicate the same by shuffling te data in the table so that the rows are not in any order and also not stored in contiguous blocks
So now I have a table Dummy with 30,000,000 recs and a table Shuffled_Dummy (Create table Shuffled_Dummy as select * from Dummy order by random() ) with the same shuffled rows of dummy.
My questions
1. I pg_dumped dummy and Shuffled_dummy (from database1) to another database(database2)
When I issued the query in both database (database1 and database2)
select * from dummy limit 1000 ( the planner chooses seq scan for this query)
----- the output results from dummy are different in the 2 databases
But
select * from shuffled_dummy limit 1000 (planner chooses seq scan)
----- the outputs from shuffled_dummy are same from both the database
Why?
2. Also when does the planner switch from choosing index scan to bitmap index scan? Is it dependent on the number of rows to be retrieved or the position of the relevant data in the blocks or something else?
Thanks
josh
On Wed, 09 Jan 2008 10:54:21 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Josh Harrison escribió: > >> Fine. I can use order by when I want to order it in terms of > >> some columns. But What if I want to maintain the same order as > >> in the database1? ie., I want my rows of TableABC in Database2 > >> to be the same order as the rows in TableABC in Database 1 ??? > > > You can't. > > According to the SQL standard, a table is an *unordered* collection > of rows, and the results of any query are produced in an > unspecified order (unless you use ORDER BY). The ambiguity about > row ordering is intentional and is exploited by most DBMSes > including Postgres to improve implementation efficiency. If you > assume there is such a thing as a specific ordering within a table, > you'll live to regret it eventually. Does it make any sense *knowing* how the implementation works to load records in a table in a specific order to improve performances? And yeah I know that once you start deleting/updating row you may lose the advantage you gained betting on some peculiarity of the implementation... but in case you're dealing with a mostly static table? eg. if I'm importing a table does it make any sense to pre-sort it before importing it in postgres? -- Ivan Sergio Borgonovo http://www.webthatworks.it
On Wed, Jan 09, 2008 at 05:28:15PM +0100, Ivan Sergio Borgonovo wrote: > Does it make any sense *knowing* how the implementation works to load > records in a table in a specific order to improve performances? Well, this is more or less what CLUSTER does. There are some cases where happening to know about the order the table is in will yield happy effects, yes. A
On Jan 9, 2008 11:28 AM, Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:
Okay. Let me explain this againDoes it make any sense *knowing* how the implementation works to loadOn Wed, 09 Jan 2008 10:54:21 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera < alvherre@commandprompt.com> writes:
> > Josh Harrison escribió:
> >> Fine. I can use order by when I want to order it in terms of
> >> some columns. But What if I want to maintain the same order as
> >> in the database1? ie., I want my rows of TableABC in Database2
> >> to be the same order as the rows in TableABC in Database 1 ???
>
> > You can't.
>
> According to the SQL standard, a table is an *unordered* collection
> of rows, and the results of any query are produced in an
> unspecified order (unless you use ORDER BY). The ambiguity about
> row ordering is intentional and is exploited by most DBMSes
> including Postgres to improve implementation efficiency. If you
> assume there is such a thing as a specific ordering within a table,
> you'll live to regret it eventually.
records in a table in a specific order to improve performances?
And yeah I know that once you start deleting/updating row you may
lose the advantage you gained betting on some peculiarity of the
implementation... but in case you're dealing with a mostly static
table?
eg. if I'm importing a table does it make any sense to pre-sort it
before importing it in postgres?
Lets say you load the data related to a particular person sequentially into a table in the test database. This results in all of the data for any one person being located one or a very few sequential data blocks. Testing access to the person's data then reveals access to be very fast. However, the data is never loaded in that way in the production database. It is almost always spread out across many data blocks within the database, roughly organized by the date-time in which the data arrived.In this case access to a particular person's data is not as fast as compared to the previous one where the data are located close to each other.
we have this problem when we compare Oracle's performance with postgres since Oracle has index-only scan where it can access the data just from the index when the query involves only indexed columns. But since postgres currently doesn't implement index-only scan and goes to the heap for fetching the tuples it becomes very slow when the data are shuffled
Let me know if it makes sense now
Thanks
josh
On Jan 9, 2008 11:39 AM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
On Wed, Jan 09, 2008 at 05:28:15PM +0100, Ivan Sergio Borgonovo wrote:Well, this is more or less what CLUSTER does. There are some cases where
> Does it make any sense *knowing* how the implementation works to load
> records in a table in a specific order to improve performances?
happening to know about the order the table is in will yield happy effects,
yes.
You are right. Sometimes when i cluster the table according to the frequently accessed indexes then it makes queries pretty fast. But its not a feasible solution always since some tables have more indexes which are accessed frequently. So clustering the table according to one index will yield poor performance to queries involving other indexes. Index-only scan is a good solution for this I guess for queries involving indexed columns (like in oracle) !!!
josh
Josh Harrison <joshques@gmail.com> schrieb: > My questions > 1. I pg_dumped dummy and Shuffled_dummy (from database1) to another database > (database2) > When I issued the query in both database (database1 and database2) > > select * from dummy limit 1000 ( the planner chooses seq scan for this query) > select * from shuffled_dummy limit 1000 (planner chooses seq scan) > > > 2. Also when does the planner switch from choosing index scan to bitmap index > scan? Is it dependent on the number of rows to be retrieved or the position of > the relevant data in the blocks or something else? For a select * ... without a WHERE the db can't use an index, this query forced a seq-scan. A index is used when: - a index are created - a propper WHERE or ORDER BY in the query - this index is useful (a index isn't useful, for instance, for a small table or when almost all rows are in the result set) A bitmap index scan performed when: - 2 or more propper indexes available - see above Andreas -- Really, I'm not out to destroy Microsoft. That will just be a completely unintentional side effect. (Linus Torvalds) "If I was god, I would recompile penguin with --enable-fly." (unknow) Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889°
On Wed, Jan 09, 2008 at 11:51:16AM -0500, Josh Harrison wrote: > accessed frequently. So clustering the table according to one index will > yield poor performance to queries involving other indexes. Maybe not poor, but certainly not optimised. > Index-only scan is a good solution for this I guess for queries involving > indexed columns (like in oracle) !!! I think I don't know what you mean by "index-only scan". Oracle can't be returning you data just by scanning an index, can it? It needs to get the tuple if you need it back. My bet is that you haven't tuned your vacuums correctly, or you aren't doing ANALYSE often enough on the affected tables, or you need to SET STATISTICS higher on some of the columns in order to get better estimates (and therefore better plans). A
On Jan 9, 2008 11:56 AM, Andreas Kretschmer <akretschmer@spamfence.net> wrote:
ThanksJosh Harrison <joshques@gmail.com> schrieb:> My questions
> 1. I pg_dumped dummy and Shuffled_dummy (from database1) to another database
> (database2)
> When I issued the query in both database (database1 and database2)
>
> select * from dummy limit 1000 ( the planner chooses seq scan for this query)> select * from shuffled_dummy limit 1000 (planner chooses seq scan)
>
>> 2. Also when does the planner switch from choosing index scan to bitmap indexFor a select * ... without a WHERE the db can't use an index, this query
> scan? Is it dependent on the number of rows to be retrieved or the position of
> the relevant data in the blocks or something else?
forced a seq-scan.
A index is used when:
- a index are created
- a propper WHERE or ORDER BY in the query
- this index is useful
(a index isn't useful, for instance, for a small table or when almost
all rows are in the result set)
A bitmap index scan performed when:
- 2 or more propper indexes available
- see above
In my database, I have a table 'person' containing roughly 30,000,000 records
explain select count(*) from person where person_id > 1147000000
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Aggregate (cost=307708.20..307708.21 rows=1 width=0)
-> Index Scan using person_pk on person (cost=0.00..307379.79 rows=131364 width=0)
Index Cond: (person_id > 1147000000::numeric)
3 record(s) selected [Fetch MetaData: 0/ms] [Fetch Data: 0/ms]
This returns the result
count
--------
78718
But for this query where the condition is slightly different the query plan is different. The planner goes for bitmap index
explain select count(*) from person where person_id > 1146000000
QUERY PLAN
-----------------------------------------------------------------------------------------------
Aggregate (cost=342178.51..342178.52 rows=1 width=0)
-> Bitmap Heap Scan on person (cost= 3120.72..341806.71 rows=148721 width=0)
Recheck Cond: (person_id > 1146000000::numeric)
-> Bitmap Index Scan on person_pk (cost=0.00..3083.53 rows=148721 width=0)
Index Cond: (person_id > 1146000000::numeric)
5 record(s) selected [Fetch MetaData: 0/ms] [Fetch Data: 0/ms]
and the result is
count
--------
90625
How does the planner choose the plan?
josh
On Jan 9, 2008 12:11 PM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
I had vacuumed and analysed the tables ...tried increasing the statistics too. But the performance compared to oracle for these types of queries(that i had mentioned above) is pretty slow
On Wed, Jan 09, 2008 at 11:51:16AM -0500, Josh Harrison wrote:Maybe not poor, but certainly not optimised.
> accessed frequently. So clustering the table according to one index will
> yield poor performance to queries involving other indexes.I think I don't know what you mean by "index-only scan". Oracle can't be
> Index-only scan is a good solution for this I guess for queries involving
> indexed columns (like in oracle) !!!
returning you data just by scanning an index, can it? It needs to get the
tuple if you need it back.
For example if I have a query like
select column2 from ABC where column1 > 20
and table ABC is indexed on (column1,column2) then Oracle will not goto the heap to fetch the tuples. It will return them from the index itself since the column information is available in the index. But postgres always goes to the heap even if it has all the queried columns in the index.
For queries that involves all the columns(like select * from ABC where column1>20) ... of course oracle goes to the heap to fetch them
select column2 from ABC where column1 > 20
and table ABC is indexed on (column1,column2) then Oracle will not goto the heap to fetch the tuples. It will return them from the index itself since the column information is available in the index. But postgres always goes to the heap even if it has all the queried columns in the index.
For queries that involves all the columns(like select * from ABC where column1>20) ... of course oracle goes to the heap to fetch them
My bet is that you haven't tuned your vacuums correctly, or you aren't doing
ANALYSE often enough on the affected tables, or you need to SET STATISTICS
higher on some of the columns in order to get better estimates (and
therefore better plans).
I had vacuumed and analysed the tables ...tried increasing the statistics too. But the performance compared to oracle for these types of queries(that i had mentioned above) is pretty slow
josh
On Wed, Jan 09, 2008 at 12:24:37PM -0500, Josh Harrison wrote: > For example if I have a query like > select column2 from ABC where column1 > 20 > and table ABC is indexed on (column1,column2) then Oracle will not goto the > heap to fetch the tuples. It will return them from the index itself since > the column information is available in the index. Ah, I didn't know that. If your description is right, there would seem to be some pretty severe I/O penalties when using an index that doesn't cover all the fields you want. I mean, I can see the gain, but it would seem that's a significant cost. Anyway, no point in speculating about the implementation details of a system I don't know in detail. > I had vacuumed and analysed the tables ...tried increasing the statistics > too. But the performance compared to oracle for these types of queries(that > i had mentioned above) is pretty slow I see. Without rather more information, we're not going to be able to help you. What you could do is take some detailed examples to the -performance list. Oracle is terrifically capable, but Postgres can usually hold its own when correctly tuned (there are some cases where it can't, though). A
"Josh Harrison" <joshques@gmail.com> writes: > Aggregate (cost=342178.51..342178.52 rows=1 width=0) > -> Bitmap Heap Scan on person (cost=3120.72..341806.71 rows=148721 width=0) > Recheck Cond: (person_id > 1146000000::numeric) > -> Bitmap Index Scan on person_pk (cost=0.00..3083.53 rows=148721 width=0) > Index Cond: (person_id > 1146000000::numeric) ... > How does the planner choose the plan? The short answer is that it looks at the "cost" for all the reasonable plans it can come up with and picks the plan with the lowest cost. Scans which return many rows will tend to prefer bitmap index scans (when they don't do a full sequential scan of the t table) since it avoids random access. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!