Thread: Re: Updated COPY CSV patch
Bruce Momjian said: > > OK, here is a new version of the patch that includes the grammar changes we agreed upon, SGML changes, and \copy support. I will not make any more changes without contacting you so feel free to make adjustments and repost. Excellent. Quick work :-) I will test later today. Meanwhile, quick response to your comments. > > I have two open issues. First, CSV should support WITH OIDS, no? > Why on earth would you want to? OIDs only have emaning to postgresql. Dumping to/from CSVs should not be seen as an alternative to Postgresql's normal text or binary file formats. Rather, it is a way of exchanging data with other programs that can't conveniently read or write those formats, but can read/write CSVs. I expect the data to be making a one way trip. OIDs make no sense to those other programs. For all those reasons I disallowed use of WITH OIDS for CSV mode. If you think that we should have it, it will be simple enough to do. It just strikes me as a very odd thing to do. > Second, I found a problem with NULLs. If I do: > . > test=> create table test (x text, y text); > CREATE TABLE > test=> insert into test values ('', NULL); > INSERT 17221 1 > test=> > > then this: > > test=> copy test to '/tmp/b' with csv; > > creates: > > "", > > and this: > > test=> copy test to '/tmp/b' with csv NULL 'fred'; > > creates: > > ,fred > > Is that logical? A non-null field went from "" to nothing. > Yes, I think it is logical, although it is strange, I agree. See below. > I think it is caused by this code: > > bool force_quote = (strcmp(string, null_print) == 0); > CopyAttributeOutCSV(string, delim, quote, escape, > force_quote); > > The reason it happens is that when the null string is '', it matches a zero-length string, so the value is quoted. When the null stirng isn't blank, a zero-length string doesn't match the null string so it isn't quoted. I think we need to add special logic for zero-length strings so they are always quoted, even if there is a special null string. This will make our dumps more consistent, I think, or maybe the current behavior is OK. It just struck me as strange. > I agree it seems strange, but that's because you probably aren't that used to dealing with CSVs. The code you quote is there so that if *we* reload a CSV *we* created the null property is preserved. That's actually quite a neat property, and one I would have said is "beyond contract requirements" :-). The effect you have seen seems perverse because you have chosen a perverse null marker - I expect everyone (well, nearly everyone) who uses this will do what seems to me natural, and use '' as the null marker, and then, as you saw, the results look quite intuitive. The real test of this is how well it plays with other programs, rather than how well data makes a round trip from postgresql to postgresql. In that sense, we need to let it out into the wild a bit and see what happens. There is a bit of an impedance mismatch between the way databases look at data and the way spreadsheets look at data. And CSV is a horribly brain- dead data format. Its one saving grace is that it is a lowest common denominator format that practically every data handling proggram under the sun understands. So there is going to be a bit of oddness. And we may have to make the odd tweak here and there. One area that we should think about as an enhancement is NOT NULL fields. As it stands now, we will get what we normally get when we try to insert a null into a NOT NULL field, namely an error. If the field has a simple literal default we could force that. And in the special case of text/varchar fields, it would be reasonable to force an empty string even if no default is set. There isn't a nice easy answer, I'm afraid. We shouldn't hold up putting this in on that account, but handling this better is certainly a TODO. (MySQL solves this problem by requiring every NOT NULL field to have a default, and silently supplying one if it is not specified, and the inserting a null is the sane as inserting DEFAULT. I don't think we want to go down that road .... this bit me badly a few weeks back when I applied some schema changes someone had carelessly designed relying on this behaviour. Of course I was running PostgrSQL and the whole thing blew up on me.) > I did a dump/reload test with a null string and null, and it worked fine. > > Is there any data that can not be dumped/reloaded via CSV? > I know it works fine for geometric types. I have seen it dump ACLs quite happily, although I didn't try to reload them. I will do some testing soon with arrays and byteas. cheers andrew
On Tue, Apr 13, 2004 at 06:58:24 -0400, Andrew Dunstan <andrew@dunslane.net> wrote: > > One area that we should think about as an enhancement is NOT NULL fields. > As it stands now, we will get what we normally get when we try to insert > a null into a NOT NULL field, namely an error. If the field has a simple > literal default we could force that. And in the special case of > text/varchar fields, it would be reasonable to force an empty string even > if no default is set. There isn't a nice easy answer, I'm afraid. We > shouldn't hold up putting this in on that account, but handling this > better is certainly a TODO. If you try to insert NULLs into a nonnull field you should get an error. If you have unquoted empty strings, and are not using the empty string as the NULL marker, then you can just not set the NULL code to be the empty string. If you need to turn this on and off by column, then perhaps it would be better to do that externally. As for setting default values, I think that is a good idea. I suggested a while back. There could be another keyword, DEFAULT, on the COPY FROM command that is used to define a code that will be replaced by the default value (or NULL if there is no default for a column) similar to how the NULL code is replaced by NULL.
Bruno Wolff III wrote: >On Tue, Apr 13, 2004 at 06:58:24 -0400, > Andrew Dunstan <andrew@dunslane.net> wrote: > > >>One area that we should think about as an enhancement is NOT NULL fields. >>As it stands now, we will get what we normally get when we try to insert >>a null into a NOT NULL field, namely an error. If the field has a simple >>literal default we could force that. And in the special case of >>text/varchar fields, it would be reasonable to force an empty string even >>if no default is set. There isn't a nice easy answer, I'm afraid. We >>shouldn't hold up putting this in on that account, but handling this >>better is certainly a TODO. >> >> > >If you try to insert NULLs into a nonnull field you should get an error. >If you have unquoted empty strings, and are not using the empty string as >the NULL marker, then you can just not set the NULL code to be the empty >string. If you need to turn this on and off by column, then perhaps it >would be better to do that externally. > >As for setting default values, I think that is a good idea. I suggested >a while back. There could be another keyword, DEFAULT, on the COPY FROM >command that is used to define a code that will be replaced by the >default value (or NULL if there is no default for a column) similar to >how the NULL code is replaced by NULL. > > Well, as I indicated we can deal with this in a subsequent round, I think. However, here's an idea. We know (or can easily discover) if there is a NOT NULL constraint that can apply to the attribute (or domain if it is a domain type). If isnull is set on the read-in value in such a case, instead of trying to insert null, and knowing we would fail, try to insert the value we actually read (usually ''), even though we think we found a null. This will succeed with text fields, and fail with, for example, int fields. That strikes me as quite reasonable behaviour, although perhaps qualifying for a warning. Or perhaps we could enable such behaviour with a flag. Of course, this would be for CSV mode only - standard TEXT mode should work as now. cheers andrew (trying to be creative here)
Andrew Dunstan wrote: > > > > I have two open issues. First, CSV should support WITH OIDS, no? > > > > Why on earth would you want to? OIDs only have emaning to postgresql. > Dumping to/from CSVs should not be seen as an alternative to Postgresql's > normal text or binary file formats. Rather, it is a way of exchanging > data with other programs that can't conveniently read or write those > formats, but can read/write CSVs. I expect the data to be making a one > way trip. OIDs make no sense to those other programs. > > > For all those reasons I disallowed use of WITH OIDS for CSV mode. > > If you think that we should have it, it will be simple enough to do. It > just strikes me as a very odd thing to do. While doing OIDs seems atypical, it seems like a reasonable thing that CSV should be able to do. Basically, I see no reason to disable it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: >While doing OIDs seems atypical, it seems like a reasonable thing that >CSV should be able to do. Basically, I see no reason to disable it. > > > OK. We have bigger fish to fry ;-) cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Well, as I indicated we can deal with this in a subsequent round, I > think. However, here's an idea. We know (or can easily discover) if > there is a NOT NULL constraint that can apply to the attribute (or > domain if it is a domain type). If isnull is set on the read-in value in > such a case, instead of trying to insert null, and knowing we would > fail, try to insert the value we actually read (usually ''), even though > we think we found a null. This will succeed with text fields, and fail > with, for example, int fields. That strikes me as quite reasonable > behaviour, although perhaps qualifying for a warning. Or perhaps we > could enable such behaviour with a flag. I think this is all a *really really* bad idea. COPY is not supposed to be an "intuit what the user meant" facility. It is supposed to give reliable, predictable results, and in particular it *is* supposed to raise errors if the data is wrong. If you want intuition in interpreting bogus CSV files then an external data conversion program is the place to do it. > Of course, this would be for CSV mode only - standard TEXT mode should > work as now. The CSV flag should not produce any fundamental changes in COPY's behavior, only change the external format it handles. Guessing has no place here. regards, tom lane
Andrew Dunstan wrote: > >As for setting default values, I think that is a good idea. I suggested > >a while back. There could be another keyword, DEFAULT, on the COPY FROM > >command that is used to define a code that will be replaced by the > >default value (or NULL if there is no default for a column) similar to > >how the NULL code is replaced by NULL. > > > > > > Well, as I indicated we can deal with this in a subsequent round, I > think. However, here's an idea. We know (or can easily discover) if > there is a NOT NULL constraint that can apply to the attribute (or > domain if it is a domain type). If isnull is set on the read-in value in > such a case, instead of trying to insert null, and knowing we would > fail, try to insert the value we actually read (usually ''), even though > we think we found a null. This will succeed with text fields, and fail > with, for example, int fields. That strikes me as quite reasonable > behavior, although perhaps qualifying for a warning. Or perhaps we > could enable such behavior with a flag. > > Of course, this would be for CSV mode only - standard TEXT mode should > work as now. I see that the default NULL for CSV mode is ''. I was hoping the default was something more special. Right now, by default, comma-comma is a null and comma-double-quote-double-quote-comma is a zero-length string. I am thinking there should be a way to set NULL to be either of those, or neither of those, in which case comma-comma is a zero-length string too. To me, these characteristics are a property of the file, not of the individual fields. For example, WITH NULL BOTH would allow ,, and ,"", to both be null, while using WITH NULL NONE, both ,, and ,"", are zero-length strings. And, finally, the default is WITH NULL STRICT (or SOME) where ,, is NULL and ,"", is the zero-length string. Those are all existing keywords, and those special NULL values would only be available in CSV mode. I am not sure what NULL '' should so in these cases. I am thinking we would actually disable it for CSV mode because you would need to define which '' you are talking about. If you specify an actual string for NULL like WITH NULL 'fred', then both ,, and ,"", are zero-length strings, I think. Again, I can assist in making these modifications to the patch. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Andrew Dunstan wrote: > Bruce Momjian wrote: > > >While doing OIDs seems atypical, it seems like a reasonable thing that > >CSV should be able to do. Basically, I see no reason to disable it. > > > > > > > > OK. We have bigger fish to fry ;-) Uh, sorry, what I meant was that we should have it working in this patch. No reason to leave it for later. We have to get this 100% right the first time because people will start relying on it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: >Andrew Dunstan wrote: > > >>Bruce Momjian wrote: >> >> >> >>>While doing OIDs seems atypical, it seems like a reasonable thing that >>>CSV should be able to do. Basically, I see no reason to disable it. >>> >>> >>> >>> >>> >>OK. We have bigger fish to fry ;-) >> >> > >Uh, sorry, what I meant was that we should have it working in this >patch. No reason to leave it for later. We have to get this 100% right >the first time because people will start relying on it. > > > Yes. What *I* meant was that allowing it was OK with me, and not worth arguing over. Incidentally, the patch looks OK at first glance, and seems to work fine, modulo today's little controversies, with this exception: if (csv_mode) { if (!quote) quote = "\""; if (!escape) escape = "\""; /* should be "escape = quote;" */ } cheers andrew
Andrew Dunstan wrote: > Yes. What *I* meant was that allowing it was OK with me, and not worth > arguing over. OK, thanks. > Incidentally, the patch looks OK at first glance, and seems to work > fine, modulo today's little controversies, with this exception: > > if (csv_mode) > { > if (!quote) > quote = "\""; > if (!escape) > escape = "\""; /* should be "escape = quote;" */ > } Oh, excellent point. On my other WITH NULL ALL|STRICT|NONE email, we could have '' be ALL and have it be WITH NULL ''|STRICT|NONE. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Andrew Dunstan wrote: >> For all those reasons I disallowed use of WITH OIDS for CSV mode. > While doing OIDs seems atypical, it seems like a reasonable thing that > CSV should be able to do. Basically, I see no reason to disable it. I agree. It's an orthogonal thing, and we shouldn't make CSV mode work differently from normal mode where we don't have to. regards, tom lane
Bruce Momjian wrote: >I see that the default NULL for CSV mode is ''. I was hoping the >default was something more special. Right now, by default, comma-comma >is a null and comma-double-quote-double-quote-comma is a zero-length >string. I am thinking there should be a way to set NULL to be either >of those, or neither of those, in which case comma-comma is a >zero-length string too. > >To me, these characteristics are a property of the file, not of the >individual fields. > >For example, WITH NULL BOTH would allow ,, and ,"", to both be null, > > I can't see a real world use for this setting. And I think it would break the property of the patch as it currently stands, that we can unambiguously import what we exported, no matter what the settings. I don't think we should abandon that lightly. Quite apart from any other reason because it makes testing easier (just compare what you wrote with what you read back). >while using WITH NULL NONE, both ,, and ,"", are zero-length strings. > > Again, I think this will break that property. But if that's what it takes to be able to import to a table with NOT NULL in at least some cases I could live with it. Just. But in the general case it won't work. Say you are importing into a table with the following defn: (a text, b text not null, c int). then the line 'x,,' will fail on b if '' is null, and will fail on c if '' is empty string. And yet this sort of output is exactly what is to be expected from a spreadsheet. >And, finally, the default is WITH NULL STRICT (or SOME) where ,, is NULL >and ,"", is the zero-length string. > > That's what happens now with the default. >Those are all existing keywords, and those special NULL values would >only be available in CSV mode. > >I am not sure what NULL '' should so in these cases. I am thinking we >would actually disable it for CSV mode because you would need to define >which '' you are talking about. > >If you specify an actual string for NULL like WITH NULL 'fred', then >both ,, and ,"", are zero-length strings, I think. > > I don't believe '' should be special, any more than 'fred' should be. As it stands now, NULL 'fred' does say that ,, and '"", are empty strings. >Again, I can assist in making these modifications to the patch. > > I appreciate your efforts. But as indicated elsewhere, right now I'm leaning towards reworking this into a client, because the road seems to be blocked on doing what I regard as necessary in the backend. cheers andrew
Andrew Dunstan wrote: > I don't believe '' should be special, any more than 'fred' should be. As > it stands now, NULL 'fred' does say that ,, and '"", are empty strings. > > >Again, I can assist in making these modifications to the patch. > > > > > > I appreciate your efforts. But as indicated elsewhere, right now I'm > leaning towards reworking this into a client, because the road seems to > be blocked on doing what I regard as necessary in the backend. I don't agree about moving this to a client. It is too important for that. Many admin apps and psql need this capability and having it in a client just requires everyone to reinvent the wheel. Let me think over you issues and see what I can come up with. The patch isn't that big and already does 90% of what we need. We just need ot get that extra 10%. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian said: > Andrew Dunstan wrote: >> I don't believe '' should be special, any more than 'fred' should be. >> As it stands now, NULL 'fred' does say that ,, and '"", are empty >> strings. >> >> >Again, I can assist in making these modifications to the patch. >> > >> > >> >> I appreciate your efforts. But as indicated elsewhere, right now I'm >> leaning towards reworking this into a client, because the road seems >> to be blocked on doing what I regard as necessary in the backend. > > I don't agree about moving this to a client. It is too important for > that. Many admin apps and psql need this capability and having it in a > client just requires everyone to reinvent the wheel. > > Let me think over you issues and see what I can come up with. > > The patch isn't that big and already does 90% of what we need. We just > need ot get that extra 10%. > Yeah. Yesterday was a bad day - sorry if I was a bit liverish. Regrouping ... The likely failures I see are these: . on import, a null value is attempted to be inserted into a NOT NULL field, and . on export, a column that should have not-null values quoted does not, confusing the foreign program. I was trying to be clever and handle these automatically, causing a small explosion somewhere near Pittsburgh. How about if we are dumb and simply give finer grained control to the user? Let's say instead of your per table/file ALL|NONE|STRICT proposal, we have something like a "FORCE list-of-columns" option. On import, this would force isnull to be false and on export it would force quoting of non-null values just for those columns. The problems are not symmetric, but the solution is the same - make the user decide on a per column basis. This would have the following happy results: . what we write would remain unambiguously null-preserving . you would never need to use (and shouldn't) this option for a postgresql<->postgresql round trip, assuming the table definitions were the same. . even if you wrote a file using this option you could still reimport it correctly without using the option. . (fingers crossed) Tom is kept happy. cheers andrew
Bruce Momjian wrote: > I don't agree about moving this to a client. It is too important for > that. Many admin apps and psql need this capability and having it in > a client just requires everyone to reinvent the wheel. So far the only need I've heard of is dumping the data to an appropriately formatted text file and reading it into some external application. That surely sounds like something a specialized client could handle.
Peter Eisentraut wrote: > Bruce Momjian wrote: > > I don't agree about moving this to a client. It is too important for > > that. Many admin apps and psql need this capability and having it in > > a client just requires everyone to reinvent the wheel. > > So far the only need I've heard of is dumping the data to an > appropriately formatted text file and reading it into some external > application. That surely sounds like something a specialized client > could handle. But too many folks need this functionality, and if you create a client, how do you access it from other computers. It is a logical extension for COPY, I think. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Andrew Dunstan wrote: > Again, I think this will break that property. But if that's what it > takes to be able to import to a table with NOT NULL in at least some > cases I could live with it. Just. But in the general case it won't work. > Say you are importing into a table with the following defn: (a text, b > text not null, c int). then the line 'x,,' will fail on b if '' is null, > and will fail on c if '' is empty string. And yet this sort of output is > exactly what is to be expected from a spreadsheet. I thought about this case. I think the big objection to the original idea was trying to tie the CSV behavior to specific types. With a type-agnostic system like PostgreSQL, doing that will certainly fail to catch all cases and be a maintenance nightmare. However, your idea of testing pg_attribute.attnotnull is acceptable, I think. It isn't type-specific. One way of solving the above issue is to allow comma-comma to promote to a zero-length string if the column is NOT NULL. One idea would be to print a warning the first time a NOT NULL column sees a comma-comma, and state that it will be changed to a zero-length string. You could use STRICT to disable that and throw an error. I don't see why this has to be type-specific. If a zero-length string doesn't match an int column, you throw an error. If you see comma-comma for an INT NOT NULL column, you have to throw an error. There is no proper value for the column. You might also want to do that anyway in some cases even if the column isn't NOT NULL. This is where your FORCE col1, col2 came from, where you can specify which columns should be NULL and which zero-length strings. I guess the issue is can we get a single specification for the entire table, or do we need per-column specifications? Basically, we are trying to over-load the CSV system to store null information by using the distinction of whether a column has quote-quote or nothing. Let's look at your example again: CREATE TABLE test (col1 TEXT NOT NULL, col2 INT) and you have a CSV input file of: , If we say both are NULL, it fails, and if we say neither is NULL, it fails. Do we need control for each column? What if we go with preferring NULL for comma-comma, and then print warnings for NOT NULL columns and try the promote. If you want comma-comma to be zero-length string, you can create the column with NOT NULL, load the file, then ALTER TABLE to allow NULL's again. Basically, the NOT NULL specification on the column is the COPY CSV control method, rather than having it be in COPY. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: > >Do we need control for each column? What if we go with preferring NULL >for comma-comma, and then print warnings for NOT NULL columns and try >the promote. If you want comma-comma to be zero-length string, you can >create the column with NOT NULL, load the file, then ALTER TABLE to >allow NULL's again. Basically, the NOT NULL specification on the column >is the COPY CSV control method, rather than having it be in COPY. > > > If we can't do type specific stuff then we need to be able to have column-specific controls on export, at least. Consider a text column containing US 5-digit ZIP codes. If they are not quoted, a spreadsheet will almost certainly not preserve the leading zero some of them have, producing very undesirable results. However, a genuine numeric-type field must not be quoted, or the same spreadsheet won't see that value as a number. Unless we do stuff based on type, we have no way of knowing from the text representation of the data what we really need. Thus my proposal from this morning for column-specific user control over this aspect. And if we are going to have per column user control on export, why not on import too, to handle the NOT NULL problem? It might make life easier for us code-wise than chasing down nullability (e.g. in domains). cheers andrew
Andrew Dunstan wrote: > Bruce Momjian wrote: > > > > >Do we need control for each column? What if we go with preferring NULL > >for comma-comma, and then print warnings for NOT NULL columns and try > >the promote. If you want comma-comma to be zero-length string, you can > >create the column with NOT NULL, load the file, then ALTER TABLE to > >allow NULL's again. Basically, the NOT NULL specification on the column > >is the COPY CSV control method, rather than having it be in COPY. > > > > > > > > > If we can't do type specific stuff then we need to be able to have > column-specific controls on export, at least. > > Consider a text column containing US 5-digit ZIP codes. If they are not > quoted, a spreadsheet will almost certainly not preserve the leading > zero some of them have, producing very undesirable results. However, a > genuine numeric-type field must not be quoted, or the same spreadsheet > won't see that value as a number. Unless we do stuff based on type, we > have no way of knowing from the text representation of the data what we > really need. Thus my proposal from this morning for column-specific user > control over this aspect. And if we are going to have per column user > control on export, why not on import too, to handle the NOT NULL > problem? It might make life easier for us code-wise than chasing down > nullability (e.g. in domains). Wow, that is certainly an excellent point. When we import, we know the resulting data type, but spreadsheets don't, and rely on the quoting to know what to do with the value. The zipcode is an excellent example. You can't even test for leading zeros because then some spreadsheet values in the column are text and some numeric. We do have a column list capability with COPY already: COPY tablename [ ( column [, ...] ) ] FROM { 'filename' | STDIN } Maybe we should extend that to control quoting on export and NULL handling on import. Does that solve our problems? FYI, do you have IM? I am: AIM bmomjian ICQ 151255111 Yahoo bmomjian MSN root@candle.pha.pa.us IRC bmomjian via FreeNode or EFNet -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: > Wow, that is certainly an excellent point. When we import, we know the > resulting data type, but spreadsheets don't, and rely on the quoting to > know what to do with the value. > > The zipcode is an excellent example. You can't even test for leading > zeros because then some spreadsheet values in the column are text and > some numeric. I talked to Andrew on IRC and we went over the open CSV issues. We talked about how we could do quoting for zipcode in TEXT fields and not quote true numeric values without hardcoding datatypes into the system. The most tricky case was NUMERIC vs. TEXT with zipcodes. NUMERIC and TEXT have almost identical pg_type entries, so there is nothing there to help us. I found parse_coerce.c::TypeCategory(), which contains information about which data type oids are in which grouping, e.g. DATETIME, STRING, NUMERIC, etc. It seems that function, if called with pg_type.typbasetype could help determine if quotes should be used. My idea is to skip quotes for NUMERIC and DATETIME types, and quote everything else. This means that user-defined types will always be output with quotes, which is probably OK. So, for open CSV items we have: o add oid dump/reload o handle loading of comma-comma into NOT NULL collumns o handle quoting of TEXT type with zipcodes, etc. For the NOT NULL cases, I am thinking we can just throw a warning the first time a comma-comma is loaded into a TEXT column and promote to a zero-lengh string. If the column is INT, it throws an error. If we head in this direction, we will not need any additional syntax except that which is in the patch already. I have the master version of the patch and made his suggested changes for the default for 'escape'. Comments? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: > I talked to Andrew on IRC and we went over the open CSV issues. > > We talked about how we could do quoting for zipcode in TEXT fields and > not quote true numeric values without hardcoding datatypes into the > system. The most tricky case was NUMERIC vs. TEXT with zipcodes. > NUMERIC and TEXT have almost identical pg_type entries, so there is > nothing there to help us. > > I found parse_coerce.c::TypeCategory(), which contains information about > which data type oids are in which grouping, e.g. DATETIME, STRING, > NUMERIC, etc. It seems that function, if called with > pg_type.typbasetype could help determine if quotes should be used. My > idea is to skip quotes for NUMERIC and DATETIME types, and quote > everything else. This means that user-defined types will always be > output with quotes, which is probably OK. > > So, for open CSV items we have: > > o add oid dump/reload > o handle loading of comma-comma into NOT NULL collumns > o handle quoting of TEXT type with zipcodes, etc. Oh, one more item. CSV needs a way to output date values in Datestyle format, for use by spreadsheets that understand mm/dd/yy dates, but not ISO dates. This will not be on by default. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: > Bruce Momjian wrote: > > I talked to Andrew on IRC and we went over the open CSV issues. > > > > We talked about how we could do quoting for zipcode in TEXT fields and > > not quote true numeric values without hardcoding datatypes into the > > system. The most tricky case was NUMERIC vs. TEXT with zipcodes. > > NUMERIC and TEXT have almost identical pg_type entries, so there is > > nothing there to help us. > > > > I found parse_coerce.c::TypeCategory(), which contains information about > > which data type oids are in which grouping, e.g. DATETIME, STRING, > > NUMERIC, etc. It seems that function, if called with > > pg_type.typbasetype could help determine if quotes should be used. My > > idea is to skip quotes for NUMERIC and DATETIME types, and quote > > everything else. This means that user-defined types will always be > > output with quotes, which is probably OK. > > > > So, for open CSV items we have: > > > > o add oid dump/reload > > o handle loading of comma-comma into NOT NULL collumns > > o handle quoting of TEXT type with zipcodes, etc. > > Oh, one more item. CSV needs a way to output date values in Datestyle > format, for use by spreadsheets that understand mm/dd/yy dates, but not > ISO dates. This will not be on by default. Ah, I see COPY already honors DateStyle on output. I will mention that in the COPY docs as a tip for CSV. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I found parse_coerce.c::TypeCategory(), which contains information about > which data type oids are in which grouping, e.g. DATETIME, STRING, > NUMERIC, etc. It seems that function, if called with > pg_type.typbasetype could help determine if quotes should be used. TypeCategory is a crock that should have been done away with long ago. We need to be working to eliminate it, not expand our dependency on it. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I found parse_coerce.c::TypeCategory(), which contains information about > > which data type oids are in which grouping, e.g. DATETIME, STRING, > > NUMERIC, etc. It seems that function, if called with > > pg_type.typbasetype could help determine if quotes should be used. > > TypeCategory is a crock that should have been done away with long ago. > We need to be working to eliminate it, not expand our dependency on it. Ah, so do we have any other way to identify the type of field we are using? Particularly, how do we identify a numeric and dates? If we don't find something, we have to push those decisions to the user, which is kind of strange because deep down we must know this. The only other thing we could do would be to add something to pg_type that says whether it needs quotes. Seems like overkill. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian said: > So, for open CSV items we have: > > o add oid dump/reload There seems to be agreement on this at least. All you need to do is remove these lines - AFAICS the OID code should be able to be happily non-CSV- aware. /* * Don't allow OIDs in CSV mode */ if (csv_mode && oids) // FIX ME bjm ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("Cannot specify OIDS in CSV mode "))); cheers andrew
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Particularly, how do we identify a numeric and dates? We don't, and I'm not convinced that we should. The entire concept is suspect in a type-agnostic system. In particular, I've really got a problem with the fact that TypeCategory uses a fixed, nonexpansible set of categories. Even if you try to push ahead with using TypeCategory, I'd lay a side bet that it doesn't work. Will those spreadsheets that make this nonstandard assumption about quotes having semantic significance be able to cope with all the possible output formats from timestamptz, for instance? How about timetz, which is in the same category? > The only other thing we could do would be to add something to pg_type > that says whether it needs quotes. Seems like overkill. Seems like wrong. The real problem here is that "whether it needs quotes" depends on the program you are intending to export to, the semantic behavior you want, and likely also the phase of the moon. I don't think we can hope to invent a COPY behavior rule that automagically gets this right. What's needed is a tool that can output a user-customizable data format, and that seems to me to be outside COPY's charter. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Particularly, how do we identify a numeric and dates? > > We don't, and I'm not convinced that we should. The entire concept is > suspect in a type-agnostic system. In particular, I've really got a > problem with the fact that TypeCategory uses a fixed, nonexpansible > set of categories. > > Even if you try to push ahead with using TypeCategory, I'd lay a side > bet that it doesn't work. Will those spreadsheets that make this > nonstandard assumption about quotes having semantic significance be able > to cope with all the possible output formats from timestamptz, for > instance? How about timetz, which is in the same category? > > > The only other thing we could do would be to add something to pg_type > > that says whether it needs quotes. Seems like overkill. > > Seems like wrong. The real problem here is that "whether it needs > quotes" depends on the program you are intending to export to, the > semantic behavior you want, and likely also the phase of the moon. > I don't think we can hope to invent a COPY behavior rule that > automagically gets this right. What's needed is a tool that can output > a user-customizable data format, and that seems to me to be outside > COPY's charter. Well, certainly one option is to quote everything. That would import into everything just fine. However, numbers really want to be numbers, and dates want to be dates. Most spreadsheets understand it, though you are right that some apps might get confused. One idea Andrew had was to allow the user to specify which fields get quotes and which don't. I figured we could just skip quotes from our builtin numeric types, and maybe dates/times, and quote everything else. I just checked OpenOffice and it understand those values. The other date fields have spaces and stuff and are just treated as text anyway. Sure, this is a hard problem, but with the patch I am working on, it gets pretty close with very little code change. It isn't perfect, but it is closer that most folks are going to get with some external utility that will never get written or maintained properly. We can't let the perfect be the enemy of good on this one. Too many folks ask for it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Andrew Dunstan wrote: > Bruce Momjian said: > > So, for open CSV items we have: > > > > o add oid dump/reload > > > There seems to be agreement on this at least. All you need to do is remove > these lines - AFAICS the OID code should be able to be happily non-CSV- > aware. > > /* > * Don't allow OIDs in CSV mode > */ > > if (csv_mode && oids) // FIX ME bjm > ereport(ERROR, > (errcode(ERRCODE_SYNTAX_ERROR), > errmsg("Cannot specify OIDS in CSV mode "))); Thanks. Done. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian said: > Tom Lane wrote: >> TypeCategory is a crock that should have been done away with long ago. >> We need to be working to eliminate it, not expand our dependency on >> it. > > Ah, so do we have any other way to identify the type of field we are > using? Particularly, how do we identify a numeric and dates? > > If we don't find something, we have to push those decisions to the > user, which is kind of strange because deep down we must know this. > To be honest, I would feel safer with the decision in the hands of users. They know their data and needs better than we do. Continuing with the zip code example, if the foreign program is another db, the same problem should not arise as there is in a spreadsheet, because it should know the data type it is importing into, just like we do. Since we don't know what the foreign program is, we would be making a "worst case" guess. With user choice, we would be quoting in exactly these circumstances: . looks like the null string but isn't null . contains the quote or delimiter chars . isn't null and the user told us to I don't think it unduly burdensome. Using the syntax I suggested earlier, something like: copy mytable to 'mydata.csv' csv force zipcode; seems OK to me. I'm all in favor of low-tecH solutions where appropriate. ;-) cheers andrew
Andrew Dunstan wrote: > Bruce Momjian said: > > Tom Lane wrote: > >> TypeCategory is a crock that should have been done away with long ago. > >> We need to be working to eliminate it, not expand our dependency on > >> it. > > > > Ah, so do we have any other way to identify the type of field we are > > using? Particularly, how do we identify a numeric and dates? > > > > If we don't find something, we have to push those decisions to the > > user, which is kind of strange because deep down we must know this. > > > > > To be honest, I would feel safer with the decision in the hands of users. > They know their data and needs better than we do. Continuing with the zip > code example, if the foreign program is another db, the same problem > should not arise as there is in a spreadsheet, because it should know the > data type it is importing into, just like we do. Since we don't know what > the foreign program is, we would be making a "worst case" guess. With user > choice, we would be quoting in exactly these circumstances: > > . looks like the null string but isn't null > . contains the quote or delimiter chars > . isn't null and the user told us to > > I don't think it unduly burdensome. Using the syntax I suggested earlier, > something like: > > copy mytable to 'mydata.csv' csv force zipcode; > > seems OK to me. I'm all in favor of low-tecH solutions where > appropriate. ;-) We could do smart defaults and let folks control it too. That would work. What would FORCE do? We actually need quote control on output and null control on input, or am I wrong? Do we have input OK with the warning idea? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073