Thread: Solution of the file name problem of copy on windows.
Hi Tom-san. I want to solve one problem before the release of 8.4. However, since it also seems to be the new feature, if not enough for 8.4, you may suggest that it is 8.5. In Japan, the local file name of a server is dealt with by SJIS. The example present Postgres... server_encoding = UTF-8 client_encoding = SJIS At this time, a copy file name is UTF-8. It was troubled by handling.:-( Then, I make this proposal patch. regression test ======================= All 120 tests passed. ======================= as for database is UTF-8. HIROSHI=# \l データベース一覧 名前 | 所有者 | エンコーディング | Collation | Ctype | アクセス権 -----------+---------+------------------+-----------+-------+------------------- -- HIROSHI | HIROSHI | UTF8 | C | C | eucdb | HIROSHI | EUC_JP | C | C | HIROSHI=# create table 日本語てすと (きー text); CREATE TABLE HIROSHI=# insert into 日本語てすと values('わーい'); INSERT 0 1 HIROSHI=# copy 日本語てすと to 'C:/tmp/日本語UTF8.txt'; COPY 1 HIROSHI=# delete from 日本語てすと; DELETE 1 HIROSHI=# copy 日本語てすと from 'C:/tmp/日本語UTF8.txt'; COPY 1 HIROSHI=# select * from 日本語てすと; きー -------- わーい (1 行) as for database is eucjp. HIROSHI=# \c eucdb psql (8.4devel) データベース "eucdb" に接続しました。. eucdb=# \d リレーションの一覧 スキーマ | 名前 | 型 | 所有者 ----------+--------------+-------+--------- public | 日本語てすと | table | HIROSHI (1 行) eucdb=# select * from 日本語てすと; きー -------- わーい (1 行) eucdb=# copy 日本語てすと to 'C:/tmp/日本語eucdb.txt'; COPY 1 eucdb=# delete from 日本語てすと; DELETE 1 eucdb=# copy 日本語てすと from 'C:/tmp/日本語eucdb.txt'; COPY 1 eucdb=# select * from 日本語てすと; きー -------- わーい (1 行) C:\tmp>dir 日本語* ドライブ C のボリューム ラベルは SYS です ボリューム シリアル番号は 1433-2C7C です C:\tmp のディレクトリ 2009/04/07 13:58 8 日本語eucdb.txt 2009/04/07 13:58 8 日本語utf8.txt 2 個のファイル 16 バイト It seems that it is very comfortable. !! What do you think? Regards, Hiroshi Saito
Attachment
"Hiroshi Saito" <z-saito@guitar.ocn.ne.jp> writes: > I want to solve one problem before the release of 8.4. > However, since it also seems to be the new feature, > if not enough for 8.4, you may suggest that it is 8.5. I'm not too clear on what this is really supposed to accomplish, but we are hardly going to put code like that into every single file access in Postgres, which is what seems to be the logical implication. Shouldn't we just tell people to use a database encoding that matches their system environment? regards, tom lane
Hi, "Hiroshi Saito" <z-saito@guitar.ocn.ne.jp> wrote: > At this time, a copy file name is UTF-8. It was troubled by handling.:-( > Then, I make this proposal patch. I think the problem is not only in Windows but also in all platforms where the database encoding doesn't match their OS's encoding. Instead of Windows specific codes, how about adding GetPlatformEncoding() and convert all of *absolute* paths? It would be performed at the lowest API layer; i.e, BasicOpenFile(). Standard database file accesses with RelFileNode are not affected because is uses *relative* paths. There are some issues: * Is it possible to determine the platform encoding? * The above cannot handle non-ascii pathunder $PGDATA. Is it acceptable? * In Windows, the native encoding is UTF-16, but we will use SJIS if we takeon the above method. Is the limitation acceptable? Comments welcome. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Tom Lane wrote: > "Hiroshi Saito" <z-saito@guitar.ocn.ne.jp> writes: >> I want to solve one problem before the release of 8.4. >> However, since it also seems to be the new feature, >> if not enough for 8.4, you may suggest that it is 8.5. > > I'm not too clear on what this is really supposed to accomplish, but > we are hardly going to put code like that into every single file access > in Postgres, which is what seems to be the logical implication. > Shouldn't we just tell people to use a database encoding that matches > their system environment? Unfortunately (as usual) under Japanese Windows there's no database encoding that matches the system environment. As for the file name in COPY command, there's little meaning to convert it to the server encoding because the file name is irrelevant to the database. Because Windows is Unicode(UTF-16) based, it seems natural to convert the file name to wide characters once. regards, Hiroshi Inoue
Hi. ----- Original Message ----- From: "Hiroshi Inoue" <inoue@tpf.co.jp> > Tom Lane wrote: >> "Hiroshi Saito" <z-saito@guitar.ocn.ne.jp> writes: >>> I want to solve one problem before the release of 8.4. >>> However, since it also seems to be the new feature, >>> if not enough for 8.4, you may suggest that it is 8.5. >> >> I'm not too clear on what this is really supposed to accomplish, but >> we are hardly going to put code like that into every single file access >> in Postgres, which is what seems to be the logical implication. >> Shouldn't we just tell people to use a database encoding that matches >> their system environment? > > Unfortunately (as usual) under Japanese Windows there's no database > encoding that matches the system environment. > As for the file name in COPY command, there's little meaning to > convert it to the server encoding because the file name is irrelevant > to the database. Because Windows is Unicode(UTF-16) based, it seems > natural to convert the file name to wide characters once. Yes, If server encoding can be chosen by windows, the facilities in good working order. It was not possible though it was regrettable. Regards, Hiroshi Saito
Hi Itagaki-san. Um, I had a focus in help the problem which is not avoided. I am not sensitive to a problem being avoided depending on usage. However, I will wish to work spontaneously, when it is help much. Regards, Hiroshi Saito ----- Original Message ----- From: "Itagaki Takahiro" <itagaki.takahiro@oss.ntt.co.jp> > Hi, > > "Hiroshi Saito" <z-saito@guitar.ocn.ne.jp> wrote: > >> At this time, a copy file name is UTF-8. It was troubled by handling.:-( >> Then, I make this proposal patch. > > I think the problem is not only in Windows but also in all platforms > where the database encoding doesn't match their OS's encoding. > > Instead of Windows specific codes, how about adding GetPlatformEncoding() > and convert all of *absolute* paths? It would be performed at the lowest > API layer; i.e, BasicOpenFile(). Standard database file accesses with > RelFileNode are not affected because is uses *relative* paths. > > There are some issues: > * Is it possible to determine the platform encoding? > * The above cannot handle non-ascii path under $PGDATA. > Is it acceptable? > * In Windows, the native encoding is UTF-16, but we will use SJIS > if we take on the above method. Is the limitation acceptable? > > Comments welcome. > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
"Hiroshi Saito" <z-saito@guitar.ocn.ne.jp> wrote: > Um, I had a focus in help the problem which is not avoided. > I am not sensitive to a problem being avoided depending on usage. > However, I will wish to work spontaneously, when it is help much. I'll research whether encoding of filesystem path is affected by locale settings or not in some platforms. Also, we need to research where we should get the system encoding when the locale is set to "C", which is popular in Japanese users. I'll report to you the progress :) Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote: > "Hiroshi Saito" <z-saito@guitar.ocn.ne.jp> wrote: > > > Um, I had a focus in help the problem which is not avoided. > > I am not sensitive to a problem being avoided depending on usage. > > However, I will wish to work spontaneously, when it is help much. > > I'll research whether encoding of filesystem path is affected by > locale settings or not in some platforms. Also, we need to research > where we should get the system encoding when the locale is set to "C", > which is popular in Japanese users. Here is a patch to implement GetPlatformEncoding() and convert absolute file paths from database encoding to platform encoding. Since encoding of paths are converted at AllocateFile() and BasicOpenFile(), not only COPY TO/FROM but also almost of file operations are covered by the patch. Callers of file access methods don't have to modify their codes. Please test the patch in a variety of platforms. I tested it on Windows and Linux, and then I found {PG_UTF8, "ANSI_X3.4-1968"} is required for encoding_match_list in src/port/chklocale.c on Linux (FC6). Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > Here is a patch to implement GetPlatformEncoding() and convert absolute > file paths from database encoding to platform encoding. This seems like a fairly significant overhead added to solve a really minor problem (if it's not minor why has it never come up before?). I'm also not convinced by any of the details --- why are GetACP and pg_get_encoding_from_locale the things to look at, and why is fd.c an appropriate place to hook in? Surely if we need it here, we need it in places like initdb as well. But really this is much too low a level to be solving the problem at. If we have to convert path encodings in the backend, we should be doing it once somewhere around the place where we identify the value of PGDATA. It should not be necessary to repeat all this for every file access within the database directory. regards, tom lane
Hi. Anyhow, I appreciate discussion. ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> > Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: >> Here is a patch to implement GetPlatformEncoding() and convert absolute >> file paths from database encoding to platform encoding. > > This seems like a fairly significant overhead added to solve a really > minor problem (if it's not minor why has it never come up before?). > > I'm also not convinced by any of the details --- why are GetACP and > pg_get_encoding_from_locale the things to look at, and why is fd.c an > appropriate place to hook in? Surely if we need it here, we need it in > places like initdb as well. But really this is much too low a level to > be solving the problem at. If we have to convert path encodings in the > backend, we should be doing it once somewhere around the place where we > identify the value of PGDATA. It should not be necessary to repeat all > this for every file access within the database directory. Ahh, I think this is a sensitive problem and requires careful handling too. However, following tests are shown in order to help your understanding. This is the case which can't be operated if no apply the patch of Itagaki-san. C:\work>set PGDATA=C:\tmp\日本語 data C:\work>set PGPORT=5444 C:\work>set PGHOME=C:\MinGW\local\pgsql C:\work>cmd.exe Microsoft Windows XP [Version 5.1.2600] (C) Copyright 1985-2001 Microsoft Corp. C:\work>initdb -E UTF-8 --no-locale データベースシステム内のファイルの所有者は"HIROSHI"ユーザでした。 このユーザがサーバプロセスを所有しなければなりません。 データベースクラスタはロケールCで初期化されます。 デフォルトのテキスト検索設定はenglishに設定されました。 ディレクトリC:/tmp/日本語 dataの権限を設定しています ... ok サブディレクトリを作成しています ... ok デフォルトのmax_connectionsを選択しています ... 100 デフォルトの shared_buffers を選択しています ... 32MB 設定ファイルを作成しています ... ok C:/tmp/日本語 data/base/1にtemplate1データベースを作成しています ... ok pg_authidを初期化しています ... ok 依存関係を初期化しています ... ok システムビューを作成しています ... ok システムオブジェクトの定義をロードしています ... ok 変換を作成しています ... ok ディレクトリを作成しています ... ok 組み込みオブジェクトに権限を設定しています ... ok 情報スキーマを作成しています ... ok template1データベースをバキュームしています ... ok template1からtemplate0へコピーしています ... ok template1からpostgresへコピーしています ... ok 警告: ローカル接続向けに"trust"認証が有効です。 pg_hba.confを編集する、もしくは、次回initdbを実行する時に-Aオプショ ンを使用することで変更することができます。 成功しました。以下を使用してデータベースサーバを起動することができます。 "postmaster" -D "C:/tmp/日本語 data" または "pg_ctl" -D "C:/tmp/日本語 data" -l logfile start C:\work>set PGCLIENTENCODING=SJIS C:\work>psql postgres psql (8.4beta1) "help" でヘルプを表示します. postgres=# create table 日本語(きー text); CREATE TABLE postgres=# insert into 日本語 values('いれた'); INSERT 0 1 postgres=# copy 日本語 to 'C:/tmp/日本語 data/日本語utf8.txt'; COPY 1 postgres=# delete from 日本語; DELETE 1 postgres=# copy 日本語 from 'C:/tmp/日本語 data/日本語utf8.txt'; COPY 1 postgres=# select * from 日本語; きー --------いれた (1 行) C:\work>dir "C:\tmp\日本語 data"ドライブ C のボリューム ラベルは SYS ですボリューム シリアル番号は 1433-2C7C です C:\tmp\日本語 data のディレクトリ 2009/04/13 23:22 <DIR> . 2009/04/13 23:22 <DIR> .. 2009/04/13 23:18 <DIR> base 2009/04/13 23:19 <DIR> global 2009/04/13 23:17 <DIR> pg_clog 2009/04/13 23:17 3,616 pg_hba.conf 2009/04/13 23:17 1,611 pg_ident.conf 2009/04/13 23:17 <DIR> pg_multixact 2009/04/13 23:23 <DIR> pg_stat_tmp 2009/04/13 23:17 <DIR> pg_subtrans 2009/04/13 23:17 <DIR> pg_tblspc 2009/04/13 23:17 <DIR> pg_twophase 2009/04/13 23:17 4 PG_VERSION 2009/04/13 23:17 <DIR> pg_xlog 2009/04/13 23:17 17,112 postgresql.conf 2009/04/13 23:19 38 postmaster.opts 2009/04/13 23:19 24 postmaster.pid 2009/04/13 23:22 8 日本語utf8.txt 7 個のファイル 22,413 バイト 11 個のディレクトリ 42,780,246,016バイトの空き領域
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > Here is a patch to implement GetPlatformEncoding() and convert absolute > > file paths from database encoding to platform encoding. > > This seems like a fairly significant overhead added to solve a really > minor problem (if it's not minor why has it never come up before?). It's not always a minor problem in Japan. It has been discussed in users group in Japan several times. However, surely I should pay attention to the performance. One of the solutions might be to cache the encoding in GetPlatformEncoding(). There will be no overheads when database encoding and platform encoding are same, that would be a typical use. > It should not be necessary to repeat all > this for every file access within the database directory. That's why I added checking with is_absolute_path() there. We can avoid conversion in normal file access under PGDATA because relative paths are used for it. But I should have checked all of file access not only in backends but also in client programs. I'll research them... Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > There are some issues: > * Is it possible to determine the platform encoding? There is no platform encoding in linux. File name encoding depend on user locale, so different users can have different encoding of file name. -- Sergey Burladyan