Index: openacs-4/packages/acs-lang/www/doc/i18n-requirements.html =================================================================== RCS file: /usr/local/cvsroot/openacs-4/packages/acs-lang/www/doc/i18n-requirements.html,v diff -u -r1.1 -r1.2 --- openacs-4/packages/acs-lang/www/doc/i18n-requirements.html 20 Apr 2001 20:51:09 -0000 1.1 +++ openacs-4/packages/acs-lang/www/doc/i18n-requirements.html 8 Aug 2003 12:21:28 -0000 1.2 @@ -1,688 +1,466 @@ - - - -
- - -by Henry Minsky, Yon Feldman, Lars Pind, others
- -- -
-
-internationalization (i18n) --- -The provision within a computer program of the capability of making -itself adaptable to the requirements of different native languages, -local customs and coded character sets. - - -
-locale -
- -The definition of the subset of a user's environment that depends on -language and cultural conventions. - -
-localization (L10n) -
-The process of establishing information within a computer system -specific to the operation of particular native languages, local -customs and coded character sets. - -
-globalization -
-A product development approach which ensures that software products -are usable in the worldwide markets through a combination of -internationalization and localization. - -
- -
-Building an application often involves making a number of assumptions -on the part of the developers which depend on their own culture. These -include constant strings in the user interface and system error -messages, names of countries, cities, order of given and family names -for people, syntax of numeric and date strings and collation order of -strings. - -
- -The ACS should be able to operate in languages and regions beyond US -English. The goal of ACS Globalization is to provide a clean and -efficient way to factor out the locale dependent functionality from -our applications, in order to be able to easily swap in alternate -localizations. -
-This in turn will reduce redundant, costly, and error prone rework -when targeting the toolkit or applications built with the toolkit to -another locale. -
- -The cost of porting the ACS to another locale without some kind of -globalization support would be large and ongoing, since without a -mechanism to incorporate the locale-specific changes cleanly back into -the code base, it would require making a new fork of the source code -for each locale. - - -
-
-
-
-
-
- - -
-
-
-
-
- -Since the internationalization APIs may potentially be used on every -page in an application, the overhead for adding internationalization to a -module or application must not cause a significant time delay in -handling page requests. -
-In many cases there are facilities in Oracle to perform various -localization functions, and also there are facilities in Java which we -will want to move to. So the design to meet the requirements will tend -to rely on these capabilities, or close approximations to them where -possible, in order to make it easier to maintain Tcl and Java ACS -versions. -
- - -
-
-What do they need to modify to make this work? Can their localization work -be easily folded in to future releases of ACS? - -
-
-The site would have an end-user visible UI to support these languages, -and the content management system must allow articles to be posted in -these languages. In some cases it may be necessary to make the -modules' admin UI's operate in more than one supported language, while in other -cases the backend admin interface can operate in a single language. - -
-
-
- - -
-
-Other application servers: ATG Dyanmo, Broadvision, Vignette, ... ? Anyone -know how they deal with i18n ? - -
- -
-Mozilla i18N Guidelines: http://www.mozilla.org/docs/refList/i18n/l12yGuidelines.html - -
- - -
-
-See Content Repository Requirement 100.20 -- --10.10 Provide a consistent representation and API for creating and referencing a locale -
- -10.20 There will be a Tcl library of locale-aware formatting -and parsing functions for numbers, dates and times. Note that Java -has builtin support for these already. - -
-10.30 For each locale there will be default date, number and currency formats. -
- -
---20.10 The locale for a request should be computed by the following method, in descending -order of priority: -
-
get locale associated with subsite or package id - get locale from user preference - get locale from site wide default - - -20.20 An API will be provided for getting the current request locale from -the
ad_conn
structure. - -
-For example, what approaches could be used to implement a localizable -nav-bar mechanism for a site? A navigation bar might be made up of a -set of text strings and graphics, where the graphics themselves are -locale-specific, such as images of English or Japanese text (as on -www.arsdigita.com). It should be easy to specify alternate -configurations of text and graphics to lay out the page for different -locales. -
-Design note: Alternative mechanisms to implement this functionality -might include using templates, Java ResourceBundles, content-item -containers in the Content Repository, or some convention assigning a -common prefix to key strings in the message catalog. - -
- -
- ---40.10 Each message will referenced via unique a key. - -
- -40.20 The key for a message will have some hierarchical structure to it, -so that sets of messages can be grouped with respect to a module name -or package path. - -
- -40.30 The API for lookup of a message will take a locale and message key as -arguments, and return the appropriate translation of that message for -the specifed locale. - -
-40.40 The API for lookup of a message will accept an optional default string -which can be used if the message key is not found in the catalog. This -lets the developer get code working and tested in a single -language before having to initialize or update a message catalog. - -
- - -40.50 For use within templates, custom tags which invoke the message lookup -API will be provided. - -
- -40.60 Provide a method for importing and exporting a flat file of -translation strings, in order to make it as easy as possible to create -and modify message translations in bulk without having to use a web -interface. - -
-40.70 Since translations may be in different character sets, there must -be provision for writing and reading catalog files in different -character sets. A mechanism must exist for identifying the character -set of a catalog file before reading it. - -
-40.80 There should be a mechanism for tracking dependencies in the message -catalog, so that if a string is modified, the other translations of -that string can be flagged as needing update. - -
-40.90 The message lookup must be as efficient as possible so as not to slow -down the delivery of pages. - -
- -
-Design question: Is there any reason to implement the message catalog on top of the content repository as -the underlying storage and retrieval service, with a layer of caching for -performance? Would we get a nice user interface and version control -almost for free? - - - -
-50.0 A locale will have a primary associated character set -which is used to encode text in the language. When given a locale, we -can query the system for the associated character set to use. -
-The assumption is that we are going to use Unicode in our database to -hold all text data. Our current programming environments (Tcl/Oracle -or Java/Oracle) operate on Unicode data internally. However, since -Unicode is not yet commonly used in browsers and authoring tools, the -system must be able to read and write other character sets. In -particular, conversions to and from Unicode will need to be explicitly -performed at the following times: - -
-
-Same question for script and template files, how do we know what
-language and character set they are authored in? Should we overload
-the filename suffix (e.g., '.shiftjis.adp', '.ja_JP.euc.adp')?
-
-The simplest design is probably just to assign a default mapping from
-each locale to character a set: e.g. ja_JP -> ShiftJIS, fr_FR ->
-ISO-8859-1. +++ (see new ACS/Java notes) +++
Design question: Do we want to mandate that all template files
-be stored in UTF8? I don't think so, because most people don't have Unicode
-editors, or don't want to be bothered with an extra step to convert
-files to UTF8 and back when editing them in their favorite editor.
-
- -
-- -Tcl Source File Character Set
- -There are two classes of Tcl files loaded by the system; library files -loaded at server startup, and page script files, which are run on -each page request. - --
Should we require all Tcl files be stored as UTF8? That -seems too much of a burden on developers. - --50.10 Tcl library files can be authored in any character set. The system -must have a way to determine the character set before loading the files, probably from the filename. -
-50.20 Tcl page script files can be authored in any character set. The system -must have a way to determine the character set before loading the files, probably from the filename. -
- -
Submitted Form Data Character Set
- -50.30 Data which is submitted with a HTTP request using a GET or POST -method may be in any character set. The system must be able -to determine the encoding of the form data and convert it -to Unicode on demand. - --50.35 The developer must be able to override the default system -choice of character set when parsing and validating user form data. - -
-50.30.10 Extra hair: In Japan and some other Asian languages where there are multiple -character set encodings in common use, the server may need to attempt to -do an auto-detection of the character set, because buggy browsers may submit -form data in an unexpected alternate encoding. - - -
-
Output Character Set
- -50.40 The output character set for a page request will be determined by default by the -locale associated with the request (see requirement 20.0). - --50.50 It must be possible for a developer to manually override the output -character set encoding for a request using an API function. -
- -
-60.10 All ACS error messages must use the message catalog and the request locale -to generate error message for the appropriate locale. -- --60.20 Web server error messages such as 404, 500, etc must also be delivered -in the appropriate locale. -
-60.30 Where files are written or read from disk, their filenames must use a -character set and character values which are safe for the underlying -operating system. -
- -
- -70.0 For a given abstract URL, the designer may create multiple locale-specific template files may be created (one per locale or language) ---70.10 For a given page request, the system must be able to select -an approprate locale-specific template file to use. -The request locale is computed as per (see requirement 20.0). - -
Design note: this would probably be implemented -by suffixing the locale or a locale abbreviation to the template filename, such as foo.ja.adp or foo.en_GB.adp. - -
- -
-70.20A template file may be created for a partial locale (language only, without -a territory), and the request processor should be able to find the closest match for -the current request locale. - -
- -70.30 A template file may be created in any character set. The system must have a -way to know which character set a template file contains, so it can -properly process it. -
- - -
Formatting Datasource Output in Templates
- -70.50 The properties of a datasource column may include a datatype so that -the templating system can format the output for the current -locale. The datatype is defined by a standard ACS datatype plus a -format token or format string, for example: a date column might be -specified as 'current_date:date LONG,' or 'current_date:date -"YYYY-Mon-DD"' --
Forms
- -70.60 The forms API must support construction of locale-specific HTML form -widgets, such as date entry widgets, and form validation of user input -data for locale-specific data, such as dates or numbers. - -- -70.70 For forms which allow users to upload files, a standard -method for a user to indicate the charset of a text file being -uploaded must be provided. - -
Design note: -this presumably applies to uploading data to the content repository as -well -
- - -
-80.10 Support API for correct collation (sorting order) on lists of strings in locale-dependent way. - --- -80.20 For the Tcl API, we will say that locale-dependent sorting will use Oracle SQL -operations (i.e., we won't provide a Tcl API for this). We require -a Tcl API function to return the correct incantation of NLS_SORT to use -for a given locale with
ORDER BY
clauses in queries. - --80.40 The system must handle full-text search in any supported language. - - -
-90.10 Provide API support for specifying a time zone --- -90.20 Provide an API for computing time and date operations which are aware -of timezones. So for example a calendar module can properly -synchronize items inserted into a calendar from users in different -time zones using their own local times. - -
-90.30 Store all dates and times in universal time zone, UTC. - -
- -90.40 For a registered users, a time zone preference should be stored. -
- -90.50 For a non-registered user a time zone preference should -be attached via a session or else UTC should be used to display -every date and time. -
- -90.60 The default if we can't determine a time zone is to display - - all dates and times in some universal time zone such as GMT. - -
- - -
- ---100.10 Since UTF8 strings can use up to three (UCS2) or six (UCS4) bytes -per character, make sure that column size declarations in the schema -are large enough to accomodate required data (such as email addresses -in Japanese). - -
- -
- -- - --110.10 The email message sending API will allow for a character set encoding to be specified. -
-110.20 The email accepting API will allow for character set to be parsed correctly (hopefully -a well formatted message will have a MIME character set content type header) - -
- - -
Document Revision # | -Action Taken, Notes | -When? | -By Whom? | -
---|---|---|---|
0.1 | -Creation | -11/08/2000 | -Henry Minsky | -
0.2 | -Minor typos fixed, clarifications to wording | -11/14/2000 | -Henry Minsky | -
0.3 | -comments from Christian | -1/14/2000 | -Henry Minsky | -
Last modified: $Date$
- - + + + + + + + +by Henry Minsky, Yon Feldman, Lars Pind, others
+internationalization (i18n) ++The provision within a computer program of the capability of +making itself adaptable to the requirements of different native +languages, local customs and coded character sets.
+locale
+The definition of the subset of a user's environment that +depends on language and cultural conventions.
+localization (L10n)
+The process of establishing information within a computer system +specific to the operation of particular native languages, local +customs and coded character sets.
+globalization
+A product development approach which ensures that software +products are usable in the worldwide markets through a combination +of internationalization and localization.
+
Building an application often involves making a number of +assumptions on the part of the developers which depend on their own +culture. These include constant strings in the user interface and +system error messages, names of countries, cities, order of given +and family names for people, syntax of numeric and date strings and +collation order of strings.
+The ACS should be able to operate in languages and regions +beyond US English. The goal of ACS Globalization is to provide a +clean and efficient way to factor out the locale dependent +functionality from our applications, in order to be able to easily +swap in alternate localizations.
+This in turn will reduce redundant, costly, and error prone +rework when targeting the toolkit or applications built with the +toolkit to another locale.
+The cost of porting the ACS to another locale without some kind +of globalization support would be large and ongoing, since without +a mechanism to incorporate the locale-specific changes cleanly back +into the code base, it would require making a new fork of the +source code for each locale.
+Since the internationalization APIs may potentially be used on +every page in an application, the overhead for adding +internationalization to a module or application must not cause a +significant time delay in handling page requests.
+In many cases there are facilities in Oracle to perform various +localization functions, and also there are facilities in Java which +we will want to move to. So the design to meet the requirements +will tend to rely on these capabilities, or close approximations to +them where possible, in order to make it easier to maintain Tcl and +Java ACS versions.
+What do they need to modify to make this work? Can their +localization work be easily folded in to future releases of +ACS?
+The site would have an end-user visible UI to support these +languages, and the content management system must allow articles to +be posted in these languages. In some cases it may be necessary to +make the modules' admin UI's operate in more than one supported +language, while in other cases the backend admin interface can +operate in a single language.
+Other application servers: ATG Dyanmo, Broadvision, Vignette, +... ? Anyone know how they deal with i18n ?
+Mozilla +i18N Guidelines: +http://www.mozilla.org/docs/refList/i18n/l12yGuidelines.html
+ + + +See +Content Repository Requirement 100.20 ++10.10 Provide a consistent representation and API for +creating and referencing a locale
+10.20 There will be a Tcl library of locale-aware +formatting and parsing functions for numbers, dates and times. +Note that Java has builtin support for these already.
+10.30 For each locale there will be default date, number +and currency formats.
+
++20.10 The locale for a request should be computed by the +following method, in descending order of priority:
++
+- get locale associated with subsite or package id
+- get locale from user preference
+- get locale from site wide default +
+20.20 An API will be provided for getting the current +request locale from the
+ad_conn
structure.
For example, what approaches could be used to implement a +localizable nav-bar mechanism for a site? A navigation bar might be +made up of a set of text strings and graphics, where the graphics +themselves are locale-specific, such as images of English or +Japanese text (as on www.arsdigita.com). It should be easy to +specify alternate configurations of text and graphics to lay out +the page for different locales.
+Design note: Alternative mechanisms to implement this +functionality might include using templates, Java ResourceBundles, +content-item containers in the Content Repository, or some +convention assigning a common prefix to key strings in the message +catalog.
+++40.10 Each message will referenced via unique a key.
+40.20 The key for a message will have some hierarchical +structure to it, so that sets of messages can be grouped with +respect to a module name or package path.
+40.30 The API for lookup of a message will take a locale +and message key as arguments, and return the appropriate +translation of that message for the specifed locale.
+40.40 The API for lookup of a message will accept an +optional default string which can be used if the message key is not +found in the catalog. This lets the developer get code working and +tested in a single language before having to initialize or update a +message catalog.
+40.50 For use within templates, custom tags which invoke +the message lookup API will be provided.
+40.60 Provide a method for importing and exporting a flat +file of translation strings, in order to make it as easy as +possible to create and modify message translations in bulk without +having to use a web interface.
+40.70 Since translations may be in different character +sets, there must be provision for writing and reading catalog files +in different character sets. A mechanism must exist for identifying +the character set of a catalog file before reading it.
+40.80 There should be a mechanism for tracking +dependencies in the message catalog, so that if a string is +modified, the other translations of that string can be flagged as +needing update.
+40.90 The message lookup must be as efficient as possible +so as not to slow down the delivery of pages.
++
+Design question: Is there any reason to +implement the message catalog on top of the content repository as +the underlying storage and retrieval service, with a layer of +caching for performance? Would we get a nice user interface and +version control almost for free?
50.0 A locale will have a primary associated character +set which is used to encode text in the language. When given a +locale, we can query the system for the associated character set to +use.
+The assumption is that we are going to use Unicode in our +database to hold all text data. Our current programming +environments (Tcl/Oracle or Java/Oracle) operate on Unicode data +internally. However, since Unicode is not yet commonly used in +browsers and authoring tools, the system must be able to read and +write other character sets. In particular, conversions to and from +Unicode will need to be explicitly performed at the following +times:
+
+Design question: Do we want to mandate
+that all template files be stored in UTF8? I don't think so,
+because most people don't have Unicode editors, or don't want to be
+bothered with an extra step to convert files to UTF8 and back when
+editing them in their favorite editor.
Same question for script and template +files, how do we know what language and character set they are +authored in? Should we overload the filename suffix (e.g., +'.shiftjis.adp', '.ja_JP.euc.adp')?
+The simplest design is probably just to +assign a default mapping from each locale to character a set: e.g. +ja_JP -> ShiftJIS, fr_FR -> ISO-8859-1. +++ (see new ACS/Java +notes) +++
+++Tcl Source File Character Set
+There are two classes of Tcl files loaded by the system; library +files loaded at server startup, and page script files, which are +run on each page request. ++
+Should we require all Tcl files be stored +as UTF8? That seems too much of a burden on +developers.50.10 Tcl library files can be authored in any character +set. The system must have a way to determine the character set +before loading the files, probably from the filename.
+50.20 Tcl page script files can be authored in any +character set. The system must have a way to determine the +character set before loading the files, probably from the +filename.
+Submitted Form Data Character Set
+50.30 Data which is submitted with a HTTP request using a +GET or POST method may be in any character set. The system must be +able to determine the encoding of the form data and convert it to +Unicode on demand. +50.35 The developer must be able to override the default +system choice of character set when parsing and validating user +form data.
+50.30.10 Extra hair: In Japan and some other Asian +languages where there are multiple character set encodings in +common use, the server may need to attempt to do an auto-detection +of the character set, because buggy browsers may submit form data +in an unexpected alternate encoding.
+Output Character Set
+50.40 The output character set for a page request will be +determined by default by the locale associated with the request +(see requirement 20.0). +50.50 It must be possible for a developer to manually +override the output character set encoding for a request using an +API function.
+
60.10 All ACS error messages must use the +message catalog and the request locale to generate error message +for the appropriate locale. ++60.20 Web server error messages such as 404, 500, etc +must also be delivered in the appropriate locale.
+60.30 Where files are written or read from disk, their +filenames must use a character set and character values which are +safe for the underlying operating system.
+
70.0 For a given abstract URL, the designer may +create multiple locale-specific template files may be created (one +per locale or language) ++70.10 For a given page request, the system must be able +to select an approprate locale-specific template file to use. The +request locale is computed as per (see requirement 20.0).
+Design note: this would probably be +implemented by suffixing the locale or a locale abbreviation to the +template filename, such as foo.ja.adp or +foo.en_GB.adp.
+70.20A template file may be created for a partial locale +(language only, without a territory), and the request processor +should be able to find the closest match for the current request +locale.
+70.30 A template file may be created in any character +set. The system must have a way to know which character set a +template file contains, so it can properly process it.
+Formatting Datasource Output in Templates
+70.50 The properties of a datasource column may include a +datatype so that the templating system can format the output for +the current locale. The datatype is defined by a standard ACS +datatype plus a format token or format string, for example: a date +column might be specified as 'current_date:date LONG,' or +'current_date:date "YYYY-Mon-DD"' +Forms
+70.60 The forms API must support construction of +locale-specific HTML form widgets, such as date entry widgets, and +form validation of user input data for locale-specific data, such +as dates or numbers. +70.70 For forms which allow users to upload files, a +standard method for a user to indicate the charset of a text file +being uploaded must be provided.
+Design note: this presumably applies to +uploading data to the content repository as well
+
80.10 Support API for correct collation (sorting +order) on lists of strings in locale-dependent way. ++80.20 For the Tcl API, we will say that locale-dependent +sorting will use Oracle SQL operations (i.e., we won't provide a +Tcl API for this). We require a Tcl API function to return the +correct incantation of NLS_SORT to use for a given locale with +
+ORDER BY
clauses in queries.80.40 The system must handle full-text search in any +supported language.
+
90.10 Provide API support for specifying a time +zone ++90.20 Provide an API for computing time and date +operations which are aware of timezones. So for example a calendar +module can properly synchronize items inserted into a calendar from +users in different time zones using their own local times.
+90.30 Store all dates and times in universal time zone, +UTC.
+90.40 For a registered users, a time zone preference +should be stored.
+90.50 For a non-registered user a time zone preference +should be attached via a session or else UTC should be used to +display every date and time.
+90.60 The default if we can't determine a time zone is to +display all dates and times in some universal time zone such as +GMT.
+
++100.10 Since UTF8 strings can use up to three (UCS2) or +six (UCS4) bytes per character, make sure that column size +declarations in the schema are large enough to accomodate required +data (such as email addresses in Japanese).
+
++110.10 The email message sending API will allow for a +character set encoding to be specified.
+110.20 The email accepting API will allow for character +set to be parsed correctly (hopefully a well formatted message will +have a MIME character set content type header)
+
Document Revision # | +Action Taken, Notes | +When? | +By Whom? | +
---|---|---|---|
0.1 | +Creation | +11/08/2000 | +Henry Minsky | +
0.2 | +Minor typos fixed, clarifications to +wording | +11/14/2000 | +Henry Minsky | +
0.3 | +comments from Christian | +1/14/2000 | +Henry Minsky | +
Last modified: $Date$
+ +