Dear Colleagues,
I'd like to discuss some problems which arise when creating Web servers
in many non-english speaking countries:
1) necessity of presentation of information in more than one language;
2) existence of many character coding tables, especially in Russia where
5 cyrillic charsets exist.
Attached text is Konstantin Chuguev report concerning this problem
(all questions about report send please to joy@urc.ac.ru).
HTTP/1.1 proposes "Accept-Language" and "Accept-Charset" headers with
the list of acceptable languages/charsets, and corresponding "Contents-Language"
and "text/x-charset" fields in the document body. But this approach is
incompatible with all Web caching systems.
One solution is using "Content-language", "text/x-charset" fields together
with URL for all cached documents.
I'd ask Duane and other squid developers is it possible?
Could anybody propose another (better) solution.
Serge Krashakov, administrator of Chg-FREEnet
Multiple character set enabled WWW server
and conversion API
At present, with the rapid development of Russian, as well as
ex-USSR and East European, wide area network infrastructure, a very
important task is the creation and maintenance of information
resources in all fields of activities. It is desirable to keep many
of these resources in several languages, suitable both for native and
foreign users.
There is a problem with prosessing distributed text
information in many non-English speaking countries: distinctions
between character sets on different operating systems. Thus, up to
the present there are 5 different charsets used in Russia:
- CP866 (also known as DOS alternative Cyrillic charset);
- CP1251 - MS Windows Cyrillic charset;
- KOI8 ( or KOI8-r ) - de-facto standard on Unix systems and
on the Russian Internet (RFC 1489);
- ISO-8859-5 - the only Cyrillic charset supported in MIME
and by large Unix-manufacturers such as
Sun Microsystems etc.;
- MacOS Cyrillic - on Macintosh computers.
Furthermore there is a lot of other countries where more than
one charset are used.
Now the most popular, widespread and powerful Internet service
is WWW. That is why the most useful task which can be selected from
the common internationalization/localization problem is the creation
of multilingual and multiple character set enabled Web clients and
servers. But the client software is not taken into consideration
in this paper.
Multilingual means - the server contains documents in
different languages in the way most handy for a Web-master, and a user
can choose any language for any document. Now there are various
methods to do this (at least in Russia, but most likely in other
coutries as well):
1. The language negotiation mechanism defined in the HTTP/1.1
standard: the client sends the Accept-Language field with
a list of acceptable languages, and the server returns
the name of the chosen language in the Content-Language
header field of the response. In this case the same URL
containing no information about the language, is used
for all document versions in different languages. And
the value being sent in the Accept-Language field can be
set up in browser's options just before loading a document.
Advantages of this method are the following:
- document hierarchy on the server does not depend
on the languages supported by the server. A user can
alter the document language while being in any place
of this hierarchy, not just on the main page. After
changing the language, the user remains in the same point
of the hierarchy, and can continue navigation from this
place;
- there is no information about language in hypertext
references from other documents. The server determines
the language every time it receives a request. In case
of missing the document version in the requested
language, the list of all the languages possible for this
document can be given to the user.
Although it is the most preferrable method, not all client
and server software supports it. And what is more, all
existing proxy-servers cache documents only by URLs,
ignoring the Content-Language field, which makes this
method unsuitable for Web navigation through caching
proxies.
2. Explicit setting the preferred language in a URL:
a) top level directories have the names identical with
languages names:
http://www.server.org/english/info/main.html
http://www.server.org/russian/info/main.html
http://www.server.org/french/info/main.html
This method's shortcoming is the difficulty
in realization of language change, when the user would
remain in the same place of document hierarchy.
Navigation within the limits of the same server is
realized, as a rule, by means of the relative
references. And using this method, it is simple
to provide language change only for the documents
situated near the root (top level) of the hierarchy.
Since language names are in the root, it is necessary
to use the absolute references to come to another
language version of a document. This makes moving
the subtrees in the Web server hierarchy difficult.
b) each directory has subdirectories named in the same way
as the languages:
http://www.server.org/info/english/main.html
http://www.server.org/info/russian/main.html
http://www.server.org/info/french/main.html
c) language name is in the file suffix:
http://www.server.org/info/main.en.html
http://www.server.org/info/main.ru.html
http://www.server.org/info/main.fr.html
d) the same with another suffix order:
http://www.server.org/info/main.html.en
http://www.server.org/info/main.html.ru
http://www.server.org/info/main.html.fr
The advantage of all 4 methods is that they allow
navigation through caching proxy servers. And their
disadvantage is the difficulty of supporting references
from one language version of a document to the other
language versions, because one should modify all
the versions when adding a new language.
As it has been mentioned before, in many countries different
character sets are used for representing a text in the native
language on different operating systems. And since WWW has
the client-server architecture, and client software requesting a Web
server through the Internet runs on various operating systems,
the multilingual server should deal with multiple charsets.
The universal character set suitable for all the languages
(Unicode or ISO 10646) is necessary only when there is information
in several languages in the same document (dictionary pages, foreign
language manuals etc.). This is only one of every possible information
resources. As the realization of supporting several languages
in the same document requires the usage of new protocol versions
(HTTP/1.1 and HTML/3.0), which are in the early stage of practical
application now (like Unicode viewers and editors), the usage
of the universal character set is not mentioned in the paper.
It is expedient that the server keeps documents in one
appointed character set for each language and does on-the-fly recoding
when replies to the client.
Existing Web servers use various methods for choosing
the required character set:
1. Charset negotiation: the client sends a list of preferred
charsets in the Accept-Charset field and the server answers
with the charset name written in the Content-Type field:
Content-Type: text/html; charset=iso-8859-5
The same URL is used for different charset versions of
a document. This URL does not contain information about
a character set. The value being sent in the Accept-Charset
field can be set up in browser's options just before
loading a document. At present this method is supported
by even less amount of software then the language
negotiation method, and by no one caching proxy server.
This method has the same advantages and a disadvantage as
the language negotiation method.
2. Non-standard extension of the negotiation method:
a) by means of the content type negotiation - the client
sends the Accept field in the request header, and
indicates a "pseudotype" among its preferred content
types (e.g. text/x-charset-koi8). The preferred charset
is determined from such a "pseudotype". But, again,
many client programs do not allow to set up the Accept
field, which makes the use of this method impossible.
b) the server can determine client's operating system or
some language related features, and choose
the corresponding charset for any language by means
of the User-Agent field. Thus, for the Russian language,
in the presence of the word "Windows" in the User-Agent
field, the server sends a document in the windows1251
charset, and in the presence of the word "DOS"
the charset is x-ibm866.
Unfortunately the User-Agent field contents have no
standardized format. In addition, there are situations,
when different charsets are used on the same or similar
operating systems (e.g. ISO-8859-5 and KOI8-R for
Russian in Unixes). And a user can prefer the charset,
which isnot standard for the given language. Therefore
sometimes the use of this method is ineffective.
These 2 methods have the same advantages as the standard
negotiation method.
c) the server can keep the database with the "client host
(IP address or domain name) - charset name" pairs.
Clients interact with the database by means of HTML
forms located on the same server. The server always
knows the IP address of its peer. But if the client
works through a proxy, the server knows only the proxy's
address, and this method does not do.
3. Explicit indication of the charset name in a URL:
a) pseudo directories with conventional charset names
(here iso=iso-8859-5, koi=koi8-r, win=windows1251):
http://www.server.org/iso/info/main.html
http://www.server.org/koi/info/main.html
http://www.server.org/win/info/main.html
b) different domain names of servers:
http://iso.www.server.org/info/main.html
http://koi.www.server.org/info/main.html
http://win.www.server.org/info/main.html
Conventional (non-standard, tied to some OS) charset names
can be used only for one definite language. Though such
names are shorter and clearer for the end user, these
methods are oriented to servers dealing with a single
language (not counting English and some other languages
using ASCII, which is the subset of almost all modern 8-bit
character sets), and not applicable for the multilingual
server. Thus, for example, win=windows-1251 and
dos=x-ibm866 are true only for Russian.
c) different server ports for different charsets:
http://www.server.org:8080/info/main.html
http://www.server.org:8081/info/main.html
http://www.server.org:8082/info/main.html
This method is also oriented to a unilingual server
(4-5 charsets).
As the above-stated examples indicate, all the methods have
certain shortcomings. Obviously, the server supporting greater number
of methods will have less number of shortcomings. Thus, at present,
when a very small part of the client software supports the HTTP/1.1
negotiation method, the best version of the multilingual server
supporting multiple charsets is the server which can determine
character sets by both standard and non-standard negotiation methods,
and which allows the client program to set a language or charset
explicitly in a URL.
In addition, if the client program is a browser with a user
interface, it is necessary to give a user handy and simple means
for changing language/charset without using negotiation method.
A user should have possibility to choose a language or charset
by these means, when located on any Web page (or any HTML document).
For example, it can be realized as a row of buttons with language or
charset names in the upper or lower part of an HTML document, and
the button corresponding to the current choice can look like pressed.
Or an HTML page can have a refernce to the language/charset choice
page, and after choosing it the user automatically returns to the same
page (with the new language or charset).
The very useful feature of the multilingual WWW server is that
a structure and contents of HTML documents do not depend of the number
of supported languages and charsets. It means, when adding new
language or charset, the Web master does not have to modify already
existing pages, and in case of lack of the document version in some
language the user can continue navigation just having changed
a language.
Some languanges (mainly Asian ones) use multibyte character
sets. It is practically impossible to build multibyte charset support
into existing public domain WWW server software without serious
changes in the programs.
MultiWeb is the system being developed at the South Ural
Regional Center of FREEnet (FREEnet is the Russian Network for
Research, Education and Engineering). At present the system includes
the following parts relatively independent of each other:
1. Character set recoding API.
2. Patch for an existing Web-server.
Character set recoding API consists of 3 parts:
- Library module itself, libcharset.a, with API described in
charset.h header file. The API is very simple and does not
seek for admission as a standard. It supports only 8-bit
constant-wide charsets. However it is enough for building
multiple charset support for almost all European languages
into existing software with minimal changes.
- Charset database, containing description of each used
charset in the system. Each charset is given in the
simplified version of the format, desribed in [RFC1345].
This format has been chosen because of its clear character
mnemonics used for representing each character. Here is
an example:
ISO_8859-5
NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI
DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US
SP ! " Nb DO % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
At A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z <( // )> '> _
'! a b c d e f g h i j k l m n o
p q r s t u v w x y z (! !! !) '? DT
PA HO BH NH IN NL SA ES HS HJ VS PD PU RI S2 S3
DC P1 P2 TS CC MW SG EG SS GC SC CI ST OC PM AC
NS IO D% G% IE DS II YI J% LJ NJ Ts KJ -- V% DZ
A= B= V= G= D= E= Z% Z= I= J= K= L= M= N= O= P=
R= S= T= U= F= H= C= C% S% Sc =" Y= %" JE JU JA
a= b= v= g= d= e= z% z= i= j= k= l= m= n= o= p=
r= s= t= u= f= h= c= c% s% sc =' y= %' je ju ja
N0 io d% g% ie ds ii yi j% lj nj ts kj SE v% dz
The recoding table from one charset to another is being
built by a library function from the descriptions of these
two charsets. In almost all existing recoders, the initial
data are recoding tables. However keeping the initial data
in form of charset desriptions allows to create recoding
tables from any charset to any other one. Besides, such
information representation is clearer for a user.
- A set of utilities based on the library, used for different
purposes, e.g.:
+ output of the recoding table from one charset to another
in various formats (useful for example for including such
a table as an array into C source code);
+ recoding from standard input (or any file) to standard
output (this utility is used in our FTP server for
recoding Russian text files on-the-fly; see
ftp://ftp.urc.ac.ru/);
+ the recoding filter providing a user with transparent
recoding when working with a text in one charset from
a terminal supporting another one. Our center's staff
uses this utility during telnet sessions from DOS or
Windows (with CP866 and CP1251 charsets correspondingly)
machines to the UNIX server (with KOI8-R).
The beta version of this API is currently available at
http://www.urc.ac.ru/staff/joy/ (as Charset Library).
The second part of the system is a Web-server. It is slightly
modified version of the Apache Web Server (version 1.1.1). Reasons
for choosing just this server are:
- public domain software;
- one of the fastest Web servers, both commercial and
shareware and public domain;
- high stability;
- content type and language negotiation methods are already
built into it;
- modular structure: the server consists of the kernel and
separate modules, each of them implements strictly defined
functions. Such structure allows to add new features
to the server by means of creating separate modules, without
changing the server's kernel and other modules. There are 29
modules included in the 1.1.1 version, and dozens of others
can be found on the Internet.
Due to these features the Apache server is the most
widespread Web server on the Internet now.
Unfortunately, the original structure of the Apache's
1.1.1 version does not allow to provide multicharset support without
kernel modification. Therefore about 50 lines of the source code have
been required to add or change.
The main work on language (set in URL) determination and
character set (set both by means of negotiation method and in URL
explicitly) determination is fulfilled by separate module,
mod_charset. The module calls a CGI script with all necessary
arguments for giving a list of available languages or charsets
to a user. The CGI script provides a user interface for choosing
language or charset, and can be modified by the Web master for
organic inclusion into pages of any Web server.
The module also defines several directives, which can be
set both in the server's main configuration file and in
per-directory ones:
First of all, the original Apache server already has the
language negotiation mechanism. These are the following directives:
AddLanguage (from the mod_mime module) allows you to specify
the language of a document. You can then use content negotiation
to give a browser a file in a language it can understand. Note that
the suffix does not have to be the same as the language keyword -
those with documents in Polish (whose net-standard language code is
pl) may wish to use "AddLanguage pl .po" to avoid the ambiguity
with the common suffix for perl scripts.
AddLanguage de .de
AddLanguage en .en
AddLanguage es .es
AddLanguage fr .fr
AddLanguage it .it
AddLanguage pl .pl
AddLanguage pt .pt
AddLanguage ru .ru
LanguagePriority (from the mod_negotiation module) allows
you to give precedence to some languages in case of a tie during
content negotiation. Just list the languages in decreasing order
of preference.
LanguagePriority en es ru pt de fr it pl
And the following instructions have been added by the new
module:
AddCharset defines character set name and its aliases and
loads a charset description table.
AddCharset iso-8859-1 iso_8859-1 latin1
AddCharset iso-8859-2 iso_8859-2 latin2
AddCharset iso-8859-3 iso_8859-3 latin3
AddCharset iso-8859-4 iso_8859-4 latin4
AddCharset iso-8859-5 iso_8859-5 iso-ir-144 iso-cyr cyrillic iso
AddCharset iso-8859-9 iso_8859-9 latin5
AddCharset x-cp1250 win-ee ms-ee win-1250 cp1250
AddCharset x-cp1251 win-cyr win-1251 cp1251
AddCharset x-cp1252 win-ansi win-1252 cp1252
AddCharset x-cp1253 win-greek ms-greek win-1253 cp1253
AddCharset x-cp1254 win-turk ms-turk win-1254 cp1254
AddCharset ibm437 cp437
AddCharset ibm850 cp850
AddCharset ibm852 cp852
AddCharset ibm855 cp855
AddCharset ibm857 cp857
AddCharset ibm860 cp860
AddCharset ibm861 cp861 cp-is
AddCharset ibm863 cp863
AddCharset ibm865 cp865
AddCharset x-ibm866 cp866 x-cp866 ibm866 dos-rus-alt
AddCharset ibm869 cp869 cp-gr
AddCharset koi8-r koi-8 koi8 koi
AddCharset x-ru-mac MacOS_Cyrillic macOS-cyr
AddUAType binds substrings from strings, received by
the server in the User-Agent field, with the User Agent Type.
AddUAType DOS-OS2 DosLynx WebExplorer
AddUAType KOI-Unix Linux FreeBSD "via PRD" Arena Ariadna Lynx
AddUAType ISO-Unix X11 "X Window"
AddUAType Windows Win AIR_Mosaic IWENG/1 MSIEyr cyrillic iso
AddUAType Macintosh Macintosh
AddLangCS sets the unique character set for the given language
and User Agent Type (the <UAT>:<charset> form), and also defines all
other character sets suitable for the given language (the <charset>
form). The first charset name encountered for the given language
defines the charset used on the server itself for this language.
AddLangCS de iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-9
AddLangCS de x-cp1250 Windows:x-cp1252 x-cp1254
AddLangCS de DOS-OS2:ibm850 ibm852 ibm857
AddLangCS en iso-8859-1 Windows:x-cp1252
AddLangCS es iso-8859-1 iso-8859-9 Windows:x-cp1252 x-cp1254
AddLangCS es DOS-OS2:ibm850 ibm857 ibm860
AddLangCS fr iso-8859-1 iso-8859-3 Windows:x-cp1252 x-cp1254
AddLangCS fr DOS-OS2:ibm850 ibm857 ibm863
AddLangCS it iso-8859-1 Windows:x-cp1252 DOS-OS2:ibm850 ibm857
AddLangCS pl iso-8859-2 Windows:x-cp1250 DOS-OS2:ibm852
AddLangCS pt iso-8859-1 iso-8859-9 Windows:x-cp1252 x-cp1254
AddLangCS pt DOS-OS2:ibm850 ibm857 ibm860
AddLangCS ru KOI-Unix:KOI8-R ISO-Unix:ISO-8859-5 Windows:x-cp1251
AddLangCS ru DOS-OS2:x-ibm866 ibm855 Macintosh:x-ru-mac
AddLocale sets the locale for the given language(s) for use in
CGI scripts and executable includes called from server parsed HTML
files. This may be useful for search and other word processing tasks.
AddLocale de de
AddLocale en en
AddLocale es es
AddLocale fr fr
AddLocale it it
AddLocale pt pt
AddLocale ru_RU ru
LangChoiceHandler and CSChoiceHandler set paths to CGI
scripts, which allow user to choose any available language or charset
correspondingly. It can be the same script, since choice category
(language or charset) can be determined from script parameters.
LangChoiceHandler /cgi-bin/avail_choice
CSChoiceHandler /cgi-bin/avail_choice
The way MultiWeb operates is:
1. Document language detection: if the document URL does not
contain a language name (see further about explicit
language or charset setting format), then the language
negotiation method from the mod_mime and mod_negotiation
modules works. Otherwise the language is determined from
the URL. If both methods yield no result, no information
about language is passed to a client. And, accordinly, no
charset recoding is made and no information about charset
is passed.
2. Detection of the charset, required by a client: if the URL
does not contain the charset name, the server tries to
determine it by several ways in consecutive order. If no
way yields a result, or if the requested charset is not
used for the given language, the server sends the document
in the charset, in which the document is kept on
the server, i.e. no recoding is made. The ways are:
a) the standard charset negotiation method (the
Accept-Charset field);
b) the extension of the negotiation method (the
text/x-charset-<name> keywords in the Accept field);
c) the same with other keywords - text/x-cyrillic-<name>
(for compatibility with the Evgeny Mironov's Cyrillic
enabled Web server, widespread in FREEnet);
d) client software type detection from the User-Agent
field.
The language or charset can be set explicitly in the URL
in the following format:
http://<server name>[:<port>][/LANG=<language>][/CS=<charset>]/<path>
In this case the negotiation method is not called. The explicit
language or charset setting can be used by those client programs,
which can not be set up to use any of negotiation method cited above.
There are special variants of the LANG and CS keywords:
1. If the '*' sign is before or instead of a language or
charset name (e.g.
http://www.server.org/LANG=*/info/main.html), the server
gives the list of languages or charsets (correspondingly)
available for the current document (/info/main.html
in the example). For these purposes the server calls
the CGI script (set by Web master) with the arguments
"/<path>?lang=...&flang=...&alang=...&cs=...&scs=...&fcs=...&acs=...",
where:
- <path> - the current document path (that is written after
the LANG and CS keywords);
- lang - the language chosen for the current document;
- flang - the explicitly set language, if any (that is
written after 'LANG=');
- alang - the list of available languages (the line with
the language names separated by spaces);
- cs - the charset chosen for the current document;
- scs - the charset used for the given languages on
the server;
- fcs - the explicitly set charset, if any (that is written
after 'CS=');
- acs - the list of available charsets for the given
language (separated by spaces).
After choosing a language or charset, the user receives
the current document. If both keywords are used
(/LANG=*/CS=*/...), then after choosing a language,
the user receives a charset choice page, and after that
comes back to the current document.
The Web master can use the language or charset choice list by
either making an HTML reference to it from any document
( ), or including it into a server parsed
(SHTML) document. In this case the path of the initial
document (an SHTML or that, from which the reference is made)
should be written after the LANG and CS keywords to provide
a user's return to this document after the choice is made.
In that way, the document name is contained in the document
itself, and when changing the name it is necessary to modify
the document contents. To avoid this the following keywords
can be used.
2. The '.' sign is used before or instead of the language or
charset names (e.g. http://www.server.org/LANG=./). In this
case a user receives the language or charset list as well,
but the current document (i.e. the one, which a user
receives after the choice) is determined automatically: it
is either the document, from which the reference
to the choice list is made, or the SHTML document,
into which the list is included.
Due to the fact that the LANG and CS keywords are located
in a URL before the document's real path, when the navigation
to another document through a relative reference takes place,
explicit language and charset settings are remained. But there
is a problem with absolute references. If a user is in the
http://www.server.org/LANG=ru/CS=koi8-r/info/main.html URL,
and passes through the ...
reference, the new URL will be
http://www.server.org/second.html, and information about
the explicit language and charset setting is lost. To
preserve this information the Web master can use the following
method.
3. When using either 'LANG=' or 'CS=' keyword without the
language or charset name (e.g.
<A Href="/CS=/second.html">...</A>), the language AND
charset are inherited from the current document. This
method works even when a user navigates between different
multiple charset enable servers.
The demo URL of the MultiWeb server is
http://multiweb.urc.ac.ru/demo.html.
Received on Fri Nov 08 1996 - 14:46:42 MST
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:33:30 MST