Multilingual and multicharset Web server from Serge Krashakov on 1996-11-08 (squid-users)

From: Serge Krashakov <sakr@dont-contact.us>
Date: Sat, 9 Nov 1996 01:29:44 +0300

Dear Colleagues,

I'd like to discuss some problems which arise when creating Web servers
in many non-english speaking countries:
1) necessity of presentation of information in more than one language;
2) existence of many character coding tables, especially in Russia where
5 cyrillic charsets exist.
Attached text is Konstantin Chuguev report concerning this problem
(all questions about report send please to joy@urc.ac.ru).

HTTP/1.1 proposes "Accept-Language" and "Accept-Charset" headers with
the list of acceptable languages/charsets, and corresponding "Contents-Language"
and "text/x-charset" fields in the document body. But this approach is
incompatible with all Web caching systems.

One solution is using "Content-language", "text/x-charset" fields together
with URL for all cached documents.
I'd ask Duane and other squid developers is it possible?
Could anybody propose another (better) solution.

Serge Krashakov, administrator of Chg-FREEnet

                Multiple character set enabled WWW server
                           and conversion API

        At present, with the rapid development of Russian, as well as
ex-USSR and East European, wide area network infrastructure, a very
important task is the creation and maintenance of information
resources in all fields of activities. It is desirable to keep many
of these resources in several languages, suitable both for native and
foreign users.
        There is a problem with prosessing distributed text
information in many non-English speaking countries: distinctions
between character sets on different operating systems. Thus, up to
the present there are 5 different charsets used in Russia:
        - CP866 (also known as DOS alternative Cyrillic charset);
        - CP1251 - MS Windows Cyrillic charset;
        - KOI8 ( or KOI8-r ) - de-facto standard on Unix systems and
              on the Russian Internet (RFC 1489);
        - ISO-8859-5 - the only Cyrillic charset supported in MIME
              and by large Unix-manufacturers such as
              Sun Microsystems etc.;
        - MacOS Cyrillic - on Macintosh computers.
        Furthermore there is a lot of other countries where more than
one charset are used.

        Now the most popular, widespread and powerful Internet service
is WWW. That is why the most useful task which can be selected from
the common internationalization/localization problem is the creation
of multilingual and multiple character set enabled Web clients and
servers. But the client software is not taken into consideration
in this paper.

        Multilingual means - the server contains documents in
different languages in the way most handy for a Web-master, and a user
can choose any language for any document. Now there are various
methods to do this (at least in Russia, but most likely in other
coutries as well):
        1. The language negotiation mechanism defined in the HTTP/1.1
           standard: the client sends the Accept-Language field with
           a list of acceptable languages, and the server returns
           the name of the chosen language in the Content-Language
           header field of the response. In this case the same URL
           containing no information about the language, is used
           for all document versions in different languages. And
           the value being sent in the Accept-Language field can be
           set up in browser's options just before loading a document.
           Advantages of this method are the following:
           - document hierarchy on the server does not depend
             on the languages supported by the server. A user can
             alter the document language while being in any place
             of this hierarchy, not just on the main page. After
             changing the language, the user remains in the same point
             of the hierarchy, and can continue navigation from this
             place;
           - there is no information about language in hypertext
             references from other documents. The server determines
             the language every time it receives a request. In case
             of missing the document version in the requested
             language, the list of all the languages possible for this
             document can be given to the user.
           Although it is the most preferrable method, not all client
           and server software supports it. And what is more, all
           existing proxy-servers cache documents only by URLs,
           ignoring the Content-Language field, which makes this
           method unsuitable for Web navigation through caching
           proxies.
        2. Explicit setting the preferred language in a URL:
           a) top level directories have the names identical with
              languages names:
              http://www.server.org/english/info/main.html
              http://www.server.org/russian/info/main.html
              http://www.server.org/french/info/main.html
              This method's shortcoming is the difficulty
              in realization of language change, when the user would
              remain in the same place of document hierarchy.
              Navigation within the limits of the same server is
              realized, as a rule, by means of the relative
              references. And using this method, it is simple
              to provide language change only for the documents
              situated near the root (top level) of the hierarchy.
              Since language names are in the root, it is necessary
              to use the absolute references to come to another
              language version of a document. This makes moving
              the subtrees in the Web server hierarchy difficult.
           b) each directory has subdirectories named in the same way
              as the languages:
              http://www.server.org/info/english/main.html
              http://www.server.org/info/russian/main.html
              http://www.server.org/info/french/main.html
           c) language name is in the file suffix:
              http://www.server.org/info/main.en.html
              http://www.server.org/info/main.ru.html
              http://www.server.org/info/main.fr.html
           d) the same with another suffix order:
              http://www.server.org/info/main.html.en
              http://www.server.org/info/main.html.ru
              http://www.server.org/info/main.html.fr
           The advantage of all 4 methods is that they allow
           navigation through caching proxy servers. And their
           disadvantage is the difficulty of supporting references
           from one language version of a document to the other
           language versions, because one should modify all
           the versions when adding a new language.

        As it has been mentioned before, in many countries different
character sets are used for representing a text in the native
language on different operating systems. And since WWW has
the client-server architecture, and client software requesting a Web
server through the Internet runs on various operating systems,
the multilingual server should deal with multiple charsets.

        The universal character set suitable for all the languages
(Unicode or ISO 10646) is necessary only when there is information
in several languages in the same document (dictionary pages, foreign
language manuals etc.). This is only one of every possible information
resources. As the realization of supporting several languages
in the same document requires the usage of new protocol versions
(HTTP/1.1 and HTML/3.0), which are in the early stage of practical
application now (like Unicode viewers and editors), the usage
of the universal character set is not mentioned in the paper.
        It is expedient that the server keeps documents in one
appointed character set for each language and does on-the-fly recoding
when replies to the client.
        Existing Web servers use various methods for choosing
the required character set:
        1. Charset negotiation: the client sends a list of preferred
           charsets in the Accept-Charset field and the server answers
           with the charset name written in the Content-Type field:
                Content-Type: text/html; charset=iso-8859-5
           The same URL is used for different charset versions of
           a document. This URL does not contain information about
           a character set. The value being sent in the Accept-Charset
           field can be set up in browser's options just before
           loading a document. At present this method is supported
           by even less amount of software then the language
           negotiation method, and by no one caching proxy server.
           This method has the same advantages and a disadvantage as
           the language negotiation method.
        2. Non-standard extension of the negotiation method:
           a) by means of the content type negotiation - the client
              sends the Accept field in the request header, and
              indicates a "pseudotype" among its preferred content
              types (e.g. text/x-charset-koi8). The preferred charset
              is determined from such a "pseudotype". But, again,
              many client programs do not allow to set up the Accept
              field, which makes the use of this method impossible.
           b) the server can determine client's operating system or
              some language related features, and choose
              the corresponding charset for any language by means
              of the User-Agent field. Thus, for the Russian language,
              in the presence of the word "Windows" in the User-Agent
              field, the server sends a document in the windows1251
              charset, and in the presence of the word "DOS"
              the charset is x-ibm866.
              Unfortunately the User-Agent field contents have no
              standardized format. In addition, there are situations,
              when different charsets are used on the same or similar
              operating systems (e.g. ISO-8859-5 and KOI8-R for
              Russian in Unixes). And a user can prefer the charset,
              which isnot standard for the given language. Therefore
              sometimes the use of this method is ineffective.
           These 2 methods have the same advantages as the standard
           negotiation method.
           c) the server can keep the database with the "client host
              (IP address or domain name) - charset name" pairs.
              Clients interact with the database by means of HTML
              forms located on the same server. The server always
              knows the IP address of its peer. But if the client
              works through a proxy, the server knows only the proxy's
              address, and this method does not do.
        3. Explicit indication of the charset name in a URL:
           a) pseudo directories with conventional charset names
              (here iso=iso-8859-5, koi=koi8-r, win=windows1251):
              http://www.server.org/iso/info/main.html
              http://www.server.org/koi/info/main.html
              http://www.server.org/win/info/main.html
           b) different domain names of servers:
              http://iso.www.server.org/info/main.html
              http://koi.www.server.org/info/main.html
              http://win.www.server.org/info/main.html
           Conventional (non-standard, tied to some OS) charset names
           can be used only for one definite language. Though such
           names are shorter and clearer for the end user, these
           methods are oriented to servers dealing with a single
           language (not counting English and some other languages
           using ASCII, which is the subset of almost all modern 8-bit
           character sets), and not applicable for the multilingual
           server. Thus, for example, win=windows-1251 and
           dos=x-ibm866 are true only for Russian.
           c) different server ports for different charsets:
              http://www.server.org:8080/info/main.html
              http://www.server.org:8081/info/main.html
              http://www.server.org:8082/info/main.html
              This method is also oriented to a unilingual server
              (4-5 charsets).
        As the above-stated examples indicate, all the methods have
certain shortcomings. Obviously, the server supporting greater number
of methods will have less number of shortcomings. Thus, at present,
when a very small part of the client software supports the HTTP/1.1
negotiation method, the best version of the multilingual server
supporting multiple charsets is the server which can determine
character sets by both standard and non-standard negotiation methods,
and which allows the client program to set a language or charset
explicitly in a URL.
        In addition, if the client program is a browser with a user
interface, it is necessary to give a user handy and simple means
for changing language/charset without using negotiation method.
A user should have possibility to choose a language or charset
by these means, when located on any Web page (or any HTML document).
For example, it can be realized as a row of buttons with language or
charset names in the upper or lower part of an HTML document, and
the button corresponding to the current choice can look like pressed.
Or an HTML page can have a refernce to the language/charset choice
page, and after choosing it the user automatically returns to the same
page (with the new language or charset).
        The very useful feature of the multilingual WWW server is that
a structure and contents of HTML documents do not depend of the number
of supported languages and charsets. It means, when adding new
language or charset, the Web master does not have to modify already
existing pages, and in case of lack of the document version in some
language the user can continue navigation just having changed
a language.

        Some languanges (mainly Asian ones) use multibyte character
sets. It is practically impossible to build multibyte charset support
into existing public domain WWW server software without serious
changes in the programs.

        MultiWeb is the system being developed at the South Ural
Regional Center of FREEnet (FREEnet is the Russian Network for
Research, Education and Engineering). At present the system includes
the following parts relatively independent of each other:
        1. Character set recoding API.
        2. Patch for an existing Web-server.

        Character set recoding API consists of 3 parts:
        - Library module itself, libcharset.a, with API described in
          charset.h header file. The API is very simple and does not
          seek for admission as a standard. It supports only 8-bit
          constant-wide charsets. However it is enough for building
          multiple charset support for almost all European languages
          into existing software with minimal changes.
        - Charset database, containing description of each used
          charset in the system. Each charset is given in the
          simplified version of the format, desribed in [RFC1345].
          This format has been chosen because of its clear character
          mnemonics used for representing each character. Here is
          an example:

                ISO_8859-5

                NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI
                DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US
                SP ! " Nb DO % & ' ( ) * + , - . /
                0 1 2 3 4 5 6 7 8 9 : ; < = > ?
                At A B C D E F G H I J K L M N O
                P Q R S T U V W X Y Z <( // )> '> _
                '! a b c d e f g h i j k l m n o
                p q r s t u v w x y z (! !! !) '? DT
                PA HO BH NH IN NL SA ES HS HJ VS PD PU RI S2 S3
                DC P1 P2 TS CC MW SG EG SS GC SC CI ST OC PM AC
                NS IO D% G% IE DS II YI J% LJ NJ Ts KJ -- V% DZ
                A= B= V= G= D= E= Z% Z= I= J= K= L= M= N= O= P=
                R= S= T= U= F= H= C= C% S% Sc =" Y= %" JE JU JA
                a= b= v= g= d= e= z% z= i= j= k= l= m= n= o= p=
                r= s= t= u= f= h= c= c% s% sc =' y= %' je ju ja
                N0 io d% g% ie ds ii yi j% lj nj ts kj SE v% dz

          The recoding table from one charset to another is being
          built by a library function from the descriptions of these
          two charsets. In almost all existing recoders, the initial
          data are recoding tables. However keeping the initial data
          in form of charset desriptions allows to create recoding
          tables from any charset to any other one. Besides, such
          information representation is clearer for a user.
        - A set of utilities based on the library, used for different
          purposes, e.g.:
          + output of the recoding table from one charset to another
            in various formats (useful for example for including such
            a table as an array into C source code);
          + recoding from standard input (or any file) to standard
            output (this utility is used in our FTP server for
            recoding Russian text files on-the-fly; see
            ftp://ftp.urc.ac.ru/);
          + the recoding filter providing a user with transparent
            recoding when working with a text in one charset from
            a terminal supporting another one. Our center's staff
            uses this utility during telnet sessions from DOS or
            Windows (with CP866 and CP1251 charsets correspondingly)
            machines to the UNIX server (with KOI8-R).

        The beta version of this API is currently available at
        http://www.urc.ac.ru/staff/joy/ (as Charset Library).

        The second part of the system is a Web-server. It is slightly
modified version of the Apache Web Server (version 1.1.1). Reasons
for choosing just this server are:
        - public domain software;
        - one of the fastest Web servers, both commercial and
          shareware and public domain;
        - high stability;
        - content type and language negotiation methods are already
          built into it;
        - modular structure: the server consists of the kernel and
          separate modules, each of them implements strictly defined
          functions. Such structure allows to add new features
          to the server by means of creating separate modules, without
          changing the server's kernel and other modules. There are 29
          modules included in the 1.1.1 version, and dozens of others
          can be found on the Internet.
        Due to these features the Apache server is the most
widespread Web server on the Internet now.
        Unfortunately, the original structure of the Apache's
1.1.1 version does not allow to provide multicharset support without
kernel modification. Therefore about 50 lines of the source code have
been required to add or change.
        The main work on language (set in URL) determination and
character set (set both by means of negotiation method and in URL
explicitly) determination is fulfilled by separate module,
mod_charset. The module calls a CGI script with all necessary
arguments for giving a list of available languages or charsets
to a user. The CGI script provides a user interface for choosing
language or charset, and can be modified by the Web master for
organic inclusion into pages of any Web server.
        The module also defines several directives, which can be
set both in the server's main configuration file and in
per-directory ones:

        First of all, the original Apache server already has the
language negotiation mechanism. These are the following directives:

        AddLanguage (from the mod_mime module) allows you to specify
the language of a document. You can then use content negotiation
to give a browser a file in a language it can understand. Note that
the suffix does not have to be the same as the language keyword -
those with documents in Polish (whose net-standard language code is
pl) may wish to use "AddLanguage pl .po" to avoid the ambiguity
with the common suffix for perl scripts.
AddLanguage de .de
AddLanguage en .en
AddLanguage es .es
AddLanguage fr .fr
AddLanguage it .it
AddLanguage pl .pl
AddLanguage pt .pt
AddLanguage ru .ru

        LanguagePriority (from the mod_negotiation module) allows
you to give precedence to some languages in case of a tie during
content negotiation. Just list the languages in decreasing order
of preference.
LanguagePriority en es ru pt de fr it pl

        And the following instructions have been added by the new
module:

        AddCharset defines character set name and its aliases and
loads a charset description table.
AddCharset iso-8859-1 iso_8859-1 latin1
AddCharset iso-8859-2 iso_8859-2 latin2
AddCharset iso-8859-3 iso_8859-3 latin3
AddCharset iso-8859-4 iso_8859-4 latin4
AddCharset iso-8859-5 iso_8859-5 iso-ir-144 iso-cyr cyrillic iso
AddCharset iso-8859-9 iso_8859-9 latin5
AddCharset x-cp1250 win-ee ms-ee win-1250 cp1250
AddCharset x-cp1251 win-cyr win-1251 cp1251
AddCharset x-cp1252 win-ansi win-1252 cp1252
AddCharset x-cp1253 win-greek ms-greek win-1253 cp1253
AddCharset x-cp1254 win-turk ms-turk win-1254 cp1254
AddCharset ibm437 cp437
AddCharset ibm850 cp850
AddCharset ibm852 cp852
AddCharset ibm855 cp855
AddCharset ibm857 cp857
AddCharset ibm860 cp860
AddCharset ibm861 cp861 cp-is
AddCharset ibm863 cp863
AddCharset ibm865 cp865
AddCharset x-ibm866 cp866 x-cp866 ibm866 dos-rus-alt
AddCharset ibm869 cp869 cp-gr
AddCharset koi8-r koi-8 koi8 koi
AddCharset x-ru-mac MacOS_Cyrillic macOS-cyr

        AddUAType binds substrings from strings, received by
the server in the User-Agent field, with the User Agent Type.
AddUAType DOS-OS2 DosLynx WebExplorer
AddUAType KOI-Unix Linux FreeBSD "via PRD" Arena Ariadna Lynx
AddUAType ISO-Unix X11 "X Window"
AddUAType Windows Win AIR_Mosaic IWENG/1 MSIEyr cyrillic iso
AddUAType Macintosh Macintosh

        AddLangCS sets the unique character set for the given language
and User Agent Type (the <UAT>:<charset> form), and also defines all
other character sets suitable for the given language (the <charset>
form). The first charset name encountered for the given language
defines the charset used on the server itself for this language.
AddLangCS de iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-9
AddLangCS de x-cp1250 Windows:x-cp1252 x-cp1254
AddLangCS de DOS-OS2:ibm850 ibm852 ibm857
AddLangCS en iso-8859-1 Windows:x-cp1252
AddLangCS es iso-8859-1 iso-8859-9 Windows:x-cp1252 x-cp1254
AddLangCS es DOS-OS2:ibm850 ibm857 ibm860
AddLangCS fr iso-8859-1 iso-8859-3 Windows:x-cp1252 x-cp1254
AddLangCS fr DOS-OS2:ibm850 ibm857 ibm863
AddLangCS it iso-8859-1 Windows:x-cp1252 DOS-OS2:ibm850 ibm857
AddLangCS pl iso-8859-2 Windows:x-cp1250 DOS-OS2:ibm852
AddLangCS pt iso-8859-1 iso-8859-9 Windows:x-cp1252 x-cp1254
AddLangCS pt DOS-OS2:ibm850 ibm857 ibm860
AddLangCS ru KOI-Unix:KOI8-R ISO-Unix:ISO-8859-5 Windows:x-cp1251
AddLangCS ru DOS-OS2:x-ibm866 ibm855 Macintosh:x-ru-mac

        AddLocale sets the locale for the given language(s) for use in
CGI scripts and executable includes called from server parsed HTML
files. This may be useful for search and other word processing tasks.
AddLocale de de
AddLocale en en
AddLocale es es
AddLocale fr fr
AddLocale it it
AddLocale pt pt
AddLocale ru_RU ru

        LangChoiceHandler and CSChoiceHandler set paths to CGI
scripts, which allow user to choose any available language or charset
correspondingly. It can be the same script, since choice category
(language or charset) can be determined from script parameters.

LangChoiceHandler /cgi-bin/avail_choice
CSChoiceHandler /cgi-bin/avail_choice

        The way MultiWeb operates is:
        1. Document language detection: if the document URL does not
           contain a language name (see further about explicit
           language or charset setting format), then the language
           negotiation method from the mod_mime and mod_negotiation
           modules works. Otherwise the language is determined from
           the URL. If both methods yield no result, no information
           about language is passed to a client. And, accordinly, no
           charset recoding is made and no information about charset
           is passed.
        2. Detection of the charset, required by a client: if the URL
           does not contain the charset name, the server tries to
           determine it by several ways in consecutive order. If no
           way yields a result, or if the requested charset is not
           used for the given language, the server sends the document
           in the charset, in which the document is kept on
           the server, i.e. no recoding is made. The ways are:
           a) the standard charset negotiation method (the
              Accept-Charset field);
           b) the extension of the negotiation method (the
              text/x-charset-<name> keywords in the Accept field);
           c) the same with other keywords - text/x-cyrillic-<name>
              (for compatibility with the Evgeny Mironov's Cyrillic
              enabled Web server, widespread in FREEnet);
           d) client software type detection from the User-Agent
              field.

        The language or charset can be set explicitly in the URL
in the following format:
http://<server name>[:<port>][/LANG=<language>][/CS=<charset>]/<path>
In this case the negotiation method is not called. The explicit
language or charset setting can be used by those client programs,
which can not be set up to use any of negotiation method cited above.
        There are special variants of the LANG and CS keywords:
        1. If the '*' sign is before or instead of a language or
           charset name (e.g.
           http://www.server.org/LANG=*/info/main.html), the server
           gives the list of languages or charsets (correspondingly)
           available for the current document (/info/main.html
           in the example). For these purposes the server calls
           the CGI script (set by Web master) with the arguments
"/<path>?lang=...&flang=...&alang=...&cs=...&scs=...&fcs=...&acs=...",
           where:
           - <path> - the current document path (that is written after
             the LANG and CS keywords);
           - lang - the language chosen for the current document;
           - flang - the explicitly set language, if any (that is
             written after 'LANG=');
           - alang - the list of available languages (the line with
             the language names separated by spaces);
           - cs - the charset chosen for the current document;
           - scs - the charset used for the given languages on
             the server;
           - fcs - the explicitly set charset, if any (that is written
             after 'CS=');
           - acs - the list of available charsets for the given
             language (separated by spaces).
           After choosing a language or charset, the user receives
           the current document. If both keywords are used
           (/LANG=*/CS=*/...), then after choosing a language,
           the user receives a charset choice page, and after that
           comes back to the current document.
        The Web master can use the language or charset choice list by
        either making an HTML reference to it from any document
        ( ), or including it into a server parsed
        (SHTML) document. In this case the path of the initial
        document (an SHTML or that, from which the reference is made)
        should be written after the LANG and CS keywords to provide
        a user's return to this document after the choice is made.
        In that way, the document name is contained in the document
        itself, and when changing the name it is necessary to modify
        the document contents. To avoid this the following keywords
        can be used.
        2. The '.' sign is used before or instead of the language or
           charset names (e.g. http://www.server.org/LANG=./). In this
           case a user receives the language or charset list as well,
           but the current document (i.e. the one, which a user
           receives after the choice) is determined automatically: it
           is either the document, from which the reference
           to the choice list is made, or the SHTML document,
           into which the list is included.
        Due to the fact that the LANG and CS keywords are located
        in a URL before the document's real path, when the navigation
        to another document through a relative reference takes place,
        explicit language and charset settings are remained. But there
        is a problem with absolute references. If a user is in the
        http://www.server.org/LANG=ru/CS=koi8-r/info/main.html URL,
        and passes through the ...
        reference, the new URL will be
        http://www.server.org/second.html, and information about
        the explicit language and charset setting is lost. To
        preserve this information the Web master can use the following
        method.
        3. When using either 'LANG=' or 'CS=' keyword without the
           language or charset name (e.g.
           <A Href="/CS=/second.html">...</A>), the language AND
           charset are inherited from the current document. This
           method works even when a user navigates between different
           multiple charset enable servers.

        The demo URL of the MultiWeb server is
http://multiweb.urc.ac.ru/demo.html.
Received on Fri Nov 08 1996 - 14:46:42 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:33:30 MST