Non-Western European Languages (Java Servlet Programming)

Let's continue now with a look at how a servlet outputs a page written in a non-Western European language, such as Russian, Japanese, Chinese, Korean, or Hebrew. To understand how to work with these languages, we must first understand how things work behind the scenes of our previous examples.

12.3.1. Charsets

Let's begin looking at the situation from the perspective of the browser. Imagine having the browser's job. You make an HTTP request to some URL and receive a response. That response, in the basest terms, is nothing more than a long sequence of bytes. How do you know how to display that response?

A common way, and in fact the default way, is to assume that every byte represents one of 256 possible characters and to further assume that the character a byte represents can be determined by looking up the byte value in some table. The default table is specified by the ISO-8859-1 standard, also called Latin-1. It contains byte-to-character mappings for the characters most commonly used in Western European languages. So, by default, you (acting as the browser) can receive a sequence of bytes and convert them to a sequence of Western European characters.

Now whatdo you do if you want to receive text that isn't written in a Western European language? You have to take the long sequence of bytes in the response and interpret it differently, using some other byte-sequence to character mapping. Technically put, you need to use a different charset.[2] There are an infinite number of potential charsets. Fortunately, there are only a few dozen that are commonly used.

Some charsets use single-byte characters in a fashion similar to ISO-8859-1, though with a different byte-to-character mapping. For example, ISO-8859-5 defines a byte-to-character mapping for the characters of the Cyrillic (Russian) alphabet, while ISO-8859-8 defines a mapping for the Hebrew alphabet.[3]

Other charsets use multibyte characters, where it may take more than one byte to represent a single character. This is most common with languages that contain thousands of characters, such as Chinese, Japanese, and Korean--often referred to collectively as CJK. Charsets used to display these languages include Big5 (Chinese), Shift_JIS ( Japanese), and EUC-KR (Korean). A table listing languages and their corresponding charsets can be found in Appendix E, "Charsets".

What this boils down to is that if you (as the browser again) know the charset in which the response was encoded, you can determine how to interpret the bytes you receive. Just one question remains: how can you determine the charset? You can do it in one of two ways. First, you can require your user to tell you the charset. With Netscape Navigator 3, this is done through Options | Document Encoding; with Netscape Navigator 4, it is done through View | Encoding. With Microsoft Internet Explorer 4, it's done through View | Fonts. This approach often requires the user to try a few charsets until the display makes sense. The second possibility is that the server (or servlet) specifies the charset in the Content-Type header you receive. For example, the following Content-Type value:

indicates that the charset is Shift_JIS. Unfortunately, a few older browsers can be confused by the inclusion of a charset in the Content-Type header.

12.3.2. Writing Encoded Output

Now that we understand charsets from the perspective of the browser, it's time to return to the perspective of the servlet. A servlet's role is to do the following:

Example 12-5 demonstrates with a servlet that says "Hello World" and displays the current date and time in Japanese. A screen shot is shown in Figure 12-4.

Example 12-5. Hello to Japanese speakers

import java.io.*;
import java.text.*;
import java.util.*;
import javax.servlet.*;
import javax.servlet.http.*;

public class HelloJapan extends HttpServlet {

  public void doGet(HttpServletRequest req, HttpServletResponse res)
                               throws ServletException, IOException {
    res.setContentType("text/plain; charset=Shift_JIS");
    PrintWriter out = res.getWriter();
    res.setHeader("Content-Language", "ja");

    Locale locale = new Locale("ja", "");
    DateFormat full = DateFormat.getDateTimeInstance(DateFormat.LONG,
                                                     DateFormat.LONG,
                                                     locale);
    out.println("In Japanese:");
    out.println("\u4eca\u65e5\u306f\u4e16\u754c");  // Hello World
    out.println(full.format(new Date()));
  }
}

Figure 12-4. A Japanese Hello

This servlet starts by setting the content type to "text/plain" and the charset to "Shift_JIS". Then it calls res.getWriter() just like always--except in this case the PrintWriter it receives is special. This PrintWriter encodes all the servlet's output in the Shift_JIS charset because that charset is specified in the Content-Type header. This second line is therefore equivalent to the following:

Note that the call to res.getWriter() may throw an UnsupportedEncodingException if the charset is not recognized by Java[4] or an IllegalStateException if getOutputStream() has been called already on this request.

The servlet next creates a Locale with the language "ja" to represent a generic Japanese environment and then creates a DateFormat to match. Finally, it prints the equivalent of "Hello World" in Japanese, using Unicode escapes for the characters, and outputs the current date and time.

For this servlet to work, your server's classpath must include the sun.io.CharToByte* converter classes or their equivalent. On some platforms, these are not always included by default. Also, for the Japanese glyphs (or glyphs from other languages) to display correctly in the browser, the browser has to support the charset and have access to the necessary fonts to display the charset.

12.3.3. Reading and Writing Encoded Output

It can often be prohibitively slow to enter hundreds or thousands of Unicode escapes manually in Java source files. An easier option is to read localized text from an encoded file. For example, let's assume the "Hello World" Japanese text we want to output is saved by someone on the localization team in a file named HelloWorld.ISO-2022-JP, using the ISO-2022-JP encoding to make things more interesting. A servlet can read this file and send the content to the browser using the Shift_JIS encoding, as shown in Example 12-6.

Example 12-6. Sending localized output read from a file

import java.io.*;
import java.text.*;
import java.util.*;
import javax.servlet.*;
import javax.servlet.http.*;

public class HelloJapanReader extends HttpServlet {

  public void doGet(HttpServletRequest req, HttpServletResponse res)
                               throws ServletException, IOException {
    res.setContentType("text/plain; charset=Shift_JIS");
    PrintWriter out = res.getWriter();
    res.setHeader("Content-Language", "ja");

    Locale locale = new Locale("ja", "");
    DateFormat full = DateFormat.getDateTimeInstance(DateFormat.LONG,
                                                     DateFormat.LONG,
                                                     locale);
    out.println("In Japanese:");

    try {
      FileInputStream fis =
        new FileInputStream(req.getRealPath("/HelloWorld.ISO-2022-JP"));
      InputStreamReader isr = new InputStreamReader(fis, "ISO-2022-JP");
      BufferedReader reader = new BufferedReader(isr);
      String line = null;
      while ((line = reader.readLine()) != null) {
        out.println(line);
      }
    }
    catch (FileNotFoundException e) {
      // No Hello for you
    }

    out.println(full.format(new Date()));
  }
}

This servlet is essentially a character encoding converter. It reads the HelloWorld.ISO-2022-JP text encoded with ISO-2022-JP and internally converts it to Unicode. Then, it outputs the same text by converting from Unicode to Shift_JIS.