TWiki> Main Web>LocaleSettings (revision 1)EditAttach

Locale Settings In Linux

Dealing with multiple languages

When we start dealing with multiple character sets and languages, we get into a confusing area that's poorly understood by most American computer users. Here's a quick crash course on how to deal with these issues on our Linux cluster.


Most software now understands UTF-8 by default. If your file is in UTF-8, and if the correct font is installed, you probably won't have any trouble, either in X or in a terminal.

If you're running a text terminal program, it's the fonts on your local machine and your terminal program's understanding of the encoding that make the difference.

If you're running a forwarded X program, it will rely on the fonts on your local machine, unless you tell your machine's X server to connect to the server's font server so it can download additional fonts. How to do this depends on the system. Here are some instructions for MacOS X.

Other character sets

This is where it gets tricky, since there's often no standard way for the system to know what encoding your file is in. You'll need to tell it which to use.

Emacs in X

Emacs 22 defaults to UTF-8 encoding. If you need a different encoding, you must select this before you open the file. There are two ways to do this.

You can click the Options drop-down menu, then click Mule, Set Language Encoding and choose the correct encoding. Alternatively, you can press C-x RET l. (This is Emacs-speak for Ctrl-x, the RETURN (or ENTER) key, then the lowercase L key.) Then type the name of your character encoding. Tab-completion works, so you can type a partial name and then hit tab to get a list of matches. Once the character encoding is set, you can open your file with C-x C-f or by using the File drop-down menu.

A good example to test this with is the file /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm. This file is in GB2312 Simplified Chinese. Emacs's name for this encoding is "Chinese-GB".

Emacs in text terminals

Four conditions must be met for this to work:

  • Your system must have a font for the language in question
  • Your terminal must expect the proper encoding
  • Linux must be told what encoding your terminal is using
  • Emacs must be told what encoding to use

The first two items are system-dependent. In MacOS X Terminal, for example, you can set the encoding by clicking the Terminal drop-down menu, choosing Window Settings, and selecting Display. The Character Set Encoding drop-down box will be at the bottom of the dialog.

Linux determines what encoding your terminal supports based on the LANG environment variable. You can override it for a single command by setting it on the same command line, like so:

LANG=zh_CN.gb2312 emacs

Or you can override it for an entire session by using the export command:

export LANG=zh_CN.gb2312

Once emacs is loaded, you can set the encoding with C-x RET l, as before.

Other text-based programs

The LANG variable also affects other commands. For example, if your terminal is configured for GB2312, you can do the following:

LANG=zh_CN.gb2312 more /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm

and get a correct display.

locale -a will list all the valid settings for LANG that the system knows about.

-- DavidBrodbeck - 28 Sep 2007

Edit | Attach | Print version | History: r5 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2007-09-28 - 21:42:34 - DavidBrodbeck

This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions