Last revised: June. 13, 2001

Tips for handling WWW foreign language e-texts

Electronic texts in foreign languages are now common —¡olé!— and easily available on the World Wide Web. However, there are still problems involved, especially with long texts.

If you wish to use the text for electronic searches, you might be able to circumvent downloading the entire text, since some WWW sites offer their own search engines, as for example the Project Cervantes 2001 site, http://www.csdl.tamu.edu/cervantes/. Besides selecting the string to search for, you may determine which Cervantes text you wish to search, whether to do a ranked or boolean search, whether the results should be given with page numbers, three line context,or full page contest, and how large the output file may be (500 K maximum).

When you download a text from a WWW page, you normally have two options. You can save it to your own computer or to disk in the form of an HTML (HyperText Markup Language) file or as a TXT ([ASCII] text) file. There are advantages to each format:

Adapting text to a different character set can be accomplished in various ways:

  1. With a powerful modern word processor you can usually load up an HTML file in your program and then save it in whatever format you wish; the program will do the conversion for you, or at least most of it.

  2. Global search and replace can accomplish the same thing: for example, you can change all instances of é to é, ñ to ñ, ¡ to ¡, etc. This can be tedious, but it works. Some have written macros to handle the task of making conversions.

  3. A conversion program will allow for conversion between various character sets.  For example, TRANS2 is a DOS program available at http://users.ipfw.edu/jehle/cervante.htm  and which allows approximate translations or conversions between PC-Graphics (“ASCII”), ISO Latin-1, and HTML character sets. Exact conversions are often impossible since HTML and Latin-1 are capable of producing many more foreign language characters than PC-Graphics, and PC-Graphics features box characters not available in the other two sets.  The program does not transform files to and from HTML format; that is, it does not add or remove HTML tags like <BODY>. Hint: If you are trying to create an HTML document from a foreign language text, use TRANS2 first, before adding the other HTML tags; otherwise, the brackets for the HTML tags (< and >) themselves might be converted to HTML characters.


Fred Jehle jehle@ipfw.edu
Indiana University - Purdue University Fort Wayne
Fort Wayne, IN 46805-1499, USA
URL: http://users.ipfw.edu/jehle/courses/etxttips.htm
Works of Cervantes page http://users.ipfw.edu/jehle/cervante.htm
Home page http://users.ipfw.edu/jehle/