Last revised: June. 13, 2001
Tips for handling WWW foreign language e-texts
Electronic texts in foreign languages are now common
¡olé! and easily available on the World Wide
Web. However, there are still problems involved, especially with long texts.
If you wish to use the text for electronic searches, you might be able to
circumvent downloading the entire text, since some WWW sites offer their
own search engines, as for example the
Project Cervantes 2001 site,
http://www.csdl.tamu.edu/cervantes/. Besides selecting the string to search
for, you may determine which Cervantes text you wish to search, whether to
do a ranked or boolean search, whether the results should be given with page
numbers, three line context,or full page contest, and how large the output
file may be (500 K maximum).
When you download a text from a WWW page, you normally have two options.
You can save it to your own computer or to disk in the form of an HTML (HyperText
Markup Language) file or as a TXT ([ASCII] text) file. There are advantages
to each format:
-
HTML text files maintain all the formatting you see when you first
saw them while using your WWW browser program. That is, all the foreign and
special characters will remain, plus other features such as bolding, italics,
underlining, centering, fonts, and pitch. This is wonderful if you need those
visual features, although the tags which make those features possible can
hinder textual analysis; the simple Spanish phrase given at the top of this
document (¡olé!) ends up looking like
<B>¡olé!</B>. Still, it might
be possible to use text in this format, employing for example, the following
methods:
-
An Internet browser. Browsers such as Netscape and Internet Explorer
can load up any HTML file stored on your hard drive or disk; just click on
the File option on your toolbar, then use Open to
open the file you've saved. Once the file is loaded up, use the browser to
view the text, to print out a copy, or to search the text electronically.
The length of some e-texts such as a novel or even one volume of the
Schevill-Bonilla version of Don Quijote can tax the power of
the search search engine of older browsers, but recent versions seem to have
eliminated that problem.
-
A word processor. Many modern word processors allow you to create and read
HTML files, displaying them much like a browser might. Some programs may
have problems with preformatted text, especially if very long files are involved;
if this is the case, it may be necessary to adapt the text to a different
format as explained below.
-
Plain TXT files here I'm referring specifically to the
.txt files created by an Internet browser such as Netscape
are stripped of text attributes such as bolding, italics, font, and pitch,
but they still retain the foreign language characters of their HTML counterparts.
The problem is that the character set used is usually ISO Latin-1, whose
special characters may not be recognizable when such a file is used by many
DOS/Windows programs. One notable exception is the multi-lingual word
processor Accent, which uses Latin-1 as the default character set. In most
other cases you will have to change the text to a different character set
(see below) and then you can use:
-
A word processor. Most modern programs allow more or less easy ways of entering
foreign language characters, permit electronic searches and manipulating
the text, and are capable of handling very large text files. Each of
the four Schevill-Bonilla volumes of Don Quijote contains about 15,000
lines of text. All four of them could be combined into one file and used
with some modern word processing programs, although a given system might
run slowly if hindered by memory capacity or by processor speed when tackling
that much text at once. The main disadvantage of word processing programs
is that most of them have poor search engines: typically they don't allow
the user to employ wild cards in the search string or specify parameters
for it.
-
A search program. I have used various search programs with some success;
one example is Search and Replace by Funduc Software. One nice feature is
that the user can employ wild cards and search several files or even whole
directories at once, and usually these programs permit the writing of the
output directly to a file which you can save, modify, and use for other purposes.
On the other hand, sometimes the use of foreign language characters makes
entering the search string and reading the output somewhat difficult.
-
A concordance program. For serious textual analysis, it's hard to beat a
concordancer, because this type of program is designed specifically for that
task. Lingua Systems Power Concordancer, for example, allows the user to
select a number of distinct files to form the corpus, search for a string
using wildcards and various options, see the output of the search in columns
(the find, together with what immediately precedes and follows it), manipulate
the output in various ways, and save it. It will also let you create
a word list of all the words in the text, or those above a chosen frequency
count. Programs such as this one can often handle many foreign languages
reasonably well, although in some cases inverted Spanish punctuation characters
may not be recognized as true punctuation (so for example, a search for the
word qué may not find ¿qué).
Adapting text to a different character set can be accomplished in various
ways:
-
With a powerful modern word processor you can usually load up an HTML file
in your program and then save it in whatever format you wish; the program
will do the conversion for you, or at least most of it.
-
Global search and replace can accomplish the same thing: for example, you
can change all instances of é to é,
ñ to ñ, ¡
to ¡, etc. This can be tedious, but it works. Some have
written macros to handle the task of making conversions.
-
A conversion program will allow for conversion between various character
sets. For example, TRANS2 is a DOS program available at
http://users.ipfw.edu/jehle/cervante.htm
and which allows approximate translations or conversions between
PC-Graphics (ASCII), ISO Latin-1, and HTML character sets. Exact
conversions are often impossible since HTML and Latin-1 are capable of producing
many more foreign language characters than PC-Graphics, and PC-Graphics features
box characters not available in the other two sets. The program does
not transform files to and from HTML format; that is, it does not add or
remove HTML tags like <BODY>. Hint: If you are trying to create
an HTML document from a foreign language text, use TRANS2 first, before adding
the other HTML tags; otherwise, the brackets for the HTML tags
(< and >) themselves might be converted to HTML characters.
Fred Jehle
jehle@ipfw.edu
Indiana University - Purdue University Fort Wayne
Fort Wayne, IN 46805-1499, USA
URL: http://users.ipfw.edu/jehle/courses/etxttips.htm
Works of Cervantes page
http://users.ipfw.edu/jehle/cervante.htm
Home page
http://users.ipfw.edu/jehle/