Multilingualism on the Web. 5. Language-Related Research

NEF - Le Livre 010101 de Marie Lebert - Multilingualism on the Web

Multilingualism on the Web (1999)
5. Language-Related Research

5.1. Machine Translation Research
5.2. Computational Linguistics
5.3. Language Engineering
5.4. Internationalization and Localization

5.1. Machine Translation Research

The CL/MT Research Group (Computational Linguistics (CL) and Machine Translation (MT) Group) is a research group in the Department of Language and Linguistics at the University of Essex, United Kingdom. It serves as a focus for research in computational, and computationally oriented, linguistics. It has been in existence since the late 1980s, and has played a role in a number of important computational linguistics research projects.

Founded in 1986, the Center for Machine Translation (CMT) is now a research center within the new Language Technologies Institute at the School of Computer Science at Carnegie Mellon University (CMU), Pittsburgh, Pennsylvania. It conducts advanced research and development in a suite of technologies for natural language processing, with a primary focus on high-quality multilingual machine translation.

Within the CLIPS Laboratory (CLIPS: Communication langagière et interaction personne-système = Language Communication and Person-System Communication) of the French IMAG Federation, the Groupe d'étude pour la traduction automatique (GETA) (Study Group for Machine Translation) is a multi-disciplinary team of computer scientists and linguists. Its research topics concern all the theoretical, methodological and practical aspects of computer-assisted translation (CAT), or more generally of multilingual computing. The GETA participates in the UNL (Universal Networking Language) project, initiated by the Institute of Advanced Studies (IAS) of the United Nations University (UNU).

"UNL (Universal Networking Language) is a language that - with its companion "enconverter" and "deconverter" software - enables communication among peoples of differing native languages. It will reside, as a plug-in for popular World Wide Web browsers, on the Internet, and will be compatible with standard network servers. The technology will be shared among the member states of the United Nations. Any person with access to the Internet will be able to "enconvert" text from any native language of a member state into UNL. Just as easily, any UNL text can be "deconverted" from UNL into native languages. United Nations University's UNL Center will work with its partners to create and promote the UNL software, which will be compatible with popular network servers and computing platforms."

The Natural Language Group (NLG) at the Information Sciences Institute (ISI) of the University of Southern California (USC) is currently involved in various aspects of computational/natural language processing. The group's projects are: machine translation; automated text summarization; multilingual verb access and text management; development of large concept taxonomies (ontologies); discourse and text generation; construction of large lexicons for various languages; and multimedia communication.

Eduard Hovy, Head of the Natural Language Group, expained in his e-mail of August 27, 1998:

"Your presentation outline looks very interesting to me. I do wonder, however, where you discuss the language-related applications/functionalities that are not translation, such as information retrieval (IR) and automated text summarization (SUM). You would not be able to find anything on the Web without IR! -- all the search engines (AltaVista, Yahoo!, etc.) are built upon IR technology. Similarly, though much newer, it is likely that many people will soon be using automated summarizers to condense (or at least, to extract the major contents of) single (long) documents or lots of (any length) ones together. [...]

In this context, multilingualism on the Web is another complexifying factor. People will write their own language for several reasons -- convenience, secrecy, and local applicability -- but that does not mean that other people are not interested in reading what they have to say! This is especially true for companies involved in technology watch (say, a computer company that wants to know, daily, all the Japanese newspaper and other articles that pertain to what they make) or some Government Intelligence agencies (the people who provide the most up-to-date information for use by your government officials in making policy, etc.). One of the main problems faced by these kinds of people is the flood of information, so they tend to hire 'weak' bilinguals who can rapidly scan incoming text and throw out what is not relevant, giving the relevant stuff to professional translators. Obviously, a combination of SUM and MT (machine translation) will help here; since MT is slow, it helps if you can do SUM in the foreign language, and then just do a quick and dirty MT on the result, allowing either a human or an automated IR-based text classifier to decide whether to keep or reject the article.

For these kinds of reasons, the US Government has over the past five years been funding research in MT, SUM, and IR, and is interested in starting a new program of research in Multilingual IR. This way you will be able to one day open Netscape or Explorer or the like, type in your query in (say) English, and have the engine return texts in *all* the languages of the world. You will have them clustered by subarea, summarized by cluster, and the foreign summaries translated, all the kinds of things that you would like to have.

You can see a demo of our version of this capability, using English as the user language and a collection of approx. 5,000 texts of English, Japanese, Arabic, Spanish, and Indonesian, by visiting MuST Multilingual Information Retrieval, Summarization, and Translation System.

Type your query word (say, 'baby', or whatever you wish) in and press 'Enter/Return'. In the middle window you will see the headlines (or just keywords, translated) of the retrieved documents. On the left you will see what language they are in: 'Sp' for Spanish, 'Id' for Indonesian, etc. Click on the number at left of each line to see the document in the bottom window. Click on 'Summarize' to get a summary. Click on 'Translate' for a translation (but beware: Arabic and Japanese are extremely slow! Try Indonesian for a quick word-by-word 'translation' instead).

This is not a product (yet); we have lots of research to do in order to improve the quality of each step. But it shows you the kind of direction we are heading in."

"How do you see the future of Internet-related activities as regards languages?"

"The Internet is, as I see it, a fantastic gift to humanity. It is, as one of my graduate students recently said, the next step in the evolution of information access. A long time ago, information was transmitted orally only; you had to be face-to-face with the speaker. With the invention of writing, the time barrier broke down -- you can still read Seneca and Moses. With the invention of the printing press, the access barrier was overcome -- now *anyone* with money to buy a book can read Seneca and Moses. And today, information access becomes almost instantaneous, globally; you can read Seneca and Moses from your computer, without even knowing who they are or how to find out what they wrote; simply open AltaVista and search for 'Seneca'. This is a phenomenal leap in the development of connections between people and cultures. Look how today's Internet kids are incorporating the Web in their lives.

The next step? -- I imagine it will be a combination of computer and cellular phone, allowing you as an individual to be connected to the Web wherever you are. All your diary, phone lists, grocery lists, homework, current reading, bills, communications, etc., plus AltaVista and the others, all accessible (by voice and small screen) via a small thing carried in your purse or on your belt. That means that the barrier between personal information (your phone lists and diary) and non-personal information (Seneca and Moses) will be overcome, so that you can get to both types anytime. I would love to have something that tells me, when next I am at a conference and someone steps up, smiling to say hello, who this person is, where last I met him/her, and what we said then!

But that is the future. Today, the Web has made big changes in the way I shop (I spent 20 minutes looking for plane routes for my next trip with a difficult transition on the Web, instead of waiting for my secretary to ask the travel agent, which takes a day). I look for information on anything I want to know about, instead of having to make a trip to the library and look through complicated indexes. I send e-mail to you about this question, at a time that is convenient for me, rather than your having to make a phone appointment and then us talking for 15 minutes. And so on."

The Computing Research Laboratory (CRL) at New Mexico State University (NMSU) is a non-profit research enterprise committed to basic research and software development in advanced computing applications concentrated in the areas of natural language processing, artificial intelligence and graphical user interface design. Applications developed from basic research endeavors include a variety of configurations of machine translation, information extraction, knowledge acquisition, intelligent teaching, and translator workstation systems.

Maintained by the Department of Linguistics of the Translation Research Group of Brigham Young University (BYU), Utah, TTT.org (Translation, Theory and Technology) provides information about language theory and technology, particularly relating to translation. Translation technology includes translator workbench tools and machine translation. In addition to translation tools, TTT.org is interested in data exchange standards that allow various tools to interoperate, allowing the integration of tools from multiple vendors in the multilingual document production chain.

In the area of data exchange standards, TTT.org is actively involved in the development of MARTIF (machine-readable terminology interchange format). MARTIF is a format to facilitate the interchange of terminological data among terminology management systems. This format is the result of several years of intense international collaboration among terminologists and database experts from various organizations, including academic institutions, the Text Encoding Initiative (TEI), and the Localisation Industry Standards Association (LISA).

5.2. Computational Linguistics

The Laboratoire de recherche appliquée en linguistique informatique (RALI) (Laboratory of Applied Research in Computational Linguistics) is a laboratory of the University of Montreal, Quebec. The RALI's personnel includes experienced computer scientists and linguists in natural language processing both in classical symbolic methods as well as in newer probabilist methods.

Thanks to the Incognito laboratory, which was founded in 1983, the University of Montreal's Computer Science and Operational Research Department (DIRO) established itself as a leading research centre in the area of natural language processing. In June 1997, Industry Canada agreed to transfer to the DIRO all the activities of the machine-aided translation program (TAO), which had been conducted at the Centre for Information Technology Innovation (CITI) since 1984. A new laboratory -- the RALI -- was opened in order to promote and develop the results of the CITI's research, allowing the members of the former TAO team to pursue their work within the university community. The RALI's areas of expertise include work in: automatic text alignment, automatic text generation, automatic reaccentuation, language identification and finite state transducers.

The RALI produces the "TransX family" of what it calls "a new generation" of translation support tools (TransType, TransTalk, TransCheck and TransSearch), which are based on probabilistic translation models that automatically calculate the correspondences between the text produced by a translator and the original source language text.

" TransType speeds up the keying-in of a translation by anticipating a translator's choices and critiquizing them when appropriate. In proposing its suggestions, TransType takes into account both the source text and the partial translation that the translator has already produced.

TransTalk is an automatic dictation system that makes use of a probabilistic translation model in order to improve the performance of its voice recognition model.

TransCheck automatically detects certain types of translation errors by verifying that the correspondences between the segments of a draft and the segments of the source text respect well-known properties of a good translation.

TransSearch allows translators to search databases of pre-existing translations in order to find ready-made solutions to all sorts of translation problems. In order to produce the required databases, the translations and the source language texts must first be aligned."

Some of RALI's other projects are:

- the SILC Project, concerning language identification. When a document is submitted to the system, SILC attempts to determine what language the document is written in and the character set in which it is encoded.

- the FAP: Finite Automata Package (FAP), a project concerning finite-state transducers. The finite-state automaton is a simple and efficient computational device for describing sequences of symbols (words, characters, etc.) known as the regular languages. The finite-state transducer is a device for linking pairs of these sequences under the control of a grammar of local correspondences, and thus provides a means of rewriting one sequence as another. Applications of these techniques in NLP include: dictionaries, morphological analysis, part-of-speech tagging, syntactic analysis, and speech processing.

The Xerox Palo Alto Research Center (PARC)'s projects include two main projects concerning languages: Inter-Language Unification (ILU) and Natural Language Theory and Technology (NLTT).

The Inter-Language Unification (ILU) System is a multi-language object interface system. The object interfaces provided by ILU hide implementation distinctions between different languages, between different address spaces, and between operating system types. ILU can be used to build multilingual object-oriented libraries ("class libraries") with well-specified language-independent interfaces. It can also be used to implement distributed systems, or to define and document interfaces between the modules of non-distributed programs.

The goal of Natural Language Theory and Technology (NLTT) is to develop theories of how information is encoded in natural language and technologies for mapping information to and from natural language representations. This will enable the efficient and intelligent handling of natural language text in critical phases of document processing, such as recognition, summarizing, indexing, fact extraction and presentation, document storage and retrieval, and translation. It will also increase the power and convenience of communicating with machines in natural language.

Based in Cambridge, United Kingdom, and Grenoble, France, The Xerox Research Centre Europe (XRCE) is also a research organization of the international company XEROX, which focuses on increasing productivity in the workplace through new document technologies, with several tools and projects relating to languages.

One of Xerox's research activities is MultiLingual Theory and Technology (MLTT), to study how to analyze and generate text in many languages (English, French, German, Italian, Spanish, Russian, Arabic, etc.). The MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. Currently under development are phrasal parsers for French and German, a lexical functional grammar (LFG) for French and projects on multilingual information retrieval, translation and generation.

Founded in 1979, the American Association for Artificial Intelligence (AAAI) is a non-profit scientific society devoted to advancing the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines. AAAI also aims to increase public understanding of artificial intelligence, improve the teaching and training of AI practitioners, and provide guidance for research planners and funders concerning the importance and potential of current AI developments and future directions.

The Institut Dalle Molle pour les études sémantiques et cognitives (ISSCO) (Dalle Molle Institute for Semantic and Cognitive Studies) is a research laboratory attached to the University of Geneva, Switzerland, which conducts basic and applied research in computational linguistics (CL), and artificial intelligence (AI). The site gives a presentation of the ISSCO projects (European projects, projects of the Swiss National Science Foundation, projects of the French-speaking community, etc.).

Created by the Foundation Dalle Molle in 1972 for research into cognition and semantics, ISSCO has come to specialize in natural language processing and, in particular, in multilingual language processing, in a number of areas : machine translation, linguistic environments, multilingual generation, discourse processing, data collection, etc. The University of Geneva provides administrative support and infrastructure for ISSCO. The research is funded solely by grants and by contracts with public and private bodies.

ISSCO is multi-disciplinary and multi-national, "drawing its staff and its visitors from the disciplines of computer science, linguistics, mathematics, psychology and philosophy. The long-term staff of the Institute is relatively small in number; with a much larger number of visitors coming for stays ranging from a month to two years. This ensures a continual exchange of ideas and encourages flexibility of approach amongst those associated with the Institute."

The International Conferences on Computational Linguistics (COLINGs) are organized every two years by the International Committee on Computational Linguistics (ICCL).

"The International Committee on Computational Linguistics was set up by David Hays in the mid-Sixties as a permanent body to run international computational linguistics conferences in an original way, with no permanent secretariat, subscriptions or funds. It was ahead of its time in that and other ways. COLING has always been distinguished by pleasant venues and atmosphere, rather than by the clinical efficiency of an airport conference hotel: COLINGs are simply nice conferences to be at. [...] In recent years, the ACL [Association for Computational Linguistics] has given great assistance and cooperation in keeping COLING proceedings available and distributed."

5.3. Language Engineering

Launched in January 1999 by the European Commission, the website HLTCentral (HLT: Human Language Technologies) gives a short definition of language engineering:

"Through language engineering we can find ways of living comfortably with technology. Our knowledge of language can be used to develop systems that recognise speech and writing, understand text well enough to select information, translate between different languages, and generate speech as well as the printed world.

By applying such technologies we have the ability to extend the current limits of our use of language. Language enabled products will become an essential and integral part of everyday life."

A full presentation of language engineering can be found in Language Engineering: Harnessing the Power of Language.

From 1992 to 1998, the Language Engineering Sector was part of the Telematics Applications Programme of the European Commission. Its aim was to facilitate the use of telematics applications and to increase the possibilities for communication in and between European languages. RTD (research and technological development) work focused on pilot projects that integrated language technologies into information and communications applications and services. A key objective was to improve their ease of use and functionality and broaden their scope across different languages.

From January 1999, the Language Engineering Sector has been rebranded as Human Language Technologies (HLT), a sector of the IST Programme (IST: Information Society Technologies) of the European Commission for 1999-2002. HLTCentral has been set up by the LINGLINK Project as the springboard for access to Language Technology resources on the Web: information, news, downloads, links, events, discussion groups and a number of specially-commissioned studies (e-commerce, telecommunications, Call Centres, Localization, etc.).

The Multilingual Application Interface for Telematic Services (MAITS) is a consortium formed to specify an applications programming interface (API) for multilingual applications in the telematic services. A number of telematic applications, such as X.500, WWW, X.400, internet mail and data bases, is planned to be enhanced to use this i18n API, and products are planned to be implemented using the API.

FRANCIL (Réseau francophone de l'ingénierie de la langue) (Francophone Network in Language Engineering) is a programme launched in June 1994 by the Agence universitaire de la francophonie (AUPELF-UREF) (University Agency for Francophony) to strengthen activities in linguistic engineering, particularly for automatic language processing. This quickly-growing sector includes research and development for text analysis and generation, and for speech recognition, comprehension and synthesis. It also includes some applications in the following fields: document management, communication between the human being and the machine, writing aid, and computer-assisted translation.

5.4. Internationalization and Localization

"Towards communicating on the Internet in any language..." Babel is an Alis Technologies/ Internet Society joint project to internationalize the Internet. Its multilingual site (English, French, German, Italian, Portuguese, Spanish and Swedish) has two main sections: languages (the world's languages; typographical and linguistic glossary; Francophonie (French-speaking countries); and the Internet and multilingualism (developing your multilingual Web site; coding the world's writing).

The Localisation Industry Standards Association (LISA) is a main organization for the localization and internationalization industry. The current membership of 130 leading players from all around the world includes software publishers, hardware manufacturers, localization service vendors, and an increasing number of companies from related IT sectors. LISA defines its mission as "promoting the localization and internationalization industry and providing a mechanism and services to enable companies to exchange and share information on the development of processes, tools, technologies and business models connected with localization, internationalization and related topics". Its site is housed and maintained by the University of Geneva, Switzerland.

W3C Internationalization/Localization is part of the World Wide Web Consortium (W3C), an international industry consortium founded in 1994 to develop common protocols for the World Wide Web. The site gives in particular a definition of protocols used for internationalization/localization: HTML; base character set; new tags and attributes; HTTP; language negotiation; URLs & other identifiers including non-ASCII characters; etc. It also offers some help with creating a multilingual site.

Index of Websites
Table of Contents

Mutilingualism on the Web
Le Livre 010101: Home Page
NEF: Home Page

Multilingualism on the Web (1999) 5. Language-Related Research

5.1. Machine Translation Research

5.2. Computational Linguistics

5.3. Language Engineering

5.4. Internationalization and Localization

Multilingualism on the Web (1999)
5. Language-Related Research