Dossiers du NEF - Computer-Assisted Translation (CAT)

NEF (Net des études françaises) - Dossiers du NEF

Computer-Assisted Translation (CAT): Glossary

by Marie Lebert, May 2007

This version replaces a previous version dated October 2006. Please use the "Find (on this page)..." command in your browser to locate specific terms in this glossary.

= alphabet =

The alphabet is a writing system that consists of letters for writing both consonants and vowels. Consonants and vowels have equal status as letters. A letter usually corresponds to a sound. The term "alphabet" is derived from the first two letters in Greek (alpha, beta). A system of phonetic notation has been created by the International Phonetic Alphabet (IPA). Alphabets are encoded in ASCII (American standard code for information interchange, mainly for English) and Unicode (for any language).[See also: ASCII, International Phonetic Alphabet, letter, Unicode.]

= ANSI (American National Standards Institute) =

During the early days of computers, ANSI (American National Standards Institute) proposed a character encoding named ASCII (American standard code for information interchange) in 1963 and finalized it in 1968. ANSI is also the Microsoft collective name for all Windows code pages. [See also: ASCII.]

= ASCII (American standard code for information interchange) =

ASCII (American standard code for information interchange) is a 7-bit coded character set for information interchange in English. It was proposed by ANSI (American National Standards Institute) in 1963 and finalized in 1968. A more recent character set is Unicode, a universal double-byte character encoding launched in 1991 to support any language and any platform. [See also: ANSI, Unicode.]

= automated translation =

Automated translation is a synonym of machine translation. [See also: machine translation.]

= case =

A feature of certain alphabets where the letters have two distinct forms. These variants, which often differ in shape and size, are called the upper case letter and the lowercase letter. The uppercase letter is also known as "capital" or "majuscule". The lowercase letter is also known as "small" or "minuscule". [See also: alphabet.]

= computational linguistics =

Computational linguistics is an interdisciplinary field dealing with the statistical and logical modeling of natural language. Research involves the work of linguists, computer scientists, experts in artificial intelligence, cognitive psychologists and logicians, among others. Machine translation (MT) is a subfield of computational linguistics. [See also: machine translation.]

= computer-assisted translation =

A computer-assisted translation (CAT) tool rests on two steps –segmentation and translation memory (TM)– to boost the productivity of a translator. It also offers other terminology functions: concordance, glossaries, context search, reference search, terminology management, quality control, etc. Computer-assisted translation (CAT) is different from machine translation (MT). In computer-assisted translation, the computer program supports the translator, who translates the text himself. In machine translation, the computer program translates the text, with no human intervention during the translation process. [See also: concordance, glossary, machine translation, segmentation, terminology, translation memory.]

= concordance =

Concordance is a method of displaying sentences or phrases that contain similar or identical words or expressions, to be able to copy and paste them in the translation. Concordance is an option provided in a computer-assisted translation (CAT) tool. [See also: computer-assisted translation.]

= DTD (document type definition) =

A DTD (document type definition) specifies the rules for the structure of a SGML (standard generalized markup language) document. To standardize various DTDs makes it easier to share different types of documents. [See also: SGML.]

= glossary =

A glossary is an alphabetical list of terms in a special area of knowledge with the definitions for those terms. In computer-assisted translation (CAT), a glossary is a bilingual listing of terminology or software strings used to define the key terms and their translations. [See also: terminology.]

= HTML (hypertext markup language) =

Created by Tim Berners-Lee, founder of the web in 1989, HTML (hypertext markup language) is a text description language related to SGML (standard generalized markup language). It mixes text format markup with plain text content to describe formatted text. HTML is the source language for web pages. [See also: SGML.]

= human-computer interaction (HCI) =

Human-computer interaction (HCI) is the study of interaction between people and computers. It is an interdisciplinary subject relating computer science to other fields of study and research: psychology, sociology, cognitive science, visualization, design, information science, ergonomics, etc.

= human-machine interface (HMI) =

A human-machine interface (HMI) is any point where people interact with a machine, for example a user interface from a worker to a computer such as a data entry program or a voice command.

= ideograph =

An ideograph or ideogram is a graphic symbol used to express an idea, for example the Chinese characters or the Egyptian hieroglyphs, rather than a group of letters like in alphabetic languages. An ideograph is also any symbol that primarily denotes an idea (or meaning) in contrast to a sound (pronunciation), for example an icon showing a printer, to click on to print a document. [See also: alphabet.]

= IPA (International Phonetic Alphabet) =

The International Phonetic Alphabet (IPA) is a system of phonetic notation devised by linguists to provide a standardized and unique way of representing the sounds of any spoken language. Most dictionaries use the International Phonetic Alphabet to offer pronunciations of words. [See also: alphabet.]

= ITD (intermediate translation document) =

An ITD (intermediate translation document) is a special file created at the beginning of the translation process to store the segments once they have been split off from the main source text. Then a partially complete translation can be saved as an ITD file, and resumed later by reopening the ITD file. ITD is a proprietary file format of SDL International, a main provider of global information management (GIM) solutions, including translation and multilingual content. [See also: segment, source file.]

= language pair =

A language pair is the combination of one source language and one target language. [See also: source language, target language.]

= letter =

A letter is an element of an alphabet. In a broad sense, it also includes elements of syllabaries and ideographs. [See also: alphabet, ideograph, syllabary.]

= linguistics =

Linguistics is the scientific study of human language. Theoretical linguistics develops models for individual languages and universal aspects of languages, in various fields: syntax, phonology, morphology, semantics, etc. Applied linguistics deals with the practical issues and challenges of linguistics: language teaching and learning, second language acquisition, speech therapy, speech synthesis (artifical production of human speech), psycholinguistics, semantics, etc.

= LISA (Localization Industry Standards Association) =

LISA (Localization Industry Standards Association) is the leading international forum for organizations doing global business. Its 500 corporate members are public and private institutions, government ministries and trade organizations. LISA is responsible for the specification of the TMX (translation memory exchange) format. [See also: localization, TMX.]

= localization =

Localization is the means of adapting products such as publications, hardware or software for non-native environments, for example for other nations and cultures. Localization is also the process of making a product ready for a specific market, or customized for a specific region, after this product has been internationalized. [See also: LISA.]

= machine translation =

Also called automated translation, machine translation (MT) uses a computer program to translate a text or a speech from one natural language to another. Machine translation is different from computer-assisted translation (CAT). A CAT tool is meant to support a human translator in his/her work to speed up the process and provide consistent terminology while machine translation is meant to stand alone as much as possible. [See also: computer-assisted translation.]

= match =

A perfect match (also called a 100% match) is an occurrence of a sentence or phrase in a file that is identical (words, structure and formatting) to a sentence or phrase stored in a translation memory (TM). A fuzzy match is an imperfect match. [See also: translation memory.]

= PDF (portable document format) =

PDF (portable document format) is an Adobe proprietary file format for representing documents in a fixed-layout document format, for them to be shared across all platforms. PDF files are created with Adobe Acrobat and viewed with Adobe Reader (called Acrobat Reader until 2003).

= placeable =

A placeable is an element in the source text that cannot be translated (the HTML code of a web page, for example) and is therefore "placed as is" inside the target text. This is one of the many options provided by a CAT (computer-assisted translation) tool. [See also: computer-assisted translation.]

= pre-translation =

A pre-translation is the preparation of a file for translation. The file is "filled" with the related segments of previously translated material when there is a perfect or fuzzy match. The result is a hybrid file containing both source and target language terminology to speed up the translation process and make it more consistent. [See also: match, source language, target language, terminology.]

= segment =

A segment is the elementary unit of the source document to translate. Segments are usually sentences, and sometimes phrases or paragraphs. [See also: segmentation, translation unit.]

= segmentation =

Segmentation is the process of organizing the source document into segments. It is one of the two steps provided by a CAT (computer-assisted translation) tool, the second one being the use of the translation memory (TM). [See also: computer-assisted translation, translation memory.]

= SGML (standard generalized markup language) =

SGML (standard generalized markup language) is not a format in itself, but a set of rules to define formats, or a standard framework to define specific text markup languages. SGML includes the HTML (hypertext markup language) format and the XML (extensible markup language) format. [See also: HTML, XML.]

= source file =

The source file is the file containing the document to translate from a source language to a target language. [See also: source language, target language.]

= source language =

The source language is the language in which the product was originally developed. Translation is done from a source language into one or several target languages. [See also: target language.]

= syllabary =

A syllabary is a set of written symbols representing syllables, which make up words. These symbols usually represent a consonant followed by a vowel. [See also: alphabet.]

= target language =

The target language is the language to which the document is converted. A translation project can have one or several target languages. [See also: source language.]

= template =

A template is a model of document that offers a presentation layout.

= terminology =

Terminology is the usage and study of terms. It is also the vocabulary of terms used in a specific field, for example technical terminology in computing. As a discipline, terminology is related to translation. A computer-assisted translation (CAT) tool includes terminology management, to speed up the translation process and to ensure the quality of the translation. [See also: computer-assisted translation.]

= TMX (translation memory exchange) =

TMX (translation memory exchange) is an open XML standard for the exchange of translation memory (TM) data created by computer-assisted translation (CAT) and localization tools. The purpose of TMX is to allow easier exchange of translation memory data between tools and/or translation vendors with little or no loss of critical data during the process. In existence since 1998, TMX is developed and maintained by OSCAR (Open Standards for Container/Content Allowing Re-use), a Special Interest Group of LISA (Localization Industry Standards Association). [See also: LISA, translation memory, XML.]

= TMX language code =

The language code used by TMX. Here are few examples of TMX language codes: EN-US (English, USA), EN-CA (English, Canada), EN-GB (English, UK), FR-CA (French, Canada), FR-FR (French, France). [See also: TMX.]

= translation =

Translation is the process of adapting meaning from one language to another. This is not a literal, word-for-word process fom a source language to a target language. This is rather a choice of words that convey the same meaning in the target language. As a discipline, translation is related to terminology. [See also: source language, target language, terminology.]

= translation memory =

A translation memory (TM) is a database consisting of a set of segments (phrases and sentences) in a source language, with the corresponding translation of each segment in the target language. A translation memory is built from previous translations of a document or series of documents. This is the second step provided by a CAT (computer-assisted translation) tool, the first one being the segmentation. [See also: computer-assisted translation, segmentation.]

= translation unit =

A translation unit (TU) is a set of source and target segments. It shows up as an entry consisting of aligned segments of text in two or more languages. The format used for a translation unit is TMX (translation memory exchange). [See also: segment, source language, target language, TMX.]

= TTX (TRADOStag) =

TTX stands for TRADOStag. It is a special bilingual, XML-based (XML: exchange markup language) intermediary document format. TTX is a proprietary file format of SDL International, a main provider of global information management (GIM) solutions, including translation and multilingual content. [See also: XML.]

= Unicode =

Unicode is the universal character encoding maintained by the Unicode Consortium. "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language." First published in January 1991, this double-byte, platform-independent encoding software provides a basis for the processing, storage and interchange of text data in any language, and any modern software and information technology protocols.

= XML (extensible markup language) =

XML (extensible markup language) is a text markup language intended for interchange of structured data. This simple and flexible text formatting is derived from SGML (standard generalized markup language). XML is a trademark of the W3C (World Wide Web Consortium). TMX (translation memory exchange) is an open XML standard for the exchange of translation memory (TM) data. [See also: SGML, TMX, W3C.]

= W3C (World Wide Web Consortium) =

W3C (World Wide Web Consortium) develops interoperable technologies (specifications, guidelines, software and tools) for the web, as a forum for information, commerce, communication and collective understanding. W3C was founded in October 1994 to develop common protocols to lead the evolution of the web. For example, W3C is responsible for the specification of the HTML (hypertext markup language) and the XML (extensible markup language) formats. [See also: HTML, XML.]

With many thanks to Michael Hart for his kind help.

Dossiers du NEF: Home Page

NEF: Home Page