Professional translation requires skilled translators to use multiple tools — each taking care of different aspects of the translation work.
Computer Aided Translation (CAT) tools, QA Checkers, Translation Memories Extractors, all complement each other and work together on a single translation job.
To make the pipeline work, it is crucial to design an effective way for these tools to exchange data and talk to each other.
This is a common problem in information technology.
The Interchange Problem
In operating systems, a similar need has led to the design principle known as the Unix Philosophy:
“Write programs that do one thing and do it well. Write programs to work together.” — attributed to M.D. McIlroy
The user should use programs that perform one specific task, and then connect the output to the input of other programs, to accomplish more complex tasks.
In the translation field, this has been made possible by the introduction of a standard interchange format. Via this format, the translator, or more often the translation systems engineer, can build processing pipelines by passing information in a way that is understood by all tools.
The XML Localization Interchange Format is the file format that addresses the interchange problem for the translation and localization domains. It is an XML specialization, thus based on a textual markup language.
Being XML-based, it has the advantage of being human-readable, easily editable and processable with a wide set of libraries in every programming language.
It is an open standard, maintained and evolved by a technical committee at OASIS — an organization that fosters the development of standards in every domain.
Famous standards developed at OASIS include those of SOAP Web Services, messaging protocols (AMQP, MQTT), office applications (OpenDocument), security (SAML, PKCS) and interchange formats for legal, sanitary and energy businesses.
XLIFF focuses on the representation of translatable text blocks in a source language and their respective translation in a set of target languages — free of other concerns like graphic layout and formatting. Such blocks are called Translation Units.
An XLIFF file contains multiple Translation Units, from one or more document files, expressed as <trans-unit> tags. Such units are hierarchically organized in <file>s and arbitrarily nested <group>s.
Within the source and target parts of a Translation Unit, a set of inline codes tags allows to represent formatting placeholders of the original document format or untranslatable content.
Other tags, then, can divide a paragraph into sentences.
The XLIFF schema was designed to be extensible, providing extension points in its structure that allow insertion of tags with foreign namespaces and schemas. This also proved to be a major weakness — more on this later.
History and Evolution
XLIFF is a mature standard. Its first version, XLIFF 1.0, was published in April 2002. It was refined in two successive versions: 1.1 (October 2003) and 1.2 (February 2008), with the latter still being widely used today.
The standard then made a generational jump with version 2.0 in August 2014, successively refined with version 2.1 in February 2018.
The 2.0 version is composable, introducing a modular XML schema made up of a core namespace and several additional namespaces, each one representing an extension module for a specific purpose — such as change tracking, translation matches, generic metadata or glossaries. Each module adds specific tags to the core to express the additional concerns.
The core schema of the new version is incompatible with the previous, 1.2 version. In fact, the translatable text blocks in a translation unit, now represented with a <unit> tag, are further broken down in <segment>s and <ignorable>s.
Adoption and controversy
Version 1.2 of the standard has been quickly adopted by the major players in the translation space and has influenced the wider software industry as a localization standard.
Over time, however, several tools leveraged this extensibility to add proprietary features that were eventually left out from the standard. As a result, different XLIFF flavours cropped up and jeopardized the interoperability of the standard itself.
The lack of precise processing requirements in the specification also led to tools leaving out unsupported or not well understood parts of XLIFF files, which led in turn to loss of data in the pipelines.
Market adoption of version 2 has been slow, due to the bulk of proprietary extensions that have been built on top of the 1.2 version over time. Some of these extensions became de-facto standard on their own, like SDLXLIFF (used by SDL Trados Studio), MXLIFF (used by Memsource) and MQXLIFF (used by MemoQ).
Backward incompatibility has not helped either, not allowing a smooth migration path. Therefore, Version 2.x is generally used only in newer products and applications.
From a design perspective, XLIFF v2.0 modular approach is slightly superior. The new specification explicitly dictates processing requirements — i.e. what an application is expected to do when modifying an XLIFF file, especially when dealing with unsupported / unrecognized tags or namespaces.
XLIFF at Translated
Two Translated products are based on XLIFF: MateCat and MateCat Filters.
MateCat is a web-based CAT tool, originally developed by Translated together with Fondazione Bruno Kessler (FBK), Université du Maine and University of Edinburgh. It uses XLIFF v1.2 as its main input format, but it also natively understands the SDLXLIFF flavour and, to a lesser extent, other flavours as well.
With MateCat you can create translation projects loading a set of XLIFF files, translate them, revise their translations and download the finalised XLIFFs.
MateCat Filters is a REST API that extracts translatable text from documents in several formats producing XLIFF v1.2 files directly compatible with MateCat.
Filters can also perform the opposite conversion, generating the translated document in the original format (provided they are fed with their own XLIFF files). Finally, they can perform Optical Character Recognition on images and perform some intermediate conversions on certain legacy document formats (i.e. old Microsoft Office files).
Integration with MateCat is tight as they are used under the hood by MateCat when you upload documents in translation projects, and to obtain the translated drafts and final versions in the original format.
Translated Dev Newsletter
Join the newsletter to receive our latest stories in your inbox.