Generate a Concordance from an XML File

August 21st, 2008

A concordance is a list of all the words in a document, and there respective word count.

For example, if you had the following sentences: I like XML. I like computers. Do you like XML?

A concordance for those sentences would look like this:

3 like
2 I
2 XML
1 you
1 computers
1 do

Concordances are especially useful for finding the words used most often when building glossaries or multilingual dictionaries.

However, it is nontrivial to generate a concordance from an XML file, because XML elements, attributes, and attribute values are all just plain text words that will skew the results. I came up with a way to easily generate a concordance from an XML document using only the GNU Linux command line to create a concordance shell script.

To run the concordance script, you need either:

Here is the concordance shell script:

sed -e 's/<[^>]*>'//g < inputfile.xml |
tr -dc "a-zA-Z0-9'\- \012" |
tr "\<[0-9][0-9]*" "\012" |
tr " "  "\012" | tr "\r" "\012" |
sort -f | uniq -ic | sort -nr > outputfile.txt

inputfile.xml is your XML file, and outputfile.txt is the concordance file created by the script.

The script does a number of things. First we have to remove all the XML, so it strips the tags to make a plain text file. Next it converts all spaces, punctuation, stand-alone numbers, Windows special characters, etc. into standard new line characters. At this point, every line has one word on it. Then it does the actual work to build the concordance by sorting every line, counting every line and their duplicates, then sorting in reverse numerical order.

It’s actually not that complicated. It just uses a few GNU command line tools to process the data, and strings them all together to form a script that takes an XML document and builds a concordance.

The concordance file generated is plain text, but you can import it into Microsoft Excel, or any spreadsheet program by using spaces to delimit the cells. In a lot of business settings, a plain text file wont do; but that same data in an Excel spreadsheet now becomes business data.

Translating Sentences for Trados Rather Than Ideas

March 30th, 2008

The benefits of translation memory tools such as Trados for translating are numerous; but they have their negatives as well. They encourage the translator to translate everything on a sentence by sentence basis. Every source sentence will have a corresponding target sentence. It is not always ideal to translate in this manner.

For example, consider the following Japanese sentence and its translation:

すしが好きですが、ウニはぜんぜんダメです。

I like sushi. However, I cannot eat sea urchin.

Notice that in Japanese it is natural to say that all in one sentence. English on the other hand works better as two sentences.

If you translated that Japanese sentence using Trados, you can split the English translation into two sentences. However, if you use that translation memory for translating English, you will get no matches for the sentences I like sushi, or However, I cannot eat sea urchin. You would have to know to expand the segment to span two sentences.

Translation memory CAT tools like Trados encourage you to translate with a one-to-one correspondence so the translation memory is useful in both directions. It is wasteful to misalign sentences because the resulting TM will not work if the language direction is reversed. Therefore, a translator using Trados will probably translate the above sentence as I like sushi, but I cannot eat sea urchin. This sentence is fine by itself, but it doesn’t have the same impact as separating them as single ideas.

This is just a simple example, but the problem is much bigger than style choices. When using Trados, you translate entire paragraphs line by line. Every source has a matching target. However, the way you organize a paragraph and express an idea in one language, may not be the same as in another language. But with Trados, you don’t have that freedom. You are given a sentence to translate, and then another, and another. You don’t have the freedom you would if you were translating by hand. If you choose the expand the source segment to encompass the entire paragraph, you have essentially made that segment worthless with respect to the translation memory.

Trados and translation memory CAT software are great tools, but they encourage translation of single sentences, rather than ideas or concepts. A test often used after a translation is to run the translation memory that was created against the original source document. You expect to get 100% matches for the entire document. However, a good translator will not translate everything line by line with one-to-one correspondence between source and target.

Translation is more than converting a sentence from one language to another. It’s about expressing something naturally in a different language. CAT tools like Trados don’t encourage the natural translation of ideas, but rather the conversion of sentences.

First Ever to be Trados Certified

February 17th, 2008

In 2006, SDL unveiled their Trados Certification program. I had been using Trados extensively at work and thought it would be neat to have the official Trados certification on my resume.

Soon after they released the Trados training program and certification tests, I signed up online to take the tests. To my surprise, the tests were hard. It had questions about what the specific menu names were and little details like that. If I had not used Trados as much as I had, I don’t think I would have passed. You had to be really familiar with the entire suite of tools.

I passed the test and got my own personal certification page generatred:

http://oos.sdl.com/asp/products/certified/index.asp?userid=14706

What was surprising was what came next. A few weeks later I got a package at work from SDL Trados. They sent a congratulatory card informing me that I was the first person to pass the Trados certification, and a bottle of vintage champagne! I certainly wasn’t expecting any of that.

Following that, they contacted me again for a quote and profile to put up on their certification Web site: (http://www.translationzone.com/en/certification/Default.asp). They also asked for a picture of me, but I guess I wasn’t photogenic enough for their site because they put a generic image of someone else above my quote. The current version of the page has a women and multiple quotes now.

SDL Trados Certification Page

In the end, it’s kind of neat. I can tell people I was the first person to ever be certified by SDL Trados. Since then I have also passed their SDL Trados 2007 certification as well.

Can You Translate This?

January 19th, 2008

At work I’m often asked things such as “How long will it take to translate 10 pages?” Managers usually don’t like my answer; They want to hear a specific time frame to fill in some gantt chart or something. The reality is, it depends. Most managers and such don’t understand what goes into translating something. It’s not as straightforward as just translating the words. There are other aspects that go into the localization process than just translation.

Expertise. No one is an expert on everything. If you have a technical document that needs translating, you first have to understand the content of the document. If you don’t understand electromagnetic fields in your native language, how are you going to translate that subject from another language. You will often have to research the subject matter before and during the translation process. It takes time to get familiar with a topic. It takes more time to look up industry and field specific terminology and concepts. If you have SMEs at your company, you are at the mercy of their schedules when you cannot locate information yourself.

Working with others. Unfortunately, not all translation projects can be done solely by the translator. For example, if you are given a video and asked to subtitle it in a different language, there are many steps involve that most people don’t realize:

  • Transcribe the audio
  • Translate the text
  • Match the text to the video
  • Reedit the video with the translated subtitles
  • QA check the subtitled video

Ideally, the translator will be provided with the transcription of the audio with a copy of the video so they can immediately start the translation. Then, work with the editor to set cut points for the translated text. Then the editor will reedit the video, and send it to the translator to check.

Unfortunately, what usually happens is the translator is sent the video and asked to translate subtitles for it. Now the translator has to spend time transcribing the text, then translate it. Next, they must come up with cut points themselves, and hope the editor understands it. The editor will then receive the translation and edit them into the video. It will probably never be checked, and most likely the subtitles aren’t going to match the on-screen dialog.

File formats. How long it will take to translate a document depends on what format it’s in. An XML file with pure text content and no markup can be translated easily. The text can be extracted and run through the translators favorite translation software.

A PDF on the other hand is not as easily accessible. Text may be extractable with some amount of effort, but the original document structure and style cannot be rebuilt automatically. Therefore, the translator will have to spend considerable time doing page layout work.

The worst case scenario is a scanned document, or raster graphics files. The text cannot even be extracted from the document, so translation software can’t be used. With a language like Japanese where a translator may not know the pronunciation of hard technical terms, the inability to cut and paste those words into an online dictionary creates lots of problems.

Most people don’t consider the file format when sending something to be translated. The just want it translated, and don’t want to pay for page layout and text extraction, because they don’t think that is involved with the translation process. If you send a Word document to a translator, but that word document has 50 JPGs in it with text to be translated, you are asking the translator to be a graphics specialist as well.

There are a lot that goes into the translation process. Translators often have to do much more than just translate words to do a good translation. Managers need to understand what goes into this process and provide translators with the resources they need so they can specialize on what they do best.

Localization and the Japanese Language

January 12th, 2008

This is a blog about localization, and some unique issues with the Japanese language when it comes to translation and localization.

My name is Mark. I work for a large Japanese semiconductor company as a localization engineer. I write documentation, translate documentation, and use software to increase translation efficiency. I also write software and create publishing systems to assist in the documentation and translation process. I have degrees in Computer Sciences and Japanese. I went to college in the U.S.A. and in Japan.

Documentation and localization has a number of interesting issue that come up that I want to talk about. Also, the Japanese language increases the complexity of our work and adds many unique considerations to the job that I want to cover.

Localizing Japanese is hard, but very interesting. As I work and discover new things, I want to share them on here.