ISO 639-2/B language codes in Java

15/11/2013

While integrating the eZ Publish REST API, I came accross a type of locale I had never seen. They use a combination of ISO-639-2 language codes and ISO 3166 Alpha-2 country codes, e.g. 'eng-US'. But something was off about it, and it took me a while to find out what: They use the legacy ISO 639-2/B variant of the standard, which is the bibliographic version that was in use in libraries before computers were around.

The main difference is that some language codes (22 in total) look more like the English word for the language, e.g. 'ger' for German instead of 'deu'.

This made their API quite difficult to integrate, as there is no built-in support for these langage codes in Java itself or the ICU library.

The only library with support for ISO 639-2/B I could find is the Neovisionaries I18n library, available under an Apache license.

Long story short, here is the code that transforms the eZ Publish comma-separated string of funny codes into ISO 639-1 language codes, using Neovisionaries I18n:

import com.neovisionaries.i18n.LanguageAlpha3Code;
import org.springframework.util.Assert;

import javax.annotation.Nonnull;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * Extracts ISO 639-1 language codes from the string the eZ Publish API delivers.
 */
public final class EzPublishLanguageCodeConverter {

    private static final Pattern EZ_LANGUAGE_CODE_PATTERN = Pattern.compile("([a-z]{3})-[A-Z]{2}");

    private EzPublishLanguageCodeConverter() { }

    /**
     * Extracts ISO 639-1 language codes from the comma-separated list of language codes the 
     * eZ Publish API delivers.
     * @param ezPublishLanguageCodes e.g. 'ger-DE,eng-gb'
     * @return A set of language codes in ISO 639-1 format.
     */
    @Nonnull
    public static Set<String> extractIso2LanguageCodes(
            @Nonnull final String ezPublishLanguageCodes) {
        Assert.hasText(ezPublishLanguageCodes, "Input cannot be null.");

        final String[] ezLanguageCodes = ezPublishLanguageCodes.split(",");
        final Set<String> result = new HashSet<>(ezLanguageCodes.length);
        for (final String ez : ezLanguageCodes) {
            result.add(convertToIso6391LanguageCode(ez));
        }
        return result;
    }

    private static String convertToIso6391LanguageCode(final String ezLanguageCode) {
        final Matcher matcher = EZ_LANGUAGE_CODE_PATTERN.matcher(ezLanguageCode);
        if (!matcher.find()) {
            throw new IllegalArgumentException("Unable to extract language code from input: " 
               + ezLanguageCode);
        }
        final String iso3LanguageCode = matcher.group(1);

        //eZ Publish uses ISO-639-2/B (_bibliographic_) language codes, e.g. German = 'ger'. Wow.
        final LanguageAlpha3Code languageCode = LanguageAlpha3Code.getByCode(iso3LanguageCode);
        Assert.notNull(languageCode, "Language code " + iso3LanguageCode + " could not be found.");
        return languageCode.getAlpha2().toString();
    }
}

And here is the Spock unit test that goes with it. Phew.

import spock.lang.Specification
import spock.lang.Unroll

class EzPublishLanguageCodeConverterTest extends Specification {

    @Unroll
    def 'extractIso2LanguageCodes() checks argument'() {
        when:
        EzPublishLanguageCodeConverter.extractIso2LanguageCodes(input)

        then:
        thrown(IllegalArgumentException)

        where:
        input << [null, '', ' ']
    }

    @Unroll
    def 'extractIso2LanguageCodes() throws exceptions on malformed input'() {
        when:
        EzPublishLanguageCodeConverter.extractIso2LanguageCodes(input)

        then:
        thrown(IllegalArgumentException)

        where:
        input << ['de', 'ger', 'yada', 'de_DE', 'de-DE', 'ger-de', 'GER-DE']
    }

    @Unroll
    def 'extractIso2LanguageCodes() throws exceptions on unknown language codes'() {
        when:
        EzPublishLanguageCodeConverter.extractIso2LanguageCodes(input)

        then:
        thrown(IllegalArgumentException)

        where:
        input << ['zzz-DE', 'abc-DE', 'www-DE']
    }

    @Unroll
    def 'extractIso2LanguageCodes() converts to ISO 639-1 language codes'() {
        when:
        def result = EzPublishLanguageCodeConverter.extractIso2LanguageCodes(input)

        then:
        result == expected

        where:
        input               | expected
        'eng-GB'            | ['en'] as Set
        'ger-DE, eng-GB'    | ['de', 'en'] as Set
    }

}

Comments