NLP Projects at Reykjavik University
Our current main project is developing an open source natural language processing (NLP) toolkit, IceNLP, for analysing and processing the Icelandic language.
IceNLP currently consists of a tokeniser, a
morphological analyser (IceMorphy), a linguistic rule-based
part-of-speech tagger (IceTagger), a trigram tagger (TriTagger), and a shallow (finite-state) parser
(IceParser). IceNLP is written as a collection of Java classes.
IceTagger uses a tagset constructed in the compilation of the Icelandic Frequency Dictionary corpus.
IceParser produces output according to a specific shallow annotation scheme.
You can test IceNLP
Source and/or executables for IceNLP can be obtained from SourceForge.net.
IceNLP is a step towards the goal of developing a Basic Language Resource Kit (BLARK) for Icelandic.
A BLARK for a language is the minimal set of basic resources (software modules, corpora, dictionaries, etc.) that is necessary to do further research and development in the field of Language Technology.
Current research projects
Viable Language Technology Beyond English
In January 2009, The Icelandic
Centre for Language Technology (ICLT) received a Grant of Excellence from the Icelandic Research Fund.
The primary objective is to make it realistic to develop three particular types of LT modules with limited resources without sacrificing the quality of the work. The three types of modules are a database of semantic relations, a machine translation system, and a treebank.
Further information can be found here.
Construction of a new gold standard corpus
In this project, we build a new gold standard, a PoS-tagged corpus of modern Icelandic text. The corpus consists of a sample, about 1 million tokens, from Mörkuđ íslensk málheild [PoS-tagged Icelandic corpus].
This sample will become a new gold standard to be used to train and evaluate taggers on Icelandic text.
In addition to the corpus itself, the main product of this project is a tool which automates the tagging processs.
This process consists of i) sentence segmentation and work tokenisation; ii) PoS tagging using five different taggers and a combination method; and iii) the detection of tagging errors.
Improved tagging accuracy of Icelandic text
Icelandic is a morphologically complex language for which the task of part-of-speech (PoS) tagging has turned out to be difficult, both for data-driven and linguistic rule-based taggers. This project aims to improve the tagging accuracy using several methods.
First, we use data from a large morphological database to extend the dictionaries used by the taggers. Second, we search for cost-effective ways to increase the tagging accuracy of our linguistic rule-based tagger IceTagger. Third, we design an external tagset (the tagset used for evaluation) by removing information from the internal tagset (the tagset used by a tagger) which reflects distinctions that are not morphologically based. Lastly, we use six different taggers to build a combined tagger for the purpose of further increasing the tagging accuracy.
Detecting errors in a PoS-tagged corpus
Part-of-speech (PoS) tagged corpora are valuable resources for developing PoS taggers. Corpora in various languages have been used to train (in the case of data-driven methods) and develop (in the case of linguistic rule-based methods) different taggers, and to evaluate their accuracy. Consequently, the quality of the PoS annotation in a corpus (the gold standard annotation) is crucial.
In this project, we experiment with three different methods of PoS error detection using the Icelandic Frequency Dictionary (IFD) corpus.
First, we use the variation n-gram method proposed by Dickinson and Meurers (2003).
Secondly, we run five different taggers on the corpus and examine those cases where all the taggers agree on a tag, but, at the same time, disagree with the gold standard annotation.
Lastly, we use IceParser to generate shallow parses of sentences in the corpus and then develop various patterns, based on feature agreement, for finding candidates for annotation errors.
The tagging accuracy of a particular text in a given language can usually be increased by combining taggers which are based on different tagging methods. In most cases, each combined tagger has been written from scratch, i.e. each developer has written the necessary program code to build the combined tagger. This is unfortunate because, generally, it entails the reproduction of code already written.
To tackle this problem, we are developing CombiTagger, a language and tagset independent system for developing and evaluating combined taggers.
The system provides algorithms for simple and weighted voting, but it is extensible so that other combination algorithms can be added easily.
CombiTagger is an open source system which can be obtained here.
A shallow-transfer translation system
This project, which is joint work with Universitat d'Alicant, has the following three main objectives:
These objectives will be reached by using and adapting our existing LT tools, like IceTagger, Lemmald (a lemmatiser), and IceParser for the purpose of developing a prototype of an Icelandic-English STMT system.
The prototype delivered will be the first open-source MT system for Icelandic. All software and data will be made open-source. This will allow other researches or developers to further improve the system.
- To find the most economic methods for creating the rules and data needed for a successful implementation of a shallow-transfer machine translation (STMT) system.
- To find ways of incorporating existing language technology (LT) tools into the STMT system Apertium.
- To use these means to develop a prototype of an STMT system.