Journal

Measuring diachronic language distance using perplexity. Application to English, Portuguese and Spanish.

The objective of this work is to set a corpus-driven methodology to quantify automatically diachronic language distance
between chronological periods of several languages. We apply a perplexity-based measure to written text representing
different historical periods of three languages: European English, European Portuguese and European Spanish. For this
purpose, we have built historical corpora for each period, which have been compiled from different open corpus sources

Towards a top-down approach for an automatic discourse analysis for Basque: Segmentation and Central Unit detection tool

Lately, discourse structure has received considerable attention due to the benefits carried out by its application in several NLP task such as opinion mining, summarization, question answering, text simplification, among others.

Neural Machine Translation of clinical texts between long distance languages

ABSTRACT

Objective: To analyze techniques for machine translation of electronic health records (EHRs) between long distance languages, using Basque and Spanish as a reference. We studied distinct configurations of neural machine translation systems and used different methods to overcome the lack of a bilingual corpus of clinical texts or health records in Basque and Spanish.

LINGUATEC: Desarrollo de recursos lingüı́sticos para avanzar en la digitalización de las lenguas de los Pirineos

El objetivo del proyecto es desarrollar, probar y difundir nuevos recursos, nuevas herramientas y aplicaciones lingüı́sticas innovadoras para mejorar el nivel de digitalización del aragonés, vasco y occitano.

Large Scale Linguistic Processing of Tweets to Understand Social Interactions among Speakers of Less Resourced Languages: The Basque Case

Social networks like Twitter are increasingly important in the creation of new ways of communication. They have also become useful tools for social and linguistic research due to the massive amounts of public textual data available. This is particularly important for less resourced languages, as it allows to apply current natural language processing techniques to large amounts of unstructured data. In this work, we study the linguistic and social aspects of young and adult people’s behaviour based on their tweets’ contents and the social relations that arise from them.

Adapting NMT to caption translation in Wikimedia Commons for low-resource languages

This paper presents a successful domain adaptation of a general neural machine
translation (NMT) system using a bilingual corpus created with captions for images in Wiki-
media Commons for the Spanish-Basque and English-Irish pairs.
Keywords: Machine Translation, Low-resource languages, Bilingual corpora, Language
resources from Wikipedia

Pages

Subscribe to RSS - Journal