Romanian Journal of Information Science and Technology (ROMJIST)

An open – access publication

  |  HOME  |   GENERAL INFORMATION  |   ROMJIST ON-LINE  |  KEY INFORMATION FOR AUTHORS  |   COMMITTEES  |  

ROMJIST is a publication of Romanian Academy,
Section for Information Science and Technology

Editor – in – Chief:
Radu-Emil Precup

Honorary Co-Editors-in-Chief:
Horia-Nicolai Teodorescu
Gheorghe Stefan

Secretariate (office):
Adriana Apostol
Adress for correspondence: romjist@nano-link.net (after 1st of January, 2019)

Founding Editor-in-Chief
(until 10th of February, 2021):
Dan Dascalu

Editing of the printed version: Mihaela Marian (Publishing House of the Romanian Academy, Bucharest)

Technical editor
of the on-line version:
Lucian Milea (University POLITEHNICA of Bucharest)

Sponsor:
• National Institute for R & D
in Microtechnologies
(IMT Bucharest), www.imt.ro

ROMJIST Volume 24, No. 4, 2021, pp. 384-401
 

Vasile PĂIS, Radu ION, Andrei-Marius AVRAM, Maria MITROFAN, Dan TUFIȘ
In-depth evaluation of Romanian natural language processing pipelines

ABSTRACT: With the increased size of Universal Dependencies tree banks, several basic language processing kits (BLARK) for multiple languages appeared in recent years, indicating improved performances on different languages. Nevertheless, published results are not directly comparable for the Romanian language since different tools make use of different Universal Dependencies versions and different additional resources, such as pre-trained word embeddings. In this paper, we re-train several state-of-the-art tools for processing Romanian language by using a common methodology comprising of training and evaluating on the same version of RoRefTrees corpus and using the same pre-trained word embeddings from the representative corpus of contemporary Romanian language (CoRoLa). Furthermore, we also explore the capabilities of the trained models when faced with unseen text from a different domain. For this purpose, we further test the resulting model on the SiMoNERo corpus. We employ different metrics to assess the performance on operations like tokenization, sentence splitting, lemmatization, part-of-speech tagging and dependency parsing.

KEYWORDS: Natural language processing; BLARK; performance evaluation;Romanian text processing

Read full text (pdf)






  |  HOME  |   GENERAL INFORMATION  |   ROMJIST ON-LINE  |  KEY INFORMATION FOR AUTHORS  |   COMMITTEES  |