IMFD researchers organize semantic lexical change detection competition

How has the meaning of words changed over time? Does “plant” or “villain” have the same semantic definition in the 1800s as they do in 2000? Discovering these variations using computational models was the challenge posed to researchers from around the world who participated in the “Lexical Semantic Change Discovery Shared Task”, a competition organized by Felipe Bravo Márquez, IMFD researcher and academic from the Computer Science Department of the U. Chile (DCC U. Chile) ; IMFD researcher and doctoral student at DCC U. Chile, Frank Zamora; and the researcher at the Institute for Natural Language Processing, of the University of Stuttgart, Dominik Schlechtweg.

It was a semantic lexical change competition, where the objective was to detect words in Spanish that have varied or modified their meaning over time, which was held within the framework of the 3rd International Workshop on Computational Approaches to Historical Language Change 2022 (LChange ’22), hosted at ACL 2022, the premier conference in the field of Natural Language Processing (NLP). Although the workshop was held on May 26 and 27 in Dublin, Ireland, the competition was held beforehand, from February 28 to March 31, so that the results obtained by both organizers and participants could be presented. during the event held in May.

The competition was divided into two phases, with six teams participating in the first and seven in the second, all made up of three to four members, mostly postgraduate students from countries such as Canada, Spain and Russia, the latter being the winners in each phase.

Frank Zamora explains that the participants had to develop models that make predictions about a certain set of words: “The objective was to detect words that change their meaning over time. For this, we deliver two datasets with common words, one from an ancient period and the other with words from a modern period. Competitors had to create models that would allow them to detect if these common words changed their meaning, that is, if they had acquired a new one or had lost it”. To understand this, he exemplifies: “An analogy is ‘plant’, which perhaps in 1810 the most common was to refer to talk about a flower or a tree, but today it can be talking about that and also about a floor in a building or industry.

To build both datasets, the researchers used the Gutenberg project, to cover the period considered ancient, which spanned from 1806 to 1910, while for the modern period, which covered from 1994 to 2020, they used the Opus project. Both Frank Zamora and Professor Felipe Bravo Márquez highlight the enormous challenge represented by annotating the words contained in both documents, on which they point out: “Words in Spanish in general are polysemic, so it represented a challenge because a lot of data had to be annotated. , in order to previously have the results that could be produced by the prediction models developed by the participants in the competition”. Thanks to this work, it was possible to create the dataset with the most annotations that exist in this field, accounting for 62 thousand annotations of words in Spanish.

Innovating in the generation of new knowledge

After finishing the competition in March, the next step for Felipe Bravo Márquez, Frank Zamora and Dominik Schlechtweg was to present at the Workshop on Computational Approaches to Historical Language Change 2022 (LChange’22), the scientific article (paper) “LSCDiscovery: A shared task on semantic change discovery and detection in Spanish”. This work describes all the aspects that the organization and development of this competition considered, such as the way in which the annotations were made, the problem posed to the competitors, the participating teams and the models developed, among others. In addition, each participating team presented its own paper describing the model developed in the competition.

The researchers highlight the way in which new knowledge is generated from these types of events. In particular, Frank Zamora says that this area of ​​NLP called Semantic Change Detection is relatively new and has become very popular from this type of competition. “We entered this field participating in a competition of this same style two years ago. At that time the competition covered four languages: Swedish, Latin, German and English. Then there was a competition in Italian and then another in Russian. So we saw a niche on the Spanish language, where in the field of NLP there are very few resources, this being the fourth competition organized worldwide” says Zamora.

Along these lines, Felipe Bravo Márquez points out that these competencies are especially attractive for those who are starting out in research work: “What is most difficult when starting out in research is looking for a problem. In this type of event the problem is defined, there is access to data and evaluation metrics. Then you just have to solve, which makes the task much easier”. The academic also highlights the collaborative way in which research is carried out, pointing out that “it is like doing meta-research, because you propose a problem, standardize it, invite the community to solve it, and in months you have several studies by different groups of researchers, in which you can see what works and what doesn’t, and you know that everything was developed under the same conditions and in a very transparent way”.

Source: Communications DCC U. Chile

Felipe Bravo