Dr. Georgi Cholakov, Assist. Prof. 1), Dr. Emil Doychev, Assoc. Prof. 1),
Prof. Dr. Svetla Koeva 2)
1) Plovdiv University „Paisii Hilendarski“
Faculty of Mathematics and Informatics
2) Institute for Bulgarian Language „Prof. Lyubomir Andreychin“ -
Bulgarian Academy of Sciences
https://doi.org/10.53656/math2024-1-1-cha
Absract. The article presents the challenges of implementing a System for data retrieval and visualisation from the Internet by crawling language resources from the Hugging Face repository and extracting the associated data. The data in the system is updated at regular intervals to track the dynamics of language resource creation for different time periods. The article presents: a) the analysis of the
available data and its structure; b) the chosen method for crawling the pages and extracting the data. The shared experience of overcoming the specific challenges can serve to solve similar problems related to
the extraction of data from the Internet, a task that often has to be solved in various projects (including school projects).
Keywords: web crawling; automatic data extraction; linguistic datasets