Buldialect - Measuring Linguistic Unity and Diversity in Europe (2006-2010) was a joint project between the University of Tübingen, University of Groningen and Bulgarian Academy of Sciences sponsored by the Volkswagen Stiftung. The aim of this project was to a) create Bulgarian phonetic and lexical digital dialect data base; b) analyze the data using the existing statistical methods from dialectometry and compare it to the traditional scholarship; c) develop new quantitative methods that can be used to analyze dialect and language data. During the project, pronunciations variants of 156 words from 197 villages in Bulgarian were compiled and digitalized. Lexical data base consist of lexical varaints of 110 words collected from the same set of villages. The results of applying different quantitative techniques, both the old ones and those developed during the course of this project, to the Bulgarian pronunciation data have shown that some of the traditional divisions of this area have to be questioned and that they were not based purely on the linguistic criteria.
- Erhard Hinrichs, PI, University of Tübingen
- Georgi Kolev, University of Sofia
- John Nerbonne, University of Groningen
- Petya Osenova, Bulgarian Academy of Sciences
- Jelena Prokić, University of Groningen
- Petar Shishkov, University of Sofia
- Kiril Simov, Bulgarian Academy of Sciences
- Thomas Zastrow, University of Tübingen
- Vladimir Zhobov, University of Sofia
The data is prepared by Prof. Dr. Vladimir Zhobov, Georgi Kolev and Petar Shishkov from the Sofia University. It is free for research purposes.
If you use phonetic data from the Buldialect project, please cite the following paper:
Jelena Prokić, John Nerbonne, Vladimir Zhobov, Petya Osenova, Kiril Simov, Thomas Zastrow, and Erhard Hinrichs. The computational analysis of Bulgarian dialect pronunciation. Serdica Journal of Computing, 3(3):269–298, 2009.
Manually corrected multiple string aligned Buldialect data (IPA transcribed) that consists of the phonetic variants of 152 words collected at 197 sites and annotated for various linguistic phenomena like metathesis, phonetic mergers and splitters, is available in Data Packages format. This data is suitable both for (geo)linguistic analyses and testing of the alignment algorithms. To obtain the data in this format, please send an e-mail to jelena dot prokic at gmail dot com.
Subset of the Buldialect data specifically designed for testing of the alignment algorithms is part f the BDPA database and can be found here.
This is a list of publications in which the data from the Buldialect project was used. Please send me an e-mail if some publications need to be added.
Johann-Mattis List and Jelena Prokić. A benchmark database of phonetic alignments in historical linguistics and dialectology. In Proceedings of the International Conference on Language Resources and Evaluation (LREC) 2014, 26 May - 31 May 2014, Reykjavik, Iceland.
Jelena Prokić and Michael Cysouw. Combining regular sound correspondences and geographic spread. Language Dynamics and Change 3.2. Special issue, "Phylogeny and Beyond: Quantitative Diachronic Approaches to Language Diversity", edited by Michael Dunn, pages 147-168, 2013.
Jelena Prokić and John Nerbonne. Analyzing dialects biologically. In Classification and Evolution in Biology, Linguistics and the History of Science. Concepts, Methods, Visualization. Steiner Verlag, Stuttgart, 2013.
Johann-Mattis List. Multiple sequence alignment in historical linguistics. A sound class based approach. Proceedings of ConSOLE XIX (2011), pages 241-260. 2012. [pdf]
Martijn Wieling, Eliza Margaretha and John Nerbonne. Inducing a Measure of Phonetic Similarity from Pronunciation Variation. In: Journal of Phonetics 40(2), pp.307-314. 2012.
Martijn Wieling, Eliza Margaretha and John Nerbonne. Inducing Phonetic Distances from Dialect Variation. CLIN Journal 1, pp.109-118. 2011.
Jelena Prokić. Families and Resemblances. PhD thesis, University of Groningen, 2010.
Jelena Prokić and Tim Van de Cruys. Exploring dialect phonetic variation using PARAFAC. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pages 46–53, Uppsala, Sweden, July 2010. Association for Computational Linguistics. [pdf]
Peter Houtzagers, John Nerbonne, and Jelena Prokić. Quantitative and traditional classifications of Bulgarian dialects compared. Scando Slavica, 56(2):163–188, 2010.
John Nerbonne, Jelena Prokić, Martijn Wieling, and Charlotte Gooskens. Some further dialectometrical steps. In: G. Aurrekoexea & J.L. Ormaetxea (eds.) Tools for Linguistic Variation. Bilbao: Supplements of the Anuario de Filologia Vasca "Julio de Urquijo", LIII, pp. 41-56. 2010.
Jelena Prokić, Martijn Wieling, and John Nerbonne. Multiple sequence alignments in linguistics. In Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009), pages 18–25, Athens, Greece, March 2009. Association for Computational Linguistics. [pdf]
Martijn Wieling, Jelena Prokić, and John Nerbonne. Evaluating the pairwise string alignment of pronunciations. In Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009), pages 26–34, Athens, Greece, March 2009. Association for Computational Linguistics. [pdf]
Jelena Prokić, John Nerbonne, Vladimir Zhobov, Petya Osenova, Kiril Simov, Thomas Zastrow, and Erhard Hinrichs. The computational analysis of Bulgarian dialect pronunciation. Serdica Journal of Computing, 3(3):269–298, 2009. [pdf]
Jelena Prokić and John Nerbonne. Recognizing groups among dialects. International Journal of Humanities and Arts Computing, Special Issue on Language Variation, pages 153–172, 2008. [pdf]
Jelena Prokić. Application of phylogenetic methods on the dialect pronunciation data. In Proceedings of the RANLP Workshop on Computational Phonology, Borovets, 2007.
Erhard Hinrichs and Zastrow, Thomas. Novel Approaches to Computational Dialectometry - Vector Analysis and Information Theory. In Proceedings of the RANLP Workshop on Computational Phonology, Borovets 2007, pages 37-41.
Erhard Hinrichs and Zastrow, Thomas. A Vector-based Approach to Dialectometry. In Proceedings of the 17th Meeting of Computational Linguistics in the Netherlands, Leuven 2007, pages 21 - 33.
Jelena Prokić. Identifying linguistic structure in a quantitative analysis of dialect pronunciation. In Proceedings of the ACL 2007 Student Research Workshop, pages 61–66, Prague, Czech Republic, June 2007. Association for Computational Linguistics. [pdf]