CORPUS-BASED TERMS EXTRACTION IN LINGUISTICS DOMAIN FOR INDONESIAN LANGUAGE

Authors

  • Wahyu Maulana Universitas Sumatera Utara
  • Eddy Setia Universitas Sumatera Utara

DOI:

https://doi.org/10.22216/kata.v6i2.908

Keywords:

Corpus, Terminology, Term Extraction

Abstract

This research aims to extract the mono-lexical and poly-lexical terms from linguistics domain in Indonesian language. As the terminology and lexicology concept is somehow blurry, this research applies CTT by Cabré to do the terms extraction procedure. The corpus-based terminology method is applied in this research to get the best mono-lexical and poly-lexical terms possible. To compile the general and the specialized corpus in this research, AntConc is applied as an instrument. Even though the result is noisy, further analysis about the term limitation manually makes this research semi-automatic. The result shows that the limitation in language and words structure helps this research to delimit the mono-lexical terms extracted in this research. Furthermore, the mono-lexical terms extracted act as the starting point for poly-lexical terms.

Downloads

Download data is not yet available.

References

Anthony, L. (2014) AntConc (Windows , Macintosh OS X , and Linux) Getting Started (No installation necessary).

Anthony, L. (2017) ‘AntFileConverter’. Tokyo. Available at: https://www.laurenceanthony.net/software.

Anthony, L. (2019) ‘AntConc’. Tokyo. Available at: https://www.laurenceanthony.net/software.

Atkielski, A. (2005) ‘Using Phonetic Transcription in Class’, Using Phonetic Transcription in Class, pp. 1–12.

Bourigault, D. and Jacquemin, C. (1999) ‘TERM EXTRACTION + TERM CLUSTERING : An Integrated Platform for Computer-Aided Terminology’, in In Proceedings of EACL ’99 of the ninth conference on European chapter of the Association for Computational Linguistics, pp. 15–22.

Cabré, M. T. (1998) Terminology: Theory, methods and applications. 1st edn. Edited by J. C. Sager. Amsterdam: John Benjamins Publishing Company.

Christanty, V., Pragantha, J. and Victor (2016) ‘Part-of-Speech Tagging untuk Bahasa Indonesia Menggunakan Stanford POS-Tagging’, in Seminar Nasional Teknologi Informasi 2016, pp. 179–185.

Dima, G. (2012) ‘A Terminological Approach to Dictionary Entries. A Case Study’, in Procedia - Social and Behavioral Sciences, pp. 93–98. doi: 10.1016/j.sbspro.2012.10.016.

Dinakaramani, A. et al. (2014) ‘Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus’, Proceedings of the International Conference on Asian Language Processing 2014, IALP 2014, pp. 66–69. doi: 10.1109/IALP.2014.6973519.

Elfkih, F. and Omri, M. N. (2012) ‘A Linguistic Model for Terminology Extraction based Conditional Random Fields’, in ICCRK’2012: International Conference on Computer Related Knowledge. doi: 10.13140/RG.2.1.3530.7685.

Faber, P. (2014) ‘Frames as a framework for terminology’, in Kockaert, H. and Steurs, F. (eds) Handbook of Terminology. Hardbound. John Benjamins, pp. 14–33. doi: 10.1075/hot.1.02fra1.

Faber, P. and Martinez, S. M. (2019) ‘Terminology’, The ASHA Leader, 24(6), pp. 28–29. doi: 10.1044/leader.ppl.24062019.28.

Faber, P. and Rodríguez, C. I. L. (2012) ‘Terminology and specialized language’, in A Cognitive Linguistics View of Terminology and Specialized Language, pp. 9–32.

Fu, S. et al. (2018) ‘Towards Indonesian Part-of-Speech Tagging : Corpus and Models’, in Yang, E. and Sun, L. (eds) Proceedings of LREC 2018 Workshop on Belt and Road LRE. European Language Resources Association (ELRA), pp. 2–7. Available at: http://universaldependencies.org/.

Goh, G.-Y. (2011) ‘Choosing a Reference Corpus for Keyword Calculation’, Linguistic Research, 28(1), pp. 239–256. doi: 10.17250/khisli.28.1.201104.013.

Kamayani, M. (2019) ‘Perkembangan Part-of-Speech Tagger Bahasa Indonesia’, Jurnal Linguistik Komputasional (JLK), 2(2), p. 34. doi: 10.26418/jlk.v2i2.20.

Leech, G. (2002) ‘The Importance of Reference Corpora’, UZEI, pp. 1–11.

Marzá, N. E. (2008) The communicative theory of Terminology (CTT) applied to the development of a corpus-based specialised dictionary of the ceramics industry. Universitat Jaume I. Available at: http://cbueg-mt.iii.com/iii/encore/record/C__Rb5001995__Scorpus dictionary__P0%2C2__Orightresult__X2;jsessionid=191AC477BB19F65A149F8705AE2B575C?lang=cat&suite=def.

Nelson, G. (2000) ‘An Introduction to Corpus Linguistics’, Journal of English Linguistics, 28(2), pp. 193–196. doi: 10.1177/00754240022004965.

Noortyani, R. (2017) Buku Ajar Sintaksis. 1st edn. Edited by M. Arsyad. Yogyakarta: Penebar Pustaka Media. Available at: https://scholar.google.co.id/scholar?hl=id&as_sdt=0%2C5&q=jurnal+artikel+ilmiah&btnG=.

Pasanen, P. (2005) ‘A Term List or a Noise List? How Helpful is Term Extraction Software when Finnish Terms are Concerned?’, in Madsen, B. N. and Thomsen, H. E. (eds) 7th International Conference on Terminology and Knowledge Engineering, pp. 375–384.

Pazienza, M. T., Pennacchiotti, M. and Zanzotto, F. M. (2005) ‘Terminology extraction: An analysis of linguistic and statistical approaches’, Studies in Fuzziness and Soft Computing, 185(2005), pp. 255–279. doi: 10.1007/3-540-32394-5_20.

Peñas, A., F, V. and Gonzalo, J. (2001) ‘Corpus-based terminology extraction applied to information access’, Proceedings of Corpus Linguistics 2001, (August 2013).

Peñas, A., Verdejo, F. and Gonzalo, J. (2002) ‘Terminology Retrieval: Towards a synergy between thesaurus and free text searching’, Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science), 2527(Hertzberg), pp. 684–693. doi: 10.1007/3-540-36131-6_70.

Toriida, M.-C. (2017) ‘Steps for Creating a Specialized Corpus and Developing an Annotated Frequency-Based Vocabulary List’, TESL Canada Journal, 34(1), pp. 87–105. doi: 10.18806/tesl.v34i1.1257.

Yuliawati, S., Suhardijanto, T. and Hidayat, R. S. (2018) ‘A Corpus-based Analysis of the Terminology of the Social Sciences and Humanities’, in IOP Conference Series: Earth and Environmental Science. IOP Publishing. doi: 10.1088/1755-1315/175/1/012109.

Downloads

Published

2022-10-30