Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Automatic Indexing for Agriculture : Designing a Framework by Deploying Agrovoc, Agris and Annif


Affiliations
1 SRF, Department of Library and Information Science, Kalyani University, Nadia – 741235, West Bengal, India
     

   Subscribe/Renew Journal


There are several ways to employ machine learning for automating subject indexing. One popular strategy is to utilize a supervised learning algorithm to train a model on a set of documents that have been manually indexed by subject matter using a standard vocabulary. The resulting model can then predict the subject of new and previously unseen documents by identifying patterns learned from the training data. To do this, the first step is to gather a large dataset of documents and manually assign each document a set of subject keywords/descriptors from a controlled vocabulary (e.g., from Agrovoc). Next, the dataset (obtained from Agris) can be divided into – i) a training dataset, and ii) a test dataset. The training dataset is used to train the model, while the test dataset is used to evaluate the model's performance. Machine learning can be a powerful tool for automating the process of subject indexing. This research is an attempt to apply Annif (http://annif.org/), an open-source AI/ML framework, to autogenerate subject keywords/descriptors for documentary resources in the domain of agriculture. The training dataset is obtained from Agris, which applies the Agrovoc thesaurus as a vocabulary tool (https://www.fao.org/agris/download).

Keywords

Agriculture, Annif, Automatic Subject Indexing, Ensemble, Neural Network, Openrefine, Subject Indexing.
User
About The Author

Mustak Ahmed
SRF, Department of Library and Information Science, Kalyani University, Nadia – 741235, West Bengal
India


Notifications

  • Ahmed, M., Mukhopadhyay, M. and Mukhopadhyay, P. (2023). Automated knowledge organization: AI/ML-based subject indexing system for libraries. DESIDOC Journal of Library and Information Technology, 43(01), 45-54. https://doi.org/10.14429/djlit.43.01.18619
  • Aizawa, A. (2003). An information-theoretic perspective of tf-idf measures. Information Processing and Management, 39(1), 45-65. https://doi.org/10.1016/S0306-4573(02)00021-3
  • Anderson, J. D. and Pérez-Carballo, J. (2001). The nature of indexing: How humans and machines analyze messages and texts for retrieval. Part II: Machine indexing, and the allocation of human versus machine effort. Information Processing and Management, 37(2), 255-77. https://doi.org/10.1016/S0306-4573(00)00046-7
  • Benos, L., Tagarakis, A. C., Dolias, G., Berruto, R., Kateris, D. and Bochtis, D. (2021). Machine Learning in Agriculture: A comprehensive updated review. Sensors, 21(11), 3758. https://doi.org/10.3390/s21113758 PMid:34071553 PMCid:PMC8198852
  • Borlund, P. (2003). The concept of relevance in IR. Journal of the American Society for Information Science and Technology, 54(10), 913-925. https://doi.org/10.1002/asi.10286
  • Celli, F. and Keizer, J. Enabling multilingual search through controlled vocabularies: The AGRIS approach. In 10th International Conference, MTSR 2016, 22-25 November 2016, Göttingen, Germany, edited by E. Garoufallou, I. Subirats Coll, A. Stellato, and J. Greenberg, 2016, Metadata and Semantics Research, 672, pp. 237-248. https://doi.org/10.1007/978-3-319-49157-8_21
  • Frank, E. and Paynter, G. W. (2004). Predicting Library of Congress classifications from Library of Congress subject headings. Journal of the American Society for Information Science and Technology, 55(3), 214-27. https://doi.org/10.1002/asi.10360
  • Golub, K. (2021). Automated subject indexing: An overview. Cataloging and Classification Quarterly, 59(8), 702-19. https://doi.org/10.1080/01639374.2021.2012311
  • Golub, K., Soergel, D., Buchanan, G., Tudhope, D., Lykke, M. and Hiom, D. (2016). A framework for evaluating automatic indexing or classification in the context of retrieval. Journal of the Association for Information Science and Technology, 67(1), 3-16. https://doi.org/10.1002/asi.23600
  • Hahn, J. (2021). Semi-automated methods for bibframe work entity description. Cataloging and Classification Quarterly, 59(8), 853-867. https://doi.org/10.1080/01639374.2021.2014011
  • Hahn, J. (2022). Cataloger acceptance and use of semiautomated subject recommendations for web scale linked data systems. IFLA WLIC, 2022. 10. Available from: https://repository.ifla.org/bitstream/123456789/1955/1/062-hahn-en.pdf
  • Handler, A., Denny, M., Wallach, H. and O’Connor, B. (2016). Bag of what? Simple noun phrase extraction for text analysis. In EMNLP Workshop on Natural Language Processing and Computational Social Science, 5 November 2016, Austin, TX, pp. 114-124. https://doi.org/10.18653/v1/W16-5615
  • Hillard, D., Purpura, S. and Wilkerson, J. (2008). Computer-assisted topic classification for mixed-methods social science research. Journal of Information Technology and Politics, 4(4), 31-46. https://doi.org/10.1080/19331680801975367
  • Huang, X. and Soergel, D. (2013). Functional relevance and inductive development of an e-retailing product information typology. Information Research, 18(2). Available from: https://informationr.net/ir/18-2/paper574.html
  • ISO. (1985). ISO 5963:1985, Documentation-methods for examining documents, determining their subjects, and selecting indexing terms. Available from: https://www.iso.org/obp/ui/#iso:std:iso:5963:ed-1:v1:en
  • Joorabchi, A. and E. Mahdi, A. (2013). Classification of scientific publications according to library controlled vocabularies: A new concept matching-based approach. Library Hi Tech, 31(4), 725-747. https://doi.org/10.1108/LHT-03-2013-0030
  • Junger, U. (2018). Automation first- The subject cataloguing policy of the Deutsche Nationalbibliothek. Available from: http://library.ifla.org/id/eprint/2213/
  • Lin, S.-C., Yang, J.-H., Nogueira, R., Tsai, M.-F., Wang, C.-J. and Lin, J. (2021). Multi-stage conversational passage retrieval: An approach to fusing term importance estimation and neural query rewriting (arXiv:2005.02230). arXiv. Available from: http://arxiv.org/abs/2005.02230 https://doi.org/10.1145/3446426
  • Martín-Moncunill, D., Sicilia-Urban, M. A., García-Barriocanal, E. and Stracke, C. M. (2017). Evaluating the concept specialization distance from an end-user perspective: The case of AGROVOC. Online Information Review, 41(6), 860-876. https://doi.org/10.1108/OIR-03-2016-0094
  • Misra, N. N., Dixit, Y., Al-Mallahi, A., Bhullar, M. S., Upadhyay, R. and Martynenko, A. (2022). IoT, big data, and artificial intelligence in agriculture and food industry. IEEE Internet of Things Journal, 9(9), 6305-6324. https://doi.org/10.1109/JIOT.2020.2998584
  • Möller, G., Carstensen, K., Diekmann, B. and Wätjen, H. (1999). Automatic classification of the world-wide web using the universal decimal classification. Available from: https://www.semanticscholar.org/paper/Automatic-Classification-of-the-World-Wide-Web-the-M%C3%B6ller-Carstensen/fb9f0675dd18608dc57244a934a552220183f34c
  • Mukhopadhyay, P. (2022). How green is my valley? Measuring open access friendliness of Indian Institutes of Technology (IITs) through data carpentry. In Panorama of Open Access: Progress, Practices and Prospects; pp. 67-89. Ess Ess. https://doi.org/10.5281/zenodo.6511080
  • Mukhopadhyay, P., Mitra, R. and Mukhopadhyay, M. (2021). Library carpentry: Towards a new professional dimension (Part I - Concepts and Case Studies). Journal of Information and Knowledge (Formerly SRELS Journal of Information Management), 58(2), 67-80. https://doi.org/10.17821/srels/2021/v58i2/159969
  • National Agricultural Library. (2014). NFAIS webinar: Automated indexing: A case study from the National Agricultural Library | ISSN. Available from: https://www.issn.org/newsletter_issn/nfais-webinar- automated-indexing-a-case-study-from-the-national-agricultural-library/
  • National Library of Medicine (NLM). (2002). NLM Medical Text Indexer (MTI). Available from: https://lhncbc.nlm.nih.gov/ii/tools/MTI.html
  • Oliver, C. (2021). Leveraging KOS to extend our reach with automated processes. Cataloging and Classification Quarterly, 59(8), 868-874. https://doi.org/10.1080/01639374.2021.2023717
  • Purpura, S. and Hillard, D. (2006). Automated classification of congressional legislation. In 2006 National Conference on Digital Government Research, 21-24 May, 2006, San Diego California USA; pp. 219-225. https://doi.org/10.1145/1146598.1146660
  • Rayhana, R., Xiao, G. and Liu, Z. (2020). Internet of things empowered smart greenhouse farming. IEEE Journal of Radio Frequency Identification, 4(3), 195-211. https://doi.org/10.1109/JRFID.2020.2984391
  • Roitblat, H. L., Kershaw, A. and Oot, P. (2010). Document categorization in legal electronic discovery: Computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1), 70-80. https://doi.org/10.1002/asi.21233
  • Salisbury, L. and Smith, J. J. (2014). Building the AgNIC Resource Database Using Semi-Automatic Indexing of Material. Journal of Agricultural and Food Information, 15(3), 159-176. https://doi.org/10.1080/10496505.2014.919805
  • Salton, G. and McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.
  • Salton, G., Wong, A. and Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620. https://doi.org/10.1145/361219.361220
  • Saracevic, T. (2007). Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: Nature and manifestations of relevance. Journal of the American Society for Information Science and Technology, 58(13), 1915-1933. https://doi.org/10.1002/asi.20682
  • Scorpion. (2022). OCLC. Available from: https://www.oclc.org/research/activities/scorpion.html
  • Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47. https://doi.org/10.1145/505282.505283
  • Shafer, K. E. (2001). Automatic subject assignment via the scorpion system. Journal of Library Administration, 34(1-2), 187-189. https://doi.org/10.1300/J111v34n01_28
  • Silvester, J. P. (1997). Computer supported indexing: A history and evaluation of NASA’s MAI System. Encyclopedia of Library and Information Science, 61. Available from: https://ntrs.nasa.gov/citations/19980010465
  • Sood, A., Sharma, R. K. and Bhardwaj, A. K. (2021). Artificial intelligence research in agriculture: A review. Online Information Review, 46(6), 1054-1075. https://doi.org/10.1108/OIR-10-2020-0448
  • Suominen, O. (2019). Annif: DIY automated subject indexing using multiple algorithms. LIBER Quarterly: The Journal of the Association of European Research Libraries, 29(1). https://doi.org/10.18352/lq.10285
  • Suominen, O., Inkinen, J. and Lehtinen, M. (2022). Annif and Finto AI: Developing and Implementing Automated Subject Indexing. JLIS.It, 13(1). https://doi.org/10.4403/jlis.it12740
  • Svarre, T. and Lykke, M. (2014). Experiences with automated categorization in E-Government Information Retrieval. Knowledge Organization, 41, 76-84. https://doi.org/10.5771/0943-7444-2014-1-76
  • Talaviya, T., Shah, D., Patel, N., Yagnik, H. and Shah, M. (2020). Implementation of artificial intelligence in agriculture for optimisation of irrigation and application of pesticides and herbicides. Artificial Intelligence in Agriculture, 4, 58-73. https://doi.org/10.1016/j.aiia.2020.04.002
  • Thomas, R. L. and Uminsky, D. (2022). Reliance on metrics is a fundamental challenge for AI. Patterns, 3(5), 100476. https://doi.org/10.1016/j.patter.2022.100476 PMid:35607624 PMCid:PMC9122957
  • Ünal, Z. (2020). Smart farming becomes even smarter with deep learning- a bibliographical analysis. IEEE Access, 8, 105587-609. https://doi.org/10.1109/ACCESS.2020.3000175
  • Willis, C. and Losee, R. M. (2013). A random walk on an ontology: Using thesaurus structure for automatic subject indexing: A random walk on an ontology: Using thesaurus structure for automatic subject indexing. Journal of the American Society for Information Science and Technology, 64(7), 1330-44. https://doi.org/10.1002/asi.22853
  • Wu, H. C., Luk, R. W. P., Wong, K. F. and Kwok, K. L. (2008). Interpreting TF-IDF term weights as making relevance decisions. ACM Transactions on Information Systems, 26(3), 13:1-13:37. https://doi.org/10.1145/1361684.1361686
  • Young, L. and Soroka, S. (2012). Affective News: The Automated Coding of Sentiment in Political Texts. Political Communication, 29(2): 205-231. https://doi.org/10.1080/10584609.2012.671234
  • Zhang, Z., Liu, H., Meng, Z. and Chen, J. (2019). Deep learning-based automatic recognition network of agricultural machinery images. Computers and Electronics in Agriculture, 166, 104978. https://doi.org/10.1016/j.compag.2019.104978

Abstract Views: 151

PDF Views: 1




  • Automatic Indexing for Agriculture : Designing a Framework by Deploying Agrovoc, Agris and Annif

Abstract Views: 151  |  PDF Views: 1

Authors

Mustak Ahmed
SRF, Department of Library and Information Science, Kalyani University, Nadia – 741235, West Bengal, India

Abstract


There are several ways to employ machine learning for automating subject indexing. One popular strategy is to utilize a supervised learning algorithm to train a model on a set of documents that have been manually indexed by subject matter using a standard vocabulary. The resulting model can then predict the subject of new and previously unseen documents by identifying patterns learned from the training data. To do this, the first step is to gather a large dataset of documents and manually assign each document a set of subject keywords/descriptors from a controlled vocabulary (e.g., from Agrovoc). Next, the dataset (obtained from Agris) can be divided into – i) a training dataset, and ii) a test dataset. The training dataset is used to train the model, while the test dataset is used to evaluate the model's performance. Machine learning can be a powerful tool for automating the process of subject indexing. This research is an attempt to apply Annif (http://annif.org/), an open-source AI/ML framework, to autogenerate subject keywords/descriptors for documentary resources in the domain of agriculture. The training dataset is obtained from Agris, which applies the Agrovoc thesaurus as a vocabulary tool (https://www.fao.org/agris/download).

Keywords


Agriculture, Annif, Automatic Subject Indexing, Ensemble, Neural Network, Openrefine, Subject Indexing.

References





DOI: https://doi.org/10.17821/srels%2F2023%2Fv60i2%2F170966