Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Training Tree Adjoining Grammars with Huge Text Corpus Using Spark Map Reduce


Affiliations
1 Centre for Excellence in Computational Engineering and Networking, Amrita Vishwa Vidyapeetham, India
     

   Subscribe/Renew Journal


Tree adjoining grammars (TAGs) are mildly context sensitive formalisms used mainly in modelling natural languages. Usage and research on these psycho linguistic formalisms have been erratic in the past decade, due to its demanding construction and difficulty to parse. However, they represent promising future for formalism based NLP in multilingual scenarios. In this paper we demonstrate basic synchronous Tree adjoining grammar for English-Tamil language pair that can be used readily for machine translation. We have also developed a multithreaded chart parser that gives ambiguous deep structures and a par dependency structure known as TAG derivation. Furthermore we then focus on a model for training this TAG for each language using a large corpus of text through a map reduce frequency count model in spark and estimation of various probabilistic parameters for the grammar trees thereafter; these parameters can be used to perform statistical parsing on the trained grammar.

Keywords

TAGs, Spark, Probabilistic Grammar, RDDs, Parsing.
Subscription Login to verify subscription
User
Notifications
Font Size

Abstract Views: 160

PDF Views: 2




  • Training Tree Adjoining Grammars with Huge Text Corpus Using Spark Map Reduce

Abstract Views: 160  |  PDF Views: 2

Authors

Vijay Krishna Menon
Centre for Excellence in Computational Engineering and Networking, Amrita Vishwa Vidyapeetham, India
S. Rajendran
Centre for Excellence in Computational Engineering and Networking, Amrita Vishwa Vidyapeetham, India
K. P. Soman
Centre for Excellence in Computational Engineering and Networking, Amrita Vishwa Vidyapeetham, India

Abstract


Tree adjoining grammars (TAGs) are mildly context sensitive formalisms used mainly in modelling natural languages. Usage and research on these psycho linguistic formalisms have been erratic in the past decade, due to its demanding construction and difficulty to parse. However, they represent promising future for formalism based NLP in multilingual scenarios. In this paper we demonstrate basic synchronous Tree adjoining grammar for English-Tamil language pair that can be used readily for machine translation. We have also developed a multithreaded chart parser that gives ambiguous deep structures and a par dependency structure known as TAG derivation. Furthermore we then focus on a model for training this TAG for each language using a large corpus of text through a map reduce frequency count model in spark and estimation of various probabilistic parameters for the grammar trees thereafter; these parameters can be used to perform statistical parsing on the trained grammar.

Keywords


TAGs, Spark, Probabilistic Grammar, RDDs, Parsing.