Open Access Open Access  Restricted Access Subscription Access

Classification of SDSS Photometric Data Using Machine Learning on A Cloud


Affiliations
1 International Institute of Information Technology-Bangalore, 26/C, Electronics City, Bengaluru - 560 100, India
 

Astronomical datasets are typically very large, and manually classifying the data in them is effectively impossible. We use machine learning algorithms to provide classifications (as stars, quasars and galaxies) for more than one billion objects given photometrically in the Third Data Release of the Sloan Digital Sky Survey (SDSS-III). We have used kNN, SVM and random forest algorithms in a distributed environment over the cloud to classify 1,183,850,913 unclassified photometric objects present in the SDSSIII catalog. This catalog contains photometric data for all objects viewed through a telescope and spectroscopic data for a small part of these. Although it is possible to classify all the objects using spectroscopic data, it is impractical to obtain such data for each one of them. To classify such a big dataset on a single machine would be impractically slow, so we have used the Spark cluster computing framework to implement a distributed computing environment over the cloud. We found that writing results (dozens of gigabytes) to the cloud storage is very slow while using kNN. Though writing the results with SVM is faster as it is done in parallel, its accuracy is only around 87%, due to lack of a kernel implementation of it in Spark. We then used the random forest algorithm to classify the entire set of 1,183,850,913 objects with an accuracy of 94% in about 17 hours of processing time. The result set is significant as even collecting spectroscopic data for these many objects would take decades, and our classifications can help astronomers and astrophysicists carry out further studies.

Keywords

Astronomical Data, Classification, Cloud Computing, Distributed Algorithms, Machine Learning.
User
Notifications
Font Size

  • Eisenstein, D. J. et al., SDSS-III: massive spectroscopic surveys of the distant universe, the Milky Way, and extra-solar planetary systems. Astron. J., 2011, 142, 72.
  • Dawson, D. J. S. K. S. et al., The Baryon oscillation spectroscopic survey of SDSS-III. Astron. J., 2013, 145, 10.
  • https://spark.apache.org/docs/1.2.0/cluster-overview.html
  • Witten, I. H., Frank, E., Hall, M. A. and Pal, C. J., Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2016.
  • http://hadoop.apache.org/
  • https://cloud.google.com/
  • Blanton, M. R. et al., The luminosity function of galaxies in SDSS commissioning data. The Astron. J., 2001, 121(5), 2358.
  • Peleg, D., Distributed Computing: A Locality-Sensitive Approach, Society for Industrial and Applied Mathematics, 2000.
  • Zhang, Y. and Zhao, Y., Astronomy in the big data era. Data Sci. J., 2015, 14.
  • Supernova legacy survey, 2005; https://tspace.library.utoronto.ca/handle/1807/25390.
  • Moller, A. et al., Photometric classification of type Ia supernovae in the SuperNova Legacy Survey with supervised learning. J. Cosmol. Astropart. Phys., 2016, 12, 008.
  • Ostrovski, F. et al., VDES j2325-5229 a z = 2.7 gravitationally lensed quasar discovered using morphology-independent supervised machine learning. Mon. Not. R. Astronom. Soc., 2017, 465(4), 4325–4334.
  • Lochner, M., McEwen, J., Peiris, H., Lahav, O. and Winter, M., Photometric SN classification with machine learning. In Kavli Institute for Cosmological Physics Workshop on Photometric Classification of SNIA, Chicago, IL, USA, April 2016.
  • Miller, G. and Berger, E., PS1 classification of SN using ensemble decision tree methods. In Kavli Institute for Cosmological Physics Workshop on Photometric Classification of SNIA, Chicago, IL, USA, April 2016.
  • Moeller, A., SN photometric classification of SNLS data with supervised learning. In Kavli Institute for Cosmological Physics Workshop on Photometric Classification of SNIA, Chicago, IL, USA, April 2016.
  • du Buisson, L., Sivanandam, N., Bassett, B. and Smith, M., ‘ Machine learning classification of SDSS transient survey images. Mon. Not. R. Astronom. Soc., 2015, 454(2), 2026–2038.
  • https://cloud.google.com/dataproc/.
  • Alam, S. et al., The eleventh and twelfth data releases of the Sloan Digital Sky Survey: final data from SDSS-III. Astrophys. J. Suppl., 2015, 219, 12.
  • Pedregosa, F. et al., Scikit-learn: machine learning in python. J. Mach. Learn. Res., 2011, 12, 2825–2830.
  • Oliphant, T. E., A Guide to NumPy, Trelgol Publishing USA, 2006, vol. 1.
  • https://spark.apache.org/docs/1.6.0/api/python/pyspark.mllib.html#pyspark.mllib.classification.SVMModel
  • Rennie, J. D. M. and Srebro, N., Loss functions for preference levels: regression with discrete ordered labels. In Proceedings of the IJCAI Multidisciplinary Workshop on Advances in Preference Handling, Kluwer Norwell, MA, 2005, pp. 180–186.
  • Hartshorn, S., Machine Learning with Random Forests and Decision Trees: A Visual Guide for Beginners, Amazon Kindle, 2016.
  • Fawcett, T., An introduction to ROC analysis. Pattern Recognit. Lett., 2006, 27(8), 861–874.
  • Genuer, R., Variance reduction in purely random forests. J. Nonparametr. Stat., 2012, 24(3), 543–562.
  • Halevy, A., Norvig, P. and Pereira, F., The unreasonable effectiveness of data. IEEE Intell. Syst., 2009, 24(2), 8–12.
  • Zhu, X., Vondrik, C., Fowlkess, C. C. and Ramanan, D., Do we need more training data? Int. J. Comput. Vis., 2016, 119(1), 76– 92.
  • Lapaine, M., Mollweide map projection, 2011; http://master.grad.hr/hdgg/kogstranica/kog15/2Lapaine-KoG15.pdf
  • Martin, E., The Mollweide projection, 2013; http://balbuceosastropy.blogspot.in/2013/09/the-mollweide-projection.html

Abstract Views: 234

PDF Views: 81




  • Classification of SDSS Photometric Data Using Machine Learning on A Cloud

Abstract Views: 234  |  PDF Views: 81

Authors

Vishwanath Acharya
International Institute of Information Technology-Bangalore, 26/C, Electronics City, Bengaluru - 560 100, India
Piyush Singh Bora
International Institute of Information Technology-Bangalore, 26/C, Electronics City, Bengaluru - 560 100, India
Karri Navin
International Institute of Information Technology-Bangalore, 26/C, Electronics City, Bengaluru - 560 100, India
Anisha Nazareth
International Institute of Information Technology-Bangalore, 26/C, Electronics City, Bengaluru - 560 100, India
P. S. Anusha
International Institute of Information Technology-Bangalore, 26/C, Electronics City, Bengaluru - 560 100, India
Shrisha Rao
International Institute of Information Technology-Bangalore, 26/C, Electronics City, Bengaluru - 560 100, India

Abstract


Astronomical datasets are typically very large, and manually classifying the data in them is effectively impossible. We use machine learning algorithms to provide classifications (as stars, quasars and galaxies) for more than one billion objects given photometrically in the Third Data Release of the Sloan Digital Sky Survey (SDSS-III). We have used kNN, SVM and random forest algorithms in a distributed environment over the cloud to classify 1,183,850,913 unclassified photometric objects present in the SDSSIII catalog. This catalog contains photometric data for all objects viewed through a telescope and spectroscopic data for a small part of these. Although it is possible to classify all the objects using spectroscopic data, it is impractical to obtain such data for each one of them. To classify such a big dataset on a single machine would be impractically slow, so we have used the Spark cluster computing framework to implement a distributed computing environment over the cloud. We found that writing results (dozens of gigabytes) to the cloud storage is very slow while using kNN. Though writing the results with SVM is faster as it is done in parallel, its accuracy is only around 87%, due to lack of a kernel implementation of it in Spark. We then used the random forest algorithm to classify the entire set of 1,183,850,913 objects with an accuracy of 94% in about 17 hours of processing time. The result set is significant as even collecting spectroscopic data for these many objects would take decades, and our classifications can help astronomers and astrophysicists carry out further studies.

Keywords


Astronomical Data, Classification, Cloud Computing, Distributed Algorithms, Machine Learning.

References





DOI: https://doi.org/10.18520/cs%2Fv115%2Fi2%2F249-257