Classification of SDSS Photometric Data Using Machine Learning on A Cloud

Vishwanath Acharya; Piyush Singh Bora; Karri Navin; Anisha Nazareth; P. S. Anusha; Shrisha Rao

The PDF file you selected should load here if your Web browser has a PDF reader plug-in installed (for example, a recent version of Adobe Acrobat Reader).

If you would like more information about how to print, save, and work with PDFs, Highwire Press provides a helpful Frequently Asked Questions about PDFs.

Alternatively, you can download the PDF file directly to your computer, from where it can be opened using a PDF reader. To download the PDF, click the Download link above.

Fullscreen Fullscreen Off

Abstract
References
Article Metrics
Refbacks

Astronomical datasets are typically very large, and manually classifying the data in them is effectively impossible. We use machine learning algorithms to provide classifications (as stars, quasars and galaxies) for more than one billion objects given photometrically in the Third Data Release of the Sloan Digital Sky Survey (SDSS-III). We have used kNN, SVM and random forest algorithms in a distributed environment over the cloud to classify 1,183,850,913 unclassified photometric objects present in the SDSSIII catalog. This catalog contains photometric data for all objects viewed through a telescope and spectroscopic data for a small part of these. Although it is possible to classify all the objects using spectroscopic data, it is impractical to obtain such data for each one of them. To classify such a big dataset on a single machine would be impractically slow, so we have used the Spark cluster computing framework to implement a distributed computing environment over the cloud. We found that writing results (dozens of gigabytes) to the cloud storage is very slow while using kNN. Though writing the results with SVM is faster as it is done in parallel, its accuracy is only around 87%, due to lack of a kernel implementation of it in Spark. We then used the random forest algorithm to classify the entire set of 1,183,850,913 objects with an accuracy of 94% in about 17 hours of processing time. The result set is significant as even collecting spectroscopic data for these many objects would take decades, and our classifications can help astronomers and astrophysicists carry out further studies.

Keywords

Astronomical Data, Classification, Cloud Computing, Distributed Algorithms, Machine Learning.

I-Scholar

Journal Help

User

Notifications

Journal Content
Browse

Font Size

Information

Username
Password
Remember me

Username
Password
Remember me

Current Science

Current Science

Keywords