Open Access Open Access  Restricted Access Subscription Access

Parallel Processing Scheme for Minimizing Computational and Communication Cost of Bioinformatics Data


Affiliations
1 Division of Information and Communication Convergence Engineering, Mokwon University, Korea, Republic of
 

With the completion of the Human Genome Project, volumes of genomic data being routinely produced are far exceeding the ability for humans to digest and identify underlying phenomena. Bioinformatics plays a significant role in interpreting large amounts of genomic data. It helps scientists develop new treatments for genetic diseases on the basis of genomic data. Databases are essential for bioinformatics research and applications, so hundreds of databases in bioinformatics and inconsistent terminology and data formats from a variety of data sources must be efficiently managed. This paper proposes a parallel processing scheme that helps the understanding of bioinformatics data in heterogeneous network environments. The proposed scheme creates a hierarchical representation of bioinformatics datasets in Hadoop using the fuzzy relation theory. The internal relations of bioinformatics data are computed via the fuzzy relational product and the external relations are computed via data exchanges among network nodes. The proposed scheme reduces the computational cost for analyzing, correlating and visualizing bioinformatics data by considering only their internal and external relations, irrespective of their types, functionalities, and characteristics. Hadoop employed in the proposed scheme allows distributed storage and parallel processing of huge volumes of data, speeding up processing and communications in general. In addition, the proposed scheme adopts Apache Hive to improve the analysis of distributed bioinformatics data in Hadoop.

Keywords

Bioinformatics, Fuzzy Relation, Hadoop, Heterogeneous Environment, Visualization
User

Abstract Views: 204

PDF Views: 0




  • Parallel Processing Scheme for Minimizing Computational and Communication Cost of Bioinformatics Data

Abstract Views: 204  |  PDF Views: 0

Authors

Yoon-Su Jeong
Division of Information and Communication Convergence Engineering, Mokwon University, Korea, Republic of

Abstract


With the completion of the Human Genome Project, volumes of genomic data being routinely produced are far exceeding the ability for humans to digest and identify underlying phenomena. Bioinformatics plays a significant role in interpreting large amounts of genomic data. It helps scientists develop new treatments for genetic diseases on the basis of genomic data. Databases are essential for bioinformatics research and applications, so hundreds of databases in bioinformatics and inconsistent terminology and data formats from a variety of data sources must be efficiently managed. This paper proposes a parallel processing scheme that helps the understanding of bioinformatics data in heterogeneous network environments. The proposed scheme creates a hierarchical representation of bioinformatics datasets in Hadoop using the fuzzy relation theory. The internal relations of bioinformatics data are computed via the fuzzy relational product and the external relations are computed via data exchanges among network nodes. The proposed scheme reduces the computational cost for analyzing, correlating and visualizing bioinformatics data by considering only their internal and external relations, irrespective of their types, functionalities, and characteristics. Hadoop employed in the proposed scheme allows distributed storage and parallel processing of huge volumes of data, speeding up processing and communications in general. In addition, the proposed scheme adopts Apache Hive to improve the analysis of distributed bioinformatics data in Hadoop.

Keywords


Bioinformatics, Fuzzy Relation, Hadoop, Heterogeneous Environment, Visualization



DOI: https://doi.org/10.17485/ijst%2F2015%2Fv8i15%2F75336