HDF5 Parallelization for Hierarchical Semi-Sparse Data Cubes


Paper:	HDF5 Parallelization for Hierarchical Semi-Sparse Data Cubes
Volume:	535, Astronomical Data Analysis Software and Systems XXXI
Page:	115
Authors:	Nadvornik, J.; Skoda, P.; Tvrdik, P.
Abstract:	Big Data is not only about big volumes but also a higher number of dimensions of the data. For every observed astronomical object, we usually have multiple observations in different times, wavelengths, polarization, or even created by different instrument types. Intuitively, taking all of the relevant information into account will produce higher quality results for classification or clustering algorithms, rather than just focusing on a single aspect of the object. Most often we are talking about spectroscopic and photometric observations which can be combined into data cubes. With the Hierarchical Semi-Sparse data cubes (HiSS cubes) engine we combine spectral and imaging data within the HDF5 format for efficient use of machine learning algorithms and visualization. The HiSS cube ensures this efficiency by implementing an indexing mechanism within the HDF5 that also takes advantage of the native chunking feature. Preprocessing that rescales the spectral and photometry measurements, in order to be directly comparable, takes significant time. Therefore, it needs to be parallelized, and this parallelization also takes advantage of the native HDF5 parallel I/O feature. This contribution focuses on the parallel performance of the Python version h5py of the HDF5-based solution in the construction of the HiSS cube.