Towards Large-scale RoI Indexing for Content-aware Data Discovery


Paper:	Towards Large-scale RoI Indexing for Content-aware Data Discovery
Volume:	522, Astronomical Data Analysis Software and Systems XXVII
Page:	57
Authors:	Araya, M.; Caceres, R.; Gutierrez, L.; Mendoza, M.; Ponce, C.; Valenzuela, C.
Abstract:	Data discovery within large archives is a key issue for modern astronomy: multi-source, multi-wavelength, multi-instrument and large-scale verifications need proper data discovery tools for filtering the very large datasets of observations available nowadays. The Virtual Observatory and file format standards have contributed to allow data discovery at the metadata level, where the filtering is circumscribed to what was explicitly annotated at the observation, calibration or data reduction stages. The next step is to perform data discovery at the content level, where content descriptors are automatically gathered from the observations to perform content-aware search. In a very general sense, this corresponds to automatically generate catalogs from large and diverse datasets. In this work, we consider the public spectroscopic data products from ALMA (fits cubes), and we apply the fast Region of Interest Seek and Extraction algorithm (RoiSE) to obtain content-descriptors of the spatial forms, positions, intensities and wavelengths of the source emissions. Despite the efficiency of the algorithm, it is impractical to process all the data in a batch/sequential manner. Then, the problem was to decide the tools and architecture to use for the task distribution across the datacenter. Between the several distributed/parallel computing alternatives, we selected the Dask packages to build the distributed pipeline that we outline in this paper, mainly because the current RoiSE implementation is written in Python. The main challenge of this pipeline is the diversity of data products: different resolutions, signal-to-noise ratios, densities, morphologies, imaging parameters, etc. Therefore, we include an adaptive parameter tuning mechanism to cope with this diversity. Finally, we present an example of content-aware data discovery over the obtained database.