|
|
Paper: |
Towards Large-scale RoI Indexing for Content-aware Data Discovery |
Volume: |
522, Astronomical Data Analysis Software and Systems XXVII |
Page: |
57 |
Authors: |
Araya, M.; Caceres, R.; Gutierrez, L.; Mendoza, M.; Ponce, C.; Valenzuela, C. |
Abstract: |
Data discovery within large archives is a key issue for modern astronomy:
multi-source, multi-wavelength, multi-instrument and large-scale verifications
need proper data discovery tools for filtering the very large datasets of
observations available nowadays. The Virtual Observatory and file format
standards have contributed to allow data discovery at the metadata level, where
the filtering is circumscribed to what was explicitly annotated at the
observation, calibration or data reduction stages. The next step is to perform
data discovery at the content level, where content descriptors are automatically
gathered from the observations to perform content-aware search. In a very
general sense, this corresponds to automatically generate catalogs from large
and diverse datasets. In this work, we consider the public spectroscopic data
products from ALMA (fits cubes), and we apply the fast Region of Interest Seek
and Extraction algorithm (RoiSE) to obtain content-descriptors of the spatial
forms, positions, intensities and wavelengths of the source emissions. Despite
the efficiency of the algorithm, it is impractical to process all the data in a
batch/sequential manner. Then, the problem was to decide the tools and
architecture to use for the task distribution across the datacenter. Between the
several distributed/parallel computing alternatives, we selected the Dask
packages to build the distributed pipeline that we outline in this paper, mainly
because the current RoiSE implementation is written in Python. The main
challenge of this pipeline is the diversity of data products: different
resolutions, signal-to-noise ratios, densities, morphologies, imaging
parameters, etc. Therefore, we include an adaptive parameter tuning mechanism to
cope with this diversity. Finally, we present an example of content-aware data
discovery over the obtained database. |
|
|
|
|