Data Sharing and Publication Using the SciDrive Service


Paper:	Data Sharing and Publication Using the SciDrive Service
Volume:	485, Astronomical Data Analysis Software and Systems XXIII
Page:	465
Authors:	Mishin, D.; Medvedev, D.; Szalay, A. S.; Plante R.; Graham, M.
Abstract:	Despite all the progress made during the last years in the field of cloud data storage, the problem of fast and reliable data storage for the scientific community still remains open. The SciDrive project meets the need for a free open-source scientific data publishing platform. Having the primary target audience of astronomers as the largest data producers, the platform however is not bound to any scientific domain and can be used by different communities. Our current installation provides a free and safe storage platform for scientists to publish their data and share it with the community with the simplicity of Dropbox. The system allows service providers to harvest from the files and derive their broader context in a fairly automated fashion. Collecting various scientific data files in a single location or multiple connected sites allows building an intelligent system of metadata extractors. Our system is aimed at simplifying the cataloging and processing of large file collections for the long tail of scientific data. We propose an extensible plugin architecture for automatic metadata extraction and storage. The current implementation targets some of the data formats commonly used by the astronomy communities, including FITS, ASCII and Excel tables, TIFF images, and YT simulations data archives. Along with generic metadata, format-specific metadata is also processed. For example, basic information about celestial objects is extracted from FITS files and TIFF images, if present. This approach makes the simple BLOB storage a smart system providing access to various data in its own representation, such as a database for files containing tables, or providing additional search and access features such as full-text search, image pyramids or thumbnails creation, simulation dataset id extractor for fast search. A 100TB implementation has just been put into production at Johns Hopkins University.