||High-Performance Compute Infrastructure in Astronomy: 2020 Is
Only Months Away
||461, Astronomical Data Analysis Software and Systems XXI
||Berriman, B.; Deelman, E.; Juve, G.; Rynge, M.; Vöckler, J. S.
||By 2020, astronomy will be awash with as much as 60 PB of public data.
Full scientific exploitation of such massive volumes of data will
require high-performance computing on server farms co-located with the
data. Development of this computing model will be a community-wide
enterprise that has profound cultural and technical implications.
Astronomers must be prepared to develop environment-agnostic
applications that support parallel processing. The community must
investigate the applicability and cost-benefit of emerging
technologies such as cloud computing
to astronomy, and must engage the
Computer Science community to develop science-driven
cyberinfrastructure such as workflow schedulers and optimizers.
We report here the results of collaborations between a science center,
IPAC, and a Computer Science research institute, ISI. These
collaborations may be considered pathfinders in developing a
high-performance compute infrastructure in astronomy. These
collaborations investigated two exemplar large-scale science-driver
applications: 1) Calculation of an infrared atlas of the
Galactic Plane at 18 different wavelengths by placing data from
multiple surveys on a common plate scale and co-registering all the
pixels; 2) Calculation of an atlas of periodicities present in the
public Kepler data sets, which currently contain 380,000 light curves.
These products have been generated with two workflow applications,
written in C
for performance and designed to support parallel
processing on multiple environments and platforms, but with different
compute resource needs: the Montage
image mosaic engine is I/O-bound,
and the NASA Star and Exoplanet Database periodogram code is
CPU-bound. Our presentation will report cost and performance metrics
and lessons-learned for continuing development.
Applicability of Cloud Computing: Commercial Cloud providers generally
charge for all operations, including processing, transfer of input and
output data, and for storage of data, and so the costs of running
applications vary widely according to how they use resources. The
cloud is well suited to processing CPU-bound (and memory bound)
workflows such as the periodogram code, given the relatively low cost
of processing in comparison with I/O operations. I/O-bound
applications such as Montage perform best on high-performance clusters
with fast networks and parallel file-systems.
Science-driven Cyberinfrastructure: Montage has been widely used as a
driver application to develop workflow management services, such as
task scheduling in distributed environments, designing fault tolerance
techniques for job schedulers, and developing workflow orchestration
Running Parallel Applications Across Distributed Cloud Environments:
Data processing will eventually take place in parallel distributed
across cyber infrastructure environments having different
architectures. We have used the Pegasus Work Management System (WMS)
to successfully run applications across three very different
environments: TeraGrid, OSG (Open Science Grid), and FutureGrid.
Provisioning resources across different grids and clouds (also
referred to as Sky
Computing), involves establishing a distributed
environment, where issues of, e.g, remote job submission, data
management, and security need to be addressed. This environment also
requires building virtual machine images that can run in different
environments. Usually, each cloud provides basic images that can be
customized with additional software and services. In most of our work,
we provisioned compute resources using a custom application, called
Wrangler. Pegasus WMS abstracts the architectures of the compute
environments away from the end-user, and can be considered a
first-generation tool suitable for scientists to run their
applications on disparate environments.