|
|
Paper: |
Producing an Infrared Multiwavelength Galactic Plane Atlas Using Montage, Pegasus, and Amazon Web Services |
Volume: |
485, Astronomical Data Analysis Software and Systems XXIII |
Page: |
211 |
Authors: |
Rynge, M.; Juve, G.; Kinney, J.; Good, J.; Berriman, B.; Merrihew, A.; Deelman, E. |
Abstract: |
In this paper, we describe how to leverage cloud resources to generate
large-scale mosaics of the galactic plane in multiple wavelengths. Our
goal is to generate a 16-wavelength infrared Atlas of the Galactic Plane
at a common spatial sampling of 1 arcsec, processed so that they appear
to have been measured with a single instrument. This will be achieved by
using the Montage image mosaic engine process observations from the 2MASS, GLIMPSE, MIPSGAL, MSX and WISE datasets, over a wavelength range
of 1 μm to 24 μm, and by using the Pegasus Workflow Management
System for managing the workload. When complete, the Atlas will be made
available to the community as a data product.
We are generating images that cover ±180° in Galactic longitude
and ±20° in Galactic latitude, to the extent permitted by the
spatial coverage of each dataset. Each image will be 5°x5° in
size (including an overlap of 1° with neighboring tiles), resulting
in an atlas of 1,001 images. The final size will be about 50 TBs.
This paper will focus on the computational challenges, solutions,
and lessons learned in producing the Atlas. To manage the computation we
are using the Pegasus Workflow Management System, a mature, highly
fault-tolerant system now in release 4.2.2 that has found wide
applicability across many science disciplines. A scientific workflow
describes the dependencies between the tasks and in most cases the
workflow is described as a directed acyclic graph, where the nodes are
tasks and the edges denote the task dependencies. A defining property
for a scientific workflow is that it manages data flow between
tasks. Applied to the galactic plane project, each 5 by 5 mosaic is a
Pegasus workflow. Pegasus is used to fetch the source images, execute
the image mosaicking steps of Montage, and store the final outputs in a
storage system.
As these workflows are very I/O intensive, care has to be taken when
choosing what infrastructure to execute the workflow on. In our setup,
we choose to use dynamically provisioned compute clusters running on the
Amazon Elastic Compute Cloud (EC2). All our instances are using the same
base image, which is configured to come up as a master node by
default. The master node is a central instance from where the workflow
can be managed. Additional worker instances are provisioned and
configured to accept work assignments from the master node. The system
allows for adding/removing workers in an ad hoc fashion, and could be
run in large configurations.
To-date we have performed 245,000 CPU hours of computing and generated
7,029 images and totaling 30 TB. With the current set up our
runtime would be 340,000 CPU hours for the whole project. Using spot
m2.4xlarge instances, the cost would be approximately $5,950. Using
faster AWS instances, such as cc2.8xlarge could potentially decrease the
total CPU hours and further reduce the compute costs. The paper will
explore these tradeoffs. |
|
|
|
|