Next: Tools for Coordinating Planning Between Observatories
Up: Enabling Technologies for Astronomy
Previous: Advanced Architecture for the Infrared Science Archive
Table of Contents -
Subject Index -
Author Index -
Search -
PS reprint -
PDF reprint
Thakar, A., Kunszt, Z., & Szalay, A. 2001, in ASP Conf. Ser., Vol. 238, Astronomical Data Analysis Software and Systems X, eds. F. R. Harnden, Jr., F. A. Primini, & H. E. Payne (San Francisco: ASP), 40
A Parallel, Distributed Archive Template for the VO
Aniruddha R. Thakar, Peter Z. Kunszt, Alexander S. Szalay
The Johns Hopkins University
Abstract:
In the proposed Virtual Observatory (VO), there is an urgent need for a
prototype distributed archive that a) uses standard interfaces to the outside
world, b) contains a parallel and scalable query agent, and c) can serve as a
virtual data grid node in the VO. We propose to use the current SDSS Science
Archive as a basis for developing such an archive template for the VO. This
effort will involve extending the current capabilities of the Science Archive
query agent as well as redesigning certain aspects of it. We
describe the steps that this effort will entail.
The multi-Terabyte astronomical archives that are in the process of being
built today will give rise to unprecedented sky and wavelength coverage along
with an incredibly rich dataset. But the enormous potential for scientific
discovery that these archives promise will not be fully realized until they
are interconnected in such a way that data from individual archives can be
combined, compared, and mined in a seamless fashion. The creation of such a
multi-wavelength digital universe-at-your-fingertips is the ambitious and
far-sighted goal of the effort to build a Virtual Observatory.
The enormous size of the upcoming archives, along with the multiplicity of
standards, software tools, and hardware platforms at the disposal of the
scientists that are building these archives, create a daunting challenge with
respect to integrating them into a virtual observatory framework. The very
task of defining what a virtual observatory is and must provide has consumed
several months of discussions and meetings between astronomers. In short,
there has been much talk but little action on defining a VO.
What is sorely needed is a prototype for a ``VO-ready'' archive that can serve
as a template for what future archives should (or should not) be, and help to
crystallize the concepts and priorities for the VO. We believe that a VO
archive must have at least the following features:
- well-defined, standardized interfaces to the outside world;
- a parallel, distributed, and scalable query agent to enable efficient
data mining for a large and widespread user community; and
- the proper hardware and software framework to serve as a virtual data
grid node.
In order to build such an archive template, we propose to use the Sloan
Digital Sky Survey's current Science Archive as a basis and convert it into a
VO archive template by making the modifications described below.
The Sloan Digital Sky Survey (SDSS) is a multi-institution project to build a
map of a large part of the northern sky in five wavelength bands (Szalay 1999).
The SDSS Science Archive (abbreviated SX) is the science database that
will result from the survey when it is completed (2005/2006). It is expected
to be several Terabytes in size and will contain a catalog of more than 200
million objects and 1 million spectra.
The SX has a client/server architecture that features a lightweight, portable
GUI client, a parallel (multi-threaded) distributed server (query agent), and
a commercial object-oriented Database Management System (DBMS) -
Objectivity (Thakar et al. 2000). It also includes a fast spatial
indexing scheme--the Hierarchical Triangular Mesh or HTM
(Kunszt et al. 2000),
as well as a multi-dimensional flux index. Although the Science Archive is
already considerably optimized for distributed data mining (Szalay et
al. 2000), it is still not sufficiently equipped to be a VO data grid node.
We aim to take the following specific steps to rectify this and create a
prototype VO archive.
A fundamental property of a VO-compatible archive must be the standardization
of its interfaces with the outside world. Toward this goal, we recognize the
eXtensible Markup Language (XML) as an emerging standard for data interchange
on the Internet, and seek to make our archive XML-compliant in a way that will
allow data and metadata to be exchanged with any other entity that can decode
XML. The attractiveness of XML is that it is flexible and self-contained, so
that specific information about reading even the most complex data can be
encoded using standard XML's Document Type Definitions (DTDs).
One of the biggest advantages of XML is the wealth of public-domain software
and tools that is already available (and rapidly increasing) on the Internet.
We list below some of those that are of particular interest to our
application:
- Schema Definition: These are tools for defining the structure of
the data, i.e., the data model. We propose to provide a public XML version of
our Abstract, which is essentially a runtime abstraction of our data model,
and functions as a type manager and metadata source. Currently available
tools include DDML, SOX, and XML Schema.
- Query Packaging: These are tools that enable the packaging of
queries using XML, such as XML Query, XML-QL, and Quilt.
- Parsers: There are also ready-to-use XML parsers available, such
as Expat, XML4J, Xerces (Apache).
The creation of a data grid will be one of the primary challenges and benefits
of a Virtual Observatory. The ability to generate virtual data--the
complex and often voluminous data that are created on the fly from complex
analyses of archival data--efficiently will be crucial for future
astronomical research in cosmology and other fields. Virtual Data Grids
(VDGs) will be an indispensable component of the VO. The GriPhyN project
( Grid Physics Networks, www.griphyn.org) envisages that
Petascale Virtual Data Grids (PDVGs) will be necessary in the near future to
meet the virtual data needs of the age of multi-Terabyte and Petabyte digital
archives. Indeed, these archives will need to be designed as VDG nodes.
This essentially means that each archive must provide a scalable parallel and
distributed framework for executing complex, compute and I/O intensive queries
and analysis tasks as close to the archive data as possible so as to minimize
network traffic. The GriPhyN proposal contains examples of queries that will
become possible (and frequent) with VDGs.
Our current model, although it is parallel, distributed and moderately
scalable, is not very well-suited for the large-scale grid computation
involving very complex query and analysis tools that is anticipated in a VO
context. In order to make it a fully functional grid node, we need to build a
massively parallel distributed framework with dynamic load-balancing,
resource-scheduling, and message-passing communication between intelligent
agents. We propose to reconfigure our current parallel model in order to
achieve this objective, as shown in Figure 1. Our master/slave
configuration will be replaced by a computational grid of intelligent query
agents loosely coupled to a distributed data grid via an MPI (Message Passing
Interface)-based communication toolkit like Globus (www.globus.org). It will
also be necessary to have a grid agent at the top level that will interface to
the outside world and serve as a listener/scheduler for the query agents.
Figure:
(a) Current and (b) proposed distributed query computation models.
The current master/slave model will be replaced by a loosely-coupled,
MPI-based massively parallel (MPP) model.
 |
Our current query language, SXQL, is a subset of SQL (Standard Query Language)
that also includes several object-oriented extensions and astronomical and
mathematical macros. We plan to augment the language further so as
ultimately to produce a versatile SQL-based scientific query language that
includes at least the following features, several of which have already been
incorporated into SXQL: the SQL SELECT-FROM-WHERE syntax, aliasing and
nesting, the ability to follow links through associations, including language
extensions for specifying to-many links, the ability to query on object
methods, generic mathematical macro support, specific astronomical macro
support, and support for spatial querying using the HTM.
The output from VO queries will often be very complex and voluminous, and will
need to be packaged in a self-describing, lightweight format. Again, XML
provides the answer. The eXtensible Scientific Interchange Language (XSIL) is
an XML DTD for scientific output that is extensible to any discipline
(Williams 2000). It contains an extensible object model with a Java API and
comes bundled with the Xlook browser. We hope to use XSIL to obtain a
flexible, general object transport protocol, a portable ASCII and binary
output format, and an ultra-light data format that includes support for binary
streams.
Although there are significant challenges facing the creation of a
Virtual Observatory, the lack of the necessary technology is not one
of them. The time is ripe for the implementation of the VO archive template
described above. Such a template is sorely needed, and the existing state of
software and hardware technology--in terms of storage capacity, network
bandwidth, CPU speed, and software standards and technology--makes it
achievable.
Within the next few years, the demand for virtual data will see a sharp rise,
and the computational power afforded by virtual data grids will be
indispensable for scientific research in large-scale structure and other
fields within astronomy. Archives like this one will be poised to meet those
challenges.
References
Kunszt P. Z., Szalay, A. S., & Thakar, A. R. 2000, in ASP Conf. Ser., Vol. 216, Astronomical Data
Analysis Software and Systems IX, ed. N. Manset, C. Veillet, &
D. Crabtree (San Francisco: ASP), 40
Szalay, A. S. 1999, Comp. in Sci. & Eng., Mar/Apr 1999, 54
Szalay, A. S., Kunszt, P. Z., Thakar, A., Gray, J., Slutz,
D., & Brunner, R. J. 2000, Proc. 2000 ACM SIGMOD
on Management of Data, 451
Thakar, A. R., Kunszt, P. Z., & Szalay, A. S. 2000, in ASP Conf. Ser., Vol. 216, Astronomical Data
Analysis Software and Systems IX, ed. N. Manset, C. Veillet, &
D. Crabtree (San Francisco: ASP),
231
Williams, R. 2000, http://www.cacr.caltech.edu:/XSIL
© Copyright 2001 Astronomical Society of the Pacific, 390 Ashton Avenue, San Francisco, California 94112, USA
Next: Tools for Coordinating Planning Between Observatories
Up: Enabling Technologies for Astronomy
Previous: Advanced Architecture for the Infrared Science Archive
Table of Contents -
Subject Index -
Author Index -
Search -
PS reprint -
PDF reprint
adass-editors@head-cfa.harvard.edu