Документ взят из кэша поисковой машины. Адрес оригинального документа : http://angel.cs.msu.su/~oxana/image_processing/papers/2009-02370009.pdf
Дата изменения: Thu Jul 30 18:28:08 2009
Дата индексирования: Sat Apr 9 23:50:21 2016
Кодировка:
Challenges of Data Processing for Earth Observation in Distributed Environments
Dana Petcu

Abstract. Remote sensing systems have a continuous growth in the capabilities that can be handled nowadays only using distributed systems. In this context, the challenges for the distributed systems coming from Earth observation field are reviewed in this paper. Moreover, the technological solutions used to built a platform for Earth observation data processing are exposed as proof of concept of current distributed system capabilities.

1 Introduction
Earth observation systems are gathering daily large amounts of information about our planet and are nowadays intensively used to monitor and assess the status of the natural and built environments. Earth observation (EO) is most often referring to satellite imagery or satellite remote sensing, the result of sensing process being an image or a map. Remote sensing refers to receiving and measuring reflected or emitted radiation from different parts of the electromagnetic spectrum (in ultraviolet, visible, reflected infrared, thermal infrared, or microwave). Remote sensing systems systems involve not only the collection of the data, but also their processing and distribution. The rate of increase in the remote sensing data volume is continuously growing. Moreover, the number users and applications is also increasing and the data and resource sharing became a key issue in remote sensing systems. Furthermore, EO scientists are often hindered by difficulties locating and accessing the data and services. These needs lead to a shift in the design of remote sensing systems from centralized environments towards wide-area distributed environments. The current paper is a short survey of the current challenges imposed on distributed systems and coming from remote sensing application field. It is based on
Dana Petcu Computer Science Department, West University of Timisoara, Romania ё e-mail: petcu@info.uvt.ro
G.A. Papadopoulos and C. Badica (Eds.): Intelligent Distributed Computing III, SCI 237, pp. 9­19. c Springer-Verlag Berlin Heidelberg 2009 springerlink.com


10

D. Petcu

several recent research reports of EO and distributed systems communities. The paper is organized as follows. The next section is presenting the identified challenges. The third section is dedicated to a review of the Grid usage benefits for EO. The fourth section is describing a case study on building a distributed environment for training in EO. A short list of conclusions is provided in the last section.

2 Challenges of Earth Observation on Distributed Environments
Satellite image processing is usually a computational and storage consuming task and special techniques are required for both data storage and processing in distributed environments. In what follows we point some main topics. This section is a survey of the ideas exposed in the recent scientific reports, like [2, 4, 5]. Data Management. The management of the distribution of data, from storing to long-term archiving, is currently an important topic in EO systems. The first issue is the data format that is varying from image files, databases, or structured file. Usually an EO data contain metadata describing the data, such as the dimensionality or reference coordinates. Another issue is related to the user need to access remotely the EO data. Due to the size of the EO data, a distributed file system is needed. For more than three decades there are several distributed file systems enabling multiple, distributed servers to be federated under the same file namespace. Another issue is the data discovery and this is currently done usually exploiting metadata catalogs, Moreover, for EO, replica management services are essential, allowing to determine an optimal physical location for data access based on data destination aiming to reducing the network traffic and the response time. Data transfers secure protocols were developed to extends the traditional file transfer protocol. While for the basic needs mentioned above there are several stable and standardized solutions, the current key issue in EO data management is to make the data reachable and useful for any application through interoperability. Interoperability is achieved through the usage of standard interfaces and protocols. Interoperable interfaces are attractive to users allowing the fast design of distributed application based on multiple components. Achieving interoperability include also building adapter interfaces providing different front end to basic services and bridging protocols (namely interoperate). There are at least two layers for interoperability: for resource format and domain encoding, and semantic interoperability. Interoperability solutions for resources structures and content are often application-field dependent. The solutions are related to different levels, like device, communication, middleware and deployment ones. At device level, the solutions are mostly standardized and are referring to the interfaces to the storage devices. At communication level, there are standardized data transfer protocols (as HTTP, HTTPS, or GridFTP), standardized protocols for Web services, and less standardized data movers for heterogeneous computing environments. At middleware level there are fewer standard solutions. For example, for data storage it is necessary a single consistent interface to different storage systems ­ a solution is coming from Grid


Challenges of Data Processing for Earth Observation in Distributed Environments

11

community through the open standard storage resource manager, a control protocol for accessing mass storage. In what concerns file catalogs there are no current standards, but several implementations are available in Grid environments that are using special file catalogs allowing data replications. The same situation is valid also for metadata catalogs; fortunately, in the particular case of EO this issue is pursued by Open Geospatial Consortium (http://www.opengeospatial.org). In what concerns the interoperability of federated databases, a standard again proposed by Grid community is the Open Grid Services Architecture Data Movement Interface (OGSA-DMI, http://forge.gridforum.org/sf/projects/ ogsa- dmi- wg). At deployment level, interoperability degradation is related to the event of new deployments ­ currently there are no automated tools or standard interfaces allowing the propagation of updates. While resource-level interoperability is ensuring the compatibility of implementations at hardware and software levels, the semantic interoperability is enabling data and information flows to be understood at a conceptual level. Research efforts are currently devoted to the definition of generic data models for specific structured linguistic data types with the intention to represent a wide class of documents without loosing the essential characteristics of the linguistic data type. EO data particularities. Data provision services in EO are not satisfying the today's user needs due to current application and infrastructure limitations. The process of identifying and accessing data takes up the a lot of time, according [4], due to: physical discontinuity of data, diversity of metadata formats, large volume of data, unavailability of historic data, and many different actors involved. In this context, there is a clear need for an efficient data infrastructure able to provide reliable long-term access to EO data via the Internet, and to allow the users to easily and quickly derive information and share knowledge. Recognizing these needs, the European INSPIRE Directive (http://inspire.jrc.ec. europa.eu) requires all public authorities holding spatial data to provide access to that data through common metadata, data and network service standards. OPeNDAP (http://opendap.org/) is a data transport architecture and protocol widely used in EO; it is based on HTTP and includes standards for encapsulating structured data, annotating the data with attributes, and adding semantics that describe the data. Moreover, it is widely used by governmental agencies to EO data [4]. The Committee on EO Satellites (www.ceos.org) maintains a Working Group on Information Systems and Services with the responsibility to promote the development of interoperable systems for the management of EO data internationally. This group plans to build in the next decade the Global EO System of Systems (GEOSS) targeting the development of a global, interoperable geospatial services architecture [8]. Data Processing. To address the computational requirements introduced by timecritical satellite image applications, several research efforts have been oriented towards parallel processing strategies. According to the Top500 list of supercomputer sites, NASA, for example, is maintaining two massively parallel clusters for remote sensing applications. The recent book [14] presents the latest achievements in the field of high performance computing (HPC).


12

D. Petcu

Currently ongoing research efforts are aiming also the efficient distributed processing of remote sensing data. Recent reports are related to the use of new versions of data processing algorithms developed for heterogeneous clusters as [13]. Moreover, distributed application framework specifically have been developed for remote sensed data processing, like JDAF [18]. EO applications are also good candidates for building architectures based on components encapsulating complex data processing algorithms and being exposed through standard interfaces like in [7]. Web services technology emerged as standard for integrating applications using open standards. In EO Web services play a key role. A concrete example is the Web mapping implementation specification proposed by OpenGIS (http:// www.opengis.org). Web technologies are allowing also the distribution of scientific data in a decentralized approach and are exposing catalogue services of dataset metadata. The promise of a Grid for EO community is to be a shared environment that provide access to an wide range of resources: instrumentation, data, HPC resources, and software tools. There are at least three reasons for using Grids for EO: (a) the required computing performance is not available locally, the solution being the remote computing; (b) the required computing performance is not available in one location, the solution being cooperative computing; (c) the required services are only available in specialized centres, the solution being application specific computing.

3 Large Distributed Environments for Earth Observation
Realizing the potential of the Grid computing for EO, several projects were launched to make the Grid usage idea a reality. We review the most important ones. Grid-based EO Initiatives. Within the DataGrid project funded by the European Commission, an experiment aiming to demonstrate the use of Grid technology for remote sensing applications has been carried out; the results can be found for example in the paper [9]. Several other international Grid projects were focused on EO, like SpaceGrid (http://www.spacegrid.org), Earth Observation Grid (http://www.e- science.clrc.ac.uk/web/projects/ earthobservation), or Genesis [22]. The MediGrid project (http://www. eu- medigrid.org) aimed to integrate and homogenize data and techniques for managing multiple natural hazards. The authors of paper [1] present an overview of SARA Digital Puglia, a remote sensing environment that shows how Grid and HPC technologies can be efficiently used to build dynamic EO systems for the management of space mission data and for their on-demand processing and delivering to final users. A frequent approach is to use the Grid as a HPC facility for processorintensive operations. The paper [17], for example, focuses on the Grid-enabled parallelization of the computation-intensive satellite image geo-rectification problem. The aim of the proposed classification middleware on Grid from [19] is to divide jobs into several assignments and submit them to a computing pool. The parallel remote-sensing image processing software PRIPS was encapsulated into a Grid


Challenges of Data Processing for Earth Observation in Distributed Environments

13

service in [20]. In the paper [21] is discussed the architecture of a spatial information Grid computing environment, based on Globus Toolkit, OpenPBS, and Condor-G; a model of the image division is proposed, which can compute the most appropriate image pieces and make the processing time shorter. CrossGrid (http://www.crossgrid.org) aimed at developing techniques for real-time, large-scale grid-enabled simulations and visualizations, and the issues addressed included distribution of source data and the usefulness of Grid in crisis scenarios. DEGREE (http://www.eu- degree.eu) delivered a study on the challenges that the Earth Sciences are imposing on Grid infrastructure. D4Science (http://www.d4science.org) studied the data management of satellite images on Grid infrastructures. G-POD (http://eogrid.esrin.esa.int/) aims to offer a Grid-based platform for remote processing the satellite images provided by European Space Agency. The GlobAEROSOL service of BEinGRID [15] is processing data gathered from satellite sensors and generates an multi-year global aerosol information in near real time. The GEOGrid project [16] provides an eScience infrastructure for Earth sciences community and integrates a wide varieties of existing data sets including satellite imagery, geological data, and ground sensed data, through Grid technology, and is accessible as a set of services. LEAD (https://portal.leadproject.org/) is creating an integrated, scalable infrastructure for meteorology research; its applications are characterized by large amounts of streaming data from sensors. The Landsat Data Continuity Mission Grid Prototype (LGP) offers a specific example of distributed processing of remotely sensed data [5] generating single, cloud and shadow scenes from the composite of multiple input scenes. GENESI-DR (http://genesi- dr.eu) intends to prove reliable long-term access to Earth Science data allowing scientists to locate, access, combine and integrate data from space, airborne and in-situ sensors archived in large distributed repositories; its discovery service allows to query information about data existing in heterogeneous catalogues, and can be accessed by users via a Web portal, or by external applications via open standardized interfaces (OpenSearch-based) exposed by the system [4]. Several other smaller projects, like MedioGrid [11], were also initiated to provide Grid-based services at national levels. Remote Sensing Grid. A RSG is defined in [5] as a highly distributed system that includes resources that support the collection, processing, and utilization of the remote sensing data. The resources are not under a single central control. Nowadays it is possible to construct a RSG using standard, open, protocols and interfaces. In the vision of [5] a RSG is made up of resources from a variety of organizations provide specific capabilities, like observing elements, data management elements, data processing and utilization elements, communications, command, and control elements, and core infrastructure. If a service oriented architecture is used, modular services can be discovered and used to build complex applications by clients. The services should have the following characteristics [5]: composition, communication, workflow, interaction, and advertise. These requirements are mapped into the definition of specific services for workflow management, data management and processing, resource management, infrastructure core functions, policy specification, and


14

D. Petcu

performance monitoring. The services proposed in [5] are distributed in four categories: workflow management services, data management services, applications in the form of services, and core Grid services. In the next section we describe a case study of a recent Grid-based satellite imagery system that follows the RSG concepts.

4 Case Study: GiSHEO
The rapid evolution of the remote sensing technology is not followed at the same developing rate by the training and high education in this field. Currently there is only a few number of resources involved in educational activities in EO. The CEOS Working Group of Education, Training and Capacity Building for example is collecting an index of free EO educational materials (http://oislab.eumetsat.org/ CEOS/webapps/). Recognizing the gap between research activities and the educational ones, we have developed recently a platform, namely GiSHEO (On Demand Grid Services for Training and High Education in EO http://gisheo.info. uvt.ro) addressing the issue of specialized services for training in EO. Contrary to the existing platforms providing tutorials and training materials, GiSHEO intends to be a living platform where experimentation and extensibility are the key words. Moreover, special solutions were proposed for data management, image processing service deployment, workflow-based service composition, and user interaction. A particular attention is given to the basic services for image processing that are reusing free image processing tools, like GDAL. A special features of the platform is the connection with the GENESI-DR catalog mentioned in the previous section. While the Grid is usually employed to respond to the researcher requirements to consume resources for computational-intensive or data-intensive tasks, we aim to use it for near-real time applications for short-time data-intensive tasks. The data sets that are used for each application are rather big (at least of several tens of GBs), and the tasks are specific for image processing (most of them very simple). In this particular case a scheme of instantiating a service where the data are located is required in order to obtain a response in near-real time. Grid services are a quite convenient solution in this case: a fabric service is available at the server of the platform that serves the user interface and this service instantiates the processing service where the pointed data reside. In our platform the Web services serve as interfaces for the processing algorithms. These interfaces allow the remote access and application execution on a Grid using different different strategies for fast response. The platform design concepts were shortly presented in [10] and the details about the e-learning component can be found in [6]. The EO services were described in [12] and the data management is detailed in [11]. In this section we describe the approaches that were undertaken to solve the previously mentioned problems. Conceptual View of the Platform's Architecture. The GiSHEO architecture is a Grid-enabled platform for satellite image processing. Figure 1 presents the conceptual view of the implemented service-oriented architecture. The WMS is the standard Web Mapping Service ensuring the access to the distributed database. WAS


Challenges of Data Processing for Earth Observation in Distributed Environments

15

Fig. 1 GiSHEO platform components. Numbers are representing the stages in its usage process: authenticate and authorize; select area of interest and available datasets; select processing unit (list of applications) and submit it to WSC; submit a task to GTD-WS; discover datasets, schedule the task, retrieve datasets for processing; upload output to GDB, register output to GDIS, notify WSC; UI requests WMS to display the output, query GDIS, select output datasets from GDB, display

(Web Application Service) is invocated by user interface at run-time and allows workflows description. GDIS is a data index service, a Web service providing information about the available data to its clients; it intermediates access to data repositories, stores the processing results, ensures role based access control to the data, retrieves data from various information sources, queries external data sources and has a simple interface that is usable by various data consumers. The platform has distributed data repositories ­ it uses PostGIS for storing raster extent information and in some cases vector data. The data search is based on PostGIS spatial operators. The Workflow Service Composition (WSC) and Workflow Manager (WfM) are the engines behind WAS and are connected with the tasks manager. Each basic image processing operation is viewed as a task; several tasks can be linked together to form a workflow in an order that is decided at client side. The GTD-WS (Grid Task Dispatcher Web Service) is a service-enabled interface for interoperability with the Grid environment. Digital certificate are required to access the full facilities of the platform. A particular component of WAS is eGLE that uses templates to allow teachers specialized in EO to develop new lessons that uses EO data ­ details are given in [6]. The physical platform is based on four clusters that are geographically distributed. Due to the low security restriction between the four institutions, data distribution between the clusters is done using Apache Hadoop distributed file system. The data transfer from and to external databases is done using GridFTP this is for example the case of the connection with GENESI-DR database.


16

D. Petcu

Fig. 2 GiSHEO interface: on the left side a snapshot of the Web interface and on the right side a snapshot of the workflow designer

Platform's Levels. The platform architecture has several levels including user, service, security, processing and a data level. The user level is in charge with the access to the Web user interface (built by using DHTML technologies). A workflow language was developed together with a set of tools for users not familiar with programming which can be used both for visually creating a workflow (details in [3]). After defining the workflow the user can then select a region containing one or more images on which the workflow is to be applied (Figure 2). The service level exposes internal mechanisms belonging to the platform and consists in: EO services ­ processing applications; workflow service ­ the internal workflow engine accessible through a special Web service; data indexing and discovery services ­ allowing the access to the platform's data management mechanisms. The security level provides security context for both users and services. Each user is identified by either using a username-password pair or a canonical name provided by a digital certificate. The services use a digital certificate for authentication, authorization, and trust delegation. A VOMS service is used for authorization. At processing level the platform enables two models for data processing: direct job submission trough Condor's specific Web services, or through WS-GRAM tool of Globus Toolkit 4 (GT4). The Grid service interface GTD (Grid Task Dispatcher) is responsible for the interaction with other internal services as the Workflow Composition Engine in order to facilitate access to the processing platform. It receives tasks from the workflow engine or directly from user interface. A task description language (the ClassAd meta language for example in case of Condor HTC) is used in order to describe a job unit, to submit and check the status of jobs inside the workload management system and to retrieve job logs for debugging purposes. GTD with Condor is used mainly for development purposes (application development and testing). For production GTD with GT4 is used because it offers a complete job management-tracking system. At data level two different types of data are involved: datasets database which contain the satellite imagery repository and processing application datasets used by applications to manipulate satellite images. The GiSHEO Data Indexing and Storage Service (GDIS) provides features for data storage, indexing data, finding data by various conditions, querying external services and for keeping track of


Challenges of Data Processing for Earth Observation in Distributed Environments Fig. 3 An example of GiSHEO simple service: multi-image transformation using a binary decision tree to detect areas with water, clouds, forest, non-forest and scrub; the top two images in gray scale are the input data (infrared and red bands), while the bottom image is the output color image Fig. 4 An example of GiSHEO complex service: identification of soil marks of burial mounds: on the left the original image and in the right the result of a sequence of operations allowing the detection of round shapes

17

temporary data generated by other components. GDIS is available to other components or external parties using a specialized Grid service. This service is also responsible for enforcing data access rules based on specific grid credentials (VO attributes, etc.). The data storage component part of GDIS is responsible for storing the data by using available storage back-ends. The data distributed across various storage domains should be exposed through a unique interface ­ this is achieved by implementing a front-end GridFTP service that has native access to the Hadoop distributed file system (HDFS) offering access to data stored inside the internal HDFS and providing the required access control facilities. The data indexing service is performed by PostGIS to index the metadata and location of the geographical data available in the storage layer. The PostGIS layer provides advanced geographical operations allowing searching the data by using various criteria. An advanced and highly flexible interface for searching the project's geographical repository was also designed and built around a custom query language designed to provide fine grained access to the data in the repository and to query external services (TerraServer, GENESI-DR, etc) ­ details can be found in [3]. In what concerns the platform's EO services, we divided the remote sensing processing operations into basic and complex types. Basic operations represent image processing algorithms that can be applied on a satellite image (obtaining the negative, gray level conversion, histogram equalization, quantization, thresholding, band extraction, embossing, equalization, layers subtraction etc.) ­ see an example in Figure 3. Complex operations are represented by the complex image processing algorithms (i.e. topographic effect regression) or by a composition of two or more basic operations ­ see an example in Figure 4. Applications for training in archaeology are presented in details in [12].


18

D. Petcu

5 Conclusions
The paper reviewed the challenges imposed to current distributed environments serving the EO community. The data processing and management are the key issues. A special emphasis has been put in the last decade on using wide-area distributed systems, namely Grids. Their usage benefits were underlined in this paper. The current standards and specific services for EO are allowing the design of new platforms focusing on the user needs. Such a distributed platform aiming to serve the training needs in EO was presented as proof of the concepts discussed in this paper.
Acknowledgements. This research is supported by ESA PECS Contract no. 98061 GiSHEO ­ On Demand Grid Services for High Education and Training in Earth Observation.

References
1. Aloisio, G., Cafaro, M.: A dynamic Earth observation system. Parallel Computing 29(10), 1357­1362 (2003) 2. Coghlan, B., et al.: e-IRG Report on Interoperability Issues in Data Management (2009) 3. Frincu, M.E., Panica, S., Neagul, M., Petcu, D.: Gisheo: On demand Grid service based platform for EO data processing. In: Procs. HiperGrid 2009, pp. 415­422 (2009) 4. Fusco, L., Cossu, R., Retscher, C.: Open Grid services for Envisat and Earth observation applications. In: High Performance Computing in Remote Sensing, pp. 237­280 (2008) 5. Gasster, S.D., Lee, C.A., Palko, J.W.: Remote sensing Grids: architecture and implementation. In: High Performance Computing in Remote Sensing, pp. 203­236 (2008) 6. Gorgan, D., Stefanut, T., Bacu, V.: Grid based training environment for Earth observation. LNCS, vol. 5529, pp. 98­109 (2009) 7. Larson, J.W., et al.: Components, the common component architecture, and the climate/weather/ocean community. In: Procs. 84th AMS Annual Meeting (2004) 8. Lee, C.A.: An introduction to Grids for remote sensing applications. In: Plaza, A., Chang, C. (eds.) High Performance Computing in Remote Sensing, pp. 183­202 (2008) 9. Nico, G., Fusco, L., Linford, J.: Grid technology for the storage and processing of remote sensing data: description of an application. SPIE, vol. 4881, pp. 677­685 (2003) 10. Panica, S., Neagul, M., Petcu, D., Stefanut, T., Gorgan, D.: Desiging a Grid-based training platform for Earth observation. In: Procs. SYNASC 2008, pp. 394­397 (2009) 11. Petcu, D., Gorgan, D., Pop, F., Tudor, D., Zaharie, D.: Satellite image processing on a Grid-based platform. International Scientific Journal of Computing 7(2), 51­58 (2008) 12. Petcu, D., Zaharie, D., Neagul, M., Panica, S., Frincu, M., Gorgan, D., Stefanut, T., Bacu, V.: Remote sensed image processing on Grids for training in Earth observation. In: Kordic, V. (ed.) Image Processing, In-Tech, Vienna (2009) 13. Plaza, A., Plaza, J., Valencia, D.: Ameepar: Parallel morphological algorithm for hyperspectral image classification in heterogeneous NoW. LNCS, vol. 3391, pp. 888­891 (2006) 14. Plaza, A., Chang, C. (eds.): High Performance Computing in Remote Sensing. Chapman & Hall/CRC, Taylor & Francis Group, Boca Raton (2008) 15. Portela, O., Tabasco, A., Brito, F., Goncalves, P.: A Grid enabled infrastructure for Earth observation. Geophysical Research Abstracts 10 (2008)


Challenges of Data Processing for Earth Observation in Distributed Environments

19

16. Sekiguchi, et al.: Design principles and IT overviews of the GEOGrid. IEEE Systems Journal 2(3), 374­389 (2008) 17. Teo, Y.M., Tay, S.C., Gozali, J.P.: Distributed geo-rectification of satellite images using Grid computing. In: Procs. IPDPS 2003, pp. 152­157 (2003) 18. Votava, P., Nemani, R., Golden, K., Cooke, D., Hernandez, H.: Parallel distributed application framework for Earth science data processing. In: Procs. IGARSS 2002, pp. 717­719 (2002) 19. Wang, J., Sun, X., Xue, Y., et al.: Preliminary study on unsupervised classification of remotely sensed images on the Grid. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3039, pp. 981­988. Springer, Heidelberg (2004) 20. Yang, X.J., Chang, Z.M., Zhou, H., Qu, X., Li, C.J.: Services for parallel remote-sensing image processing based on computational Grid. In: Jin, H., Pan, Y., Xiao, N., Sun, J. (eds.) GCC 2004. LNCS, vol. 3252, pp. 689­696. Springer, Heidelberg (2004) 21. Yang, C., Guo, D., Ren, Y., Luo, X., Men, J.: The architecture of SIG computing environment and its application to image processing. In: Zhuge, H., Fox, G.C. (eds.) GCC 2005. LNCS, vol. 3795, pp. 566­572. Springer, Heidelberg (2005) 22. Yunck, T., Wilson, B., Braverman, A., Dobinson, E., Fetzer, E.: GENESIS: the general Earth science investigation suite. In: Procs. 4th annual NASAs Earth Technology Conference (2008)