The increasingly computationally- and data-intensive nature of experimental science motivates recent interest in workflows, as a way to specify complex data processing and integration pipelines in a fairly intuitive way. Such workflows orchestrate the invocation of data retrieval services in a way that resembles, to some extent, Search Computing query plans. While the former are manually specified, however, the latter are the result of an automated translation process. Using lessons learnt from experience in workflow design, in this chapter we discuss some of the requirements on service curation that make automated, on-demand data integration processes possible and realistic.
Incremental workflow improvement through analysis of its data provenance.Missier, P.2011.In Procs. TAPP'11 (Theory and Practice of Provenance), Heraklyion, Crete, Greece, June. PaperBibtexAbstract:
Repeated executions of resource-intensive workflows over a large number of runs are commonly observed in e-science practice. We explore the hypothesis that, in some cases, provenance traces recorded for past runs of a workflow can be used to make future runs more efficient. This investigation is an initial step into the systematic study of the role that provenance analysis can play in the broader context of self-managing software systems. We have tested our hypothesis on a concrete case study involving a Chemical Engineering workflow deployed on a cloud infrastructure, where we can measure the cost of its repeated execution. Our approach involves augmenting the workflow with a feedback loop in which incremental analysis of the provenance of past runs is used to control some of the workflow steps in subsequent executions. We present initial experimental results and hint at future improvements as part of ongoing work.
Simulating Taverna workflows using stochastic process algebras.Curcin, V.; Missier, P.; and De Roure, D.2011.Concurrency and Computation: Practice and Experience, In press.:. PaperBibtex
Workflows to Open Provenance Graphs, round-trip.Missier, P., and Goble, C.2011.Future Generation Computer Systems (FGCS), 27(6):812--819, April. PaperBibtexAbstract:
The Open Provenance Model is designed to capture relationships amongst data values, and amongst processors that produce or consume those values. While OPM graphs are able to describe aspects of a workflow execution, capturing the structure of the workflows themselves is understandably beyond the scope of the OPM specification, since the graphs may be generated by a broad variety of processes, which may not be formal workflows at all. % In particular, OPM does not address two questions: firstly, whether for any OPM graph there exists a $\$textit{plausible} workflow, in some model, which could have generated the graph. And secondly, which information should be captured as part of an OPM graph that is derived from the execution of some known type of workflow, so that the workflow structure and the execution trace can both be inferred back from the graph. % Motivated by the need to address the $\$textit{Third Provenance Challenge} using Taverna workflows and provenance, in this paper we explore such notion of $\$textit{lossless-ness} of OPM graphs relative to Taverna workflows. % For the first question, we show that Taverna is a suitable model for representing plausible OPM-generating processes. For the second question, we show how augmenting OPM with two types of annotations makes it lossless with respect to Taverna. We support this claim by presenting a two-way mapping between OPM graphs and Taverna workflows.
Achieving Reproducibility by Combining Provenance with Service and Workflow Versioning.Woodman, S.; Hiden, H.; Watson, P.; and Missier, P.2011.In Procs. WORKS 2011, Seattle, WA, USA. Bibtex
Towards the preservation of scientific workflows.Roure, D. D.; Belhajjame, K.; Missier, P.; and Al., E.2011.In Procs. of the 8th International Conference on Preservation of Digital Objects (iPRES 2011), Singapore. PaperBibtex
Scientific collaboration increasingly involves data sharing between separate groups. We consider a scenario where data products of scientific workflows are published and then used by other researchers as inputs to their workflows. For proper interpretation, shared data must be complemented by descriptive metadata. We focus on provenance traces, a prime example of such metadata which describes the genesis and processing history of data products in terms of the computational workflow steps. Through the reuse of published data, virtual, implicitly collaborative experiments emerge, making it desirable to compose the independently generated traces into global ones that describe the combined executions as single, seamless experiments. We present a model for provenance sharing that realizes this holistic view by overcoming the various interoperability problems that emerge from the heterogeneity of workflow systems, data formats, and provenance models. At the heart lie (i) an abstract workflow and provenance model in which (ii) data sharing becomes itself part of the combined workflow. We then describe an implementation of our model that we developed in the context of the Data Observation Network for Earth (DataONE) project and that can “stitch together” traces from different Kepler and Taverna workflow runs. It provides a prototypical framework for seamless cross-system, collaborative provenance management and can be easily extended to include other systems. Our approach also opens the door to new ways of workflow interoperability not only through often elusive workflow standards but through shared provenance information from public repositories.
This paper presents a formal semantics for the Taverna 2 scientific workflow system. Taverna 2 is a successor to Taverna, an open-source workflow system broadly adopted within the e-science community worldwide. The new version improves upon the existing model in two main ways: (i) by adding support for data pipelining, which in turns enables input streams of indefinite length to be processed efficiently; and (ii) by providing new extensibility points that make it possible to add new operators to the workflow model. Consistent with previous work by some of the authors, we use trace semantics to describe the effect of workflow computations, and we show how they can be used to describe the new features in the Taverna 2 model.
A Comparison of Using Taverna and BPEL in Building Scientific Workflows: the case of caGrid.Tan, W.; Missier, P.; Foster, I.; Madduri, R.; and Goble, C.2009.Concurrency and Computation Practice and Experience. Bibtex
Practical data quality certification: model, architecture, and experiences.Missier, P.; Oliaro, A.; and Raffa, S.2006.In IQIS, International Workshop on Information Quality in Information Systems, 30 June 2006, Chicago, USA (SIGMOD 2006 Workshop), ACM. PaperBibtex
The Grid's vision, of sharing diverse resources in a flexible, coordinated and secure manner through dynamic formation and disbanding of virtual communities, strongly depends on metadata. Currently, Grid metadata is generated and used in an ad hoc fashion, much of it buried in the Grid middleware's code libraries and database schemas. This ad hoc expression and use of metadata causes chronic dependency on human intervention during the operation of Grid machinery, leading to systems which are brittle when faced with frequent syntactic changes in resource coordination and sharing protocols. The Semantic Grid is an extension of the Grid in which rich resource metadata is exposed and handled explicitly, and shared and managed via Grid protocols. The layering of an explicit semantic infrastructure over the Grid Infrastructure potentially leads to increased interoperability and greater flexibility. In recent years, several projects have embraced the Semantic Grid vision. However, the Semantic Grid lacks a Reference Architecture or any kind of systematic framework for designing Semantic Grid components or applications. The Open Grid Service Architecture (OGSA) aims to define a core set of capabilities and behaviours for Grid systems. We propose a Reference Architecture that extends OGSA to support the explicit handling of semantics, and defines the associated knowledge services to support a spectrum of service capabilities. Guided by a set of design principles, Semantic-OGSA (S-OGSA) defines a model, the capabilities and the mechanisms for the Semantic Grid. We conclude by highlighting the commonalities and differences that the proposed architecture has with respect to other Grid frameworks.
An Information Quality Management Framework for Cooperative Information Systems.Missier, P., and Batini, C.2003.In Procs. ISE 2003, Montreal, Canada, July. Bibtex
A model for Information Quality management in Cooperative Information Systems.Missier, P., and Batini, C.2003.In SEBD, 191--206. Bibtex
A Multidimensional Model for Information Quality in Cooperative Systems.Missier, P., and Batini, C.2003.In Proceedings of 8th International Conference on Information Quality (IQ'03), 25--40. Bibtex
Integration of Highly Fragmented Legacy Information Systems Through Object Modeling and Layered Wrappers.Mecella, M.; Missier, P.; Massari; and Batini, C.1999.In Procs. AICA99, Italy. Bibtex
A Knowledge-based Decision Support Workbench for Enterprise Resource Integration and Migration.Umar, A., and Missier, P.1999.In Procs. First International Workshop on Enterprise Management and Resource Planning Systems (EMRPS99), Venice, Italy. Bibtex