EGI-LW-CC/TRUFA/SoA - eScienceWiki

State of the art in Workflow

Kepler

Overview

Kepler is driven to help scientists, analysts and computer programmers to design, execute and share models analyses in scientific and engineering fields. Kepler offers a graphical user interface that allows users to create their own scientific workflows. The service also helps user to share, and reuse the workflows created by other members of the scientific community, improving the usability and the time needed to create certain workflows. Kepler is a java-based application that is maintained for the Windows, OSX, and Linux operating systems.

Implementation

Kepler inherits modeling and design capabilities from Ptolemy II, a framework developed at the University of California, Berkeley, including Ptolemy graphic user interface, workflow scheduling and execution patterns. Kepler also inherits from Ptolemy the actor-oriented modeling paradigm, which separates workflow components from the overall workflow orchestration, offering reusability. Kepler also includes Ptolemy components aimed at scientific applications: remote data and metadata access, data transformations, data analysis, interfacing with legacy applications, Web service invocation and deployment and provenance tracking. Kepler also offers web-based execution solutions, some of them are Hydrant and SciencePipes, Hydrant provides the means necessary for users to deploy their workflows on the web and SciencePipes allows users to connect to real biodiversity data, create visualizations.

Applications

Kepler offer support for a wide range of fields, including chemistry, ecology, geology, molecular biology, oceanography and phylogeny.

Conclusion

Kepler is an open source workflow application that works particularly well for scientists that work in the fields of biology, ecoinformatics and geoinformatics, Kepler only offer a few templates to create workflows for these scientific fields, if Kepler offered differents templates for another domains that would extend their scope.

Taverna

Overview

Taverna allows to automatize multi-step analysis that uses several services and also process regarding the use of web services. Taverna enables their users to create their models defining their final goals with no regarding on how the services or the process will be executed. Taverna will simply automate and will make a pipeline model for the user demands. Taverna also offers data conversion between services which are not entirely data compatible. Another feature of Taverna is the quick incorporation of new services without the necessity of coding it, such is this functionality that Taverna offers more than 3500 services, including local and remote resources and several analysis tools. Taverna also covers thoroughly the result section of the experiments providing detailed information about the execution, including the services that were executed, when, which inputs each service used and the outputs produced. Finally it also offers the convenience to share the workflows created on Taverna on myExperiment.

Implementation

The Taverna suite is written in Java and includes the Taverna Engine that powers both Taverna Workbench (the desktop client application) and Taverna Server (which executes remote workflows). Taverna is also available as a Command Line Tool for faster execution of workflows from a terminal without the overhead of a GUI.

Applications

Taverna can be used in the fields of bioinformatics, astronomy, chemo informatics, health informatics and others. Many examples of using Taverna lie in the bioinformatics field, although Taverna is actually domain independent. This means that Taverna can be applied to a wide range of fields, for example it has been used for the composition of music using Web services for synthesis.

Conclusion

Taverna allows to perform the automatization of experimental methods and the use of a number of different services from a diverse set of domains, biology, chemistry and medicine including music, meteorology and social sciences. Taverna simply enables a scientist who has a limited background in computing, limited technical resources and support, to construct highly complex analyses over data and computational resources that are both public and private, all from a standard PC, using Windows UNIX system or Apple computer.

myExperiment

Overview

myExperiment is based on the principle of reutilization, this is effective in multiple cases:

1.The reutilization of a workflow with different parameters and input data, and the possibility of modify the workflow for their own purposes.

2. Workflows can be shared with other scientists that work in similar projects, so each one can help each other in matters of coding, sharing and spreading the workflow designer’s practice.

3. Workflows, their components and workflow patterns can be used to give support on fields where they were not initially designed.

The main objective of myExperiment is to provide the means needed for scientists to share and work collaboratively reusing workflows, taking care of the social and technical challenges.

Implementation

myExperiment was designed according to an interpretation of the Web 2.0 design principles in the context of the virtual research environment.All the interfaces to myExperiment are accessed via HTTP protocol. The user access myExperiment via HTML based Web interface, meanwhile, external applications can access the other interfaces. The HTML interface uses JavaScript and AJAX to improve the interactive experience, while the RESTful API makes possible the construction of Rich Web Applications and mashups.

Applications

The application field of myExperiment is exclusively sharing scientific workflows so we can conclude that their public is the scientific community and research fields. myExperiment hosts different types of workflows, including Taverna, Kepler and Galaxy, so their application field goes along with the nature of their workflows.

Conclusion

myExperiment is the largest public repository of scientific workflows, it offers a service that none of the other workflows applications offer, the possibility to use an existing workflow to adapt it for another purpose, changing the parameters and the data used. In the context of myExperiment, sharing and collaborating in the use of workflows the way how the user interacts has a lot of impact on the user experience, and for that, the search engine, including a better way to show the results and providing good filters for searches, as upgrading the interface of the web are necessary goods for the success of myExperiment.

Galaxy

Overview

Galaxy is an open source, web-based platform for data intensive biomedical research, it allows users to organize and manipulate data from existing resources in different ways. One of the import features of Galaxy is memory, every action of the user is recorded and stored in the history system so any user can repeat and understand a complete computational analysis. Galaxy allows users to conduct independent queries on genomic data from different sources and then combine or refine them, perform calculations, or extract and visualize corresponding sequences or alignments.

Galaxy differs from existing systems in its specificity for access to, and comparative analysis of, genomic sequences and alignments.

Implementation

Galaxy consists in several independent software components that work together to perform tasks. The central core component orchestrates the action, executing the queries and keeping track of user histories, while the user interface and the operation, tool or output libraries are implemented separately. The communication with other sites is handled by the core component. The user interface communicates with the core via HTTP requests, using the GET or POST methods. The core provides an API consisting of the requests it is prepared to handle, for example using a tool or retrieving a user’s query history for a particular assembly of a genome. The Galaxy core component and operation libraries are written in C, the initial UI (called HUI for History User Interface) is written in Perl for convenient text manipulation and CGI access, but one could use any language that can generate an HTTP request. The use of an HTTP API is justified by the great compatibility of user interfaces that can be used, which do not have to be running on the same server. This allows any site on the Web to be able to create its own user interface for Galaxy by crafting the appropriate HTTP requests, and individual researches can use the API directly for programmatic access to Galaxy’s features. The benefits of this design are extensibility (easy of adding new tools and interfaces) and convenient division of labor and expertise among programmers.

Applications

Galaxy offers a new set of interactive tools for large-scale genome analysis .The application field of Galaxy is restricted to bioinformatics.

Conclusion

Galaxy allows large-scale analyses that previously required users to have some programming experience and database management skills. The Galaxy history page is simple to use, and is able to handle large genome annotation data sets. Users have the ability to perform multiple types of analyses (e.g., query intersections, subtractions, and proximity searches) and then display the results using existing browsers (e.g., the UCSC Genome Browser or Ensembl). The only thing Galaxy’s missing is compatibility with different areas not related to bioinformatics.

TRUFA

Overview

TRUFA stands for TRanscriptome User-Friendly Analysis, an informatics platform based on a web interface that generates the outputs commonly used in de novo RNA-seq analysis and comparative transcriptomics. TRUFA offer the next services, raw read cleaning executed, transcript assembly and annotation, and expression quantification. TRUFA is highly parallelized and benefits from the use of high performance computing resources. TRUFA gives the user an easy, fast and valid analysis on RNA-seq data.

Implementation

The platform is written using JavaScript, Python and Bash and it is currently installed in the ALTAMIRA supercomputer at the Instituto de Fisica de Cantabria(IFCA, Spain).The platform is highly parallelized both at the pipeline and program level, and can access up to 256 cores per execution instance for certain components of the pipeline.

Applications

TRUFA is designed to exclusively perform tasks related to de novo RNA-seq analysis, for this reason, the most important fields of application are evolutionary biology, ecology ,biomedicine and computational biology.

Conclusion

TRUFA provides a set of the most common tasks to perform a whole de novo RNA-seq analysis. It allows scientists which does not have bioinformatics skills or access to fast computing services. TRUFA works in an efficient, consistent and user-friendly manner, based on a pipeline approach. TRUFA integrates some widely used quality control programs in order to obtain optimization of the assembly process in the RNA-seq analysis.

Lifewatch Marine VRE

Overview

The LifeWatch Marine Virtual Research Environment (VRE) assembles several marine resources, data bases, data systems, web services, tools, etc. into one marine virtual research environment. The Marine VRE allows researchers to retrieve and access data resources holding marine biodiversity and ecosystem data, a range of data systems on species names, traits, distribution and genes.

A set of online tools is available to facilitate data analysis of marine biodiversity and ecosystem data, and analysis can be performed on data from known data resources and/or data uploaded by the users themselves. Should a researcher need a specifically adapted service, the Marine VRE gives the possibility to build his/her own marine virtual lab, making use of the web services that access and process data. Service catalogues and 'how to' manuals will guide the users during the development of their own system. The Marine VRE is already looking to the future, working to further increase the integration and interaction between its components.

Implementation

The Lifewatch Marine VRE is a web portal that contains in an organized way a set of web services, applications and scientific workflows, but it does not execute or perform any operation by itself, it merely hosts the references to BioVel (Biodiversity Virtual e-Laboratory) or Taverna in the matter of workflows.

Applications

The Lifewatch Marine VRE brings together relevant resources for Web-based marine research: data systems, Web services, workflows, online tools, etc. in one environment in the context of LifeWatch.

Conclusion

LifeWatch marine VRE supports marine environmental research and enables scientists to access resources and conduct analysis without having to install or configure any additional software. With everything accessible in one place, scientists can access data resources and tools and collaborate together. They can analyze their data in conjunction with data from other sources.

Chipster

Overview

Chipster is a versatile data analysis platform with interactive visualizations and workflows. It offers a comprehensive collection of analysis tools for next generation sequencing (NGS), microarray and proteomics data. The NGS functionality applies to analysis from quality control and alignment to downstream applications such as pathway analysis and motif detection and more analysis tools are added all the time. The built-in Chipster genome browser allows users to visualize reads and results in their genomic context. The microarray functionality covers expression and allows users to integrate expression data with different data.

Implementation

Chipster's platform is technically based on a desktop application user interface, a flexible distributed architecture, and the ability to integrate many types of analysis tools. The Chipster client software is a full graphical Java desktop application, offering an intuitive user interface with highly interactive visualizations and an overall smooth user experience. To make the client installation and updates as easy and automatic as possible, Chipster uses the Java Web Start technology.

Chipster offers high compatibility that makes possible the integration of almost any kind of tool, regardless of their implementation. As R/Bioconductor provides a rich collection of analysis functionality for microarray and NGS data, Chipster offers a strong support for R integration: Wrappers manage communication with R processes and pool them for rapid responsiveness, and several R versions can be run side-by-side. Integration of command line tools is also supported and can be accomplished even automatically. The tool selection offered by the local server can be augmented by external Web services (SOAP). Chipster is a client-server system. Server architecture allows tasks to be performed in optimal places: for example, interactive visualizations happen in the client, whereas the actual analysis tasks are processed by computing services, which can be run on server machines with ample CPU and memory resources. This way the user can run several analysis tasks simultaneously without burdening his computer power. In addition, there is no need to install any analysis tools or libraries to the user's computer as they are installed and maintained centrally in the computing servers. To avoid transferring data multiple times between the client and server, a caching mechanism is used. The caching extends to multi-user scenarios thanks to Chipster's cryptographically strong data identifiers: When a previously saved analysis session is opened from a different computer, possibly by a different user, the system still uses the original cached copy of the data and does not transfer it again to the server side.

Applications

Chipster enables biologists to access a powerful collection of data analysis and integration tools, and to visualize data interactively. Consequently we can conclude that the most important fields of application for Chipster are evolutionary biology, ecology, biomedicine and computational biology.

Conclusion

Taken together, Chipster is a user-friendly open source analysis software for microarray and other high throughput data. Its intuitive user interface brings a comprehensive collection of analysis methods within the reach of experimental biologists, enabling them to analyze and integrate different data types such as gene expression, miRNA and aCGH.

Quick Links

Search Wiki

Page Tools

State of the art in Workflow

Kepler

Taverna

myExperiment

Galaxy

TRUFA

Lifewatch Marine VRE

Chipster