Abstract
Cloud computing technologies have made it possible to analyze big data sets in scalable computing infrastructure. In many scientific fields, such as bioinformatics and astronomy, applications are composed of complex workflow tasks and generate huge amounts of data continuously, which require large space of storage as well as high-speed computing resource. DNA sequence analysis, where very large data sets are now generated at reduced cost using the Next-Generation Sequencing (NGS) methods, is an area which can greatly benefit from cloud-based infrastructures. It has become major challenge for these datasets to transfer, storage, and analysis. Even though many approaches have been proposed in distributed solutions, however, they focus on static scheduling with batch processing scheme in local computing farm and data storage. In the case of a large scale workflow system, it is essential and valuable to outsource entire or a part of their tasks to public clouds for reducing resource cost. However, transferring huge datasets between these potential resources and local node became a major challenge. Reducing transfer time as well as unbalanced completion time of different problem size is very important for making overall process faster.
In this thesis, we discuss current issues on resource provisioning, scheduling and computing model in the area of distributed environment, study on relevant approaches to solve them, also we propose an adaptive workflow scheduling scheme, including run-time data distribution and collection service for reducing the data transfer time. The proposed scheme optimizes the allocation ratio of computing elements to the different datasets in order to minimize the total makespan under resource constraints. We present an initial implementation and evaluation of this approach for Workflow Management System (WMS) composed of well-known sequence alignment algorithm and finally, experiment results show that our proposed scheme is promising.
Abstract
Cloud computing technologies have made it possible to analyze big data sets in scalable computing infrastructure. In many scientific fields, such as bioinformatics and astronomy, applications are composed of complex workflow tasks and generate huge amounts of data continuously, which require large space of storage as well as high-speed computing resource. DNA sequence analysis, where very large data sets are now generated at reduced cost using the Next-Generation Sequencing (NGS) methods, is an area which can greatly benefit from cloud-based infrastructures. It has become major challenge for these datasets to transfer, storage, and analysis. Even though many approaches have been proposed in distributed solutions, however, they focus on static scheduling with batch processing scheme in local computing farm and data storage. In the case of a large scale workflow system, it is essential and valuable to outsource entire or a part of their tasks to public clouds for reducing resource cost. However, transferring huge datasets between these potential resources and local node became a major challenge. Reducing transfer time as well as unbalanced completion time of different problem size is very important for making overall process faster.
In this thesis, we discuss current issues on resource provisioning, scheduling and computing model in the area of distributed environment, study on relevant approaches to solve them, also we propose an adaptive workflow scheduling scheme, including run-time data distribution and collection service for reducing the data transfer time. The proposed scheme optimizes the allocation ratio of computing elements to the different datasets in order to minimize the total makespan under resource constraints. We present an initial implementation and evaluation of this approach for Workflow Management System (WMS) composed of well-known sequence alignment algorithm and finally, experiment results show that our proposed scheme is promising.