GridFactory is a research project investigating usability, security and scalability in distributed computing.
The mission of GridFactory is to help scientists take a next step in collaborating and use the Internet not just for communication, but for concrete collaborative work through the sharing of computational resources.
In particular by making it simple to:
The mission of GridFactory is to help scientists take a next step in collaborating and use the Internet not just for communication, but for concrete collaborative work through the sharing of computational resources.
In particular by making it simple to:
- form and dissolve collaborations
- provision compute clusters of heterogeneous, distributed resources
- use such cluster in an intuitive and collaborative way for scientific research
The project takes a fresh look at the now 10 years old grid computing concept and incorporate ideas from utility and cloud computing, focusing on improving the user experience.
The long story
Batch systems
Batch systems allow the sharing of a single large computing facility by multiple users. A central concept is that of “jobs”, which users can submit to job queues. Users are typically offered a command-line interface for this. Commonly used batch systems include: LSF, PBS, SGE, LoadLeveler and Condor. These batch systems have one thing in common: once a job is submitted to a server, the files making up the job are pushed to a worker node by the server. This means that all worker nodes must run a network service, listening for jobs. Moreover, most of the systems rely on a shared file system and shared user accounts to exist on the server and the worker nodes. All this effectively means that server and worker nodes must be on the same local area network.Grids
Most of the batch systems listed above offer a common web service interface for job submission: DRMAA. The DRMAA specification (v1.0) does not support transferring input and output files from/to remote locations. It also does not specify how to authenticate with a remote system. To allow submitting jobs to remote sites, these sites must therefore have a service in front of the batch system. In the grid community, a widely supported web service interface for remote job submission is BES. Examples of middleware projects that implement or plan to implement BES are: EGEE, NorduGrid and Unicore. There also exists a more stand-alone BES implementation based on gSOAP. The many grid computing projects of the past decade aimed at commoditizing computing power and enabling global, distributed computing. In our view, the main outcome of these, largely academic, projects was primarily a conceptual one: a consensus that the future of high performance computing lies in distributing and sharing computing and storage resources over the Internet.Clouds
Server rental offerings of varying degree of automatization have been around for a some time. Examples include RackSpace, ServePath and ThePlanet. Recently, Amazon’s Elastic Compute Cloud has been offering rental of virtual machines. These machines are provisioned to the user through a fully automatized RESTful web service interface. Such a system is often referred to as a compute cloud. More recently, a few vendors, notably OpenNebula, Enomaly and Eucalyptus, have started providing the software to allow anyone to set up a compute cloud.GridFactory
Idea
The idea behind GridFactory is to start with a traditional job-oriented batch system interface and extend it with a catalog of operating systems and software packages plus file staging capabilities. The system shares many characteristics with traditional grid systems, but in contrast to these it is not built on top of batch systems, but is a batch system extended with WAN capabilities. The system also shares many characteristics with cloud systems, but in contrast to these it is not machine oriented, but job oriented: a virtual machine is seen as just another piece of software that a given job may need in order to run. GridFactory also shares some characteristics with utility computing systems like BOINC and XtremWeb, but in contrast to these it supports virtualization, is a general purpose batch system - allowing to run any executable, and moreover allows any user to build up a virtual machine and software catalog.Target audience
GridFactory is designed for running large numbers of independent compute jobs, i.e. jobs that don’t communicate with each other. While multi-CPU provisioning and MPI support could be added, this is not currently a priority. Also MapReduce falls outside of the scope of GridFactory.As such, potential users include high-performance computing users who want:
- on-the-fly provisioning of virtual machines and software
- easy installation and configuration - at the price of less scheduling features than traditional batch systems
- an intuitive GUI for managing large numbers of independent compute jobs
- scaling across multiple sites
- companies and academic institutions looking for a simple way to analyze large amounts of scientific data, processing log files, rendering 3D images or converting media files
- companies and academic institutions looking for a way to streamline data processing workflows - keeping track of in and output files and operating system and software requirements.
Features
What distinguishes GridFactory from traditional batch computing systems is the following:- WAN support: There is no dependence on shared file systems, ssh login or other LAN artifacts. Remote job submission is inherently supported.
- Network security: All peers in a grid are (optionally) identified by an X.509 certificate and all communication is secure.
- Ease of use: To make network configuration as simple as possible, a pull paradigm is employed for the communication between a server and its worker nodes.
- Host security: Worker nodes can be protected from rogue computing jobs by configuring them to run all jobs inside virtual machines.
- Fewer abstraction layers, less bloat: Since wide area networks are inherently supported, the complication of interfacing a grid middleware stack with batch systems is eliminated. Moreover, no attempt is made of supporting other transport mechanisms than HTTPS.
- Standards compliance: Only standard security libraries are used. Only plain and standard HTTPS is used. Only proven open-source technology is used: Apache, MySQL, VirtualBox, Java. For job submission, neither BES nor WSRF is used, but instead a simple RESTful interface.
- Platform support: All common platforms are supported: most of the GridFactory code is 100% pure Java - the non-Java code is restricted to the server and consists of two Apache modules which compile on all major Linux distributions (binaries are provided).
- No central bottlenecks: The hierarchical pull architecture leaves the decision of which jobs to run to the resources running them and therefore avoids central bottlenecks and single points of failure like resource brokers and central information systems.
- Scalability: The hierarchical pull architecture supports multiple parents and the submission clients support multiple submission target hosts. Therefore, load balancing is built in and performance scales linearly with the number of servers.
- Virtual organizations can easily be created and destroyed.
- Software catalogs can easily be created, published and used.