Habanero Java Tutorial

Warning: This tutorial uses an older version of Habanero Java (HJ). For up to date materials on the current version, please look at the materials created by the HJ group.

Contents

  1. Introduction
    1. About this tutorial
    2. What is Habanero Java?
    3. Why Habanero Java?
    4. Getting Habanero Java
  2. Parallelism in HJ
    1. Forall
    2. Isolated
    3. Async
    4. Finish
    5. Future
    6. Barrier Synchronization and the Next Statement

1. Introduction


1.1. About This Tutorial

This tutorial is designed to quickly get you started programming in Habanero Java (HJ). It is written to be accessible to a broad audience so we do not assume background other than familiarity with Java.

This tutorial is a combination of contributions from David Bunde, Johnathan Ebbers, Maxwell Galloway-Carson, Michael Graf, Sung Joo Lee, Andrei Papancea, and Casey Samoore. It was created as part of the "Teaching parallel computing with higher-level languages and activity-based laboratories" project, with support from National Science Foundation's "Transforming Undergraduate Education in Science, Technology, Engineering and Mathematics (TUES)" program under Award No. DUE-1044299 and the Andrew W. Mellon Foundation. It is currently being maintained by David Bunde. Please let us know (dbunde@knox.edu) what you think of the tutorial so that we can continue to improve it.

1.2. What is Habanero Java?

Habanero Java (HJ) is a part of the Habanero Multicore Software Research Project at Rice University. HJ is an extension of java designed to teach high level parallel programming to undergraduate and graduate level students. This tutorial is designed to walk someone familiar with Java through the additional features afforded by Habanero.

1.3. Why Habanero Java?

Before Habanero Java, the standard way to write concurrent programs in Java was using the Thread class. Threads create child tasks that run on different processors than the parent tasks. These threads are given pieces of code to run, and upon completion they are then joined with the parent. Habanero Java takes these concepts and simplifies them by using keywords and functions.

1.4 Getting Habanero Java

For directions to install the new library-based implementation of HJ, go here. The older version which we describe here used the instructions here, but that version is out of date.


2. Parallelism in HJ


2.1. Forall

One of the simplest forms of parallelization in HJ is done via a forall loop. A forall loop works similarly to a for loop and must be written in terms of points and regions.

The idea of a forall loop is to run the iterations of the for loop in parallel. As the parent task enters the forall loop it creates multiple threads. These threads then continue to iterate through the loop and join up when they exit. Let's take a look at a quick example, a program that does simple summation of the numbers in an integer array, like this:

for (int i = 0; i < A.length; i++){
sum += A[i];
}
To parallelize this section of the code we can use a forall loop like this:
forall (point [i] : [0:(A.length)-1]){
sum += A[i];
}
However, when you compile and run this code you will find that the result is different every time you run the code. In order to fix this issue, we must use a construct that is called isolated, which we will discuss later in this tutorial.

Concepts

Points are elements of a k-dimensional Cartesian space (k > 0), where k is the rank of the point. Points can be set up like:
//1-dimensional point
point(1);

//multi-dimensional point (k > 0)
point(1,2,3,...,k);
It is unlikely that you will use points outside forall loop iterations.

A k-dimensional region is a set of k-dimensional points, defined as a Cartesian product of low:high contiguous subranges in each dimension. Basically, a 1-dimensional region [1:5] consists of 5 points [1],...,[5], while a 2-dimensional region [1:5,1:5] consists of 25 points [1,1],[1,2],...,[5,4],[5,5].

Regions are useful when rewriting regular for loops into forall loops in order to parallelize the code. For instance, a basic declaration of a loop in Java is:
for (int i = 0; i < 10; i++){
statements;
}
When rewriting a regular for loop into a forall loop using points and regions the syntax looks as follows:
forall(point [i] : [0:10]){
statements;
}

Exercise

Let us now try to parallelize something! Below is the code that will serially solve a matrix multiplication and print the runtime in miliseconds. The variable n is used to keep track of the size of the matrices. We have initialized arrays A, B, and C which are all two-dimensional arrays. So let's try to convert this program into a program that runs in parallel using the forall construct. After you have parallelized it with the forall construct, run it a couple of times and compare the runtimes of the serial and parallel versions. What do you notice? Also, think about why it’s behaving in a certain way and try different values for n.

Note that this is not the best way to parallelize this program. Also note that in order to check that the program is actually doing matrix multiplication reduce the size of n to 10 and uncomment the m1.display(); line in the main method.

(Create a separate file for both serial and parallel programs since you would want to compare their runtimes)

public class MatrixMultiply{
   int n = 500;
   int[][] A = new int[n][n];
   int[][] B = new int[n][n];
   int[][] C = new int[n][n];

   public void display() {
       System.out.println("Matrix 1: ");
       printMatrix(A);
       System.out.println("Matrix 2: ");
       printMatrix(B);
       System.out.println("Result of Multiplication of both matrix: ");
       printMatrix(C);
   }
   
   public void printMatrix(int[][] mat){
       for (int i = 0 ; i < n ; i ++){
           for (int j = 0 ; j < n ; j ++)
               System.out.print(" " + mat[i][j]);
           System.out.println();
	}
   }

   public void multiply() {
	for (int i = 0 ; i < n ; i ++)
            for (int j = 0 ; j < n ; j ++)
		A[i][j] = i+1;

	for (int i = 0 ; i < n ; i ++)
            for (int j = 0 ; j < n ; j ++)
		B[i][j] = i+1;

	long timeKeep = System.currentTimeMillis();

	for (int i = 0 ; i < n ; i ++)
            for (int j = 0 ; j < n ; j ++)
		C[i][j] = 0;

	for (int i = 0 ; i < n ; i++)
	  for (int j = 0 ; j < n ; j++)
		for (int k = 0 ; k < n ; k++)
			C[i][j] += A[i][k] * B[k][j];
	System.out.println(System.currentTimeMillis()-timeKeep);
   }

   public static void main(String[] args)  {       
	MatrixMultiply m1 = new MatrixMultiply();
   
	m1.multiply();
	//m1.display();
   }
}

public class MatrixMultiplyP{
   public static int n = 500;
   public static int[][] A = new int[n][n];
   public static int[][] B = new int[n][n];
   public static int[][] C = new int[n][n];
   
   public void display() {
       System.out.println("Matrix 1: ");
       printMatrix(A);
       System.out.println("Matrix 2: ");
       printMatrix(B);
       System.out.println("Result of Multiplication of both matrix: ");
       printMatrix(C);
   }
   
   public void printMatrix(int[][] mat){
       for (int i = 0 ; i < n ; i ++){
           for (int j = 0 ; j < n ; j ++)
               System.out.print(" " + mat[i][j]);
           System.out.println();
	}
   }

   public void multiply() {

       for (int i = 0 ; i < n ; i ++)
           for (int j = 0 ; j < n ; j ++)
               A[i][j] = i+1;

       for (int i = 0 ; i < n ; i ++)
           for (int j = 0 ; j < n ; j ++)
               B[i][j] = i+1;

       long timeKeep = System.currentTimeMillis();
	   
       for (int i = 0 ; i < n ; i ++)
           for (int j = 0 ; j < n ; j ++)
               C[i][j] = 0;

       forall (point[i] : [0:n-1])
           for (int j = 0 ; j < n ; j ++)
               for (int k = 0; k < n; k++)
                   C[i][j] += A[i][k] * B[k][j];

	System.out.println(System.currentTimeMillis()-timeKeep);
   }
   

   public static void main(String[] args)  {
       MatrixMultiplyP m1 = new MatrixMultiplyP();
       
       m1.multiply();
       //m1.display();
   }
}

2.2. Isolated

A typical problem in concurrent languages is when a section of code is accessed as a shared resource when it should not be concurrently accessed by more than one thread of execution. The isolated construct fixes this problem, by ensuring that the task will execute in mutual exclusion with all other possibly parallel interfering instances. An example use of the isolated construct applied to our previous example is:

//Note that an isolated block cannot contain
//forall, async, or finish statements within the construct.

forall (point[i] : [0:(A.length)-1]) {
isolated {
sum += A[i];
}
}
This should have eliminated the race condition from the previous exercise. However, you may realize that using forall is not an ideal way to parallelize this task. This construct creates more tasks than necessary, resulting in a slower run time. Constructs introduced later in this tutorial can aid in attaining better speed-up.

Try the following exercise to get better acquainted to the isolated construct:

Exercise

The following code will increment n 20,000 times. Starting from 0, the correct value for n at the end should be 20,000. Run the program a few times and see the solutions. You will notice that the program does not consistently give 20,000. It gives a variety of solutions. Add an isolated statement to the code such that the solution is consistently correct.

public class isoex{
    static int n=0;
    public static void main(String[] args){
       forall(point[i]:[0:19999]){
           n++;
       }
       System.out.println(n);
    }
}

public class isoex{
    static int n=0;
    public static void main(String[] args){
       forall(point[i]:[0:19999]){
	   isolated{
	       n++;
	   }
       }
       System.out.println(n);
    }
}

2.3. Async

On the inside a forall loop is just a way to create a collection of tasks, one for each iteration of a for loop. HJ allows the programmer to make tasks individually as well. This is done with the async keyword. Simply stated, async causes the following statement to execute as its own task. For example:

async function1();
async {
function2();
function3();
}

This creates two new tasks; the first calls function1 and the second calls both function2 and function3.

Exercise

The start code below increments n up to 444,444,444 and m up to 555,555,555 and then sums them up and then prints the results and the runtime. Run the given code and check the runtime and the output. Now, wrap each loop with async. What do you notice?

public class Asyncex {
   public static int n = 0;
   public static int m = 0;
   public static long timeKeep;
 
   public static void main(String[] args){
      timeKeep = System.currentTimeMillis();
		
      for(int i=0; i<444444444; i++){
	  n++;
      }
		
      for(int i=0; i<555555555; i++){
          m++;
      }
		
      System.out.println("Time "+(System.currentTimeMillis() - timeKeep));
      System.out.println("Sum "+(n + m));
   }
}

public class AsyncexP {
   public static int n = 0;
   public static int m = 0;
   public static long timeKeep;
 
   public static void main(String[] args){
      timeKeep = System.currentTimeMillis();
		
      async {
         for(int i=0; i<444444444; i++){
            n++;
         }
      }
		
      async {
         for(int i=0; i<555555555; i++){
            m++;
         }
      }
		
      System.out.println("Time "+(System.currentTimeMillis()-timeKeep));
      System.out.println("Sum "+(n + m));
	  
      /** Notice how the runtime has considerably gone down, but the result is wrong?
       *  The advantage of running code asynchroniously is the fact that it creates a
       *  a thread for each chunk of code wrapped around with async. The disadvantage
       *  is that sometimes code that is dependent on a piece of code running asynchroniously
       *  might run before the asynchronous part. This can obviously cause errors, just
       *  like you see above.
       *
       *  This issue can be fixed using the finish statement, which we'll introduce next.
       */
   }
}

2.4. Finish

Just as in low level task parallelization, it is sometimes important for all the parallelized tasks to join together when finished to prevent one thread from continuing on without the other threads finishing. In HJ this is accomplished via finish.

Let's take another look at our async example. After the two async statements run, we print the sum of the incremented values of n and m. The main problem here is that the print method executes before both loops finish. To fix this, we need to wrap the async blocks with a finish statement.

Exercise

Use the finish statement on the given code to ensure the loops finish before printing.

public class Finishex {
   public static int n = 0;
   public static int m = 0;
   public static long timeKeep;
 
   public static void main(String[] args){
      timeKeep = System.currentTimeMillis();
		
      async {
         for(int i=0; i<444444444; i++){
            n++;
         }
      }
		
      async {
         for(int i=0; i<555555555; i++){
            m++;
         }
      }
		
      System.out.println("Time "+(System.currentTimeMillis()-timeKeep));
      System.out.println("Sum "+(n + m));
   }
}

public class FinishexP {
   public static int n = 0;
   public static int m = 0;
   public static long timeKeep;
 
   public static void main(String[] args){
      timeKeep = System.currentTimeMillis();
		
      finish {
         async {
            for(int i=0; i<444444444; i++){
                n++;
            }
         }
		
         async {
            for(int i=0; i<555555555; i++){
                m++;
            }
         }
      }
		
      System.out.println("Time "+(System.currentTimeMillis()-timeKeep));
      System.out.println("Sum "+(n + m));
   }
}

Because the program cannot continue until the finish block is complete, the runtime will increase, but it will still be lower than in the serial version. Note that the greater the workload the better the speed-up from parallelization.

Remark: Forall revisited

Now that we've walked you through the forall, async, and finish statements, we can explain you how forall loops actually work. Basically, a forall loop can be rewritten as follows:
finish{
for(int i=0; i<n; i++){
async{
statement;
}
}
}
While forall loops are easier to implement, the example above should help you understand what is actually going on, and why having a small amount of workload for a forall loop can give you slower runtime than the serial version.

2.5. Future

This is a construct that allows the creation of a container of type <T> that is used to store data in the parallelized task. This construct avoids the issue of having to initialize variables as a static field, instance field, or an array element. The syntax of future looks like this:

final future<T> var = async<T> {
function;

//The return value will be of type <T>
return retVal;
};
Note that a semi-colon is needed after the last bracket. Also, it is important that this container is preceded by the modifier final, which prevents the variable from being modified after initialization.

In order to retrieve the value of var, from above, as soon as the async block finishes you need to use the get() method of the future construct:
final future<T> var = async<T> {
function;

//The return value will be of type <T>
return retVal;
};

//T is the data type of var i.e. int
T result = var.get();
Note that you do not need a finish block when you use the get() method, because by default get() will wait for the future construct to complete.

2.6. Barrier Synchronization and the Next Statement

It should now be clear how to synchronize programs with features like finish and isolated. However there are still more ways to synchronize them. Consider a program that iterates through guests at a party and gives them cake and ice cream.

public class cakeice{
public static void main(String[] args){
forall(point[guest]: [1:numguests]){
givecake(guest);
giveicecream(guest);
}
}
}
In its current version, the program will iterate through your guests and give them cake and ice cream. However as it runs it is very possible for some guests to receive cake and ice cream before others even receive their cake. Let's say that as a fair host, you want to wait until everybody has cake before starting to give people ice cream. In order to solve such a problem we use the next statement. When the next statement is put in a forall loop, it will act as a barrier at which all iterations of the loop must wait for all other iterations to reach before continuing.
public class cakeice{
public static void main(String[] args){
forall(point[guest]: [1:numguests]){
givecake(guest);
next;
giveicecream(guest);
}
}
}
Now in its altered version, the program will wait until each iteration runs the givecake function before continuing on to the giveicecream function assuring that all guests receive cake before any ice cream is given. Admittedly this example is a bit contrived but it explains the one of the uses of barrier synchronization. The same idea can also be used for things such as weather prediction where all the nodes in a 1 mile radius must be predicted before the nodes in the 2 mile radius are computed.

Exercise

Consider the following code that prints out "Hello" and "Goodbye" a various number of times. Use barrier synchronization (next statement) to make sure that all "Hellos" occur before the "Goodbyes".

public class nextex{
    public static void main(String[] args){
	forall(point[i]:[1:5]){
	    System.out.println("Hello" +i);
	    System.out.println("Goodbye" +i);
	}
    }
}

public class nextex{
    public static void main(String[] args){
	forall(point[i]:[1:5]){
	    System.out.println("Hello"+i);
	    next;
	    System.out.println("Goodbye"+i);
	}
    }
}