This tutorial is designed as an introductory guide to parallelizing C and C++ code using the Cilk language extension. This tutorial assumes that you have a fair knowledge of C and/or C++. The Intel® Cilk™ documentation was used to build this tutorial.
The authors of this tutorial are Michael Graf and Andrei Papancea. Its creation was partially supported by National Science Foundation's "Transforming Undergraduate Education in Science, Technology, Engineering and Mathematics (TUES)" program under Award No. DUE-1044299, the Andrew W. Mellon Foundation, and the Baker-Velde Award. It is currently being maintained by David Bunde. Please let us know what you think of the tutorial so that we can continue to improve it.
Intel® Cilk™ Plus is a user-friendly language add-on to the C and C++ programming languages, implemented by the Intel® C++ Compiler. Since almost all modern day devices have a multicore processor, parallelism is becoming increasingly relevant. The problem is that most popular languages were not created with the idea of parallelism in mind, and if they do support this feature it is usually unintuitive and difficult to implement. Intel® Cilk™ Plus provides a simple to use library that makes parallelizing C and C++ code trivial.
Intel® Cilk™ Plus adds only 3 keywords to C and C++: cilk_for, cilk_spawn, and cilk_sync. With only three keywords you can start writing C and C++ parallel code in minutes, that runs significantly faster than its serial counterpart, so you can better take advantage of multicore machines.
The cilk_spawn keyword tells the compiler that the function that cilk_spawn precedes may run in parallel with the caller - but is not required to. A cilk_spawn statement can be called in the following ways:
Looking at the previous example you can see some side effects of running things in parallel - tasks will run out of order most of the time. Sometimes you want certain tasks to run in order because they might be dependent on each other, case in which you would use cilk_sync.
When you place cilk_sync somewhere in the code it causes all previously spawed tasks to wait for each other to complete before the program can continue. Getting back to our previous "Hello world! Done!" example, placing a single cilk_sync statement right before "Done!" is printed can ensure that "Done!" is printed only after the two spawed tasks, hello() and world(), respectively, have finished their work.
Imagine you are a car manufacturer and that you need to write a computer program to make and place the parts of the car. The creation of the parts should begin at the same time, yet the order in which they are finished does not matter. On the other hand, the car parts cannot be placed until they are created, and they have to be placed in a specific order: chassis and frame must be placed before everything else, the wheels and engine have to be placed before the steering wheel, which is placed last. Once all parts have been placed, print the message "The car has been built.". With this scenario in mind, use the code below and finish the program to satisfy these conditions.
#include <stdio.h> #include <cilk/cilk.h> void make(char* str){ int i=0; for(i=0;i<1000000;i++) printf(""); printf("%s has/have been created.\n",str); } void place(char* str){ int i=0; for(i=0;i<1000000;i++) printf(""); printf("%s has/have been placed.\n",str); } int main(){ //Place your code here }
#include <iostream> #include <string> #include <cilk/cilk.h> using namespace std; void make(string str){ for(int i=0;i<1000000;i++) cout << ""; cout << str << " has/have been created." << endl; } void place(string str){ for(int i=0;i<1000000;i++) cout << ""; cout << str << " has/have been placed." << endl; } int main(){ //Place your code here }
#include <stdio.h> #include <cilk/cilk.h> void make(char* str){ int i=0; for(i=0;i<1000000;i++) printf(""); printf("%s has/have been created.\n",str); } void place(char* str){ int i=0; for(i=0;i<1000000;i++) printf(""); printf("%s has/have been placed.\n",str); } int main(){ //These 5 parts will finish in no particular order. //Run it a couple of time to see the variation in ordering. cilk_spawn make("Wheels"); cilk_spawn make("Chassis"); cilk_spawn make("Engine"); cilk_spawn make("Frame"); cilk_spawn make("Steering wheel"); cilk_sync; //wait for the parts to be created cilk_spawn place("Chassis"); cilk_spawn place("Frame"); cilk_sync; //Wait for chassis and frame to be placed cilk_spawn place("Wheels"); cilk_spawn place("Engine"); cilk_sync; //Wait for chassis, frame, wheels, and engine to be placed cilk_spawn place("Steering wheel"); cilk_sync; //wait for all parts to be placed printf("The car has been built.\n"); }
#include <stdio.h> #include <iostream> #include <string> #include <cilk/cilk.h> using namespace std; void make(string str){ for(int i=0;i<1000000;i++) cout << ""; cout << str << " has/have been created." << endl; } void place(string str){ for(int i=0;i<1000000;i++) cout << ""; cout << str << " has/have been placed." << endl; } int main(){ //These 5 parts will finish in no particular order. //Run it a couple of time to see the variation in ordering. cilk_spawn make("Wheels"); cilk_spawn make("Chassis"); cilk_spawn make("Engine"); cilk_spawn make("Frame"); cilk_spawn make("Steering wheel"); cilk_sync; //wait for the parts to be created cilk_spawn place("Chassis"); cilk_spawn place("Frame"); cilk_sync; //Wait for chassis and frame to be placed cilk_spawn place("Wheels"); cilk_spawn place("Engine"); cilk_sync; //Wait for chassis, frame, wheels, and engine to be placed cilk_spawn place("Steering wheel"); cilk_sync; //wait for all parts to be placed cout << "The car has been built." << endl; }
The cilk_for construct specifies a loop that permits loop iterations to run in parallel. This is a the parallel version of a normal C/C++ for loop.
cilk_for divides a loop into chunks containing one or more loop iterations. Once the loop is broken down, each chunk is executed on a specific thread of execution.
Let's take a look at a quick example, where we use cilk_for to sum up in parallel the first 10,000 integers:
Recall for a second our previous example, in which we sum up the first 10,000 integers. Whenever we run it we get a different result, because of a race condition. One way to solve this problem is to use locks. Locks are synchronization mechanisms that prevent multiple threads from changing a variable concurrently. Thus, locks help to eliminate data races.
Here is how you can use locks in C (using the pthread.h library) and in C++ (using the tbb/mutex.h library):
Recall our example at the end of the cilk_for section, where we count all the prime numbers up to 10,000,000. The issue with that example is that a race condition occurs when different threads try to increase the prime number counter. Your task is to use locks to fix the race condition and output the correct result, 664579 prime numbers.
Play around with the grainsize value to see what's the best value for you. We ran the program on a 16 core machine, so the same grainsize might not work as well for a machine with fewer cores.
//Run the code with the '-lm' compiler flag, to allow sqrt to work #include <stdio> #include <cilk/cilk.h> #include <sys/time.h> #include <math.h> int isPrime(int n){ int limit = sqrt(n); int i = 0; for(i=2; i<=limit; i++) if(n%i == 0) return 0; return 1; } int main(){ int n = 10000000; int gs = 25000; int numPrimes = 0; int i; struct timeval start,end; gettimeofday(&start,NULL); #pragma grainsize = gs cilk_for(i = 0; i < n/gs; i++){ int j; for(j = i*gs+1; j < (i+1)*gs; j += 2){ if(isPrime(j)) numPrimes++; } } gettimeofday(&end,NULL); double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000); printf("Found %d primes in %lf seconds.\n",numPrimes,myTime); }
#include <iostream> #include <cilk/cilk.h> #include <sys/time.h> #include <math.h> using namespace std; bool isPrime(int n){ int limit = sqrt(n); for(int i=2; i <=limit; i++) if(n%i == 0) return false; return true; } int main(){ int n = 10000000; int gs = 25000; //grainsize int numPrimes = 0; struct timeval start,end; gettimeofday(&start,NULL); //Start timing the computation #pragma grainsize = gs cilk_for(int i = 0; i < n/gs; i++){ for(int j = i*gs+1; j < (i+1)*gs; j += 2){ if(isPrime(j)) numPrimes++; } } gettimeofday(&end,NULL); //Stop timing the computation double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000); cout << "Found " << numPrimes.get_value() << " primes in " << myTime << " seconds.\n"; }
//Run the code with the '-lm' compiler flag, to allow sqrt to work #include <stdio.h> #include <cilk/cilk.h> #include <pthread.h> #include <sys/time.h> #include <math.h> int isPrime(int n){ int limit = sqrt(n); int i = 0; for(i=2; i<=limit; i++) if(n%i == 0) return 0; return 1; } int main(){ int n = 10000000; int gs = 25000; int numPrimes = 0; int i; pthread_mutex_t m; //create the lock struct timeval start,end; gettimeofday(&start,NULL); pthread_mutex_init(&m,NULL); //initialize the lock #pragma grainsize = gs cilk_for(i = 0; i < n/gs; i++){ int j; for(j = i*gs+1; j < (i+1)*gs; j += 2){ if(isPrime(j)){ pthread_mutex_lock(&m); //lock the code below numPrimes++; pthread_mutex_unlock(&m); //unlock numPrimes } } } gettimeofday(&end,NULL); double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000); printf("Found %d primes in %lf seconds.\n",numPrimes,myTime); }
#include <iostream> #include <cilk/cilk.h> #include <tbb/mutex> #include <sys/time.h> #include <math.h> using namespace std; bool isPrime(int n){ int limit = sqrt(n); for(int i=2; i <=limit; i++) if(n%i == 0) return false; return true; } int main(){ int n = 10000000; int gs = 25000; //grainsize int numPrimes = 0; tbb::mutex m; //create the lock struct timeval start,end; gettimeofday(&start,NULL); //Start timing the computation #pragma grainsize = gs cilk_for(int i = 0; i < n/gs; i++){ for(int j = i*gs+1; j < (i+1)*gs; j += 2){ if(isPrime(j)){ m.lock(); //lock the code below numPrimes++; m.unlock(); //unlock numPrimes } } } gettimeofday(&end,NULL); //Stop timing the computation double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000); cout << "Found " << numPrimes.get_value() << " primes in " << myTime << " seconds.\n"; }
You have seen how locks can be used to solve data race, but some of the problems associated with them can make them a poor solution to the problem. The better solution in Cilk™ are reducers. By definition, a reducer is a variable that can be safely used by multiple threads running in parallel.
The runtime ensures that each thread has access to a private copy of the variable, eliminating the possibility of races without requiring locks. When the threads synchronize, the reducer copies are merged (or reduced) into a single variable. The runtime creates copies only when needed, minimizing overhead.
Getting back to our summation example, where we add up the first 10,000 integers, take a look below at the reducer solution for the race condition problem:
Create an array with 1,000,000 elements, and fill it with random numbers. Use a reducer from the list provided by Intel® to find the maximum value of the array.
#include <cilk/cilk.h> #include <cilk/reducer_max.h> #include <iostream> using namespace std; int main(){ cilk::reducer_max<int> maxVal; int A [1000000]; for (int i = 0; i < 1000000; i++) A[i] = i; cilk_for (int i = 0; i < 1000000; i++) cilk::max_of(maxVal,A[i]); cout << maxVal.get_value() << "\n"; }
Recall (again, we know) our example at the end of the cilk_for section, where we count all the prime numbers up to 10,000,000. The issue with that example is that a race condition occurs when different threads try to increase the prime number counter. Your task is to use one of the available reducers to fix the race condition and output the correct result, 664579 prime numbers.
Play around with the grainsize value to see what's the best value for you. We ran the program on a 16 core machine, so the same grainsize might not work as well for a machine with fewer cores.
#include <iostream> #include <cilk/cilk.h> #include <sys/time.h> #include <math.h> using namespace std; bool isPrime(int n){ int limit = sqrt(n); for(int i=2; i <=limit; i++) if(n%i == 0) return false; return true; } int main(){ int n = 10000000; int gs = 25000; //grainsize int numPrimes = 0; struct timeval start,end; gettimeofday(&start,NULL); //Start timing the computation #pragma grainsize = gs cilk_for(int i = 0; i < n/gs; i++){ for(int j = i*gs+1; j < (i+1)*gs; j += 2){ if(isPrime(j)) numPrimes++; } } gettimeofday(&end,NULL); //Stop timing the computation double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000); cout << "Found " << numPrimes << " primes in " << myTime << " seconds.\n"; }
#include <iostream> #include <cilk/cilk.h> #include <cilk/reducer_opadd.h> #include <sys/time.h> #include <math.h> using namespace std; bool isPrime(int n){ int limit = sqrt(n); for(int i=2; i <=limit; i++) if(n%i == 0) return false; return true; } int main(){ int n = 10000000; int gs = 25000; //grainsize cilk::reducer_opadd<int> numPrimes; struct timeval start,end; gettimeofday(&start,NULL); //Start timing the computation #pragma grainsize = gs cilk_for(int i = 0; i < n/gs; i++){ for(int j = i*gs+1; j < (i+1)*gs; j += 2){ if(isPrime(j)) numPrimes++; } } gettimeofday(&end,NULL); //Stop timing the computation double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000); cout << "Found " << numPrimes.get_value() << " primes in " << myTime << " seconds.\n"; }
The runtime system provides a small number of functions that allow the user to control certain details of the program's behavior. To get access to these functions you need to include cilk/cilk_api.h in the header of your program.
There are four run time system functions: