Cilk Tutorial

Contents

  1. Introduction
    1. About this Tutorial
    2. About Cilk
  2. Parallelism
    1. cilk_spawn
    2. cilk_sync
    3. cilk_for
    4. Locks
    5. Reducers (C++ Only)
    6. Run Time System Functions

1. Introduction


1.1. About this Tutorial

This tutorial is designed as an introductory guide to parallelizing C and C++ code using the Cilk language extension. This tutorial assumes that you have a fair knowledge of C and/or C++. The Intel® Cilk™ documentation was used to build this tutorial.

The authors of this tutorial are Michael Graf and Andrei Papancea. Its creation was partially supported by National Science Foundation's "Transforming Undergraduate Education in Science, Technology, Engineering and Mathematics (TUES)" program under Award No. DUE-1044299, the Andrew W. Mellon Foundation, and the Baker-Velde Award. It is currently being maintained by David Bunde. Please let us know what you think of the tutorial so that we can continue to improve it.

1.2. About Cilk

Intel® Cilk™ Plus is a user-friendly language add-on to the C and C++ programming languages, implemented by the Intel® C++ Compiler. Since almost all modern day devices have a multicore processor, parallelism is becoming increasingly relevant. The problem is that most popular languages were not created with the idea of parallelism in mind, and if they do support this feature it is usually unintuitive and difficult to implement. Intel® Cilk™ Plus provides a simple to use library that makes parallelizing C and C++ code trivial.

Intel® Cilk™ Plus adds only 3 keywords to C and C++: cilk_for, cilk_spawn, and cilk_sync. With only three keywords you can start writing C and C++ parallel code in minutes, that runs significantly faster than its serial counterpart, so you can better take advantage of multicore machines.

2. Parallelism


2.1. cilk_spawn

The cilk_spawn keyword tells the compiler that the function that cilk_spawn precedes may run in parallel with the caller - but is not required to. A cilk_spawn statement can be called in the following ways:

type var = cilk_spawn func(args); // func () returns a value

var = cilk_spawn func(args); // func () returns a value

cilk_spawn func(args); // func () may return void
When a function spawns another function the original function is knows as the parent, while the other function is know as the child. It is illegal to spawn a function as the argument of another function such as:
f(cilk_spawn g()); //Forbidden
Here's a more concrete use of cilk_spawn. In the example below, we want to print the message "Hello world!" and then "Done!" right before the program ends. Run the code a couple of times.
C
C++
#include <stdio.h>
#include <cilk/cilk.h>

static void hello(){
int i=0;
for(i=0;i<1000000;i++)
printf("");
printf("Hello ");
}

static void world(){
int i=0;
for(i=0;i<1000000;i++)
printf("");
printf("world! ");
}

int main(){
cilk_spawn hello();
cilk_spawn world();
printf("Done! ");
}
#include <iostream>
#include <cilk/cilk.h>

using namespace std;

static void hello(){
for(int i=0;i<1000000;i++)
cout << "";
cout << "Hello " << endl;
}

static void world(){
for(int i=0;i<1000000;i++)
cout << "";
cout << "world! " << endl;
}

int main(){
cilk_spawn hello();
cilk_spawn world();
cout << "Done! ";
}
What you probably noticed is that messages will print out of order most of the time. This is because cilk_spawn is making them run asynchroniously. Since the hello() and world() functions each have loops inside them, "Done!" can sometimes print before hello(), world() or both. Note that we used the loops to generate more work for the program - if you remove the loops the statements will print in order, which means that you need a substantial amount of work to take full advantage of cilk_spawn's power.

2.2. cilk_sync

Looking at the previous example you can see some side effects of running things in parallel - tasks will run out of order most of the time. Sometimes you want certain tasks to run in order because they might be dependent on each other, case in which you would use cilk_sync.

When you place cilk_sync somewhere in the code it causes all previously spawed tasks to wait for each other to complete before the program can continue. Getting back to our previous "Hello world! Done!" example, placing a single cilk_sync statement right before "Done!" is printed can ensure that "Done!" is printed only after the two spawed tasks, hello() and world(), respectively, have finished their work.

C
C++
#include <stdio.h>
#include <cilk/cilk.h>

.
.
.

int main(){
cilk_spawn hello();
cilk_spawn world();
cilk_sync;
printf("Done! ");
}
#include <iostream>
#include <cilk/cilk.h>
using namespace std;
.
.
.

int main(){
cilk_spawn hello();
cilk_spawn world();
cilk_sync;
cout << "Done! ";
}

Exercise

Imagine you are a car manufacturer and that you need to write a computer program to make and place the parts of the car. The creation of the parts should begin at the same time, yet the order in which they are finished does not matter. On the other hand, the car parts cannot be placed until they are created, and they have to be placed in a specific order: chassis and frame must be placed before everything else, the wheels and engine have to be placed before the steering wheel, which is placed last. Once all parts have been placed, print the message "The car has been built.". With this scenario in mind, use the code below and finish the program to satisfy these conditions.

C
C++
#include <stdio.h>
#include <cilk/cilk.h>

void make(char* str){
  int i=0;
  for(i=0;i<1000000;i++)
    printf("");
  printf("%s has/have been created.\n",str);
}

void place(char* str){
  int i=0;
  for(i=0;i<1000000;i++)
    printf("");
  printf("%s has/have been placed.\n",str);
}

int main(){
  //Place your code here
}
#include <iostream>
#include <string>
#include <cilk/cilk.h>
using namespace std;

void make(string str){
  for(int i=0;i<1000000;i++)
    cout << "";
  cout << str << " has/have been created." << endl;
}

void place(string str){
  for(int i=0;i<1000000;i++)
    cout << "";
  cout << str << " has/have been placed." << endl;
}

int main(){
  //Place your code here	
}

C
C++
#include <stdio.h>
#include <cilk/cilk.h>

void make(char* str){
  int i=0;
  for(i=0;i<1000000;i++)
    printf("");
  printf("%s has/have been created.\n",str);
}

void place(char* str){
  int i=0;
  for(i=0;i<1000000;i++)
    printf("");
  printf("%s has/have been placed.\n",str);
}

int main(){
	//These 5 parts will finish in no particular order.
	//Run it a couple of time to see the variation in ordering.
	
	cilk_spawn make("Wheels");
	cilk_spawn make("Chassis");
	cilk_spawn make("Engine");
	cilk_spawn make("Frame");
	cilk_spawn make("Steering wheel");

	cilk_sync; //wait for the parts to be created

	cilk_spawn place("Chassis");
	cilk_spawn place("Frame");

	cilk_sync; //Wait for chassis and frame to be placed

	cilk_spawn place("Wheels");
	cilk_spawn place("Engine");

	cilk_sync; //Wait for chassis, frame, wheels, and engine to be placed

	cilk_spawn place("Steering wheel");

	cilk_sync; //wait for all parts to be placed
	
	printf("The car has been built.\n");
}
#include <stdio.h>
#include <iostream>
#include <string>
#include <cilk/cilk.h>

using namespace std;

void make(string str){
  for(int i=0;i<1000000;i++)
    cout << "";
  cout << str << " has/have been created." << endl;
}

void place(string str){
  for(int i=0;i<1000000;i++)
    cout << "";
  cout << str << " has/have been placed." << endl;
}

int main(){
	//These 5 parts will finish in no particular order.
	//Run it a couple of time to see the variation in ordering.
	
	cilk_spawn make("Wheels");
	cilk_spawn make("Chassis");
	cilk_spawn make("Engine");
	cilk_spawn make("Frame");
	cilk_spawn make("Steering wheel");

	cilk_sync; //wait for the parts to be created

	cilk_spawn place("Chassis");
	cilk_spawn place("Frame");

	cilk_sync; //Wait for chassis and frame to be placed

	cilk_spawn place("Wheels");
	cilk_spawn place("Engine");

	cilk_sync; //Wait for chassis, frame, wheels, and engine to be placed

	cilk_spawn place("Steering wheel");

	cilk_sync; //wait for all parts to be placed
	
	cout << "The car has been built." << endl;
}

2.3. cilk_for

The cilk_for construct specifies a loop that permits loop iterations to run in parallel. This is a the parallel version of a normal C/C++ for loop. cilk_for divides a loop into chunks containing one or more loop iterations. Once the loop is broken down, each chunk is executed on a specific thread of execution.

Let's take a look at a quick example, where we use cilk_for to sum up in parallel the first 10,000 integers:

C
C++
#include <stdio.h>
#include <cilk/cilk.h>

int main(){
int sum = 0;
int i = 0;
cilk_for (i = 0; i <= 10000; i++)
sum += i;
printf("%d\n",sum);
}
#include <iostream>
#include <cilk/cilk.h>
using namespace std;

int main(){
int sum = 0;
cilk_for (int i = 0; i <= 10000; i++)
sum += i;
cout << sum << "\n";
}
Note that in order to use Cilk keywords and features, you have to include the Cilk library like above. Furthermore, notice how the only difference in this example between the native C version of the code and the Cilk version of it, is the replacement of the for keyword with the cilk_for one. Yet again, it is trivial to parallelize already-written C/C++ code using Cilk. Finally, note that the program above will return a different answer almost every time. That's because a race condition is created, about which we will talk and solve later in the tutorial.

There are certain restrictions on cilk_for that you should take into account. First, you cannot change the loop control variable in the loop body. So, the following is illegal:
cilk_for (int i = 0; i <= 10000; i++)
i = someFunction();
Moving on, you cannot declare the loop control variable outside the loop in C++, as opposed to C. More exactly, the following code will not work in C++:
int i = 0;
cilk_for (i = 0; i <= 10000; i++)
//work
To fix the above code for C++ you need to declare the loop control variable i, in the header of the cilk_for loop:
cilk_for (int i = 0; i <= 10000; i++)
//work
As we mentioned previously, the cilk_for statement divides the loop into multiple smaller chunks, that run on specific threads of execution. The maximum number of iterations in each chunk is defined as the grain size. The actual number of iterations run as a chunk will often be less than the grain size.

In a loop with many iterations, a large grain size can significantly reduce overhead. Alternately, in a loop with few iterations, a small grain size can improve the parallelization of the program and thus increase performance as the number of processors increases.

In order to define the grain size you need to use the cilk grainsize pragma (used to tell the compiler to use implementation-dependent features). To change the default value of the grain size add the following right before the cilk_for statement for which you would like to change the grain size:
#pragma cilk grainsize = expression
The expression above can be a number, an arithmetic operation or a function call. The default value of the grainsize, which works well in most cases is:
#pragma cilk grainsize = min(2048, N / (8*p))
where N is the number of loop iterations and p is the number of workers (threads) created during the current program execution.

When it executes the loop, the Intel compiler breaks down every loop in half until the number of loop iterations of each loop is smaller than or equal to the grainsize. For example, if the grainsize is 4 and the number of loop iterations is 64, the loop will be broken down into 16 chunks with 4 iterations each. If the number of iterations divided by the grain size has no remainder, then the number of chunks that are created is equal to the number of iterations divided by the grain size (in our example with a grain size of 4 and 64 iterations, the loop is broken down into 16 chunks). On the other hand, if this division returns a remainder, the number of chunks might be different than the integer division between the two parameters, number of iterations and grain size, respectively (think of the case where the number of iterations is 64 and the grain size is 5).

Note that if you change the grain size, test the performance of your program to ensure that you have made the loop faster, not slower.

The example below uses cilk_for to count all the prime numbers up to 10,000,000. We use grainsize to adjust the number of worker threads created, because there is an implicit cilk_sync after every cilk_for loop iteration. Thus, unless we adjust the grainsize there is going to be a lot of overhead - having 10,000,000 cilk_syncs is not probably something that you want. Let's take a look:
C
C++
//Run the code with the '-lm' compiler flag, to allow sqrt to work
#include <stdio.h>
#include <cilk/cilk.h>
#include <sys/time.h>
#include <math.h>

int isPrime(int n){
int limit = sqrt(n);
int i = 0; for(i=2; i<=limit; i++)
if(n%i == 0)
return 0;
return 1;
}

int main(){
int n = 10000000;
int gs = 25000; //grainsize
int numPrimes = 0;
int i;
struct timeval start,end;

gettimeofday(&start,NULL); //Start timing the computation

#pragma grainsize = gs
cilk_for(i = 0; i < n/gs; i++){
int j; for(j = i*gs+1; j < (i+1)*gs; j += 2){
if(isPrime(j))
numPrimes++;
}
}

gettimeofday(&end,NULL); //Stop timing the computation

double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000);

printf("Found %d primes in %lf seconds.\n",numPrimes,myTime);
}
#include <iostream>
#include <cilk/cilk.h>
#include <sys/time.h>
#include <math.h>

using namespace std;

bool isPrime(int n){
int limit = sqrt(n);
for(int i=2; i<=limit; i++)
if(n%i == 0)
return false;
return true;
}

int main(){
int n = 10000000;
int gs = 25000;
int numPrimes = 0;
struct timeval start,end;

gettimeofday(&start,NULL); //Start timing the computation

#pragma grainsize = gs
cilk_for(int i = 0; i < n/gs; i++){
for(int j = i*gs+1; j < (i+1)*gs; j += 2){
if(isPrime(j))
numPrimes++;
}
}

gettimeofday(&end,NULL); //Stop timing the computation

double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000);

cout << "Found " << numPrimes << " primes in " << myTime << " seconds.\n";
}
Once again, just like in the summation example above, there is a race condition that we will solve in the next two sections Locks and Reducers (C++ only).

2.4. Locks

Recall for a second our previous example, in which we sum up the first 10,000 integers. Whenever we run it we get a different result, because of a race condition. One way to solve this problem is to use locks. Locks are synchronization mechanisms that prevent multiple threads from changing a variable concurrently. Thus, locks help to eliminate data races.

Here is how you can use locks in C (using the pthread.h library) and in C++ (using the tbb/mutex.h library):

C
C++
#include <stdio.h>
#include <pthread.h> //pthread library

int main(){
int sum = 0;
int i = 0;
pthread_mutex_t m; //define the lock

pthread_mutex_init(&m,NULL); //initialize the lock
cilk_for (i = 0; i <= 10000; i++){
pthread_mutex_lock(&m); //lock - prevents other threads from running this code
sum += i;
pthread_mutex_unlock(&m); //unlock - allows other threads to access this code
}
printf("%d\n",sum);
}
//Run the code with the '-ltbb' compiler flag, to allow mutexes to work
#include <iostream>
#include <tbb/mutex.h> //mutex library

using namespace std;

int main(){
int sum = 0;
tbb::mutex m; //define the lock
cilk_for (int i = 0; i <= 10000; i++){
m.lock(); //lock - prevents other threads from running this code
sum += i;
m.unlock(); //unlock - allows other threads to access this code
}
cout << sum << "\n";
}
Even though locks are a solution to data races, there are a few things that can go wrong. First, deadlock might occur, which is when all the threads are waiting on each other. This is best illustrated by this image. Second, since the threads have to wait on each other, the locked part of the code is seriallized, causing performance issues. In Cilk™, the constructs that solve most of the issues associated with locks, are called reducers, about which we are going to talk in the next section.

Exercise

Recall our example at the end of the cilk_for section, where we count all the prime numbers up to 10,000,000. The issue with that example is that a race condition occurs when different threads try to increase the prime number counter. Your task is to use locks to fix the race condition and output the correct result, 664579 prime numbers.

Play around with the grainsize value to see what's the best value for you. We ran the program on a 16 core machine, so the same grainsize might not work as well for a machine with fewer cores.

C
C++
//Run the code with the '-lm' compiler flag, to allow sqrt to work
#include <stdio>
#include <cilk/cilk.h>
#include <sys/time.h>
#include <math.h>

int isPrime(int n){
  int limit = sqrt(n);
  int i = 0;
  for(i=2; i<=limit; i++)
    if(n%i == 0)
      return 0;
  return 1;
}

int main(){
  int n = 10000000;
  int gs = 25000;
  int numPrimes = 0;
  int i;
  struct timeval start,end;

  gettimeofday(&start,NULL);

  #pragma grainsize = gs
  cilk_for(i = 0; i < n/gs; i++){
    int j; for(j = i*gs+1; j < (i+1)*gs; j += 2){
      if(isPrime(j))
        numPrimes++;
    }
  }

  gettimeofday(&end,NULL);

  double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000);

  printf("Found %d primes in %lf seconds.\n",numPrimes,myTime);
}
#include <iostream>
#include <cilk/cilk.h>
#include <sys/time.h>
#include <math.h>

using namespace std;

bool isPrime(int n){
  int limit = sqrt(n);
  for(int i=2; i <=limit; i++)
    if(n%i == 0)
      return false;
  return true;
}

int main(){
  int n = 10000000;
  int gs = 25000; //grainsize
  int numPrimes = 0;
  struct timeval start,end;

  gettimeofday(&start,NULL); //Start timing the computation

  #pragma grainsize = gs
  cilk_for(int i = 0; i  < n/gs; i++){
    for(int j = i*gs+1; j  < (i+1)*gs; j += 2){
      if(isPrime(j))
        numPrimes++;
    }
  }

  gettimeofday(&end,NULL); //Stop timing the computation

  double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000);

  cout  << "Found "  << numPrimes.get_value()  << " primes in "  << myTime  << " seconds.\n";
}

C
C++
//Run the code with the '-lm' compiler flag, to allow sqrt to work
#include <stdio.h>
#include <cilk/cilk.h>
#include <pthread.h>
#include <sys/time.h>
#include <math.h>

int isPrime(int n){
  int limit = sqrt(n);
  int i = 0;
  for(i=2; i<=limit; i++)
    if(n%i == 0)
      return 0;
  return 1;
}

int main(){
  int n = 10000000;
  int gs = 25000;
  int numPrimes = 0;
  int i;
  pthread_mutex_t m; //create the lock
  struct timeval start,end;

  gettimeofday(&start,NULL);
  
  pthread_mutex_init(&m,NULL); //initialize the lock

  #pragma grainsize = gs
  cilk_for(i = 0; i < n/gs; i++){
    int j; for(j = i*gs+1; j < (i+1)*gs; j += 2){
      if(isPrime(j)){
        pthread_mutex_lock(&m); //lock the code below
        numPrimes++;
        pthread_mutex_unlock(&m); //unlock numPrimes
      }
    }
  }

  gettimeofday(&end,NULL);

  double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000);

  printf("Found %d primes in %lf seconds.\n",numPrimes,myTime);
}
#include <iostream>
#include <cilk/cilk.h>
#include <tbb/mutex>
#include <sys/time.h>
#include <math.h>

using namespace std;

bool isPrime(int n){
  int limit = sqrt(n);
  for(int i=2; i <=limit; i++)
    if(n%i == 0)
      return false;
  return true;
}

int main(){
  int n = 10000000;
  int gs = 25000; //grainsize
  int numPrimes = 0;
  tbb::mutex m; //create the lock
  struct timeval start,end;

  gettimeofday(&start,NULL); //Start timing the computation

  #pragma grainsize = gs
  cilk_for(int i = 0; i  < n/gs; i++){
    for(int j = i*gs+1; j  < (i+1)*gs; j += 2){
      if(isPrime(j)){
        m.lock(); //lock the code below
        numPrimes++;
        m.unlock(); //unlock numPrimes
	  }
    }
  }

  gettimeofday(&end,NULL); //Stop timing the computation

  double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000);

  cout  << "Found "  << numPrimes.get_value()  << " primes in "  << myTime  << " seconds.\n";
}

2.5. Reducers (C++ Only)

You have seen how locks can be used to solve data race, but some of the problems associated with them can make them a poor solution to the problem. The better solution in Cilk™ are reducers. By definition, a reducer is a variable that can be safely used by multiple threads running in parallel.

The runtime ensures that each thread has access to a private copy of the variable, eliminating the possibility of races without requiring locks. When the threads synchronize, the reducer copies are merged (or reduced) into a single variable. The runtime creates copies only when needed, minimizing overhead.

Getting back to our summation example, where we add up the first 10,000 integers, take a look below at the reducer solution for the race condition problem:

#include <stdio.h>
#include <cilk/cilk.h>
#include <cilk/reducer_opadd.h> //needs to be included to use the addition reducer

int main(){
cilk::reducer_opadd<int> sum;
//defining the sum as a reducer with an int value
cilk_for (int i = 0; i <= 10000; i++)
sum += i;
printf("%d\n",sum.get_value()); //notice that sum is now an object
}
First thing that you need to take care of in order to use a reducer is to include one of the Cilk reducer libraries, that fits the needs of your program most - there are multiple types of reducers: for mathematical operations, for strings, for determining minimum and maximum values of a list etc. For the full list of reducers check out the Intel® Cilk™ documentation. In our case, we need a reducer for a summation, so we will include the reducer_opadd.h library. Next, define the variable susceptible to a race condition as a reducer. Once the operation is complete, in order to retrieve the final value of the computation you need to call the get_value() function on the reducer (reducers are C++ hyperobjects, which is why this section is dedicated to C++ only).

Exercise (C++ only)

Create an array with 1,000,000 elements, and fill it with random numbers. Use a reducer from the list provided by Intel® to find the maximum value of the array.

#include <cilk/cilk.h>
#include <cilk/reducer_max.h>
#include <iostream>

using namespace std;

int main(){
  cilk::reducer_max<int> maxVal;
  int A [1000000];
  for (int i = 0; i < 1000000; i++)
    A[i] = i;

  cilk_for (int i = 0; i < 1000000; i++)
    cilk::max_of(maxVal,A[i]);

  cout << maxVal.get_value() << "\n";
}

Exercise (C++ only)

Recall (again, we know) our example at the end of the cilk_for section, where we count all the prime numbers up to 10,000,000. The issue with that example is that a race condition occurs when different threads try to increase the prime number counter. Your task is to use one of the available reducers to fix the race condition and output the correct result, 664579 prime numbers.

Play around with the grainsize value to see what's the best value for you. We ran the program on a 16 core machine, so the same grainsize might not work as well for a machine with fewer cores.

#include  <iostream>
#include <cilk/cilk.h>
#include <sys/time.h>
#include <math.h>

using namespace std;

bool isPrime(int n){
  int limit = sqrt(n);
  for(int i=2; i <=limit; i++)
    if(n%i == 0)
      return false;
  return true;
}

int main(){
  int n = 10000000;
  int gs = 25000; //grainsize
  int numPrimes = 0;
  struct timeval start,end;

  gettimeofday(&start,NULL); //Start timing the computation

  #pragma grainsize = gs
  cilk_for(int i = 0; i  < n/gs; i++){
    for(int j = i*gs+1; j  < (i+1)*gs; j += 2){
      if(isPrime(j))
        numPrimes++;
    }
  }

  gettimeofday(&end,NULL); //Stop timing the computation

  double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000);

  cout  << "Found "  << numPrimes  << " primes in "  << myTime  << " seconds.\n";
}

#include  <iostream>
#include <cilk/cilk.h>
#include <cilk/reducer_opadd.h>
#include <sys/time.h>
#include <math.h>

using namespace std;

bool isPrime(int n){
  int limit = sqrt(n);
  for(int i=2; i <=limit; i++)
    if(n%i == 0)
      return false;
  return true;
}

int main(){
  int n = 10000000;
  int gs = 25000; //grainsize
  cilk::reducer_opadd<int> numPrimes;
  struct timeval start,end;

  gettimeofday(&start,NULL); //Start timing the computation

  #pragma grainsize = gs
  cilk_for(int i = 0; i  < n/gs; i++){
    for(int j = i*gs+1; j  < (i+1)*gs; j += 2){
      if(isPrime(j))
        numPrimes++;
    }
  }

  gettimeofday(&end,NULL); //Stop timing the computation

  double myTime = (end.tv_sec+(double)end.tv_usec/1000000) - (start.tv_sec+(double)start.tv_usec/1000000);

  cout  << "Found "  << numPrimes.get_value()  << " primes in "  << myTime  << " seconds.\n";
}

2.6. Run Time System Functions

The runtime system provides a small number of functions that allow the user to control certain details of the program's behavior. To get access to these functions you need to include cilk/cilk_api.h in the header of your program.

There are four run time system functions:

int  __cilkrts_set_param(const char* name, const char* value);
This functions allows you to change function control parameters given a two string name-value pair. The name is the name of the parameter to be changed and the value is its value. The name argument currently accepted is nworkers, which allows you to change the number of threads that the program uses. The inputed value can be decimal, hexadecimal or octal.

int  __cilkrts_get_nworkers(void);
Returns the number of threads assigned to handle Intel® Cilk™ Plus programs. If the code is running serially the default returned value is 1.

int  __cilkrts_get_worker_number (void);
Returns the number of the thread in which this function is called. If more than one user-created thread calls __cilkrts_get_worker_number, they may get identical results becasue thread IDs are not unique across user threads. If the code is running serially the default returned value is 0.

int  __cilkrts_get_total_workers (void);
Returns the total number of threads assigned by the run time system. Because there might be more than one user-created threads the run time system may allocate more thread slots than are active at a given time. If called inIf the code is running serially the default returned value is 1.

Here's an example of how you would call these functions:
C
C++
#include <stdio.h>
#include <cilk/cilk.h>
#include <cilk/cilk_api.h>


int main(){
int numWorkers = __cilkrts_get_nworkers();

printf("There are %d workers by default.\n",numWorkers);
__cilkrts_set_param("nworkers","20");

numWorkers = __cilkrts_get_nworkers();
printf("We changed the number of workers to %d.\n",numWorkers);

int workerNum = __cilkrts_get_worker_number();
printf("The current worker number is %d. That's because we are running serially.\n",workerNum);

int totalWorkers = __cilkrts_get_total_workers();
printf("The total number of threads assigned to the RTS is %d.\n",totalWorkers);
}
#include <cilk/cilk.h>
#include <cilk/cilk_api.h>
#include <iostream>
using namespace std;

int main(){
int numWorkers = __cilkrts_get_nworkers();

cout << "There are " << numWorkers << " workers by default.\n";
__cilkrts_set_param("nworkers","20");

numWorkers = __cilkrts_get_nworkers();
cout << "We changed the number of workers to " << numWorkers << ".\n";

int workerNum = __cilkrts_get_worker_number();
cout << "The current worker number is " << workerNum << ". That's because we are running serially.\n";

int totalWorkers = __cilkrts_get_total_workers();
cout << "The total number of threads assigned to the RTS is " << totalWorkers << ".\n";
}