Skip navigation

Tag Archives: programming

So, here is a first-program in MPI which i did about 6 months ago in Prof. Calvin Lin’s Parallel systems class.

MPI is a communication protocol for multiprocessors. The way it works, in short, is that there is a root process (kind of like the root thread in a multi threaded application). The difference in MPI is all the processes exist simultaneously (just like physical processors would already exist instead of being spawned). The root process would divide up the work to do, and send it to all the other processors via messages, hence message passing.

Each process can send messages and receive messages, but there has to be more control and discipline in message passing and sending. Its more trickier to get the message passing in the right order and structure.  But, hands-on, here is a basic example.

This is what a basic application should be structured like. The main would initialize MPI (5-9), and then get the total number of processes participating in the computation as well as the current process number. This is where it differs from multi-threading because unlike in multithreading where you spawn threads and tell what function they would execute, all MPI processes have to execute the same code. So you need the process ID to “divide” the work. In this case, line 15 does the initialization and division of work and line 17 would do the actual work. So if you were to sum an array, the initialization would read in the array from file, and the line 17 would invoke the worker thread which would sum the data in its allocated data. The initialization can be done by any process, here i have used MY_ROOT_PROCESS to represent the root process.

int main(int argc, char** argv)
{
    int mynode, totalnodes;
    int mpiResCode = MPI_Init(&argc, &argv);
	if(mpiResCode!=MPI_SUCCESS)
	{
		printf("\nFailed to initialize MPI.");
		MPI_Abort(MPI_COMM_WORLD,mpiResCode);
	}
    MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
    MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
	for(int i=0;i<100;i++)
	{
		if(mynode==MY_ROOT_PROCESS)
			doInitialize();
		else
			doWork();
	}
    MPI_Finalize();
    return 0;
}

The trick is getting the messages right. For instance, in the worker process which is waiting for work to be allocated to it, you would have something like follows;

MPI_Recv(&localDataSize,1,MPI_INT,MY_ROOT_PROCESS,MY_SIGNAL_CONTROL,MPI_COMM_WORLD,&status);
localbuff = new int[localDataSize];
MPI_Recv(localbuff,localDataSize,MPI_INT,MY_ROOT_PROCESS,MY_SIGNAL_CONTROL,MPI_COMM_WORLD,&status);
   ...
   ...
   ...
MPI_Send(&numVals,1,MPI_INT,i+1,MY_SIGNAL_CONTROL,MPI_COMM_WORLD);
MPI_Send(globalData+startIndex,numVals,MPI_INT,i+1,MY_SIGNAL_CONTROL,MPI_COMM_WORLD);
		

and in the root process, which is responsible for sending out work, you would have a similar sending commands. Notice that the sequence must match, since the message passing is stateless, and maintains no information between communications.
The first three lines are in the recipient of the work. It would, in this example, first receive a size of the work it should expect, it then allocates a buffer for that size and then it receives the work. In the sender (root process in our example), the sender first sends the size, and then all the data. After the data transfer has taken place, it can work on the data. But once the data processing has been completed, similar transfer has to take place for the results to be combined. This would be easier to do naively, but if you were to use a tree-structured reduce operation, this becomes just a bit trickier.

Advertisements

I was working on finishing a Parallel systems assignment in which we had to implement a prefix framework. The nice thing about the framework is that if you have N threads (or processes) working on a problem in parallel, then a sequential approach would require N stages to combine the result. But using the prefix framework, we can pair up the results and combine the result in logN stages.

I had worked in Java Threads, and found them quite interesting particularly because you just needed a minimal knowledge of threads to start working with them. pthreads, however, were slightly …un-object like, hence it made a bit of a mess to get an object oriented behaviour from them.

The core of the tree combination is the combineTree function, which is a function of the thread instance (i made up classes to represent different aspects of the problem). The function is as follows;

void LocalThreadContext::combineTree()
{
	int * rs;
	for(int stride=1;stride<(NUM_THREADS-sequenceID);stride*=2)
	{
		if((sequenceID%(2*stride))==0)
		{
			g->combineFunction(&lRes,g->children[sequenceID+stride].getLocalResult());
			///Display logic..... just something to make it look like a tree.
			#if DEBUG_CONSOLE
				pthread_mutex_lock(&consoleMutex );
				printf("\n");
				for(int j=1;j<=stride;j*=2)
					printf("\t");
				printf("Combining %d with %d",sequenceID, sequenceID+stride);
				pthread_mutex_unlock(&consoleMutex );
			#endif
			/////End of display logic!
		}
	}
}

The DEBUG_CONSOLE is something you might use if you are debugging and would like a nice tree-like representation of the combining-logic.

It was fun, but the problem of expressing arbitrary problems in such a framework remains something i would need to work on (if i get some time, which i won’t!).

The threads start off by locking on the result variable, so that no thread attempts to combine its result before the result has been computed. The thread would be given a global thread context, which would define how the thread is to use the data provided to it (int array or double array etc) and how to combine its result with the result of another thread (sum problem has different behavior compared to two_max problem). The thread would perform the function on each of the data elements it is provided and then combine the result using the function provided. The complete code is available here