Squeeze more performance from Parallelism
November 2nd, 2009
Squeeze more performance from Parallelism
In many posts, such as: The Future of the Parallelism and its Challenges I mentioned that synchronization the access to the shared resource is the major challenge to write parallel code.
The synchronization and coordination take long time from the overall execution time, which reduce the benefits of the parallelism; the synchronization and coordination also reduce the scalability.
There are many forms of synchronization and coordination, such as:
- Create Task object in frameworks such as: Microsoft TPL, TDD, and Parallel Runtime Library. Create and enqueue task objects require synchronization that it’s takes long time especially if we create it into recursive work such as: Quick Sort algorithm.
- Synchronization the access to shared data.
But there are a few techniques to avoid these issues, such as: Shared-Nothing, Actor Model, and Hyper Object (AKA Combinable Object). Simply if we reduce the shared data by re-architect our code this will gives us a huge benefits in performance and scalability.
Shared-Nothing and Actor Model
Personally I ‘m fan of these concepts; Shared-Nothing is architecture in which each node (i.e. execution thread) is independent and self-sufficient, and there is no single point of contention across the system. The principle behind Shared-Nothing is make every execution thread independent from the other threads, which improve the scalability and the performance. Also the pure Shared-Nothing can scale almost infinitely simply by adding nodes in the form.
Processing Huge File – Case
Let’s assume that we want process multi Gigabyte file in forward-only fashion. For example: count word frequency, indexing, etc. So our parallelism implementation could be:
Load chunk of data, and process it in parallel. Like this pseudo Cilk C++ code:
While (! Stream.EndOfFile())
{
data = Stream.LoadChunk();
for (i = 0; I < data. Length; ++i)
spawn processLine(data[i])
}
I used Cilk C++ because it’s very simple, however you can use your parallelism framework.
This way to processing huge file is ineffective because:
- Load every chunk done in synchronization which means the processors will wait until every chunk get loaded.
We may reduce size of the chunks to minimize the waiting time but this will introduce another problem: The Parallelism overhead, the chunks should be big-enough to be feasible.
- The partitioning, thread constructing, and other coordinating code is reduce the scalability and the performance of this code. Because every time you will load chunk you will call the parallel infrastructure to prepare free thread and partition the data and start execute ‘Parallel.For(), etc’.
Actually there are a few to methods to ignore the above issues, such as: Shared-Nothing Architecture, and Actor Model. So let’s fix it…
We will re-architect our code to make every piece independent. So we will have components such as:
- File Reader: This component will read from the file to the buffer.
- Buffer: Chunks buffer between the File Reader and Chunk Processor.
- Chunk Processor: Process Chunk of data, by loop and call ProcessLine function without any parallelism
- ProcessLine: The above showed ProcessLine function
The File Reader and every Chunk Processor will work independent. At the start you will initialize the Buffer (Fixed size concurrent queue) and File Reader in another thread, then initialize the workers everyone in different threads and map it to physical thread. For just start a new Task into your parallelism framework. The worker should workers in parallel try to get chunk from the buffer and process it.
What the different? This method will remove the direct dependence between the components, so everyone can scale independently. Now you can reduce the chunk size to reduce the waiting time and because every worker already working and there is no parallel overhead at all attached into the main thread after reading the data.
Combinable Object (A.K.A. Hyper Object)
Combinable Object or as Cilk like to call it ‘Hyper Object’ it’s a technique to avoid the synchronization and it’s problems by create combinable object per thread that at end of the task could combined to give you the final object or result. With Combinable object every thread access it is own copy, which means this is no need to synchronization and this copy will be cached effectively into the processors cache. See the following C sharp with Parallel Runtime Library simple code:
Combinable<List<string>> CO = new Combinable<List<string>>();
Parallel.For(0, ItemCount, i =>
{
CO.Local.Add(i.ToString());
});
List<string> result = CO.Combine((r, temp)=> {r.AddRange(temp); });
The List<> object itself not thread-safe, but the Combinable object solve the problem by create different List<> object per thread. And at the end you will be able to combine the objects together. But access the combinable objects itself require internal synchronization
However most of the Parallel frameworks today implement Prepare (i.e. local initialize) technique to maximize the benefits of the Combinable object. See the following C sharp with Parallel Runtime Library simple code:
Combinable<List<string>> CO = new Combinable<List<string>>();
Parallel.For(0, ItemCount, () => CO.Local, (i, local) =>
{
local.Add(i.ToString());
});
List<string> result = CO.Combine((r, temp)=> {r.AddRange(temp); });
It’s almost the same except we prepare and cache the thread local copy of the List<> object before the loop body, which reduce Local call to one time only.
In general Combinable Object, Shared-Nothing, and Actor model can improve you performance and scalability, but you should








November 3rd, 2009 at 1:45 pm
[...] Continue > http://www.hfadeel.com/Blog/?p=160 [...]
November 21st, 2009 at 7:45 am
Haytham,
I have been looking through a blog catalog in search of a blogger that writes substantive articles, with particular knowledge of parallel programming. After about 115 blogs, I came across your blog. Yours is by far the most comprehensive I’ve see.
The reason for my search is to locate somone suitable to perform an independent analysis of the just released QuickThread parallel programming software development kit.
See my website for info should you be interested.