A few months ago, my colleague Mike Barker tackled a word split
algorithm (see
here)
and proved that taking the parallel approach may not always be right.
In some case, a simple 20 lines single threaded “naive” solution can
out-perform your hard thought complex multi-threaded monster.
Still - the problem is intrigued me because on the face of it - This is
an ideal problem for a parallel algorithm. It is a classic “Data
parallelism” problem borrowed from Guy Steele’s examples of
parallel computing using
Fortress.
So - in the interest of science I devised an experiment.
I created 3 different solutions to the word split problem - All written in java.
- Brute Force - The same simple 20 lines, single threaded - iterate, split, done. (source)
- Simple Divide and conquer - A trivial implementation of the divide and conquer algorithm. The text is divided between a fixed number of worker threads. They split a section of the code to words and possibly chunks (The two character sequences at the edges which may be a part of a bigger word). Those chunks will be merged together in the merge phase. One thread is dedicated to block and wait until all the splitting is done by all threads….. It is that trivial/dumb. After all worker threads are done, that single thread merges the results (source).
- Flexible Divide - No merge at all. Every thread gets a section of the text and decides itself whether the first characters are a chunk from a bigger word (by looking at the character just before its section) or not. If they are, it skips them. Then it splits all the words in its section and carries on perhaps even beyond the border between itself and the next thread until it splits the last word that started in its section. Notice - no contention, no need for threads to combine anything between them. (Have you spotted the weakness in this algorithm? see conclusions point no. 4) (source)
I ran those algorithms on a 16 core machine using 16 worker threads (for the parallel algorithms).
The text was Alice in Wonderland, but to test the algorithms, I ran it a few times, every time doubling the size of the text by concatenating another Alice book.
and the results:
Conclusions:
- The “weak” proof stands (obviously) - Parallel algorithms are not always better. They carry a cost.
- Contention between threads is the big cost of parallel algorithms. Contention can be either on the data or resources. In this case - merging of the results creates contention. Best way to tackle this is by finding an algorithm that does not have it.
- Depending on the algorithm - if the cost of parallelism is constant - then the there will be input data set big enough for the benefit of parallelism to outweigh the cost. If it depends on the size of the input, then you’re really in trouble.
- Many times, an algorithm that sacrifices performance in certain cases may gain you huge performance improvements in other cases. This is the case with the flexible divide. If all the text was one long word with no spaces at all - The flexible divide performance would be awful. One thread would do work similar to BruteForce but all others will just keep skipping characters ie. waste CPU cycles. However, if we ignore that edge case and agree that our input contain words whose average size is a lot less than the 100K the book takes - Then it out-performs the others.
- Final conclusion for me is: There are no free lunches - You can’t have a high performing parallel algorithm without thinking about it - You can’t always ignore parallel solutions, you can’t expect to follow simple rules blindly.