Don’t let the title mislead you. This article contains no information regarding sewing, knitting, or weaving. As interesting as textiles may be, this article instead focuses on computer parallelism and multithreading. As you may know, computing in parallel involves running multiple tasks simultaneously. You may have also heard that software developers use “threads” or “multithreading” to implement parallelism. For most people, that’s all they care to know about parallelism, but here at Pointwise we’re engineers. As engineers, technology gets us excited – especially when that technology helps us build first-class software. That’s where parallelism comes in. This article will focus on what parallelism is and how it influences you and me as Pointwise users.
What is Parallelism?
So, what is parallelism? A common perception is that parallelism is to computers as horsepower is to cars. That is, the more “parallelism” that you can shove into your central processing unit (CPU), graphics processing unit (GPU), or compute cluster the more performant it will be. Current CPU trends certainly suggest this. Both Intel and AMD seem to cram more cores into each sequential generation of microprocessor they produce. Furthermore, the marketing behind these chips promotes the notion that more cores equals more power. There is some truth to this, but let me introduce you to a programmer’s perspective.
Parallelism is better thought of as a specialized tool than as raw horsepower. A tool is a good metaphor for a few reasons. First, parallelism, like a tool, must be actively applied. Hammers don’t build houses – craftspeople do. Similarly, software developers must actively and intentionally design parallel code. Second, not all tools should be applied to all jobs. Chisels are nice when woodworking but not desirable for felling a tree. In the same way, not all computer algorithms can or should be implemented with parallelism. Finally, specialized tools require skill to use properly. Backhoes are fantastic, but you certainly hope the operator is trained. The same applies for those developing parallel algorithms.
If parallelism is a tool, then that implies it is used to build something. What is that something? Put simply, optimized software. More specifically, parallelism is used to help computers solve bigger problems in less time. However, parallelism is not the only technique used to optimize code.
In order to better understand how parallelism fits into the grand scheme of optimization, let’s first consider a real world analogy. Suppose you’re self-employed in the ditch digging business. Things are going swimmingly, but there’s a big problem. Ditch digging is not fun. You’ve found that you want to spend less time in the ditch and more time doing anything else. What are your options? You could quit, but a paycheck is desirable. Here are some options that don’t require moving back in with your parents: you could simply dig faster with the shovel you have, you could buy a bigger shovel that moves more dirt in each scoop, or you could hire people to help you. With luck, any of those options will get you from the ditch to the couch faster.
How does this scenario translate to software? When optimizing code, engineers face options similar to those of our imagined ditch digger. The option of digging faster is similar to running hardware at higher clock rates. CPUs can run at higher frequencies which results in faster execution times. Unfortunately, space, heat, and power constraints limit how far clock rates can be pushed.
What about using a bigger shovel? In the computer realm, this is analogous to doing more work with the same number of clock cycles. Clock rates may be limited, but each instruction could be optimized to accomplish more in a single cycle. The most obvious way this is accomplished is by making our algorithms smarter and thus more efficient, but hardware also plays a role in the form of various micro-architecture techniques. These manifest in the form of instruction pipelining, speculative execution, and branch prediction to name a few.
Finally, what about hiring more people for the job? As you probably guessed, this situation is most analogous to parallelism. “More hands make light work” is the old adage, and it also applies to machines. In the computing world, one name we give our workers is threads (short for threads of execution). When multiple threads are used in conjunction to solve a problem, this is called multithreading. If a problem can be subdivided so that multiple threads may execute concurrently to accomplish the same goal, then less time will be required to finish. However, subdividing problems is not always straightforward.
To illustrate this, let’s revisit our example. Flash forward. You’ve now hired an elite ditch digging team, and you’re back on the job. Things are going swimmingly, but there’s a problem. You’ve been tasked with digging a very deep hole. Up to this point, you’ve been digging wide but shallow ditches. Now, the hole is only big enough to fit one man at a time. Progress slows to a painful crawl because the work has again become sequential in nature.
The same problem plagues computer parallelism. When the problem domain is “wide and shallow”, parallelism applies naturally and speedup can scale linearly. These types of problems are often referred to as “embarrassingly parallel.” However, most computer algorithms do not fit this bill. Algorithms typically must be designed from the ground up in order to support parallelism properly, and that assumes that an efficient parallel algorithm can be devised at all.
One other notable obstacle that faces programmers is the operating system itself. It may come as a surprise, but the software we write lives in a strict dictatorship. When it comes to system resources, the operating system (OS) reigns supreme. If the OS decides that hardware cores are not to be used (say, due to a power savings policy), then our software cannot use them. The software may be the most brilliantly written parallel code, but if the OS gives us only one hardware core, then all our threads will run serially on that core.
So, what is parallelism? Parallelism is not a panacea. Software does not benefit from multiple cores unless the developer specifically designed it to do so. Thus, it’s just another tool in the programmer’s tool belt. When designing algorithms, the trained programmer knows when and where parallelism can be applied to create fast, scalable code.
How does Pointwise Use Parallelism?
Since you’re reading this article on the Pointwise Blog, then you’re probably familiar with Pointwise and what we do. For the uninitiated, we write software to make meshes, and we strive to make our meshing software the best on the planet. Everyone at Pointwise is committed to this. One of the ways we accomplish this is by making our code performant, and one tool the product development team leverages is parallelism. You’re probably itching to see a real world example that showcases parallelism at work, so here are a few practical ways that we use parallelism within Pointwise.
The Pointwise GUI
One of the subtler ways Pointwise employs parallelism is in our user interface. Ideally, the user interface of any application should remain responsive even if the program is hard at work. In a traditional single-threaded application, it is difficult to design user interface code that remains robust and responsive while the underlying program engages in resource intensive processing. A common strategy to combat this is to use parallelism in the form of a graphical user interface (GUI) thread. The GUI thread’s sole purpose is to ensure that visual feedback can continue while the main thread churns away on the task at hand. The GUI thread is essentially a separate entity. It only communicates with the main thread via sanctioned methods – for example, a synchronized message queue. As the main thread chugs along, it will sporadically add messages to the queue and continue on its way. The GUI thread can then process these messages separately and react accordingly.
From the moment Pointwise is launched, a separate GUI thread is running. The sole responsibility of this thread is to handle GUI interaction. All the Pointwise meshing smarts occur in other threads. We call the main thread that synchronizes all meshing work, the “Core”. Communication between the GUI thread and the Core is handled via our scripting language – Glyph. See the diagram below for an illustration.
For example, suppose you wish to initialize an unstructured volume mesh. To do this, you would open the Grid Solve task and press the initialize button. Under the covers, the GUI thread processes your request and constructs an appropriate Glyph command that triggers the generation of the mesh. The Core will intercept this Glyph command and begin processing. Mesh initialization is often a CPU intensive process. Even so, the GUI remains responsive. This is attested to by the fact that the dual level progress bar and Messages window remain active. In fact, even the Display window is responsive to view manipulations. This is all thanks to the GUI thread.
Of course, a GUI thread requires just one additional thread. If your machine was built any time after the Stone Age, then you probably have more than a few cores/threads to spare. If you really want Pointwise to push your system’s limits, then you’ll be interested in our next topic.
Examining Mesh Quality Metrics
Any computational fluid dynamics (CFD) expert knows that grid quality is integral to accurate solutions. If you need to identify problem areas within a mesh, then Pointwise’s examination tools are your best friend. Of course, meshes can be large – very large. Processing tens or hundreds of millions of elements can take a while. Fortunately, the algorithms used for many of the examine metrics in Pointwise fall into the “embarrassingly parallel” category.
So, what does it mean for an algorithm to be “embarrassingly parallel”? Simply put, it means that very little effort is required to divvy up a problem into parallel tasks. On top of this, little or no communication or synchronization among threads is required. Raw information is provided to each thread and each thread processes that data independently.
This is the case with many of the mesh metrics available in Pointwise’s Examine command. The raw input to these algorithms is typically 2-D or 3-D mesh elements. Since examination occurs after mesh generation, the elements have already been created. They are resident in memory and waiting to be processed. Furthermore, metric calculations can usually be computed in isolation. For example, the volume of any one element is calculated independently from any other element in the mesh. This process is illustrated below.
Since problems like these are “embarrassingly” natural to the parallelism paradigm, they allow software multitasking to really shine. However, as outlined above, there are several stipulations required for an algorithm to harness parallelism to this extent. Unfortunately, cases like this are the exception. Data is rarely in such an amenable format, and threads often need to be highly interdependent to arrive at a solution.
Grid quality is at the heart of the Pointwise meshing philosophy. One method Pointwise employs to improve the final quality of a generated grid is smoothing. This occurs as a post processing step during volume initialization. Unlike the examine functionality discussed above, point smoothing is inherently interdependent. This doesn’t mean that a parallel solution is unachievable though. Several years ago, a parallel implementation of this algorithm was added to Pointwise, and this is how we did it.
At its core, the algorithm works by manipulating a point’s position within a hull of elements such that the element metric is improved for the local set of elements. This method is applied to all points in the mesh until a sufficient mesh quality is achieved. On the surface, this seems like an ideal problem for parallelization – a large dataset of points where each is locally modifying only its own data. However, closer inspection reveals that simply dividing the point data between threads for multithreaded execution will not provide sufficient isolation. This is because the algorithm requires that incident element data remain constant while a point is being positioned. If one thread modifies any point on a hull that another thread is manipulating, then results become inconsistent at best and corrupt at worst. The likelihood of such a scenario occurring increases dramatically as the number of threads increases. Obviously, multithreaded execution must control how point data is accessed. An example in 2-D is provided for illustration purposes below.
If the algorithm were read-only such that no point positions were altered, we could arbitrarily divide points among threads and expect performance to scale evenly. While this is enticing, the point smoothing algorithm is not read-only. It modifies point positions as it processes. In order to incorporate parallelism into this algorithm, we need to enforce a spatial policy.
Such organization can be achieved through the use of mesh partitioning. If the input mesh may be reliably and consistently partitioned into contiguous chunks, then each thread would be free to operate on its own chunk without fear of interfering with a concurrent thread. Of course, mesh partitioning is a huge computer science topic itself. Fortunately, partitioning is well-studied. Multilevel graph partitioning schemes have been developed that are fast and reliable.
Mesh partitioning allows us to isolate portions of the mesh, but the problem still remains on the partition boundaries. That is, multiple threads cannot be allowed to concurrently manipulate points on either side of a partition boundary. The solution we adopted for this problem is to create a new, non-contiguous partition that only includes boundary points. This partition is then processed serially after multithreaded execution. See the figure for a demonstration of how the boundary partition would look in 2-D.
With mesh partitions enforced like this, each thread is guaranteed to be able to work in isolation. Of course, mesh partitioning presents overhead that a serial implementation would not incur. The fruits of parallelizing the algorithm must outweigh the upfront costs. Fortunately, the partition only needs to be computed one time. Since the point smoothing algorithm is iterative in nature and may require several passes, the performance overhead is quickly recouped.
Finally, this discussion would not be complete without a pretty image of a 3-D partitioned mesh. If you are a patron of Another Fine Mesh, then you already know that meshing is as much a science as it is an art. To satisfy the art lover in us all, the diagram below shows some 3-D partitioned meshes.
Pointwise is heavily invested in crafting the best meshing software available. As such, we’re committed to researching state-of-the-art technology and incorporating it into our product. In case you weren’t aware, all the features mentioned in this article are currently integrated into our product. We’re also constantly improving our software. If you’re running an older version of the software, be sure to check out the latest version to see what’s new. Parallelism is just one exciting avenue that we look to when designing new algorithms and when retrofitting old ones.
If you are new to Pointwise and are curious to try it out for yourself, then why not give Pointwise a test drive? Request a free trial to get started.