CapeSym | ![]() |
Table of Contents |
As mentioned in the Temperature Computations section, OpenMP provides shared memory parallelism on a single computer, and MPI provides distributed memory parallelism that can execute over many computers. MPI parallelism is only used when xSYMMIC is preceded by mpiexec on the command line, as demonstrated in this section.
By default, OpenMP parallelism is fully utilized by the xSYMMIC command line, as follows.
> xSYMMIC FET.xmlIn these examples, the > symbol is meant to indicate a Windows command prompt.
To add the MPI parallelism and distribute the computations across multiple computers, the xSYMMIC command line must be invoked through the mpiexec launcher from an MPI library. xSYMMIC is compiled and tested with the Intel MPI Library. For differences with other libraries, see the discussion towards the end of the Remote Run section. The Remote Run and Remote Jobs dialogs use the same command line for MPI computation as described here for execution on a local cluster.
Note: Both xSYMMIC and the Intel MPI Library should already be installed, and environment variables should be configured to allow these commands to the used at the command prompt without giving the full paths. Installing the Intel MPI Libraries for Windows should be sufficient to setup the environment variables, but some versions might require configuration at the command prompt by calling a batch file provided with the MPI library.
> call mpivars.batThis is the minimal command line for testing MPI and xSYMMIC. The mpiexec command always comes first, then the mpiexec flags (-n, -ppn, -hostfile, etc.), then the xSYMMIC executable, the template file, and finally the xSYMMIC command line options, if any. This example should produce exactly the same execution and parallelism as the non-MPI command line above, because the -n 1 option specifies a single MPI process, in other words, no additional MPI parallelism.
Since most CPUs are hyper-threaded, with two threads per core, the number of physical cores is usually half of the number of (logical) processors or CPUs shown in a task manager or system monitor application. If the Intel MPI library is installed on the system, the cpuinfo utility may be used at the command prompt to view the hyper-thread and physical core information.
> cpuinfo ===== Processor composition ===== Processor name : Intel(R) Xeon(R) E5-1620 0 Packages(sockets) : 1 Cores : 4 Processors(CPUs) : 8 Cores per package : 4 Threads per core : 2 ===== Processor identification ===== Processor Thread Id. Core Id. Package Id. 0 0 0 0 1 1 0 0 2 0 1 0 3 1 1 0 4 0 2 0 5 1 2 0 6 0 3 0 7 1 3 0 ===== Placement on packages ===== Package Id. Core Id. Processors 0 0,1,2,3 (0,1)(2,3)(4,5)(6,7) ===== Cache sharing ===== Cache Size Processors L1 32 KB (0,1)(2,3)(4,5)(6,7) L2 256 KB (0,1)(2,3)(4,5)(6,7) L3 10 MB (0,1,2,3,4,5,6,7)S
To run using two MPI parallel processes increase the -n flag to 2.
> mpiexec -n 2 xSYMMIC FET.xmlSince the command line did not list any hosts, the logical CPUs of the local host are divided up between the two MPI processes running on the local machine. To see the distribution of logical CPUs to processes on the local machine, use the -env flag to set the I_MPI_DEBUG environment variable during the run.
> mpiexec -n 2 -env I_MPI_DEBUG=4 xSYMMIC FET.xmlThe last two lines of debug information report that two processes (rank=0 and rank=1) are being used, both on the same machine (Cape29). The rank 0 process has access to four logical cores (0,1,2,3) while rank 1 has access to the four other logical cores (4,5,6,7). As reported by cpuinfo, logical CPUs 0 and 1 reside on physical core 0, logical CPUs 2 and 3 on core 1, logical CPUs 4 and 5 on core 2, and logical CPUs 6 and 7 on core 3. There are two logical CPUs per physical core on this machine because each Intel core has Hyper-Threading technology. xSYMMIC automatically chooses the correct total number of threads to use such that one and only one thread resides on each physical core (e.g. four in the example above). If one thread per logical CPU is prescribed then the performance will be reduced.
To run on a cluster, the machines of the cluster all need to have xSYMMIC and the MPI library installed, and the template(s) to be solved should reside on a shared network file system. Define a host file or machine file that lists the names of the machines in the cluster. (On Linux the host names are defined in the /etc/hosts file.) For example, a hosts file containing two machines might look like:
$ more hostsHere, the dollar sign ($) signifies the Linux command prompt. Use this hosts file as follows.
$ mpiexec -n 2 -hostfile hosts -env I_MPI_DEBUG=4 xSYMMIC FET.xmlAlthough the hosts file lists multiple machines, process pinning was left up to the MPI library which chose to put all processes on the first machine in the file. Rank 0 is assigned 8 of the 16 physical cores on HPCL8, while rank 1 is assigned the other 8 cores. The number of processes per node may be specified with the -ppn flag, as follows.
$ mpiexec -n 2 -ppn 1 -hostfile hosts -env I_MPI_DEBUG=4 xSYMMIC FET.xmlNow the two ranks are divided between two machines, with 16 physical cores (32 hyper-threads) per process. This could also be achieved through the use of a machine file in which the node names may be augmented with the desired number of processes per thread.
$ more machinesIn this last example, the -print-rank-map flag (instead of the I_MPI_DEBUG environment variable) is used to display the process pinning. Rank 0 is assigned to machine HPCL8, while rank 1 is assigned to HPCL7. MPI will assign all of the available hyper-threads on the machine to the process.
As described in the Parallel Computations section, Level 2 superposition solves a layout in parallel by giving each core a separate part to solve independently, whereas Level 1 superposition solves each part in sequence by dividing each solve up over all of the cores (i.e. all of the cores work together to solve each part of the layout).
Here's a simple test for level 2 superposition that can be performed on any desktop with at least four cores and 8 Gb of RAM. Open the mesaResistor.xml template in SYMMIC and use Create layout... from the File menu to make an array of 16 mesa resistors.
In the Device Layout Table dialog that follows, set the Length and Width of the MMIC to 12mm (12000). Save the layout to a file named mesaResistor_4x4layout.xml. Although the solution mesh for the layout will contain almost two million temperature points, the individual solutions of the superposition are much smaller and should require less than 0.5 Gb of RAM to solve. Thus, it is reasonable to run 4 MPI processes in parallel on a machine with 8 Gb of RAM for level 2 superposition of this problem.
Level 2 superposition is requested by the -s flag on the command line with mpiexec and xSYMMIC. To run the layout over 4 cores on the local machine, the command would be as follows.
> mpiexec -n 4 xSYMMIC mesaResistor_4x4layout.xml -s=2The solver being used is announced during the run as "Parallel Direct Solver" which indicates PARDISO. Level 1 superposition would have a total run time of about 369 seconds on the same machine, since it must perform all 32 solves one at a time sequentially instead of in parallel. Furthermore, the speed up provided by the added MPI parallelism has to compete with the loss of OpenMP threads available to the PARDISO solver because the machine has only four physical cores.
Moving to a Linux machine with 16 cores we get the following result.
$ mpiexec -n 4 xSYMMIC mesaResistor_4x4layout.xml -s=2Solving the same problem on the same cores with level 1 superposition takes longer using the built-in PCG solver distributed over 4 MPI processes.
$ mpiexec -n 4 xSYMMIC mesaResistor_4x4layout.xml -s=1As predicted from the Choosing Parallel Computing Methods section, the best performance for level 1 superposition is realized by using all of the cores for the OpenMP parallelism in the direct solver rather than using the hybrid approach and the PCG solver.
$ xSYMMIC mesaResistor_4x4layout.xmlAn iterative solution method may be substituted for the direct PARDISO solver for single computer simulations. For Linux, the iterative PETSc solver is available by adding the -usePETSc flag to the command line.
$ xSYMMIC GaNSi_FET5million.xml -usePETSc Solving part 1
of 1...
Parallel
Iterative PETSc Solver (ConjugateGradient-ILU) for Steady State
:
Total
run time was 462.499 seconds.
On Windows, the use of the -usePETSc flag will not have any effect. Instead, a message will appear saying that PETSc is not available:
> xSYMMIC mesaResistor.xml -usePETScWhen PETSc is not available, the computation will revert to the default non-PETSc solver. This is usually the PARDISO solver except when the mpiexec command is used, in which case the built-in PCG solver will be used.
PETSc is most advantageous for cluster computing, where the mpiexec command is used to distribute the problem over multiple cores on multiple machines. For example, creating a machine file in which up to 8 parts of the problem are distributed to each host results in much better performance than using PETSc on just one process on a single host.
$ more machines (HPCL3:24,25,26,27,28,29,30,31)
Starting
xSYMMIC with PETSc and 32 MPI processes and 2 OpenMP threads each...
:
Solving part 1
of 1...
Parallel
Iterative PETSc Solver (ConjugateGradient-ILU) for Steady State
:
Total
run time was 47.319 seconds.
On Linux, PETSc MPI parallelism can even be used with superposition. The above example is repeated to allow direct comparison.
$ mpiexec -n 4 xSYMMIC mesaResistor_4x4layout.xml -s=1 -usePETSc Solving part 1
of 32...
Parallel
Iterative PETSc Solver (ConjugateGradient-ILU) for Steady State
:
Total
run time was 71.726 seconds.
Using PETSc with level 2 superposition is ideal for large problems, when memory is insufficient to use PARDISO.
$ mpiexec -n 4 xSYMMIC mesaResistor_4x4layout.xml -s=2 -usePETScCapeSym > SYMMIC
> Users Manual
> Table of Contents
© Copyright 2007-2024 CapeSym, Inc. | 6 Huron Dr. Suite 1B, Natick, MA 01760, USA | +1 (508) 653-7100