Adaptive Runtime Management For Dynamic Applications Overview | The NSF Cloud and Autonomic Computing Center at Rutgers

Adaptive Runtime Management for Dynamic Applications

Objective:
This projects investigates the design and evaluation of an adaptive runtime management framework for distributed adaptive grid hierarchies that underlie parallel adaptive mesh-refinement (AMR) techniques for the solution of partial differential equations. The framework uses application and system state information to select the appropriate partitioning scheme, distribution parameters (e.g. granularity per processor), communication mechanism, number and type of processors to be used, etc., at runtime. The overall objective of the framework is the design and development of policy driven tools for automated configuration and runtime management of distributed adaptive applications on dynamic and heterogeneous networked computing environments.

Introduction:
Dynamically adaptive methods for the solution of partial differential equations that employ locally optimal approximations can yield highly advantageous ratios for cost/accuracy when compared to methods based upon static uniform approximations. These techniques seek to improve the accuracy of the solution by dynamically refining the computational grid in regions of high local solution error. Distributed implementations of these adaptive methods offer the potential for the accurate solution of realistic models of important physical systems. These implementations, however, lead to interesting challenges in dynamic resource allocation, data-distribution and load balancing, communications and coordination, and resource management. The overall efficiency of the algorithms is limited by the ability to partition the underlying data-structures at runtime so as to expose all inherent parallelism, minimize communication/synchronization overheads, and balance load. A critical requirement while partitioning adaptive grid hierarchies is the maintenance of logical locality, both across different levels of the hierarchy under expansion and contraction of the adaptive grid structure, and within partitions of grids at all levels when they are decomposed and mapped across processors. The former enables efficient computational access to the grids while the latter minimizes the total communication and synchronization overheads. Furthermore, application adaptivity results in application grids being created, moved and deleted on-the-fly, making it is necessary to efficiently re-partition the hierarchy so that it continues to meet these goals.

Moving these applications to dynamic and heterogeneous networked computing environments introduces a new level of complexity. These environments require selecting and configuring application components based on available resources. However, the complexity and heterogeneity of the environment make the selection of a "best" match between system resources, application algorithms, problem decompositions, mappings and load distributions, communication mechanisms, etc., non-trivial. System dynamics coupled with application adaptation makes application configuration and runtime management a significant challenge. Clearly, there is a need for "smart" tools that can automate the configuration and management process. This project is a step towards that direction.

Application sensitive adaptation uses the current state of the application to drive the runtime selection of:
1) Partitioning scheme and partitioner (SFC, Vampire, ParMetis, ITB, CGD, ...)
2) Partitioning granularity (patch size, aspect ratio, ....)
3) Number and type of processors used

Application sensitive adaptation is based on:
1) application-centric characterization of dynamic partitioning/load-balancing schemes based on a 5-component quality metric
2) characterization of application execution state based on the octant approach

System sensitive adaptation investigates the use of current system state to drive runtime adaptation. System state (computing, network) information is obtained using SNMP MIB's. Adaptation include the selection of:
1) number of processors to be used
2) work allocation per processors
3) distribution type (e.g. optimize for latency tolerance or load-balance)

Foundations for Application-centric Characterization and Partitioning

AMR Overview : A brief overview on Adaptive Mesh Refinement

AMR Infrastructures and Taxonomy : AMR packages/libraries and classification of partitioning strategies

Partitioning Techniques for SAMR Grid Hierarchies : SFC, G-MISP, G-MISP+SP, pBD-ISP, SP-ISP, ParMetis

SAMR Applications : Scalarwave 2D & 3D, Buckley-Leverette 2D & 3D, RM 2D & 3D, EnoAMR 2D, Transport 2D

Partitioning Quality : 5-component metric for PAC evaluation

Octant Approach : Application-centric characterization of partitioning techniques

Adaptive Meta-Partitioner : Dynamic PAC tuple

Partitioner Evaluation : Results from evaluation of partitioning techniques

Application-centric Evaluation : Evaluation of partitioners for RM3D SAMR application

Application-sensitive Partitioning: Experimental Study

The choice of the ``best'' partitioning technique and associated partitioning parameters (e.g. granularity) not only depends the nature of the application, but also on its runtime state. This is motivated by the observation that for parallel/distributed SAMR, no single partitioning scheme performs the best for all types of applications and systems. Even for a single application, the most suitable partitioning technique depends on input parameters and the application runtime state. As a result, it becomes necessary to actively manage these dynamic applications at runtime. This includes using application and system runtime state to select and configure the partitioning and load-balancing strategy to be used, so as to maximize performance. The goal of the adaptive application-sensitive meta-partitioner is to reduce overall runtimes of parallel SAMR applications by dynamically selecting and configuring partitioners at run-time to match the current state of the application. The structure of the adaptive grid hierarchy is used to characterize its current state and define its current partitioning requirements. These requirements are then used by the adaptive partitioner to appropriately select and configure the partitioning technique that best satisfies them.

The objective of this feasibility study is to manually demonstrate the benefits of adaptive application-sensitive partitioning based on a manual analysis of the SAMR application. The adaptation policies are manually formulated using a SAMR simulator and these policies are used to adaptively select and invoke the most suitable partitioner from among the available SFC, G-MISP+SP, and pBD-ISP domain-based partitioners. The SAMR application used in this evaluation is a basic, rudimentary version of the RM3D compressible turbulence kernel (RM3D_b) solving the Richtmyer-Meshkov instability in 3-D. The Richtmyer-Meshkov instability is a fingering instability which occurs at a material interface accelerated by a shock wave. This instability plays an important role in studies of supernova and inertial confinement fusion.

Characterizing Application State

Application characterization consists of two steps. First, the adaptive behavior of the application is captured in an adaptation trace generated using a single processor run. The adaptation trace contains snap-shots of the SAMR grid hierarchy at each regrid step. This trace is then analyzed using the octant approach (in this case, manually) and the adaptive partitioning strategy is defined. The application is executed for 800 coarse level time-steps and the trace consists of over 200 snap-shots. A selection of these snap-shots is shown in the figure below and illustrates the dynamics of the RM3D_b application. RM3D_b starts out with a scattered adaptation, lower activity dynamics, and more computation, placing it initially in octant IV. Either G-MISP+SP or SFC is the most appropriate partitioner for this octant. As the application evolves, its adaptation patterns cause it to move between octants and may require different partitioning schemes. At regrid step 106, the RM3D_b application has high communication requirements, high dynamics, and a scattered adaptation, placing it in octant VI. pBD-ISP is the partitioner of choice in this octant. At the end of the simulation, the application adaptations are localized, with increased computation and lower activity, placing the application in octant III and G-MISP+SP is the appropriate partitioner.

Formulating Partitioner Adaptation Policy using SAMR Simulator

Once the RM3D_b application trace containing snap-shots of the SAMR grid hierarchy is obtained, a SAMR simulator is used to analyze the behavior of a partitioner for the application on 64 processors. In the discussion below, we only present the simulator analysis for the SFC and pBD-ISP partitioners. As the G-MISP+SP scheme has similar characteristics as the SFC partitioner, its analysis is not presented. The SAMR simulator uses the trace to determine the performance of the partitioners at each regrid step in terms of the primary components of the quality metric (communication, data movement, load imbalance). Note that the other two metric, partitioning overhead and partitioning time, are intrinsic characteristics of a partitioner. The partitioning time is directly dependent on the three primary components that determine the runtime performance of the partitioner. The three figures below plot the maximum values of the total communication, data movement, and load imbalance components of the quality metric respectively for SFC and pBD-ISP partitioners on 64 processors. Analyzing these graphs, five operational zones can be identified.

Zone I (0 to ~25): This is the application start-up zone and is typically associated with low communication and data migration. In this zone, we observe that the communication and data movement costs for both partitioners are similar. The deciding factor then becomes the load imbalance. Though SFC starts with high load imbalance, this is quickly corrected to achieve improved load balance. In this zone, the G-MISP+SP scheme starts with and maintains low imbalance. In contrast, the pBD-ISP scheme produces relatively high load imbalance in this zone. Hence, G-MISP+SP or SFC is the preferred partitioner for this zone.
Zone II (~25 to ~135): In this zone, there are substantial communication and data movement costs as the application evolves. The pBD-ISP scheme aims at reducing these costs while maintaining moderate-to-low load imbalance. Though the SFC and G-MISP+SP partitioners yield better load balance than pBD-ISP, they produce larger data movement and communication overheads that adversely affect performance. Resolving these conflicting objectives, the pBD-ISP scheme is expected to perform better in this zone.
Zone III (~135 to ~150): This zone is characterized by high activity dynamics and greater communication requirements. The SFC and G-MISP+SP schemes outperform pBD-ISP in, both, the load balance and communication metric components, while producing comparable data movement. Clearly, G-MISP+SP or SFC is the appropriate partitioner for this zone.
Zone IV (~150 to ~175): In this zone, pBD-ISP gives lower data movement and communication overheads and produces an acceptable load imbalance as compared to the G-MISP+SP or SFC scheme, making it the partitioner of choice for the zone.
Zone V (~175 to ~200): This is the relatively ``quiet'' final stage of the RM3D_b application. The data migration and communication requirements are low and constant. Although all schemes perform equally well with respect to the quality metric, the G-MISP+SP or SFC scheme is slightly preferred as compared to pBD-ISP due to the better load balance produced.

An application-sensitive partitioner adaptation policy (denoted by ``adaptive'' in this example) for RM3D_bcan be formulated using the simulator analysis for application-sensitive partitioning as shown in the table below. The regrid step is used as the switching point where a new partitioner is selected, configured, and invoked.

Experimental Evaluation of the Adaptive Partitioner

The adaptive partitioning policy, manually formulated above, is experimentally evaluated on the NPACI IBM SP2 ``Blue Horizon'' at the San Diego Supercomputing Center, using the RM3D_bapplication with a base grid of size 128*32*32, 3 levels of factor 2 space-time refinements, and regridding performed every 4 time-steps. The experiments consisted of measuring application execution times for different processor configurations using the meta-partitioning strategies. The partitioner as well as the partitioning parameters were switched on-the-fly while the application was executing using the regrid step as an indicator. Run-times from single partitioner runs were also measured. Run-times for experiments on 32 and 64 processors are shown in the following graphs. The basic RM3D_bversion serves as a ``feasibility study'' example and is significantly different from the original RM3D kernel.

The above results demonstrate that an adaptive application-sensitive partitioning policy that dynamically switches partitioners (and associated partitioning parameters) at runtime improves application performance and results in reduced execution times. For 64 processors, the improvement is 27.2% over the slowest partitioner. In the adaptive partitioner case, G-MISP+SP improves the load balance when the application is computationally dominated, while pBD-ISP reduces communication and data-migration overheads. This ``feasibility study'' example demonstrates the potential benefits of adaptive application-sensitive partitioning and motivates the development of the ARMaDA autonomic partitioning framework.

Autonomic Application Aware Runtime Management

ARMaDA is an autonomic partitioning framework for SAMR applications that dynamically selects and configures partitioning algorithms at runtime to optimize overall application performance. The ARMaDA framework has three components: application state monitoring and characterization component, partitioner selection component and policy engine, and runtime adaptation component. The partitioners used includes a selection from software tools such as GrACE and Vampire. Note that ARMaDA does not require any a priori knowledge about the behavior of the application or its adaptation patterns.

Automated Application State Monitoring and Characterization

The current state of a SAMR application and its partitioning requirements are determined using the structure of the SAMR grid hierarchy, which is expressed as sets of bounding boxes assigned to the different processors at various levels of refinement. These boxes are used to inexpensively compute the computation/communication requirements, application dynamics, and the nature of the adaptation using simple geometric operations as follows.

Computation/Communication Requirements: The computation-to-communication ratio (``CCratio'') determines whether the current application state is computationally-intensive or communication-dominated. The ratio of the total volume to total surface area of the bounding boxes determines CCratio for the current state of the application.

Application Dynamics: The application dynamics (``Dynamics'') estimate the rate of change of the application refinement patterns and are determined by comparing the current application state with its previous state. To compute the Dynamics metric, the overlap between the current set of bounding boxes and the previous set of boxes at each level is computed.

Nature of Adaptation: The nature of the adaptation (``Adapt'') captures the adaptation pattern, i.e., whether the refinements are scattered or localized. A simple, approximate algorithm determines the nature of adaptation based on the number and geometric arrangement of refinement regions within the overall domain under consideration.

The three metric described above can now be used to effectively estimate the state of the SAMR application at runtime. To avoid the possibility of thrashing due to very frequent application state changes, the ARMaDA framework maintains a history of the application state by storing the structure of the grid hierarchy for two preceding regrid steps. Three normalized ratios are computed corresponding to the three metric, viz., computation/communication ratio ``Cratio'', application dynamics ratio ``Dratio'', and adaptation ratio ``Aratio''.

Policy-based Partitioner Selection

The three normalized ratios (Cratio, Dratio, and Aratio) computed by the state characterization component are combined to map the application state to a specific octant. First, user and system defined thresholds are used to assign a bit to each metric. Application-dependent weights fine-tune the sensitivity of the metric to closely match the needs of the dynamic application. The three bits are then used to map the application to the application state octant. The appropriate partitioner is then selected from the set of partitioners that are mapped to the octant and satisfy the partitioning requirements associated with that octant.

Let M_r denote any one of the three metric that are computed from the state of the grid hierarchy at regrid step r. The normalized metric ratio is then computed as follows. Let Mratio and Mbit represent the normalized ratio and the octant-bit value for a particular metric M. Let lowMwt and highMwt denote the application metric weights for the low and high application state thresholds (represented by LOW_THRESH and HIGH_THRESH) respectively. The metric ratio to octant bit mapping is given by

Three octant bits,``Cbit'', ``Dbit'', and ``Abit'' corresponding to the three metric ratios are computed in this way. A low Cbit (Cbit = 0) represents more communication while a high Cbit (Cbit = 1) indicates greater computation. A low Dbit (Dbit = 0) indicates greater activity whereas a high Dbit (Dbit = 1) represents lesser application dynamics. A low Abit (Abit = 0) denotes more localized adaptation while a high Abit (Abit = 1) indicates that the nature of adaptation is scattered. These bits are then used to identify the ARMaDA characterized state, which is then mapped to an application state octant and associated partitioners using the policies as shown in the table below.

Adaptive Application-sensitive Partitioning

The ARMaDA adaptation component configures the selected partitioner with appropriate partitioning parameters (such as partitioning granularity) and invokes it to partition and load balance the SAMR grid hierarchy. A common interface is defined for the available partitioners allowing them to be uniformly invoked. The granularity is based on the requirements of the octant, though it may be overidden by a user defined value. A key concern in adaptive partitioning is ensuring that the overheads of characterizing the application state and switching partitioners (if necessary) at runtime do not offset the benefits of the adaptation. ARMaDA uses efficient and inexpensive mechanisms based on the union and interaction of the structured geometry of the grid components to characterize application state and compute the required metric ratios. Furthermore, ARMaDA provides optimizations and control parameters that can be used to customize the sensitivity, quality, and overheads of application monitoring and partitioner adaptation.

Evaluation of ARMaDA Framework

The experimental evaluation of the ARMaDA framework is performed using various SAMR application kernels. The experiments consist of measuring overall application execution times for different partitioner configurations using the ARMaDA framework. The partitioners used in the evaluation include SFC, G-MISP+SP, pBD-ISP, and adaptive combinations of these partitioners determined at runtime based on application state. With the exception of the partitioning strategy and associated partitioning granularity, all application-specific and refinement-specific parameters are kept constant. In the adaptive partitioning experiments using ARMaDA, an initial partitioner is selected by the user along with other partitioning parameters and thresholds. The initial partitioner is used for the initial distribution of the SAMR grid hierarchy, while the partitioners used in subsequent distributions are autonomically determined by ARMaDA.

TportAMR-2D on ``Discover'': ``Discover'' is a 16-node Linux Beowulf cluster at Rutgers University. The Transport AMR 2D (TportAMR-2D) application is a benchmark kernel solving the transport equation and is primarily communication-dominated with high adaptation overheads. This evaluation is performed on 8 processors of Discover using an application base grid of size 64*64 and 3 levels of factor 2 space-time refinements. Regridding is performed every 4 time-steps at each level and the application executes for 40 coarse level time steps. The application execution times are listed in the table below. The TportAMR-2D application used in this experiment is computationally inexpensive but requires considerable communication and data movement. Furthermore, the experiment uses a small domain and few iterations. Since the pBD-ISP partitioner is particularly suited to strongly communication-dominated application states and reduces communication and data migration costs, it is expected to perform better than the other partitioners. The results show that the ARMaDA framework with pBD-ISP as the initial partitioner performs the best and its performance is similar to the stand alone pBD-ISP case. In this evaluation, the overheads associated with the ARMaDA framework range between 40 and 100 milliseconds for an entire run of approximately 350 seconds. Though thrashing is detected in the case of G-MISP+SP with switching between G-MISP+SP and pBD-ISP partitioners, its effects are not significant due to the ARMaDA optimizations described earlier.

VectorWave-2D on ``Frea'': ``Frea'' is a 64-node Linux Beowulf cluster at Rutgers University. The VectorWave-2D application is a coupled set of partial differential equations and forms a part of the Cactus 2-D numerical relativity toolkit, solving Einstein's and gravitational equations. This evaluation is performed on 32 processors of Frea using an application base grid of size 128*128 and 3 levels of factor 2 space-time refinements. Regridding is performed every 4 time-steps at each level and the application runs for 60 coarse level time steps. The application execution times for the individual partitioners and for ARMaDA with SFC start are presented in the table above. The VectorWave-2D application is primarily computation-dominated, requiring good load balance and reduced communication and data migration costs. SFC and pBD-ISP partitioners optimize communication and data migration metric while G-MISP+SP gives good load balance. In this evaluation, the ARMaDA partitioner with SFC start improves performance by 26.19% over the slowest partitioner. The ARMaDA framework overhead for state sensing and adaptation in this case is 0.0616% of the total execution time, which is negligible.

RM2D & RM3D on ``Blue Horizon'': The NPACI IBM SP2 ``Blue Horizon'' is a Teraflop-scale Power3 based clustered SMP system at the San Diego Supercomputing Center. RM2D and RM3D are 2-D and 3-D versions respectively of a compressible turbulence kernel solving the Richtmyer-Meshkov instability. This evaluation is performed on 64 processors of Blue Horizon. The RM2D evaluation uses a base grid of size 128*32 and 60 coarse level time steps, while RM3D uses a base grid of size 128*32*32 and 100 coarse level time steps. Both evaluations use 3 levels of factor 2 space-time refinements and regridding is performed every 4 time-steps at each level. The application execution times for the individual partitioners and ARMaDA for RM2D and RM3D are listed in the tables below. Both, RM2D and RM3D start out as computation-dominated with low activity dynamics and a more scattered adaptation, placing them in octant IV. As the applications evolve, their communication requirements become more significant. Moreover, the applications exhibit different dynamics and adaptations at different stages of their execution. The pBD-ISP scheme improves communication and data migration metric and is seen to perform better than the SFC and G-MISP+SP partitioners. The ARMaDA adaptive partitioner starts out with either the G-MISP+SP or SFC partitioner as these partitioners are better suited for the initial stages of the application. The overheads associated with the ARMaDA framework are low. For the RM2D and RM3D evaluations, the overheads are 0.415 seconds and 1.85 seconds respectively, and are negligible as compared to the overall application execution times. These experimental results demonstrate that autonomic partitioning using the ARMaDA framework reduces execution time and improves application performance. In the case of RM2D application, improvements due to ARMaDA are 4.66%, 11.32%, and 27.88% over pBD-ISP, G-MISP+SP, and SFC partitioners respectively. In the case of RM3D application, these improvements are 22.17% over the pBD-ISP partitioner and 38.28% over the SFC scheme.

Enabling Scalable SAMR Implementations

Objective

Parallel implementations of SAMR methods lead to interesting challenges in dynamic resource allocation, data-distribution, load-balancing, and runtime management. While balancing these conflicting requirements is necessary for achieving large-scale scalability of SAMR applications, identifying appropriate trade-offs is non-trivial. This is especially true in the case of SAMR applications with highly dynamic and localized features and requiring high SAMR efficiencies, as they lead to deep hierarchies and small regions of refinement with unfavorable computation to communication ratios. The primary objective of this research is to address the scalability requirements of such applications on large numbers of processors (upto 1024 processors). We present a SAMR runtime engine consisting of two components: (1) a hierarchical partitioning framework consisting of a stack of partitioners that can address the space-time heterogeneity and dynamism of the SAMR grid hierarchy. The partitioners build on a locality preserving space-filling-curve-based representation of the SAMR grid hierarchy and enhance it based on localized requirements to minimize synchronization costs within a level (level-based partitioning), balance load (bin-packing based partitioning), or reduce partitioning costs (greedy partitioner); and (2) a SAMR communication substrate that is sensitive to the implementations of the MPI non-blocking communication primitives and is optimized to reduce communication overheads.

The SAMR runtime engine as well as its individual components are experimentally evaluated on the IBM SP2 supercomputer using a 3-D Richtmyer-Meshkov (RM3D) compressible turbulence kernel. RM3D is characterized by highly localized solution features resulting in small patches and deep application grid hierarchies (i.e., a small region is very highly refined in space and time). As a result, the application has increasing computational workloads and greater communication/synchronization requirements at higher refinement levels with unfavorable computation to communication ratios. The resulting characteristics significantly limit RM3D scalability on large numbers of processors. The SAMR runtime engine addresses these runtime challenges in synchronization, load-balance, locality, and communication to realize scalable RM3D implementations.

SAMR Hierarchical Partitioning Framework

The hierarchical partitioning framework consists of a stack of partitioners that can manage the space-time heterogeneity and dynamism of the SAMR grid hierarchy. These dynamic partitioning algorithms are based on a core Composite Grid Distribution Strategy (CGDS) belonging to the GrACE (Grid Adaptive Computational Engine) SAMR infrastructure. CGDS uses space filling curves (SFCs) and partitions the entire SAMR domain into subdomains such that each subdomain keeps all refinement levels in the subdomain as a single composite grid unit. Thus, all inter-level communication are local to a subdomain and the inter-level communication time is greatly reduced. The resulting composite grid unit list (GUL) for the overall domain must now be partitioned and balanced across processors. Once the grid hierarchy index space is mapped using the SFC+CGDS scheme, higher-level partitioning techniques can be applied in a hierarchical manner using the hierarchical partitioning algorithm (HPA). In HPA, processor groups are defined based on the dynamic grid hierarchy structure and correspond to regions of the overall computational domain. The top processor group partitions the global GUL obtained initially and assign portions to each processor subgroup in a hierarchical manner. In this way, HPA further localizes the communication to subgroups, reduces global communication and synchronization costs, and enables concurrent communication. Within each processor subgroup, higher-level partitioning strategies are then applied based on the local requirements of the SAMR grid hierarchy subdomain. The objective could be to minimize synchronization costs within a level using the level-based partitioning algorithm (LPA), or efficiently balance load using the bin-packing based partitioning algorithm (BPA), or reduce partitioning costs using the greedy partitioning algorithm (GPA), or combinations of the above.

Greedy Partitioning Algorithm

GrACE uses a default greedy partitioning algorithm to partition the global GUL and produce a local GUL for each processor. The GPA scheme performs a rapid partitioning of the SAMR grid hierarchy as it scans the global GUL only once while attempting to distribute the load equally among all processors. GPA helps in reducing partitioning costs and works quite well for a relatively homogeneous computational domain with few levels of relatively uniform refinement, and small-to-medium scale application runs. However, due to the greedy nature of the algorithm, GPA tends to result in heavy loading of later processors.

Level-based Partitioning Algorithm

The computational workload for a certain patch of the SAMR application is tightly coupled to the refinement level at which the patch exists. The computational workload at a finer level is considerably greater than that at coarser levels. The level-based partitioning algorithm (LPA) attempts to simultaneously balance load and minimize synchronization cost. LPA essentially preprocesses the global application computational units represented by a global GUL, disassembles them based on their refinement levels, and feeds the resulting homogeneous units at each refinement level to GPA (or any other partitioning/load-balancing scheme). The GPA scheme then partitions this list to balance the workload. Due to the preprocessing, the load on each refinement level is also balanced. LPA benefits from the SFC-based technique by maintaining parent-children relationship throughout the composite grid and localizing inter-level communications, while simultaneously balancing the load at each refinement level, which reduces the synchronization cost.

Bin-packing based Partitioning Algorithm

The bin-packing based partitioning algorithm (BPA) attempts to improve the load balance during the SAMR partitioning phase, where the computational workload associated with a GUL at different refinement levels of the SAMR hierarchy is distributed among available processors. The distribution is performed under constraints such as the minimum patch size and the aspect ratio. BPA distributes the workload among processors as long as the processor work threshold is not exceeded. Patches with a workload larger than the threshold limit are split into smaller patches. Unallocated work is first distributed using a ``best-fit'' approach. If this fails, the ``most-free'' approach is adopted until all the work is allocated. BPA allows the user to set a ``tolerance'' value that determines the acceptable workload imbalance for the SAMR application. In case of BPA, the load imbalance, if any, is low since it is bounded by the tolerance threshold. Due to the underlying bin-packing algorithm, the BPA technique provides overall better load balance for the grid hierarchy partitions among processors as compared to the GPA scheme. However, a large number of patches may be created as a result of multiple patch divisions. Also, the load distribution strategy in BPA can result in multiple scans of the grid unit list that marginally increases the partitioning overheads.

Optimizing SAMR Communication Substrate

Due to irregular load distributions and communication requirements across different levels of the grid hierarchy, parallel SAMR implementations make extensive use of MPI non-blocking primitives to reduce synchronization overheads. The analysis on the IBM SP2 MPI implementation demonstrates that the time spent waiting for processes to synchronize is the major source of communication overheads, rather than the network latency. This problem is particularly significant in applications where communications are not completely synchronized and there is some load imbalance, which is typical in SAMR applications. An attempt to increase the MPI eager limit on the IBM SP2 is not a scalable solution, since increasing this value of the environment variable increases the MPI memory usage per processor, thus reducing the amount of overall memory available to the SAMR application. A more scalable strategy is to address this at the application level by appropriating positioning MPI_Isend, MPI_Irecv, MPI_Wait on send side (Ws), and MPI_Wait on receive side (Wr) calls. The basic optimization strategy consists of delaying (Ws) until after (Wr) as shown in the figures below.

Evaluation: LPA and BPA techniques

The RM3D evaluation for LPA and BPA is performed on 64 processors on Blue Horizon using the GrACE infrastructure. RM3D uses a base grid of 128*32*32 and 3 levels of factor 2 space-time refinements with regridding performed every 8 time-steps at each level. The RM3D application executes for 100 iterations and the total execution time, synchronization (Sync) time, recompose time, average maximum load imbalance, and the number of boxes are measured for each of the following 3 configurations, namely: (1) default GrACE partitioner (GPA), (2) LPA scheme using GrACE, and (3) the LPA+BPA technique using GrACE, i.e. the combined approach. The figure below shows the effect of LPA and BPA partitioning optimizations on RM3D application performance. Note that the values plotted in the figure are normalized against the corresponding maximum value. The results demonstrate that the LPA scheme helps to reduce application synchronization time while the BPA technique provides better load balance. A combined approach reduces overall execution time and results in improved performance.

Evaluation: ``Delayed waits'' optimization

The evaluation for ``delayed waits'' optimization consists of measuring the message passing and application execution times for the RM3D application with and without the optimization in the SAMR communication substrate. Except for the MPI non-blocking optimization, all other application-specific and refinement-specific parameters are kept constant. RM3D uses a base grid size of 256*64*64 and 3 levels of factor 2 space-time refinements with regridding performed every 8 time-steps at each level, and the application executes for 100 iterations. The figure below shows the comparisons of the execution times and communication times for the two configurations on 64, 128, and 256 processors. The ``delayed waits'' optimization helps to reduce overall execution time, primarily due to the decrease in message passing time. In this evaluation, the reduction in communication time is 44.37% on the average and results in improved application performance.

Evaluation: Overall RM3D Scalability

The scalability of RM3D application execution time, computation time, and synchronization time for 256, 512, and 1024 processors are illustrated in the figures below. Clearly, the LPA partitioning, bin-packing based load-balancing, and MPI non-blocking communication optimization techniques enable scalable RM3D runs with multiple refinement levels on large numbers of processors. Though the RM3D computation time scales quite well, the scalability of synchronization time is limited by the highly communication-dominated application behavior and unfavorable computation to communication ratios. Also, note that a ``unigrid'' implementation of the RM3D application for the same experimental settings will use a domain of size 1024*256*256 with approximately 67 million computational grid points and require extremely large memory availability. Thus, unigrid implementations are not even feasible for large-scale execution of the RM3D application.

To evaluate the benefits of using SAMR, another experiment compares the RM3D application performance on 512 processors of Blue Horizon by varying coarse base grid size and the number of refinement levels. All other application-specific and refinement-specific parameters kept constant. Note that for every step on the coarse level (level 0), two steps are taken at the first refinement level (level 1), four steps on level 2, and so on. The evaluation is performed for two configurations with the same resolution (8000 steps) at the finest level of the SAMR grid hierarchy. The RM3D domain size at the finest level is 1024*256*256 and the evaluation uses the LPA+BPA technique with a minimum block size of 16. The first configuration (Run 1) uses a coarse grid of size 128*32*32 with 4 levels of factor 2 space-time refinements, and executes 1000 coarse-level iterations which correspond to 8000 steps at the finest level. The second configuration (Run 2) uses a 64*16*16 base grid and 5 refinement levels, and runs for 500 coarse-level iterations to achieve the same resolution at the finest level. As shown in the table above, Run 2 provides better performance (reduced execution and compute times) than Run 1. The improvement in RM3D application performance for the 5-level hierarchy is due to the efficiency of the basic Berger-Oliger SAMR algorithm.

Publications:

1. Characterization of Domain-based Partitioners for Parallel SAMR Applications  [PDF]
    Johan Steensland, Sumir Chandra, Manish Parashar, and Michael Thune.
    IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), Las Vegas, USA, pp. 425-430, November 2000.

2. An Evaluation of Partitioners for Parallel SAMR Applications  [PDF]
    Sumir Chandra and Manish Parashar.
    Accepted for publication in the proceedings of European Conference on Parallel Computing (Euro-Par '01), Manchester, UK, August 2001.

3. An Application-centric Characterization of Domain-based Partitioners for Parallel Adaptive Mesh Refinement [PDF]
Johan Steensland, Sumir Chandra, and Manish Parashar.
Submitted to IEEE Transactions on Parallel and Distributed Systems, February 2001.

4. An Experimental Study of Adaptive Application Sensitive Partitioning Strategies for SAMR Applications [PDF]
Sumir Chandra, Johan Steensland, and Manish Parashar.
Submitted to IEEE/ACM Supercomputing, May 2001.

5. System Sensitive Runtime Management of Adaptive Applications  [PDF]
    Shweta Sinha and Manish Parashar.
    Accepted at Tenth IEEE Heterogenous Computing Workshop (HCW), San Francisco, USA, 2001.

TASSL Menu

Quick Links