Grid Adaptive Computational Engine (Grace) Overview | The NSF Cloud and Autonomic Computing Center at Rutgers

Grid Adaptive Computational Engine (GrACE)

An adaptive computational engine infrastructure forms an effective basis for the development of adaptive methods for the solution of systems of partial differential equations for varied complex scientific application domains, including Grand Challenge problems. Parallel/distributed implementations of these adaptive methods offer the potential for the accurate solution of realistic models of important physical systems. This assumes greater importance as scientific simulations play an increasingly critical role in all areas of science and engineering. The simulations of expensive scientific problems are necessary to validate and justify the investment to be made.

The need for adaptive computational engines is motivated by the fact that -

Adaptive methods will be utilized for the solution of almost all very large-scale science and engineering models. These adaptive methods will be executed upon the very large-scale heterogeneous parallel execution environments.
Effective application of these complex methods on scalable parallel architectures will be possible only through the use of programming abstractions which lower the complexity of application structures to a tractable level.
A common infrastructure for this family of algorithms will result in enormous savings in coding effort and effective pooling and focusing of effort.

GrACE (Grid Adaptive Computational Engine) is an adaptive computational and data-management engine for enabling distributed adaptive mesh-refinement computations on structured grids. It builds on the Distributed Adaptive Grid Hierarchy (DAGH) infrastructure and extends it to provide a virtual, semantically specialized distributed (and dynamic) shared memory infrastructure with multifaceted objects specialized to distributed adaptive grid hierarchies and grid functions. GrACE has been deployed to support applications in different application domains including DoE ASCI codes at the California Institute of Technology and theUniversity of Chicago, and the NASA neutron star grand challenge.

GrACE has been developed at the Texas Institute for Computational and Applied Mathematics (TICAM) at the University of Texas, Austin and The Applied Software Systems Laboratory (TASSL) in the Center for Advanced Information Processing (CAIP) at Rutgers University.

Overview

Dynamically adaptive methods for the solution of partial differential equations that employ locally optimal approximations can yield highly advantageous ratios for cost/accuracy when compared to methods based upon static uniform approximations. These techniques seek to improve the accuracy of the solution by dynamically refining the computational grid in regions of high local solution error. Parallel/distributed implementations of these adaptive methods offer the potential for the accurate solution of realistic models of important physical systems. These implementations, however, lead to interesting challenges in dynamic resource allocation, data-distribution and load balancing, communications and coordination, and resource management.

Critical among these is the partitioning problem where load balance, communication, data migration, and other overhead costs such as grid aspect ratios need to be optimized simultaneously. Hence, the key requirements for a decomposition scheme for partitioning an adaptive grid hierarchy can be summarized as:
    *    expose available data parallelism,
    *    minimize communication overhead,
    *    balance overall load distribution,
    *    enable dynamic load redistribution with minimum overheads.

Another important requirement while partitioning adaptive grid hierarchies is the maintenance of logical locality, both across different levels of the hierarchy under expansion and contraction of the adaptive grid structure, and within partitions of grids at all levels when they are decomposed and mapped across processors. The former enables efficient computational access to the grids while the latter minimizes the total communication and synchronization overheads. Furthermore, application adaptation results in application grids being created, moved and deleted on-the-fly, making it is necessary to efficiently re-partition the hierarchy so that it continues to meet these goals.

Moving the adaptive applications to dynamic and heterogeneous networked computing environments introduces a new level of complexity. These environments require selecting and configuring application components based on available resources. However, the complexity and heterogeneity of the environment make the selection of a "best" match between system resources, application algorithms, problem decompositions, mappings and load distributions, communication mechanisms, etc., non-trivial. System dynamics coupled with application adaptation makes application configuration and run-time management a significant challenge.

GrACE addresses these issues in an efficient manner using the approach of separation of concerns. The Grid Adaptive Computational Engine consists of two components:

A set of programming abstractions in which computations on dynamic hierarchical grid structures are directly implemental. The appropriate programming abstractions are a hierarchy of scalable distributed dynamic grids and a set of operations on this grid hierarchy. The operations include creation, partitioning, computations on the grid such as stencil operations, communication among grid partitions at a single level and communication among grids at different levels of the hierarchy.

A set of distributed dynamic data-structures that support the implementation of the abstractions in parallel execution environments and preserve efficient execution while providing transparent distribution of the grid hierarchy across processing element execution environment.

Adaptive Mesh Refinement (AMR)

Dynamically adaptive numerical techniques for solving differential equations provide a means for concentrating effort to computationally demanding regions. In the case of hierarchical AMR methods, this is achieved by tracking regions in the domain that require additional resolution and dynamically overlaying finer grids over these regions. Techniques based on AMR start with a base coarse grid with the lowest acceptable resolution that covers the entire computational domain. As the solution progresses, regions in the domain with high solution error and requiring additional resolution are flagged and finer grids are overlaid on the flagged regions of the coarse grid. Refinement proceeds recursively so that regions on the finer grid requiring higher resolution are similarly flagged, and even finer grids are overlaid on these regions. The resulting grid structure is a dynamic adaptive grid hierarchy. The figure below shows adaptive grid hierarchy for the classic Berger AMR formulation.

The Berger-Oliger Algorithm [(1984) Journal of Computational Physics, 53, 484-512]

Suited for solving hyperbolic partial differential equations on structured computational domains
Refinement factor is the same in both space and time: nested time marching on all levels (efficient)
Leads naturally to a hierarchic tree of grids
Fine grids have a separate identity from the underlying coarse grid; relation children to parent is maintained
Uses a prolongation operator to populate Grid Function on a newly created grid from the underlying coarser grid
Uses a restriction operator to inject newly computed function values of a fine grid to the underlying coarser grid

Recursive Procedure Integrate(level)

If (RegridTime) Regrid

Step Dt on all grids at level “level”

If (level + 1 exists)

Integrate (level + 1)

Update(level, level + 1)

End if

End Recursion

level = 0

Integrate(level)

AMR Terminology

Base Grid : Initial griding of the computational domain, typically a uniform “coarse” mesh with minimum required resolution.

Grid Hierarchy : Hierarchical grid structure with multiple levels of grid resolution.

Time Integration : Solving the PDE on the grid to take one timestep.

Local Truncation Error (LTE) : Locally computed error in the PDE solution. Used to identify grid points requiring refinement.

Refinement : Addition of finer resolution grids in regions with high LTE. Refinement is performed both in time and space.

Regriding : Creating the new grid hierarchy, adding high-resolution grids in refined regions and removing high-resolution grids no longer required. The regriding process includes the initialization of refined grids.

Clustering : Combining grid points with high LTE into regular regions to be refined.

Prolongation : Transfer of information from a coarser grid to a finer grid. Prolongation is used to initialize higher resolution grids on creation.

Restriction : Transfer of information from finer grids to coarser grid. Restriction is performed every timestep to ensure that the solutions on the fine and coarse grids do not diverge.

Grid Structures

AMR 2-D Example

AMR 3-D Example

Integration Algorithm

The adaptive finite-difference (AFD) integration algorithm defines the order in which different levels of the grid hierarchy are integrated, the interactions between overlaying component grids at different levels, and the criterion and method for grid refinement. Correspondingly, it is composed of three key components: (1) Time Integration, (2) Error Estimation & Regriding, and (3) InterGrid Operations.

Time Integration : Time integration is performed on each component grid using a specified finitedifference operator. Each component grid may have its own operator and can be integrated independently (except for determination of the boundary values). The order of integration along the grid hierarchy is defined recursively such that, before advancing component grids at a particular level of refinement in time, all component grids at higher levels of refinement must be integrated to the current time of component grids at that level. That is, before stepping component grids at level l, i.e. {G^l_i}, from time T to T + t_l , all component grids at levels > l must be integrated to time T .

Error Estimation & Regriding : The error estimation and regriding component of the integration algorithm performs the following three steps; (1) flagging regions needing refinement based on error estimation, (2) clustering flagged points, and (3) refined grid generation. The result may be the creation of a new level of refinement or component grids at existing levels, and/or the deletion of existing component grids. Grid generation in step 3 is performed so as to maintain proper nesting along the grid hierarchy.

InterGrid Operations : Intergrid operations are used to communicate solutions values along the adaptive grid hierarchy. In case of the BergerOliger AFD scheme, the following intergrid operations are defined:

Initialization of refined component grids: Refined component grid initialization may be performed using the interior values of an intersecting component grid at same level if one exists; or by prolongated values from the underlying coarser component grid.
Coarse grid update: An underlying coarse component grids are updated using the values on a nested finer component grid each time they are integrated to the same time. This update or restriction intergrid operation may be performed by direct injection or using a defined averaging/interpolation scheme.
Averaging: Averaging is necessary when two component grids at the same level of refinement overlap, and is used to update the coarse component grids underlying the overlap region.

Space-Filling Curves (SFC)

An efficient and effective parallel/distributed adaptive finite-difference (AFD) implementation must use a parallelization, decomposition, and distribution scheme that complements the problem itself. Correspondingly, the implementation of the identified data structures must complement the dynamic and hierarchical nature of adaptive grid hierarchy. The fundamental requirement in realizing such a class of dynamic, distributed data structures is the generation of an extendable, global index space. A class of dimension changing mapping called space filling curves can be used to provide such an index space.

Space filling curves are a class of locality preserving mappings from ddimensional space to 1-dimensional space , i.e. N^d --» N¹, such that each point in N^d is mapped to a unique point or index in N¹. The mapping can thus be though of as laying out a string within the d-dimensional space so that it completely fills the space. The 1-dimensional mapping generated by the space-filling curve serves as an ordered indexing into multidimensional space. Mapping functions used to generate the space-filling index corresponding to a point in multidimensional space typically consist of interleaving operations and logical manipulations of the coordinates of the point, and are computational inexpensive. Two such mappings, the Morton order and the Peano-Hilbert order, are shown in the figure below. Two properties of space filling curves, viz. digital causality and self-similarity, make them particularly suited as a means of generating the required adaptive and hierarchical index-space.

Digital Causality: Digital causality implies that points that are close together in the d-dimensional space will be mapped to points that are close together in the 1dimensional space, i.e. locality is preserved by the mapping.

Self-Similarity: Self-similarity property implies that, as a d-dimensional region is refined into smaller sub-regions, the refined sub-regions can be recursively filled by curves that have the same structure as the curve used to fill the original (unrefined) region but possibly different orientations. The figure above illustrates this property for a 2dimensional regions with refinements by factors of 2 and 3.

In addition to providing a linear ordered index space that is hierarchical and extendable, each index generated by the spacefilling mapping has information about the original multidimensional space embedded in it. As result, given a key, it is possible to obtains its position in the original multidimensional space.

HDDA/DAGH Model

Hierarchical, Extendible Index Space

The hierarchical, extendible index space component of the Hierarchical Dynamic Distributed Array (HDDA) is derived directly from the application domain using space-filling mappings which are computationally efficient, recursive mappings from N-dimensional space to 1dimensional space. The solution space is first partitioned into segments. The space filling curve (such as a 2dimensional Peano-Hilbert curve) then passes through the midpoints of these segments. Space filling mapping encode application domain locality and maintain this locality though expansion and contraction. The self-similar or recursive nature of these mappings can be exploited to represent a hierarchical structure and to maintain locality across different levels of the hierarchy. Space-filling mappings allow information about the original multidimensional space to be encoded into each space-filling index. Given an index, it is possible to obtain its position in the original multidimensional space, the shape of the region in the multidimensional space associated with the index, and the space-filling indices that are adjacent to it. The index-space is used as the basis for application domain partitioning, as a global namespace for name resolution, and for communication scheduling.

Mapping to Address Space

The mapping from the multidimensional index space to the one-dimensional physical address space is accomplished by mapping the positions in the index space to the order in which they occur in a traversal of the space filling curve. This mapping can be accomplished with simple bit-interleaving operations to construct a unique ordered key. This mapping produces a unique key set which defines a global address space. Coalescing segments of the linear key space into a single key, blocks of arbitrary granularity can be created.

Storage and Access

Data storage is implemented using extendible hashing techniques to provide a dynamically extendible, globally indexed storage. The keys for the Extendible Hash Table are contractions of the unique keys. Entries into the HDDA correspond to Dynamic Adaptive Grid Hierarchy (DAGH) blocks. Expansion and contraction are local operations involving at most two buckets. Locality of data is preserved without copying. The HDDA data storage provides a means for efficient communication between DAGH blocks. To communicate data to another DAGH blocks, the data is copied to appropriate locations in the HDDA. This information is then asynchronously shipped to the appropriate processor. Similarly, data needed from remote DAGH blocks is received on-the-fly and inserted into the appropriate location in the HDDA. Storage associated with the HDDA is maintained in ready-to-ship buckets. This alleviates overheads associated with packing and unpacking. An incoming bucket is directly inserted into its location in the HDDA. Similarly, when data associated with a DAGH block entry is ready to ship, the associated bucket is shipped as is. The overall HDDA/DAGH distributed dynamic storage scheme is shown in the figure.

Partitioning and Communication

An instance of a DAGH is mapped to an instance of the HDDA. The granularity of the storage blocks is system dependent and attempts to balance the computation-communication ratio for each block. Each block in the list is assigned a cost corresponding to its computational load. In case of an AMR scheme, computational load is determined by the number of grid elements contained in the block and the level of the block in the AMR grid hierarchy. The former defines the cost of an update operation on the block while the latter defines the frequency of updates relative to the base grid of the hierarchy. In the representation described above, space-filling mappings are applied to grid blocks instead of individual grid elements. The shape of a grid block and its location within the original grid is uniquely encoded into its space-filling index, thereby allowing the block to be completely described by a single index. Partitioning a DAGH across processing elements using this representation consists of appropriately partitioning the DAGH key list so as to balance the total cost at each processor. Since space-filling curve mappings preserve spatial locality, the resulting distribution is comparable to traditional block distributions in terms of communication overheads.

Refinement and Coarsening of the Grid

The DAGH representation starts with a single HDDA corresponding to the base grid of the grid hierarchy, and appropriately incorporates newly created grids within this list as the base grid gets refined. The resulting structure is a composite key space of the entire adaptive grid hierarchy. Incorporation of refined component grids into the base grid key space is achieved by exploiting the recursive nature of space-filling mappings. For each refined region, the key list corresponding to the refined region is replaced by the child grid's key list. The costs associated with blocks of the new list are updated to reflect combined computational loads of the parent and child. The DAGH representation therefore is a composite ordered list of DAGH blocks where each DAGH block represents a block of the entire grid hierarchy and may contain more than one grid level; i.e. inter-level locality is maintained within each DAGH block. Each DAGH block in this representation is fully described by the combination of the space-filling index corresponding to the coarsest level it contains, a refinement factor, and the number of levels contained.

Features

Distributed Dynamic Data-structures for Parallel Hierarchical AMR

Transparent access to Scalable Distributed Dynamic Arrays/Grids/Grid-Hierarchies

Multi-grid/Line-Multi-grid support within AMR

Shadow grid hierarchy for on-the-fly error estimation

Automatic dynamic partitioning and load distribution

Locality, Locality, Locality !!!

Scalability, Portability, Performance

Programming Interface

Application Objects => Abstract Data Types

Coarse grained Single Program Multiple Data (SPMD) form of data parallelism

C++ driver
- declares and defines computational domain and application variables in terms of GrACE programming abstractions
- defines overall structure of the AMR algorithms

FORTRAN/FORTRAN 90/C computational kernels
- defined on regular arrays

Programming Abstraction

Grid Hierarchy Abstraction
- Template for the distributed adaptive grid hierarchy

Grid Function Abstraction
- Application fields defined on the adaptive grid hierarchy

Grid Geometry Abstraction
- High-level tools for addressing regions in the computational domain

Supporting Parallel AMR

Data-parallel forall operator
- Parallel operation for all grid components at a particular time step & level
  - forall (gf, time, level, component)
    Call FORTRAN Subroutine...
    end_forall

Refining and regriding
- Encapsulates: Generation of refined grids, Redistribution, Load-balancing, Data-transfers, Interaction schedules
  - Refine(GH, Level, BBoxList)
    RecomposeHierarchy(GH)

Prolongations and Restrictions
- User defined prolong/restrict routines for each GridFunction
  - foreachGF(GH, GF, DIM, GFType)
    SetProlongFunction(GF, Pfunc);
    SetRestrictFunction(GF, Rfunc);
    end_forallGF

Checkpoint/Restart/Rollback

Each Grid Function can be individually selected or deselected for check-pointing

Checkpoint files independent of number of processors

Integrate IO/Visualization

Allocates a hierarchy of IO/Viz servers

User-defined IO routine

Existing Interfaces for HDF/IEEEIO

1-D Slices via Xgraph

Multi-grid Support

Allow multi-grid cycling at each level of the AMR hierarchy

Programming Abstractions for Hierarchical AMR

For GrACE, three fundamental programming abstractions are developed, using the HDDA/DAGH data-structures, that can be used to express parallel adaptive computations based on adaptive mesh refinement (AMR) and multigrid techniques, as seen in the figure below. Our objectives are twofold: firstly, to provide application developers with a set of primitives that are intuitive for implementing the problem, i.e. application objects «-» abstract data-types, and secondly, a separation of data-management issues and implementations from application specific computations.

Grid Geometry Abstraction

The purpose of the grid geometry abstractions is to provide an intuitive means for identifying and addressing regions in the computational domain. These abstractions can be used to direct computations to a particular region in the domain, to mask regions that should not be included in a given operation, or to specify region that need more resolution or refinement. The grid geometry abstractions represent coordinates, bounding boxes and doubly linked lists of bounding boxes.

Coordinates: The coordinate abstraction represents a point in the computational domain. Operations defined on this class include indexing and arithmetic/logical manipulations. These operations are independent of the dimensionality of the domain.
Bounding Boxes: Bounding boxes represent regions in the computation domain and is comprised of a triplet: a pair of Coords defining the lower and upper bounds of the box and a step array that defines the granularity of the discretization in each dimension. In addition to regular indexing and arithmetic operations, scaling, translations, unions and intersections are also defined on bounding boxes. Bounding boxes are the primary means for specification of operations and storage of internal information (such as dependency and communication information) within DAGH.
Bounding Boxes Lists: Lists of bounding boxes represent a collection of regions in the computational domain. Such a list is typically used to specify regions that need refinement during the regriding phase of an adaptive application. In addition to linked-list addition, deletion and stepping operation, reduction operations such as intersection and union are also defined on a BBoxList.

- Coords
  - rank, x, y, z, ...
  BBox
  - lb, ub, stride
  BBoxList
  Operations
  - union, intersection, cluster, refine/coarsen, difference, ...

Grid Hierarchy Abstraction

The grid hierarchy abstraction represents the distributed dynamic adaptive grid hierarchy that underlie parallel adaptive applications based on adaptive mesh refinement. This abstraction enables a user to define, maintain and operate a grid hierarchy as a first-class object. Grid hierarchy attributes include the geometry specifications of the domain such as the structure of the base grid, its extents, boundary information, coordinate information, and refinement information such as information about the nature of refinement and the refinement factor to be used. When used in a parallel/distributed environment, the grid hierarchy is partitioned and distributed across the processors and serves as a template for all application variables or grid functions. The locality preserving composite distribution based on recursive space-filling curves is used to partition the dynamic grid hierarchy. Operations defined on the grid hierarchy include indexing of individual component grid in the hierarchy, refinement, coarsening, recomposition of the hierarchy after regriding, and querying of the structure of the hierarchy at any instant. During regriding, the repartitioning of the new grid structure, dynamic load-balancing, and the required data-movement to initialize newly created grids, are performed automatically and transparently.

GridHierarchy GH(Dim,GridType,MaxLevs)

Attributes:
- number of dimensions
  maximum number of levels
  specification of the computational domain
  distribution type
  refinement factor
  boundary type/width

Grid Function Abstraction

Grid Functions represent application variables defined on the grid hierarchy. Each grid function is associated with a grid hierarchy and uses the hierarchy as a template to define its structure and distribution. Attributes of a grid function include type information, and dependency information in terms of space and time stencil radii. In addition the user can assign special (Fortran) routines to a grid function to handle operations such as inter-grid transfers (prolongation and restriction), initialization, boundary updates, and input/output. These function are then called internally when operating on the distributed grid function. In addition to standard arithmetic and logical manipulations, a number of reduction operations such as Min/Max, Sum/Product, and Norms are also defined on grid functions. Grid Function objects can be locally operated on as regular Fortran 90/77 arrays.

GridFunction(DIM)<T> GF("gf", Stencils,GH, ...)

Attributes:
- dimension and type                                      vector?
  spatial/temporal stencils                               associated Grid Hierarchy
  prolongation/restriction functions                 "shadow" specification
  alignments                                                    ghost cells
  boundary types/updates                                interaction types
  flux registers                                                 parent storage

Separation of Concerns => Hierarchical Abstractions

GrACE Class Structures

......

Applications

DAGH/GrACE Applications

Application	Description
H3expresso AMR (J. Masso)	Numerical Relativity code that solves the full Einstein equations in 3D using a finite differencing conservative scheme and is parallel AMR operative
Black-box Multigrid Solvers (S. Klasky)	Equations that routines will solve L^hu^h = f^h, "black-box" calling interface with multigrid flavors available
Boson Star Application (R. Guenther)	Solves self-gravitating time-dependent Schrodinger equation in three spatial dimensions, with line-multigrid solvers used for update of Schrodinger field
Texas Black Hole Evolution Code (TCODE) (R. Correll & S. Klasky)	Black hole evolution code with leap frog time updates and second order methods at outer boundary replaced by first order methods
2D Laser-Plasma Interaction Fluid Code (D. Fisher & S. Klasky)	Models the wakefields of a very short high-intensity laser pulse interacting with a plasma in a moving frame coordinate system, Lax Wendroff time updates, and 2D multigrid using a point-wise smoothing method

Structured Adaptive Mesh Refinement (SAMR) Applications

Application	Description
Scalarwave 2D/ Scalarwave 3D	Numerical relativity application (aka NumRel 2D/ NumRel 3D) is a coupled set of partial differential equations. Equations can be divided into two classes: elliptic (Laplace equation-like) constraint equations which must be satisfied at each time, and coupled hyperbolic (Wave equation-like) equations describing time evolution. This kernel is part of the Cactus numerical relativity toolkit. http://www.cactuscode.org
Buckley-Leverette 2D/ Buckley-Leverette 3D	Buckley-Leverette model is used in Oil-Water Flow Simulation (OWFS) software system for simulation of hydrocarbon pollution in aquifers. OWFS provides for layer-by-layer modeling of oil-water mixture in confined aquifers with regard to discharge/recharge, infiltration, interaction with surface water bodies and drainage systems, discharge into springs and leakage between layers. This kernel is taken from the IPARS reservoir simulation toolkit developed at the Center for Subsurface Modeling at University of Texas at Austin. http://www.ticam.utexas.edu/CSM/ACTI/ipars.html
RM 2D/ RM3D	2/3-dimensional compressible turbulence application solving the Richtmyer-Meshkov instability. This application kernel is part of the virtual test facility developed at the ASCI/ASAP Center at California Institute of Technology. The Richtmyer-Meshkov instability is a fingering instability which occurs at a material interface accelerated by a shock wave. This instability plays an important role in the studies of supernova and inertial confinement fusion. http://www.cacr.caltech.edu/ASAP
EnoAMR 2D	2-dimensional application in computational fluid dynamics that addresses the forward facing step problem, describing what happens when a step is instantaneously risen in a supersonic flow. The application/simulation has several features including bow shock, Mach stem, contact discontinuity, and a numerical boundary. EnoAMR 2D is also part of the virtual test facility developed at the ASCI/ASAP Center at Caltech. http://www.cacr.caltech.edu/ASAP

GrACE Visualizations

Multi-Block Grid Structure

RM3D + GrACE (R. Samtaney)

Twin Mach Reflection Refraction (TMRR)

CO₂-CH₄ gas interface
Incident shock Mach number, M = 2.25
Experiment by Abd-El-Fattah and Henderson (JFM 1978)

Strong Shock Richtmyer-Meshkov Instability

Air-SF6 interface with strong single harmonic perturbation
Incident shock Mach number, M = 10
4 levels of refinement

Richtmyer-Meshkov Instability

Air-SF6 interface with strong single harmonic perturbation
Incident shock Mach number, M = 1.5
2 levels of refinement

512 Processor AMR with 3 levels of refinement

Early stages of refraction (5203 zones shown above)

128 Processor AMR with 3 levels of refinement

1825 total zones
Level 1: 615 zones (shown in black)
Level 2: 602 zones (not shown)
Level 3: 608 zones (shown in red)

RM3D + GrACE

AMR for Eulerian Solvers in the VTF

Objective: To have adaptive mesh refinement and coarsening capability for Eulerian fluid engine (RM3D) within the VTF (Virtual Test Facility)

CFD engine in VTF: RM3D

Example: 2D Supersonic flow over a circle

M = 2 supersonic inflow boundary

Base mesh 40*40 with 3 AMR levels
Results agree with unigrid simulation
Shock stand-off distance agrees with experiment

Example: 2D Complex Inlet (base mesh 50*50, 5 AMR levels)

Example: Shock over a 2D Complex Contour (base mesh 50*50, 4 AMR levels)

VTF Algorithmic Integration

Example: Detonation Diffraction around a Cube

Related Infrastructures (MACE & PAGH)

Multi-block Adaptive Computational Engine (MACE)

The MACE (Multi-block Adaptive Computational Engine) infrastructure support multi-block grids where multiple distributed and adaptive grid blocks with heterogeneous discretization are coupled together with lower dimensional (also distributed and adaptive) mortar grids. MACE is being extensively used to support subsurface modeling and oil reservoir simulations.

Parallel Adaptive Grid Hierarchy (PAGH)

PAGH (Parallel Adaptive Grid Hierarchy), developed by Erik Schnetter working with ASC (Astrophysics Simulation Collabortory) project members, Evans, Goodale and Parashar, is a GrACE-based driver routine designed and developed for the ASC. It provides an interface for memory management and I/O for the grid functions used in a simulation. On parallel machines, the driver also manages necessary communication between the individual processing nodes.

PAGH replaces the standard Cactus driver PUGH (Parallel Uniform Grid Hierarchy) which is limited to uniform structured grids. PAGH extends the driver to distributed adaptive grid hierarchies. It also provides facilities for adaptive mesh refinement (AMR) using the Berger-Oliger AMR formulation, where the refined regions are rectangular boxes that have to be completely inside a coarser box. PAGH restricts the refined regions to have an integer refinement factor and to be aligned with respect to the coarser regions. The number of refinement levels and the number of refined regions per level are not restricted. The location and the size of the refined regions can be changed at runtime. Based on a user-defined truncation error estimator, refined regions automatically track regions of high error and adapt to the computational needs. To facilitate the truncation error estimation, PAGH can provide a shadow hierarchy, i.e. a twin to the computational domain with a lower resolution. The truncation error can be estimated by comparing the results obtained on both domains. This capability uses the support for a Shadow hierarchy provided by GrACE. PAGH parameterizes all technical aspects of the refinement procedure and allows them to be changed by the user. These aspects include certain space/time tradeoffs, and the prolongation and restriction stencils used to transport data between the refinement levels. Reasonable defaults are provided for most applications.

PAGH is suited for all 3D applications that need to selectively enhance the spatial resolution to meet their numerical requirements within the bounds given by the available computational resources.

Characterization of Distributed Adaptive Grid Hierarchies

This consists in doing a performance characterization of dynamic partitioning and load-balancing techniques for distributed adaptive grid hierarchies that underlie parallel adaptive mesh-refinement (AMR) techniques for the solution of partial differential equations. The primary motivation for the characterization is the development of a policy driven tool for automated configuration and run-time management of distributed adaptive applications on dynamic and heterogenous networked computing environments.

Distributed implementations of these adaptive methods offer the potential for accurate solution of physically realistic models of important physical systems. These implementations however, lead to interesting challenges in dynamic resource allocation, data-distribution and load balancing, communications and coordination, and resource management. The overall efficiency of the algorithms is limited by the ability to partition the underlying data-structures at run-time so as to expose all inherent parallelism,minimize communication/synchronization overheads, and balance load. A critical requirement while partitioning adaptive grid hierarchies is the maintenance of logical locality, both across different levels of the hierarchy under expansion and contraction of the adaptive grid structure and within partitions of grids at all levels when they are decomposed and mapped across processors. The former enables efficient computational access to the grids while the latter minimizes the total communication and synchronization overheads. Furthermore, application adaptivity results in application grids being created, moved and deleted on-the-fly, making it necessary to efficiently re-partition the hierarchy on the fly so that it continues to meet these goals.

Moving these applications to dynamic and heterogenous computational grid environments introduces a new level of complexity. These environments aim at improving the efficiency with which resources are used by matching individual simulations to the most appropriate available resource. However, the complexity and heterogeneity of the grid environment make the selection of "best" match between system communication mechanisms, etc., non-trivial. System dynamics coupled application adaptivity makes for "smart" tools that can automate the configuration and management process.

Hence our work is an application-centric characterization of distribution mechanism for AMR applications on heterogenous (and dynamic) cluster computing environment

Portable Profiling and Performance Analysis using TAU, PDT and VAMPIR

Objective: Profiling Applications allows the user to identify bottlenecks in the application and get function profiling information for templated functions. Generating a profile can show which modules are most time intensive. Hence the objective is to collect and analyze performance data, in order to isolate functions in GrACE that can be scheduled more efficiently to improve the overall execution time.

TAU: TAU (Tuning and Analysis utilities) is one such visual and performance analysis environment for parallel C++ and HPF that uses Tcl/Tk for graphics. It is currently designed to instrument parallel multi-threaded, C & C++ code. TAU collects performance data during run time execution of the program and then provides a post mortem analysis and display of performance information.

TAU can show the exclusive and inclusive time spent in each function. For templated entities, it shows the breakup of time spent for each instantiation. The other data includes how many times each function was called, how many profiled functions did each function invoke, what the mean inclusive time per call was. It shows the mean time spent in a function over all nodes, contexts and threads. It can also show the exclusive and inclusive times spent in a function for each invocation of every function (and the aggregated sum over all invocations).
Instead of time, it can use hardware performance counters and show the number of instructions issued for each function, the cycles, loads, stores, floating point operations, primary and secondary data cache misses, TLB misses, etc.
It can also calculate the statistics such as the standard deviation of the exclusive time( or counts) spent in each templated function.
Instead of Profiling functions, the user can profile at a finer granularity using timers and it can profile all the above quantities for multiple user defined timers to profile statements in the code instead of functions.

Instrumenting the code using PDT: For Profiling a function Macros must be added to the source code to identify routine transitions. This can be automatically done using the TAU C++ instrumentor tau_instrumentor. Or by instrumenting the code at runtime using the Dyninst API. We used PDT (Program Database Toolkit) provided by the Oregon University to instrument the GrACE source code. PDT inserts macros in the source code during compilation and then the object files are created from the instrumented source files. The architecture can be explained by the diagram below.

Visualizing traces using VAMPIR: Typically profiling shows the distribution of execution time across routines. It can show the code locations associated with specific bottlenecks, but it does not show the temporal aspect of performance variations. Tracing the execution of a parallel program shows when and where an event occurred, in terms of the process that executed it and the location in the source code. In Addition to PROFILE files, TAU also generates TRACE files for each node, thread and context. These TRACE files can be then converted to .pv format to be viewed using VAMPIR (Visualization and analysis of MPI programs).

This generates exactly at what time a message is sent from one node to other along with other parallelism statistics

References :

TAU : http://www.acl.lanl.gov/tau/

VAMPIR : http://www.pallas.com/e/products/vampir/index.htm

PDT : http://www.cs.uoregon.edu/research/paracomp/pdtoolkit/

Multithreaded Computational Engine

Objective:
The objective of this project is to develop a multithreaded communication engine to support the GrACE infrastructure for AMR applications based on adaptive grid hierarchies. The overall goal is to improve performance by overlapping computations on individual grid blocks with associated (inter-level and intra-level) communications.

Motivation:
Dynamically adaptive methods for the solution of partial differential equations that employ locally optimal approximations can yield highly advantageous ratios for cost/accuracy when compared to methods based upon static uniform approximations. These techniques seek to improve the accuracy of the solution by dynamically refining the computational grid in regions of high local solution error.
Distributed implementations of these adaptive methods offer the potential for the accurate solution of realistic models of important physical systems. These implementations however, lead to interesting challenges in dynamic resource allocation, data-distribution and load balancing, communications and coordination, and resource management. The overall efficiency of the algorithms is limited by the ability to partition the underlying data-structures at run-time to expose all inherent parallelism, minimize communication/synchronization overheads, and balance load. This motivates the need for an efficient communication engine to minimize communication and synchronization overheads.
Performance analysis of AMR applications showed that in certain cases, up to 50% of the total execution time can be spent in synchronizing ghost regions between grid blocks at different levels of the grid hierarchies. This limited the overall scalability of these applications. This led us to the conclusion that to improve performance, synchronization time has to minimised by exploiting inherent parallelism available during computation on blocks. This can be done by incorporating multithreading into the library.

Approach:
The multithreaded engine enables overlap between the computations and communications on grid blocks owned by a processor. As the processor cycles through its grid blocks sequentially, communication of ghost regions of completed blocks are scheduled concurrently. Our multithread engine consists of two classes of threads: synchronization threads and computation threads. Synchronization threads are normally dormant and are activated only when communication is required. Computation threads are responsible for all management and computational tasks (i.e. for setting up the grid hierarchy, for data and storage management and storage and load balancing). The key motivation for such a simple models is that only one thread interacts with MPI at any time and so the implementation does not depend on thread-safe MPI which is not available on all platforms. The operation of the threaded engine is as follows:

Ghost Synchronizations: The multithreaded engine guarantees the semantics of a computational loop followed by ghost synchronization where the all the ghost regions are updated on return form the synchronization call. When the computational loop is initiated (e.g. GrACE forall loop), the send thread and receive thread are signaled. All communication tasks are then offloaded to these two threads. At the end of the computational loop, all block communications are guaranteed to be complete.

Redistribution: During redistribution, the grid data is communicated by the communication thread using the signaling mechanisms and mutexes available to us in the thread libraries. The computation thread signals the send and receive threads once computation has begun on the blocks. The send thread, receive thread and computation threads synchronize using condition variables and send and receive queues.

Status:
The multithreaded communication engine is being developed using the POSIX pthreads library to enable easy porting of this communication engine to different operating systems. Current implementation and experimentation is being done on Sun E10K machines that have the Sun HPC 5.0 installed on them.

GrACE Class Structures

The following is the class structure listings of the important classes within GrACE, organized under functional groups.

Classes

Grid Geometry

Grid Data

Grid Components

Grid Hierarchy

Grid Function

Publications

Grid Adaptive Computational Engine (GrACE)

"Hierarchical Partitioning Techniques for Structured Adaptive Mesh Refinement (SAMR) Applications," X. Li*, S. Ramanathan*, and M. Parashar, Proceedings of the 2002 International Conference on Parallel Processing (ICPP 2002), 4th Workshop on High Performance Scientific and Engineering Computing Applications (HPSECA-02), Vancouver, British Columbia, Canada, IEEE Computer Society Press, pp. 336 – 343, August 2002.
“Systems Engineering for High Performance Computing Software: The HDDA/DAGH Infrastructure for Implementation of Parallel Structured Adaptive Mesh Refinement,” M. Parashar and J. C. Browne, IMA Volume 117: Structured Adaptive Mesh Refinement (SAMR) Grid Methods, Editors: S. B. Baden, N. P. Chrisochoides, D. B. Gannon, and M. L. Norman, Springer-Verlag, pp. 1 – 18, January 2000. (PDF) (Postscript)
“HDDA: A Distributed Dynamics Data Structure for Parallel Adaptive Computations,” M. Parashar, CAIP Update,Center for Advanced Information Processing, Rutgers University, Vol. 11, No. 4, 1998. (Archived athttp://www.caip.rutgers.edu)(PDF)
“Integrated Data-Management for Computational Steering,” M. Parashar, IEEE Conference on Information Technology, Syracuse, NY, IEEE Computer Society Press, pp. 61-64, September 1998. (PDF)
“Integrated Data-Management for Computational Steering,” M. Parashar and J. C. Browne, 31st Annual Hawaii International Conference on System Sciences, Kohala Coast, Hawaii, CDROM, IEEE Computer Society Press, 10 pages, January 1998. (Postscript)
“A Common Data Management Infrastructure for Parallel Adaptive Algorithms for PDE Solutions,” M. Parashar, J. C. Browne, C. Edwards, and K. Klimkowsky, Proceeding of Supercomputing ‘97(ACM Sigarch and IEEE Computer Society), San Jose, CA, online publication, IEEE Computer Society Press, 15 pages, November 1997. (Archived at http://www.supercomp.org/sc97/proceedings/TECH/PARASHAR/INDEX.HTM (HTML)
“A Computational Infrastructure for Parallel Adaptive Methods,” M. Parashar, J. C. Browne, C. Edwards, and K. Klimkowsky, Symposium on Parallel Adaptive Method, 4th U.S. Congress on Computational Mechanics, San Francisco, CA, 6 pages, August 1997.(PDF)
“Systems Engineering Issues in the Implementation of an Infrastructure for Parallel Structured Adaptive Meshes,” M. Parashar and J. C. Browne, Workshop on Structured Adaptive Mesh Refinement Grid Methods, SIAM Conference on Parallel Processing for Scientific Computing, Minneapolis, MN, March 1997. (Presentation:HTML)
“Object-Oriented Programming Abstractions for Parallel Adaptive Mesh-Refinement,” M. Parashar and J. C. Browne, Parallel Object-Oriented Methods and Applications (POOMA), Santa Fe, NM, 10 pages, February 1996. (Postscript)
“On Partitioning Dynamic Adaptive Grid Hierarchies,” M. Parashar and J. C. Browne, Proceedings of the 29th Annual Hawaii International Conference on System Sciences, Computer Society Press, Maui, Hawaii, IEEE Computer Society Press, pp. 604-613, January 1996. (Postscript)
“Distributed Dynamic Data-Structures for Parallel Adaptive Mesh-Refinement,” M. Parashar and J. C. Browne,Proceedings of the International Conference for High Performance Computing, IEEE Computer Society Press, pp. 22-27, December 1995. (Postscript)

Multiblock Adaptive Computational Engine (MACE)

“A Programming Environment for Adaptive Multi-Resolution, Multi-Physics Applications,” M. Parashar and M. Peszynska, in preparation, November, 2001. (PDF)
“An Environment for Parallel Multi-Block, Multi-Resolution Reservoir Simulations,” M. Parashar and I. Yotov,Proceedings of the 11th International Conference on Parallel and Distributed Computing Systems (PDCS 98),Chicago, IL, International Society for Computers and their Applications (ISCA), pp. 230-235, September 1998. (PDF)
“A New Generation EOS Compositional Reservoir Simulator: Framework and Multiprocessing,” M. Parashar, J. A. Wheeler, G. Pope, K. Wang, P. Wang, Proceedings of the Society of Petroleum Engineers, Reservoir Simulation Symposium, Paper No. 37977, Dallas, TX, pp. 31 – 38, June 1997. (Postscript)

Autonomic/Adaptive Runtime Management for Dynamic Applications (ARMaDA)

“Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications,” X. Li* and M. Parashar, accepted for publication Proceedings of the 9^th International Euro-Par Conference (Euro-Par 2003), Klagenfurt, Austria, Springer Verlag, August 2003. (PDF)
“Adaptive Runtime Management of SAMR Applications,” S. Chandra*, S. Sinha*, M. Parashar, Y. Zhang*, Y. Yang* and S. Hariri, Proceedings of the 9^th International Conference on High Performance Computing (HiPC 2002), Lecture Notes in Computer Science, Editors: S. Sahni, V.K. Prasanna, U. Shukla, Springer Verlag, Bangalore, India, Vol. 2552 pp 564-574, December 2002. (PDF)
"ARMaDA: An Adaptive Application-Sensitive Partitioning Framework for SAMR Applications," S. Chandra and M. Parashar, proceedings of the 14th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2002), Cambridge, MA, USA, ACTA Press, pp. 446-451, November 2002. (PDF)
"An Application-Centric Characterization of Domain-Based SFC Partitioners for Parallel SAMR," J. Steensland, S. Chandra, and M. Parashar, IEEE Transactions on Parallel and Distributed Systems, IEEE Computer Society Press, Vol. 13, No. 12, pp. 1275-1289, December 2002. (PDF)
“Adaptive Application Sensitive Partitioning Strategies for SAMR Applications,” S. Chandra*, J. Steensland* and M. Parashar, Poster publication at IEEE/ACM Supercomputing 2001, Denver CO, USA, November 2001. (Judged as Best Poster).
“An Experimental Study of Adaptive Application Sensitive Partitioning Strategies for SAMR Applications,” S. Chandra*, J. Steensland*, M. Parashar, and J. Cummings, Proceedings of the 2^nd LACSI (Los Alamos Computer Science Institute) Symposium 2001, Santa Fe, NM, 12 pages (on CDROM), October 2001. (PDF)
“Adaptive System-Sensitive Partitioning of AMR Applications on Heterogeneous Clusters,” S. Sinha* and M. Parashar, Cluster Computing: The Journal of Networks, Software Tools, and Applications, Kluwer Academic Publishers, Vol. 5, Issue 4, pp. 343-352, 2002. (PDF)
“Adaptive Runtime Partitioning of AMR Applications on Heterogeneous Clusters,” S. Sinha* and M. Parashar,Proceedings of the 3^rd IEEE International Conference on Cluster Computing, Newport Beach, CA, IEEE Computer Society Press, pp 435 – 442, October 2001. (PDF)
“An Evaluation of Partitioners for Parallel SAMR Applications,” S. Chandra* and M. Parashar, Proceedings of the 7^th International Euro-Par Conference (Euro-Par 2001), Lecture Notes in Computer Science, Editors: R. Sakellariou, J. Keane, J. Gurd and L. Freeman, Springer-Verlag, Manchester, UK, Vol. 2150, pp 171 – 174, August 2001. (PDF)
“A Multi-Threaded Communication Engine for Distributed Adaptive Applications,” S. Ramanathan* and M. Parashar,Proceedings of the 2001 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2001), Nevada, USA, Computer Science Research, Education, and Applications (CSREA) Press, pp 692-698, June 2001. (PDF)
“System Sensitive Runtime Management of Adaptive Applications on Heterogeneous Environments,” S. Sinha* and M. Parashar, Proceedings of the 15^th International Parallel and Distributed Computing Symposium (IEEE, ACM), 10^th Heterogeneous Computing Workshop, San Francisco, CA, CDROM, IEEE Computer Society Press, 8 pages, April 2001. (PDF)
“System Sensitive Runtime Management of Adaptive Applications,” S. Sinha* and M. Parashar, Abstract/Presentation at the 10^th SIAM Conference on Parallel Processing for Scientific Computing, Portsmouth, VA, March 2001.(PDF) (Presentation: HTML)
“Characterization of Domain-based Partitioners for Parallel SAMR Applications,” J. Steensland*, M. Thune*, S. Chandra*, and M. Parashar*, Proceedings of the IASTED International Conference on Parallel and Distributed Computing Systems, Las Vegas, NV, ACTA Press, pp. 425 – 430, November 2000. (PDF)
“Scalability of Irregular Adaptive Mesh Refinement,” M. Parashar, Scalability Workshop, ACM International Conference on Supercomputing, Santa Fe, NM, May 11-12, 2000. (Presentation: HTML)
“Adaptive Distribution of Dynamic Grid Hierarchies,” S. Ramanathan* and M. Parashar, Abstract/Presentation at the special session on “Parallel Adaptive Computations,” p and hp Finite Element Method: Mathematics and Engineering Practice, St. Louis, MO, pp. 86, May 2000. (Presentation: HTML)
“Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies,” S. Bhavsar*, M. Shee*, and M. Parashar, IASTED International Conference on Parallel and Distributed Computing and Systems, Cambridge, MA, ACTA Press, pp. 782-787, November 3-6, 1999. (PDF)
“A Characterization of Distribution Techniques for Dynamic Adaptive Grid Hierarchies,” S. Bhavsar*, M. Shee*, and M. Parashar, Fifth U.S. National Congress on Computational Mechanics (USACM), Boulder, CO, p. 198, August 1999. (PDF) (Presentation: HTML)
“An Application-Centric Characterization of Distribution Techniques for Dynamic Adaptive Grid Hierarchies,” S. Bhavsar*, M. Shee*, and M. Parashar, 1999 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 99), Las Vegas, NV, Computer Science Research, Education, and Applications (CSREA) Press, pp. 2530-2536, June 1999. (PDF)
An Application-Centric Characterization of Distribution Techniques for Dynamic Adaptive Grid Hierarchies" presented at PDPTA at Las Vegas, June 1999
Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies" to be presented at the Eleventh IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS '99)

Collaborators

Center for Subsurface Modeling, TICAM, University of Texas at Austin

DoE ASCI/ASAP Center for Simulation of Dynamic Response of Materials, California Institute of Technology

Mathematics and Computer Science Division, Argonne National Laboratory

Department of Electrical and Computer Engineering, University of Arizona

Numerical Relativity Group, Washington University at St Louis

Max-Planck Institute for Gravitational Physics, Golm, Germany

Combustion Research Facility, Sandia National Laboratories, Livermore CA

Institute for Geophysics, University of Texas at Austin

Applied Numerical Algorithms Group, Lawrence Berkeley National Laboratory

TASSL Menu

Quick Links