
An adaptive computational engine infrastructure forms an effective basis for the development of adaptive methods for the solution of systems of partial differential equations for varied complex scientific application domains, including Grand Challenge problems. Parallel/distributed implementations of these adaptive methods offer the potential for the accurate solution of realistic models of important physical systems. This assumes greater importance as scientific simulations play an increasingly critical role in all areas of science and engineering. The simulations of expensive scientific problems are necessary to validate and justify the investment to be made.

The need for adaptive computational engines is motivated by the fact that -
Adaptive methods will be utilized for the solution of almost all very large-scale science and engineering models. These adaptive methods will be executed upon the very large-scale heterogeneous parallel execution environments.
GrACE (Grid Adaptive Computational Engine) is an adaptive computational and data-management engine for enabling distributed adaptive mesh-refinement computations on structured grids. It builds on the Distributed Adaptive Grid Hierarchy (DAGH) infrastructure and extends it to provide a virtual, semantically specialized distributed (and dynamic) shared memory infrastructure with multifaceted objects specialized to distributed adaptive grid hierarchies and grid functions. GrACE has been deployed to support applications in different application domains including DoE ASCI codes at the California Institute of Technology and theUniversity of Chicago, and the NASA neutron star grand challenge.
GrACE has been developed at the Texas Institute for Computational and Applied Mathematics (TICAM) at the University of Texas, Austin and The Applied Software Systems Laboratory (TASSL) in the Center for Advanced Information Processing (CAIP) at Rutgers University.
Overview
Dynamically adaptive methods for the solution of partial differential equations that employ locally optimal approximations can yield highly advantageous ratios for cost/accuracy when compared to methods based upon static uniform approximations. These techniques seek to improve the accuracy of the solution by dynamically refining the computational grid in regions of high local solution error. Parallel/distributed implementations of these adaptive methods offer the potential for the accurate solution of realistic models of important physical systems. These implementations, however, lead to interesting challenges in dynamic resource allocation, data-distribution and load balancing, communications and coordination, and resource management.

Critical among these is the partitioning problem where load balance, communication, data migration, and other overhead costs such as grid aspect ratios need to be optimized simultaneously. Hence, the key requirements for a decomposition scheme for partitioning an adaptive grid hierarchy can be summarized as:
* expose available data parallelism,
* minimize communication overhead,
* balance overall load distribution,
* enable dynamic load redistribution with minimum overheads.
Another important requirement while partitioning adaptive grid hierarchies is the maintenance of logical locality, both across different levels of the hierarchy under expansion and contraction of the adaptive grid structure, and within partitions of grids at all levels when they are decomposed and mapped across processors. The former enables efficient computational access to the grids while the latter minimizes the total communication and synchronization overheads. Furthermore, application adaptation results in application grids being created, moved and deleted on-the-fly, making it is necessary to efficiently re-partition the hierarchy so that it continues to meet these goals.
Moving the adaptive applications to dynamic and heterogeneous networked computing environments introduces a new level of complexity. These environments require selecting and configuring application components based on available resources. However, the complexity and heterogeneity of the environment make the selection of a "best" match between system resources, application algorithms, problem decompositions, mappings and load distributions, communication mechanisms, etc., non-trivial. System dynamics coupled with application adaptation makes application configuration and run-time management a significant challenge.
GrACE addresses these issues in an efficient manner using the approach of separation of concerns. The Grid Adaptive Computational Engine consists of two components:
Adaptive Mesh Refinement (AMR)
Dynamically adaptive numerical techniques for solving differential equations provide a means for concentrating effort to computationally demanding regions. In the case of hierarchical AMR methods, this is achieved by tracking regions in the domain that require additional resolution and dynamically overlaying finer grids over these regions. Techniques based on AMR start with a base coarse grid with the lowest acceptable resolution that covers the entire computational domain. As the solution progresses, regions in the domain with high solution error and requiring additional resolution are flagged and finer grids are overlaid on the flagged regions of the coarse grid. Refinement proceeds recursively so that regions on the finer grid requiring higher resolution are similarly flagged, and even finer grids are overlaid on these regions. The resulting grid structure is a dynamic adaptive grid hierarchy. The figure below shows adaptive grid hierarchy for the classic Berger AMR formulation.


The Berger-Oliger Algorithm [(1984) Journal of Computational Physics, 53, 484-512]
AMR Terminology
Grid Structures
AMR 2-D Example
AMR 3-D Example

Integration Algorithm
The adaptive finite-difference (AFD) integration algorithm defines the order in which different levels of the grid hierarchy are integrated, the interactions between overlaying component grids at different levels, and the criterion and method for grid refinement. Correspondingly, it is composed of three key components: (1) Time Integration, (2) Error Estimation & Regriding, and (3) InterGrid Operations.
Time Integration : Time integration is performed on each component grid using a specified finitedifference operator. Each component grid may have its own operator and can be integrated independently (except for determination of the boundary values). The order of integration along the grid hierarchy is defined recursively such that, before advancing component grids at a particular level of refinement in time, all component grids at higher levels of refinement must be integrated to the current time of component grids at that level. That is, before stepping component grids at level l, i.e. {Gli}, from time T to T + tl , all component grids at levels > l must be integrated to time T .
Error Estimation & Regriding : The error estimation and regriding component of the integration algorithm performs the following three steps; (1) flagging regions needing refinement based on error estimation, (2) clustering flagged points, and (3) refined grid generation. The result may be the creation of a new level of refinement or component grids at existing levels, and/or the deletion of existing component grids. Grid generation in step 3 is performed so as to maintain proper nesting along the grid hierarchy.
InterGrid Operations : Intergrid operations are used to communicate solutions values along the adaptive grid hierarchy. In case of the BergerOliger AFD scheme, the following intergrid operations are defined:
Initialization of refined component grids: Refined component grid initialization may be performed using the interior values of an intersecting component grid at same level if one exists; or by prolongated values from the underlying coarser component grid.
Coarse grid update: An underlying coarse component grids are updated using the values on a nested finer component grid each time they are integrated to the same time. This update or restriction intergrid operation may be performed by direct injection or using a defined averaging/interpolation scheme.
Averaging: Averaging is necessary when two component grids at the same level of refinement overlap, and is used to update the coarse component grids underlying the overlap region.
Space-Filling Curves (SFC)
An efficient and effective parallel/distributed adaptive finite-difference (AFD) implementation must use a parallelization, decomposition, and distribution scheme that complements the problem itself. Correspondingly, the implementation of the identified data structures must complement the dynamic and hierarchical nature of adaptive grid hierarchy. The fundamental requirement in realizing such a class of dynamic, distributed data structures is the generation of an extendable, global index space. A class of dimension changing mapping called space filling curves can be used to provide such an index space.
Space filling curves are a class of locality preserving mappings from ddimensional space to 1-dimensional space , i.e. Nd --» N1, such that each point in Nd is mapped to a unique point or index in N1. The mapping can thus be though of as laying out a string within the d-dimensional space so that it completely fills the space. The 1-dimensional mapping generated by the space-filling curve serves as an ordered indexing into multidimensional space. Mapping functions used to generate the space-filling index corresponding to a point in multidimensional space typically consist of interleaving operations and logical manipulations of the coordinates of the point, and are computational inexpensive. Two such mappings, the Morton order and the Peano-Hilbert order, are shown in the figure below. Two properties of space filling curves, viz. digital causality and self-similarity, make them particularly suited as a means of generating the required adaptive and hierarchical index-space.



Digital Causality: Digital causality implies that points that are close together in the d-dimensional space will be mapped to points that are close together in the 1dimensional space, i.e. locality is preserved by the mapping.
Self-Similarity: Self-similarity property implies that, as a d-dimensional region is refined into smaller sub-regions, the refined sub-regions can be recursively filled by curves that have the same structure as the curve used to fill the original (unrefined) region but possibly different orientations. The figure above illustrates this property for a 2dimensional regions with refinements by factors of 2 and 3.
In addition to providing a linear ordered index space that is hierarchical and extendable, each index generated by the spacefilling mapping has information about the original multidimensional space embedded in it. As result, given a key, it is possible to obtains its position in the original multidimensional space.
HDDA/DAGH Model
Hierarchical, Extendible Index Space
The hierarchical, extendible index space component of the Hierarchical Dynamic Distributed Array (HDDA) is derived directly from the application domain using space-filling mappings which are computationally efficient, recursive mappings from N-dimensional space to 1dimensional space. The solution space is first partitioned into segments. The space filling curve (such as a 2dimensional Peano-Hilbert curve) then passes through the midpoints of these segments. Space filling mapping encode application domain locality and maintain this locality though expansion and contraction. The self-similar or recursive nature of these mappings can be exploited to represent a hierarchical structure and to maintain locality across different levels of the hierarchy. Space-filling mappings allow information about the original multidimensional space to be encoded into each space-filling index. Given an index, it is possible to obtain its position in the original multidimensional space, the shape of the region in the multidimensional space associated with the index, and the space-filling indices that are adjacent to it. The index-space is used as the basis for application domain partitioning, as a global namespace for name resolution, and for communication scheduling.
Mapping to Address Space
The mapping from the multidimensional index space to the one-dimensional physical address space is accomplished by mapping the positions in the index space to the order in which they occur in a traversal of the space filling curve. This mapping can be accomplished with simple bit-interleaving operations to construct a unique ordered key. This mapping produces a unique key set which defines a global address space. Coalescing segments of the linear key space into a single key, blocks of arbitrary granularity can be created.
Storage and Access
Data storage is implemented using extendible hashing techniques to provide a dynamically extendible, globally indexed storage. The keys for the Extendible Hash Table are contractions of the unique keys. Entries into the HDDA correspond to Dynamic Adaptive Grid Hierarchy (DAGH) blocks. Expansion and contraction are local operations involving at most two buckets. Locality of data is preserved without copying. The HDDA data storage provides a means for efficient communication between DAGH blocks. To communicate data to another DAGH blocks, the data is copied to appropriate locations in the HDDA. This information is then asynchronously shipped to the appropriate processor. Similarly, data needed from remote DAGH blocks is received on-the-fly and inserted into the appropriate location in the HDDA. Storage associated with the HDDA is maintained in ready-to-ship buckets. This alleviates overheads associated with packing and unpacking. An incoming bucket is directly inserted into its location in the HDDA. Similarly, when data associated with a DAGH block entry is ready to ship, the associated bucket is shipped as is. The overall HDDA/DAGH distributed dynamic storage scheme is shown in the figure.

An instance of a DAGH is mapped to an instance of the HDDA. The granularity of the storage blocks is system dependent and attempts to balance the computation-communication ratio for each block. Each block in the list is assigned a cost corresponding to its computational load. In case of an AMR scheme, computational load is determined by the number of grid elements contained in the block and the level of the block in the AMR grid hierarchy. The former defines the cost of an update operation on the block while the latter defines the frequency of updates relative to the base grid of the hierarchy. In the representation described above, space-filling mappings are applied to grid blocks instead of individual grid elements. The shape of a grid block and its location within the original grid is uniquely encoded into its space-filling index, thereby allowing the block to be completely described by a single index. Partitioning a DAGH across processing elements using this representation consists of appropriately partitioning the DAGH key list so as to balance the total cost at each processor. Since space-filling curve mappings preserve spatial locality, the resulting distribution is comparable to traditional block distributions in terms of communication overheads.
The DAGH representation starts with a single HDDA corresponding to the base grid of the grid hierarchy, and appropriately incorporates newly created grids within this list as the base grid gets refined. The resulting structure is a composite key space of the entire adaptive grid hierarchy. Incorporation of refined component grids into the base grid key space is achieved by exploiting the recursive nature of space-filling mappings. For each refined region, the key list corresponding to the refined region is replaced by the child grid's key list. The costs associated with blocks of the new list are updated to reflect combined computational loads of the parent and child. The DAGH representation therefore is a composite ordered list of DAGH blocks where each DAGH block represents a block of the entire grid hierarchy and may contain more than one grid level; i.e. inter-level locality is maintained within each DAGH block. Each DAGH block in this representation is fully described by the combination of the space-filling index corresponding to the coarsest level it contains, a refinement factor, and the number of levels contained.
Features
Distributed Dynamic Data-structures for Parallel Hierarchical AMR
Transparent access to Scalable Distributed Dynamic Arrays/Grids/Grid-Hierarchies
Multi-grid/Line-Multi-grid support within AMR
Shadow grid hierarchy for on-the-fly error estimation
Automatic dynamic partitioning and load distribution
Locality, Locality, Locality !!!
Scalability, Portability, Performance
Programming Interface
Programming Abstraction
Supporting Parallel AMR
Checkpoint/Restart/Rollback
Integrate IO/Visualization
Multi-grid Support
Programming Abstractions for Hierarchical AMR
For GrACE, three fundamental programming abstractions are developed, using the HDDA/DAGH data-structures, that can be used to express parallel adaptive computations based on adaptive mesh refinement (AMR) and multigrid techniques, as seen in the figure below. Our objectives are twofold: firstly, to provide application developers with a set of primitives that are intuitive for implementing the problem, i.e. application objects «-» abstract data-types, and secondly, a separation of data-management issues and implementations from application specific computations.

Grid Geometry Abstraction
The purpose of the grid geometry abstractions is to provide an intuitive means for identifying and addressing regions in the computational domain. These abstractions can be used to direct computations to a particular region in the domain, to mask regions that should not be included in a given operation, or to specify region that need more resolution or refinement. The grid geometry abstractions represent coordinates, bounding boxes and doubly linked lists of bounding boxes.
Coordinates: The coordinate abstraction represents a point in the computational domain. Operations defined on this class include indexing and arithmetic/logical manipulations. These operations are independent of the dimensionality of the domain.
Bounding Boxes: Bounding boxes represent regions in the computation domain and is comprised of a triplet: a pair of Coords defining the lower and upper bounds of the box and a step array that defines the granularity of the discretization in each dimension. In addition to regular indexing and arithmetic operations, scaling, translations, unions and intersections are also defined on bounding boxes. Bounding boxes are the primary means for specification of operations and storage of internal information (such as dependency and communication information) within DAGH.
Bounding Boxes Lists: Lists of bounding boxes represent a collection of regions in the computational domain. Such a list is typically used to specify regions that need refinement during the regriding phase of an adaptive application. In addition to linked-list addition, deletion and stepping operation, reduction operations such as intersection and union are also defined on a BBoxList.

Operations
Grid Hierarchy Abstraction
The grid hierarchy abstraction represents the distributed dynamic adaptive grid hierarchy that underlie parallel adaptive applications based on adaptive mesh refinement. This abstraction enables a user to define, maintain and operate a grid hierarchy as a first-class object. Grid hierarchy attributes include the geometry specifications of the domain such as the structure of the base grid, its extents, boundary information, coordinate information, and refinement information such as information about the nature of refinement and the refinement factor to be used. When used in a parallel/distributed environment, the grid hierarchy is partitioned and distributed across the processors and serves as a template for all application variables or grid functions. The locality preserving composite distribution based on recursive space-filling curves is used to partition the dynamic grid hierarchy. Operations defined on the grid hierarchy include indexing of individual component grid in the hierarchy, refinement, coarsening, recomposition of the hierarchy after regriding, and querying of the structure of the hierarchy at any instant. During regriding, the repartitioning of the new grid structure, dynamic load-balancing, and the required data-movement to initialize newly created grids, are performed automatically and transparently.
Grid Function Abstraction
Grid Functions represent application variables defined on the grid hierarchy. Each grid function is associated with a grid hierarchy and uses the hierarchy as a template to define its structure and distribution. Attributes of a grid function include type information, and dependency information in terms of space and time stencil radii. In addition the user can assign special (Fortran) routines to a grid function to handle operations such as inter-grid transfers (prolongation and restriction), initialization, boundary updates, and input/output. These function are then called internally when operating on the distributed grid function. In addition to standard arithmetic and logical manipulations, a number of reduction operations such as Min/Max, Sum/Product, and Norms are also defined on grid functions. Grid Function objects can be locally operated on as regular Fortran 90/77 arrays.
Separation of Concerns => Hierarchical Abstractions

GrACE Class Structures
......
Applications
DAGH/GrACE Applications
| Application | Description |
| H3expresso AMR (J. Masso) | Numerical Relativity code that solves the full Einstein equations in 3D using a finite differencing conservative scheme and is parallel AMR operative |
| Black-box Multigrid Solvers (S. Klasky) | Equations that routines will solve Lhuh = fh, "black-box" calling interface with multigrid flavors available |
| Boson Star Application (R. Guenther) | Solves self-gravitating time-dependent Schrodinger equation in three spatial dimensions, with line-multigrid solvers used for update of Schrodinger field |
| Texas Black Hole Evolution Code (TCODE) (R. Correll & S. Klasky) | Black hole evolution code with leap frog time updates and second order methods at outer boundary replaced by first order methods |
| 2D Laser-Plasma Interaction Fluid Code (D. Fisher & S. Klasky) | Models the wakefields of a very short high-intensity laser pulse interacting with a plasma in a moving frame coordinate system, Lax Wendroff time updates, and 2D multigrid using a point-wise smoothing method |

Structured Adaptive Mesh Refinement (SAMR) Applications
| Application | Description |
| Scalarwave 2D/ Scalarwave 3D | Numerical relativity application (aka NumRel 2D/ NumRel 3D) is a coupled set of partial differential equations. Equations can be divided into two classes: elliptic (Laplace equation-like) constraint equations which must be satisfied at each time, and coupled hyperbolic (Wave equation-like) equations describing time evolution. This kernel is part of the Cactus numerical relativity toolkit. http://www.cactuscode.org |
| Buckley-Leverette 2D/ Buckley-Leverette 3D | Buckley-Leverette model is used in Oil-Water Flow Simulation (OWFS) software system for simulation of hydrocarbon pollution in aquifers. OWFS provides for layer-by-layer modeling of oil-water mixture in confined aquifers with regard to discharge/recharge, infiltration, interaction with surface water bodies and drainage systems, discharge into springs and leakage between layers. This kernel is taken from the IPARS reservoir simulation toolkit developed at the Center for Subsurface Modeling at University of Texas at Austin. http://www.ticam.utexas.edu/CSM/ACTI/ipars.html |
| RM 2D/ RM3D | 2/3-dimensional compressible turbulence application solving the Richtmyer-Meshkov instability. This application kernel is part of the virtual test facility developed at the ASCI/ASAP Center at California Institute of Technology. The Richtmyer-Meshkov instability is a fingering instability which occurs at a material interface accelerated by a shock wave. This instability plays an important role in the studies of supernova and inertial confinement fusion. http://www.cacr.caltech.edu/ASAP |
| EnoAMR 2D | 2-dimensional application in computational fluid dynamics that addresses the forward facing step problem, describing what happens when a step is instantaneously risen in a supersonic flow. The application/simulation has several features including bow shock, Mach stem, contact discontinuity, and a numerical boundary. EnoAMR 2D is also part of the virtual test facility developed at the ASCI/ASAP Center at Caltech. http://www.cacr.caltech.edu/ASAP |
GrACE Visualizations
Multi-Block Grid Structure

RM3D + GrACE (R. Samtaney)

Twin Mach Reflection Refraction (TMRR)
CO2-CH4 gas interface
Incident shock Mach number, M = 2.25
Experiment by Abd-El-Fattah and Henderson (JFM 1978)

Strong Shock Richtmyer-Meshkov Instability
Air-SF6 interface with strong single harmonic perturbation
Incident shock Mach number, M = 10
4 levels of refinement

Richtmyer-Meshkov Instability
Air-SF6 interface with strong single harmonic perturbation
Incident shock Mach number, M = 1.5
2 levels of refinement
512 Processor AMR with 3 levels of refinement
![]() |
Early stages of refraction (5203 zones shown above) |

128 Processor AMR with 3 levels of refinement
RM3D + GrACE

AMR for Eulerian Solvers in the VTF
Objective: To have adaptive mesh refinement and coarsening capability for Eulerian fluid engine (RM3D) within the VTF (Virtual Test Facility)
CFD engine in VTF: RM3D

Example: 2D Supersonic flow over a circle
M = 2 supersonic inflow boundary
Base mesh 40*40 with 3 AMR levels
Results agree with unigrid simulation
Shock stand-off distance agrees with experiment
Example: 2D Complex Inlet (base mesh 50*50, 5 AMR levels)

Example: Shock over a 2D Complex Contour (base mesh 50*50, 4 AMR levels)


VTF Algorithmic Integration

Example: Detonation Diffraction around a Cube




Related Infrastructures (MACE & PAGH)
The MACE (Multi-block Adaptive Computational Engine) infrastructure support multi-block grids where multiple distributed and adaptive grid blocks with heterogeneous discretization are coupled together with lower dimensional (also distributed and adaptive) mortar grids. MACE is being extensively used to support subsurface modeling and oil reservoir simulations.
PAGH (Parallel Adaptive Grid Hierarchy), developed by Erik Schnetter working with ASC (Astrophysics Simulation Collabortory) project members, Evans, Goodale and Parashar, is a GrACE-based driver routine designed and developed for the ASC. It provides an interface for memory management and I/O for the grid functions used in a simulation. On parallel machines, the driver also manages necessary communication between the individual processing nodes.
PAGH replaces the standard Cactus driver PUGH (Parallel Uniform Grid Hierarchy) which is limited to uniform structured grids. PAGH extends the driver to distributed adaptive grid hierarchies. It also provides facilities for adaptive mesh refinement (AMR) using the Berger-Oliger AMR formulation, where the refined regions are rectangular boxes that have to be completely inside a coarser box. PAGH restricts the refined regions to have an integer refinement factor and to be aligned with respect to the coarser regions. The number of refinement levels and the number of refined regions per level are not restricted. The location and the size of the refined regions can be changed at runtime. Based on a user-defined truncation error estimator, refined regions automatically track regions of high error and adapt to the computational needs. To facilitate the truncation error estimation, PAGH can provide a shadow hierarchy, i.e. a twin to the computational domain with a lower resolution. The truncation error can be estimated by comparing the results obtained on both domains. This capability uses the support for a Shadow hierarchy provided by GrACE. PAGH parameterizes all technical aspects of the refinement procedure and allows them to be changed by the user. These aspects include certain space/time tradeoffs, and the prolongation and restriction stencils used to transport data between the refinement levels. Reasonable defaults are provided for most applications.
PAGH is suited for all 3D applications that need to selectively enhance the spatial resolution to meet their numerical requirements within the bounds given by the available computational resources.
Characterization of Distributed Adaptive Grid Hierarchies
This consists in doing a performance characterization of dynamic partitioning and load-balancing techniques for distributed adaptive grid hierarchies that underlie parallel adaptive mesh-refinement (AMR) techniques for the solution of partial differential equations. The primary motivation for the characterization is the development of a policy driven tool for automated configuration and run-time management of distributed adaptive applications on dynamic and heterogenous networked computing environments.
Dynamically adaptive methods for the solution of partial differential equations that employ locally optimal approximations can yield highly advantageous ratios for cost/accuracy when compared to methods based upon static uniform approximations.These techniques seek to improve the accuracy of the solution by dynamically refining the computational grid in regions of high local solution error.
Distributed implementations of these adaptive methods offer the potential for accurate solution of physically realistic models of important physical systems. These implementations however, lead to interesting challenges in dynamic resource allocation, data-distribution and load balancing, communications and coordination, and resource management. The overall efficiency of the algorithms is limited by the ability to partition the underlying data-structures at run-time so as to expose all inherent parallelism,minimize communication/synchronization overheads, and balance load. A critical requirement while partitioning adaptive grid hierarchies is the maintenance of logical locality, both across different levels of the hierarchy under expansion and contraction of the adaptive grid structure and within partitions of grids at all levels when they are decomposed and mapped across processors. The former enables efficient computational access to the grids while the latter minimizes the total communication and synchronization overheads. Furthermore, application adaptivity results in application grids being created, moved and deleted on-the-fly, making it necessary to efficiently re-partition the hierarchy on the fly so that it continues to meet these goals.
Moving these applications to dynamic and heterogenous computational grid environments introduces a new level of complexity. These environments aim at improving the efficiency with which resources are used by matching individual simulations to the most appropriate available resource. However, the complexity and heterogeneity of the grid environment make the selection of "best" match between system communication mechanisms, etc., non-trivial. System dynamics coupled application adaptivity makes for "smart" tools that can automate the configuration and management process.
Hence our work is an application-centric characterization of distribution mechanism for AMR applications on heterogenous (and dynamic) cluster computing environment
Portable Profiling and Performance Analysis using TAU, PDT and VAMPIR
Objective: Profiling Applications allows the user to identify bottlenecks in the application and get function profiling information for templated functions. Generating a profile can show which modules are most time intensive. Hence the objective is to collect and analyze performance data, in order to isolate functions in GrACE that can be scheduled more efficiently to improve the overall execution time.
TAU: TAU (Tuning and Analysis utilities) is one such visual and performance analysis environment for parallel C++ and HPF that uses Tcl/Tk for graphics. It is currently designed to instrument parallel multi-threaded, C & C++ code. TAU collects performance data during run time execution of the program and then provides a post mortem analysis and display of performance information.
TAU can show the exclusive and inclusive time spent in each function. For templated entities, it shows the breakup of time spent for each instantiation. The other data includes how many times each function was called, how many profiled functions did each function invoke, what the mean inclusive time per call was. It shows the mean time spent in a function over all nodes, contexts and threads. It can also show the exclusive and inclusive times spent in a function for each invocation of every function (and the aggregated sum over all invocations).
Instead of time, it can use hardware performance counters and show the number of instructions issued for each function, the cycles, loads, stores, floating point operations, primary and secondary data cache misses, TLB misses, etc.
It can also calculate the statistics such as the standard deviation of the exclusive time( or counts) spent in each templated function.
Instead of Profiling functions, the user can profile at a finer granularity using timers and it can profile all the above quantities for multiple user defined timers to profile statements in the code instead of functions.
Instrumenting the code using PDT: For Profiling a function Macros must be added to the source code to identify routine transitions. This can be automatically done using the TAU C++ instrumentor tau_instrumentor. Or by instrumenting the code at runtime using the Dyninst API. We used PDT (Program Database Toolkit) provided by the Oregon University to instrument the GrACE source code. PDT inserts macros in the source code during compilation and then the object files are created from the instrumented source files. The architecture can be explained by the diagram below.

Visualizing traces using VAMPIR: Typically profiling shows the distribution of execution time across routines. It can show the code locations associated with specific bottlenecks, but it does not show the temporal aspect of performance variations. Tracing the execution of a parallel program shows when and where an event occurred, in terms of the process that executed it and the location in the source code. In Addition to PROFILE files, TAU also generates TRACE files for each node, thread and context. These TRACE files can be then converted to .pv format to be viewed using VAMPIR (Visualization and analysis of MPI programs).

This generates exactly at what time a message is sent from one node to other along with other parallelism statistics
References :
TAU : http://www.acl.lanl.gov/tau/
VAMPIR : http://www.pallas.com/e/products/vampir/index.htm
PDT : http://www.cs.uoregon.edu/research/paracomp/pdtoolkit/
Multithreaded Computational Engine
Objective:
The objective of this project is to develop a multithreaded communication engine to support the GrACE infrastructure for AMR applications based on adaptive grid hierarchies. The overall goal is to improve performance by overlapping computations on individual grid blocks with associated (inter-level and intra-level) communications.
Motivation:
Dynamically adaptive methods for the solution of partial differential equations that employ locally optimal approximations can yield highly advantageous ratios for cost/accuracy when compared to methods based upon static uniform approximations. These techniques seek to improve the accuracy of the solution by dynamically refining the computational grid in regions of high local solution error.
Distributed implementations of these adaptive methods offer the potential for the accurate solution of realistic models of important physical systems. These implementations however, lead to interesting challenges in dynamic resource allocation, data-distribution and load balancing, communications and coordination, and resource management. The overall efficiency of the algorithms is limited by the ability to partition the underlying data-structures at run-time to expose all inherent parallelism, minimize communication/synchronization overheads, and balance load. This motivates the need for an efficient communication engine to minimize communication and synchronization overheads.
Performance analysis of AMR applications showed that in certain cases, up to 50% of the total execution time can be spent in synchronizing ghost regions between grid blocks at different levels of the grid hierarchies. This limited the overall scalability of these applications. This led us to the conclusion that to improve performance, synchronization time has to minimised by exploiting inherent parallelism available during computation on blocks. This can be done by incorporating multithreading into the library.
Approach:
The multithreaded engine enables overlap between the computations and communications on grid blocks owned by a processor. As the processor cycles through its grid blocks sequentially, communication of ghost regions of completed blocks are scheduled concurrently. Our multithread engine consists of two classes of threads: synchronization threads and computation threads. Synchronization threads are normally dormant and are activated only when communication is required. Computation threads are responsible for all management and computational tasks (i.e. for setting up the grid hierarchy, for data and storage management and storage and load balancing). The key motivation for such a simple models is that only one thread interacts with MPI at any time and so the implementation does not depend on thread-safe MPI which is not available on all platforms. The operation of the threaded engine is as follows:
Ghost Synchronizations: The multithreaded engine guarantees the semantics of a computational loop followed by ghost synchronization where the all the ghost regions are updated on return form the synchronization call. When the computational loop is initiated (e.g. GrACE forall loop), the send thread and receive thread are signaled. All communication tasks are then offloaded to these two threads. At the end of the computational loop, all block communications are guaranteed to be complete.
Redistribution: During redistribution, the grid data is communicated by the communication thread using the signaling mechanisms and mutexes available to us in the thread libraries. The computation thread signals the send and receive threads once computation has begun on the blocks. The send thread, receive thread and computation threads synchronize using condition variables and send and receive queues.

Status:
The multithreaded communication engine is being developed using the POSIX pthreads library to enable easy porting of this communication engine to different operating systems. Current implementation and experimentation is being done on Sun E10K machines that have the Sun HPC 5.0 installed on them.
GrACE Class Structures
The following is the class structure listings of the important classes within GrACE, organized under functional groups.
Classes
Publications
Grid Adaptive Computational Engine (GrACE)
"Hierarchical Partitioning Techniques for Structured Adaptive Mesh Refinement (SAMR) Applications," X. Li*, S. Ramanathan*, and M. Parashar, Proceedings of the 2002 International Conference on Parallel Processing (ICPP 2002), 4th Workshop on High Performance Scientific and Engineering Computing Applications (HPSECA-02), Vancouver, British Columbia, Canada, IEEE Computer Society Press, pp. 336 – 343, August 2002.
Multiblock Adaptive Computational Engine (MACE)
Autonomic/Adaptive Runtime Management for Dynamic Applications (ARMaDA)
Collaborators
Center for Subsurface Modeling, TICAM, University of Texas at Austin
DoE ASCI/ASAP Center for Simulation of Dynamic Response of Materials, California Institute of Technology
Mathematics and Computer Science Division, Argonne National Laboratory
Department of Electrical and Computer Engineering, University of Arizona
Numerical Relativity Group, Washington University at St Louis
Max-Planck Institute for Gravitational Physics, Golm, Germany
Combustion Research Facility, Sandia National Laboratories, Livermore CA
Institute for Geophysics, University of Texas at Austin
Applied Numerical Algorithms Group, Lawrence Berkeley National Laboratory
Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer
