### Outline

- Introduction
  - » Space-Time Simulation
- Time Parallel Simulation
- Fix-up Computations
- Example: Parallel Cache Simulation

### Simulation & Modeling

### **Time Parallel Simulations**

Problem-Specific Approach to Create Massively Parallel



Maria Hybinette, UGA

# **Space-Time Framework**

A simulation computation can be viewed as computing the state of the physical processes in the system being modeled over simulated time.



- 1 Partition space-time region into non-overlapping regions
- 2. Assign each region to a logical process
- Each LP computes state of physical system for its region, using inputs from other regions and producing new outputs to those regions 3.
- Repeat step 3 until a fixed point is reached

Maria Hybinette, LIGA

# **Space-Time Framework**

A simulation computation can be viewed as computing the state of the physical processes in the system being modeled over simulated time.



Maria Hybinette, UGA



2

### **Time Parallel Simulation**

Observation: The simulation computation is a sample path through the set of possible states across simulated time. processor processor 2 processor processor processor 3 4 5 Simulated Time Basic idea: Divide simulated time axis into non-overlapping intervals Each processor computes sample path of interval assigned to i Key question: What is the initial state of each interval (processor)? 5

### **Time Parallel Simulation: Relaxation Approach**

- Guess initial state of each interval (processor) 1.
- Each processor computes sample path of its interval 2.
- Using final state of previous interval as initial state, "fix up sample path 3.
- Repeat step 3 until a fixed point is reached 4.



Benefit: Massively parallel execution (LPs are independent -- no synchronization required between them)

Liabilities: cost of "fix up" computation, convergence may be slow (worst case, N iterations for N processors), state may be complex

## **Example: Cache Memory**

- Cache holds subset of entire memory
  - » Memory organized as blocks
  - » Hit: referenced block in cache
  - » Miss: referenced block not in cache
  - » Cache has multiple sets, where each set holds some number of blocks (e.g., 4); here, focus on cache references to a single set
- Replacement policy: Determines which block (of set) to delete to make room for a replacement / new block on a cache (miss)
  - » LRU: delete least recently used block (of set) from cache
- Implementation: Least Recently Used (LRU) stack
  » Stack contains address of memory (block number)
  - » For each memory reference in input (memory ref trace)
    - if referenced address in stack (hit), move to top of stack
      if not in stack (miss), place address on top of stack, deleting address at bottom

Hybinette, UGA

### **Example: Trace Drive Cache Simulation**

|                       | sequence of references to blocks in memory, determine<br>er of hits and misses using LRU replacement                                                                                                                             |
|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| first iteration: assu | ume stack is initially empty:                                                                                                                                                                                                    |
| address:              | 1 2 1 3 4 3 6 7 2 1 2 6 9 3 3 6 4 2 3 1 7 2 7 4                                                                                                                                                                                  |
| LRU<br>Stack:         | 1 2 1 3 4 3 6 7 2 1 2 6 9 3 3 6 4 2 3 1 7 2 7 4<br>- 1 2 1 3 4 3 6 - 2 1 2 6 9 9 3 - 4 2 3 1 7 2 7 4<br>2 1 1 4 3 1 2 6 6 9 4 2 3 1 7 2 7<br>2 2 1 4 1 2 2 2 4 2 3 1 1 2<br>2 2 1 4 1 2 2 2 4 2 3 3 1<br>processor 1 processor 3 |
| second iteration: p   | processor i uses final state of processor i-1 as initial state                                                                                                                                                                   |
| address:              | 1 2 1 3 4 3 6 7 2 1 2 6 9 3 3 6 4 2 3 1 7 2 7 4                                                                                                                                                                                  |
| LRU<br>Stack:         | (idle) 2 1 2 6 9 4 2 3 1<br>7 2 1 2 6 match! 6 4 2 3 match!<br>6 7 7 1 2 3 6 4 2<br>2 6 0 7 7 1 2 0 6 4                                                                                                                          |
| Done!                 | 3 6 6 7 1 9 3 6 4<br>processor 1 processor 2 processor 3                                                                                                                                                                         |
| Maria Hybinette, UGA  | 8                                                                                                                                                                                                                                |

# **Parallel Cache Simulation**

- Time parallel simulation works well because final state of cache for a time segment usually does not depend on the initial state of the cache at the start of the time segment
- LRU: state of LRU stack is independent of the initial state after memory references are made to (four) different blocks (if set size is four); memory references to other blocks no longer retained in the LRU stack
- If one assumes an empty cache at the start of each time segment, the first round simulation yields an upper bound on the number of misses during the entire simulation

# **State Matching Problem Approaches**

#### • Fix-up computations

- » Guess initial state and compute based on guess then redo computations as needed
- » Example: LRU cache simulations
- Precomputation of state at specific time division points
  - » Selects time division points at places where the state of the system can be easily determined
     » Example: ATM multiplexor
- Parallel prefix computation
- » Example: G/G/1 queue (see text book)

Maria Hybinette, UGA

10

# ATM Networks

 Telecommunication technology to support integration of wide variety of communication services

» voice, data, video and faxes

- Provides high bandwidth and reliable communication services
- ATM atomic units: ATM messages are divided into fixed-size cells

# **Example: ATM Multiplexer**





- Cell: fixed size data packet (53 bytes)
- N sources of traffic: Bursty, on/off sources (e.g., voice telephone)
  » stream of cells arrive if on
  - $\,\,{}^{\,\,}$  0 or 1 cell arrives on each input each time unit (cell time)
- Output link: Capacity C cells per time unit
- Fixed capacity FIFO queue: K cells
  - » Queue overflow results in dropped cells
  - » Estimate loss probability as function of queue size (design goal drop 1 in 10<sup>9</sup>)
    » Low loss probability (10-3) leads to long simulation runs!

Maria Hybinette, UGA

q



### **Problem Statement**

- Multiplexor with N input links of unit capacity
- Output link with capacity C (output burst)
- FIFO queue with K buffers

inette UGA

 Determine average utilization and number of dropped cells

### Example



# **Simulation Algorithm**



## **Parallel Simulation Algorithm**

- Generate tuples: can be performed in parallel
- Q<sub>i+1</sub> depends on Q<sub>i</sub>; appears sequential
- Observation:
  - » Some tuples guaranteed to produce overflow or empty queue, independent of all other tuples or  $\mathbf{Q}_{i}$  at start of the tuple
  - »  $\mathbf{Q}_{i+1}$  known for such tuples, independent of  $\mathbf{Q}_i$



# Guaranteed Underflow / Overflow

- A tuple <A<sub>i</sub>, δ<sub>i</sub>> is guaranteed to cause overflow
  - » if (A<sub>i</sub> C) δ<sub>i</sub>≥ K
  - »  $\mathbf{Q}_{i+1} = \mathbf{K}$  for guaranteed overflow tuples
- A tuple <A<sub>i</sub>, δ<sub>i</sub>> is guaranteed to cause underflow
  - » if (C A<sub>i</sub>)  $\delta_i \ge K$
  - » Q<sub>i+1</sub> = 0 for guaranteed underflow tuples

The simulation time line can be partitioned at guaranteed overflow/ underflow tuples to create a time parallel execution No fix-up computation required

Maria Hybinette, UGA

14

## **Time Parallel Algorithm**

## **Summary of Time Parallel Algorithms**

- Algorithm
- Generate tuples <A<sub>i</sub>, δ<sub>i</sub>> in parallel
- Identify guaranteed overflow and underflow tuples to determine time division points
- Map tuples between time division points to different processors, simulate in parallel

- The space-time abstraction provides another view of parallel simulation
- Time Parallel Simulation
  - » Potential for massively parallel computations
  - » Central issue is determining the initial state of each time segment
- Applications: Simulation of LRU caches well suited for time parallel simulation techniques
- Advantages:
  - » allows for massive parallelism
  - » often, little or no synchronization is required after spawning the parallel computations
  - » substantial speedups obtained for certain problems: queueing networks, caches, ATM multiplexers
- Liabilities:
  - » Only applicable to a very limited set of problems

Maria Hybinette, UGA

19

Maria Hybinette, UGA

20