Parallel Computing Lesson
SIMD vs SPMD
SIMD: Simple Instruction, Multiple Data
-Same instructions, but using different data. 
- What GPUs do.

SPMD: Single Program, Multiple Data
- What cores do. 
Data Centers
Image result for data centers
Limitations
Why GPUs Can Suck:
Shared Memory
Everyone can access all of the memory; all processors on the board can access all of it
Distributed Memory: If multiple boards access same information (use messages to pass information back and forth). The processors do this. 
A Single Board

Two Types of Parallelism Here
1. Processors 4 Core
Core
A core is basically a processor (a chip has four processing units, four cores):
- When you write a program you are basically using one of the core.
GPU
Has like 100 cores, but they are all really simple
- can do 100 things at once when using a GPU

Ex. Kepler GPU from NVIDIA can do 1.4 Terra flops of info processing
CPU vs. GPUs
Image result for GPU cores
Note
Description
CPU
One CPU has four cores, four processors. 

Four cores: each grabs a program and starts running. When one core is done with something it goes and grabs the next core. 

Each core:
- Grab program
- Executes program or code
- Moves to the next one
Embarrassingly Parallel
Image result for Embarrassingly parallel
Challenge in PC
1) If processes need to communicate, then everything can stall because of one delay in one process. 
- Dependencies are bad
Without dependencies imagine y=x graph, with dependencies there is horizontal asymptote
Why Cores
If you have p cores, we want to be done in 1/p of original time. 
Speed Up
Suppose: 
T1 = time if one processor did the work. 
Tp = time if p processors do work
Speed Up = (T1/Tp)

If Tp = (1/p)T1, then ideally spead up = p. 
Amdahl's Law
Suppose 10% of algorithm you can't figure out how to make paralell, then: 

Tp = T1(.1+.9/p)

This converges to .1T1  
- even if you have infinite cores (p -> infinity), Tp approaches .1T1
In the Code
Do loops, for loops -> that are NOT dependent on other factors in the loop.

For example, matrix multiplications can be done - each element is computed independently 
Open MP
Use ForTran or C/C++

it is compiler directives
You will have commands like: 

#OMP PARALLEL DO
Do ----
 ---
 ---
end Do;

Here you are telling open MP to do everything in parallel.  
Other compilers
Even if your compiler can't run this, it ignores the line. However, if your compiler does recognize Open MP, then you can parallel compute. 

GCC supports this (a free compiler)
When to use CP
1) Do this at the BIGGEST chunk possible
- there is a little overhead associated with it, so the bigger the chunk, the better the gains
- Also evens out tmore if one processor is slower or faster.

2) Open MP does not check or validate your answer. WE have to make sure that everything is right. These are hard to debug!!
3. Language Barrier
Program in CUDA (GPU Program)
2. Branch Diversion
If there are a lot of if then statements it goes through every single possibility to execute 
  • Takes time for one process waiting for the other processes to go through if one is true and one is false etc. etc.
  • SO if you have code with lots of branching, GPU is not going to help.
1. Operations Required
Need to use data for lots of operations to be worth it: Can only achieve this if everytime you bring something in, if you do at least 40 operations per single thing you bring in to be able to even get up to the 1.4 Tflops of processing (the in and out gets crowded) 
Implementations
Description
   Login to remove ads X
Feedback | How-To