Parallel Computing Lesson

SIMD vs SPMD

SIMD: Simple Instruction, Multiple Data
-Same instructions, but using different data.
- What GPUs do.

SPMD: Single Program, Multiple Data
- What cores do.

Data Centers

Limitations

Why GPUs Can Suck:

Shared Memory

Everyone can access all of the memory; all processors on the board can access all of it
Distributed Memory: If multiple boards access same information (use messages to pass information back and forth). The processors do this.

A Single Board

Two Types of Parallelism Here

1. Processors 4 Core

Core

A core is basically a processor (a chip has four processing units, four cores):
- When you write a program you are basically using one of the core.

GPU

Has like 100 cores, but they are all really simple
- can do 100 things at once when using a GPU

Ex. Kepler GPU from NVIDIA can do 1.4 Terra flops of info processing

CPU vs. GPUs

Note

Description

CPU

One CPU has four cores, four processors.

Four cores: each grabs a program and starts running. When one core is done with something it goes and grabs the next core.

Each core:
- Grab program
- Executes program or code
- Moves to the next one

Embarrassingly Parallel

Challenge in PC

1) If processes need to communicate, then everything can stall because of one delay in one process.
- Dependencies are bad
Without dependencies imagine y=x graph, with dependencies there is horizontal asymptote

Why Cores

If you have p cores, we want to be done in 1/p of original time.

Speed Up

Suppose:
T1 = time if one processor did the work.
Tp = time if p processors do work
Speed Up = (T1/Tp)

If Tp = (1/p)T1, then ideally spead up = p.

Amdahl's Law

Suppose 10% of algorithm you can't figure out how to make paralell, then:

Tp = T1(.1+.9/p)

This converges to .1T1
- even if you have infinite cores (p -> infinity), Tp approaches .1T1

In the Code

Do loops, for loops -> that are NOT dependent on other factors in the loop.

For example, matrix multiplications can be done - each element is computed independently

Open MP

Use ForTran or C/C++

it is compiler directives
You will have commands like:

#OMP PARALLEL DO
Do ----
---
---
end Do;

Here you are telling open MP to do everything in parallel.

Other compilers

Even if your compiler can't run this, it ignores the line. However, if your compiler does recognize Open MP, then you can parallel compute.

GCC supports this (a free compiler)

When to use CP

1) Do this at the BIGGEST chunk possible
- there is a little overhead associated with it, so the bigger the chunk, the better the gains
- Also evens out tmore if one processor is slower or faster.

2) Open MP does not check or validate your answer. WE have to make sure that everything is right. These are hard to debug!!

3. Language Barrier

Program in CUDA (GPU Program)

2. Branch Diversion

If there are a lot of if then statements it goes through every single possibility to execute

Takes time for one process waiting for the other processes to go through if one is true and one is false etc. etc.
SO if you have code with lots of branching, GPU is not going to help.

1. Operations Required

Need to use data for lots of operations to be worth it: Can only achieve this if everytime you bring something in, if you do at least 40 operations per single thing you bring in to be able to even get up to the 1.4 Tflops of processing (the in and out gets crowded)

Implementations

Description