OpenMP is a framework for shared memory parallel computing
Raytracing: graphics rendering technique by tracing the path of light from the view camera
through the 2D viewing plane, out into the 3D scene, and back to the light sources
GPU provides much higher insturction throughput and memory bandwidth than CPU
GPU is designed for highly parallel computations and
therefore designed such that more transistors are devoted to data processing
rather than data control and flow caching
CPU: Optimized for sequential code execution
- sophisticated control logic
- large cache memory
- powerful ALU
GPU: Optimized for execution of multiple threads
- minimize control logic and cache memory
- energy efficient ALU
- cores are grouped
host code: the part of code which will be run on the CPU
device code: the part of code which will be run on the GPU
guided is a type of OpenMP schedule
that is similar to the dynamic schedule, but the chunk size decreases as the program runs
NVCC is a proprietary compiler by Nvidia intended for use with CUDA
Thrust is a C++ template library for CUDA based on STL: high-level interface
omp_get_thread_num()
omp_set_num_threads()
meaning of #pragma omp sections:
Each section runs by only one thread in parallel
similarity between private and firstprivte:
Each thread gets a copy of shared variable
difference between private and firstprivate:
private: uninitialized per-thread copy of shared variable
firstprivate: initialized per-thread copy of shared variable
CUDA로 stencil 구현 시, <<N+THREAD_NUM-1/THREAD_NUM, THREAD_NUM>> 사용하는 이유:
The vector size is an arbitrary number,
so we cannot ensure that N/THREAD_NUM blocks can cover all elements
Thus I used N+THREAD_NUM-1 to round up and ensure full coverage
Parallel algorithms
Tp = running time on p processors
T1 (work) = running time on 1 processorss
Tinf (span) = longest time to execute algorithm on inf # of processors
Lower bounds
Tp >= T1 / P
Tp >= Tinf
Speed up
T1 / Tp
Parallelism (max speed up)
T1 / Tinf
[Parallel Merge Sort]

Problem
Parallelism is T1 / Tinf is Theta(logn) and that is too small so Merge step needs to
be parallelized

spwan 부분은 두 스레드가 동시에 진행하는 것이기 때문에 2를 곱할 필요 없이
더 오래걸리는 하나의 함수 기준으로 시간 복잡도 측정하면 된다
Par-Merge
O(3n/4) <- largest possible size of function
T(n) = O(logn) + T(3n/4) => theta(log^2(n))
Par-Merge-Sort
T(n) = T(n/2) + theta(log^2(n)) => theta(log^3(n))
Parallelism = T1 / Tinf
이때 T1은 일반적인 merge sort O(nlgn)
Tinf는 theta(log^3(n))
따라서 Parallelism은 theta (n/log^2(n))
[Parallel Quick Sort]




In CUDA, warp is a group of 32 threads where multiprocessors execute the same instructions at each clock cycle.
__global__ runs on device, callable from host code, should have void return type
__host__ runs on host, callable from host
__device__ runs on device, callable fron device
Local, Global: Off-chip memory
Shared, Register: On-chip memory
In fact, shared memory latency is roughly 100x lower than uncached global memory latency
Any call to a __global__ function must specify execution configuration for that call
In GPU, a stream multiprocessor (SM) is basically SIMT processor that executes a warp simultaneously
(1) Static scheduling:
#pragma omp parallel for schedule(static, n)
→ Chunks of size n are assigned to threads in round-robin fashion before execution,
resulting in low overhead but potential load imbalance.
(2) Dynamic scheduling:
#pragma omp parallel for schedule(dynamic, n)
→ Chunks of size n are assigned to threads at runtime as they become available,
resulting in good load balance but high overhead.
double start_time = omp_get_wtime();
// parallel section
double end_time = omp_get_wtime();
printf("execution time = %.10lf second", end_time - start_time);
In OpenMp, values of private variables are undefined on entry and exit of parallel region
Characteristics of __global__ functions:
1) Must specify execution configuration
2) Exected on device
3) Callable from host code
4) Must return void
cudaMemcpy() parameters:
destination pointer, source pointer, size in byte, type of transfer
Prefix sum: si = x1⊕x2⊕⋯⊕xi. for 1 ≤ i ≤ n
__syncthreads():
__syncthreads() acts as a barrier,
ensuring that all threads have reached the same point in execution
before any can proceed.
Main advantages of PSRS (Parallel Sorting by Regular Sampling)
over parallel quicksort and hyper-quicksort algorithms are:
Better load balancing
Repeated communications of a same value are avoided
The number of processes does not have to be a power of two, which is required by the other two algorithms.
OpenMP에서 #pragma omp parallel for는 순서를 반드시 지켜야 하지만, 그 뒤에 오는 reduction, private, shared 같은 clause들은 순서를 자유롭게 바꿔도 된다.
PSRS algorithm
In phase 1, each process uses the sequential quicksort algorithm to sort its share of elements, Each process selects regular sample of sorted list
In phase 2, all samples are gathered, sorted and p-1 global pivots are selected and broadcasted
In phase 3, each process partitions its list into p pieces using pivot values and sends partitions to other processes
In phase 4, each process merges its partitions into a single sorted partition