[Multicore] Final

2025. 6. 11. 21:39·Multicore

 
OpenMP is a framework for shared memory parallel computing
 
Raytracing: graphics rendering technique by tracing the path of light from the view camera
through the 2D viewing plane, out into the 3D scene, and back to the light sources
 
GPU provides much higher insturction throughput and memory bandwidth than CPU
GPU is designed for highly parallel computations and
therefore designed such that more transistors are devoted to data processing
rather than data control and flow caching
 
 
CPU: Optimized for sequential code execution
 - sophisticated control logic
 - large cache memory
 - powerful ALU
GPU: Optimized for execution of multiple threads
 - minimize control logic and cache memory 
 - energy efficient ALU
 - cores are grouped
 
 
host code: the part of code which will be run on the CPU
device code: the part of code which will be run on the GPU
 
guided is a type of OpenMP schedule 
that is similar to the dynamic schedule, but the chunk size decreases as the program runs
 
NVCC is a proprietary compiler by Nvidia intended for use with CUDA
 
Thrust is a C++ template library for CUDA based on STL: high-level interface
 
omp_get_thread_num()
omp_set_num_threads()
 
meaning of #pragma omp sections:
Each section runs by only one thread in parallel 
 
similarity between private and firstprivte:
Each thread gets a copy of shared variable
 
difference between private and firstprivate:
private: uninitialized per-thread copy of shared variable
firstprivate: initialized per-thread copy of shared variable
 
 
CUDA로 stencil 구현 시, <<N+THREAD_NUM-1/THREAD_NUM, THREAD_NUM>> 사용하는 이유:
The vector size is an arbitrary number,
so we cannot ensure that N/THREAD_NUM blocks can cover all elements
Thus I used N+THREAD_NUM-1 to  round up and ensure full coverage
 
 
Parallel algorithms
Tp = running time on p processors
T1 (work) = running time on 1 processorss
Tinf (span) = longest time to execute algorithm on inf # of processors 
 
Lower bounds
Tp >= T1 / P
Tp >= Tinf
 
Speed up
T1 / Tp
 
Parallelism (max speed up)
T1 / Tinf
 
 
 
[Parallel Merge Sort]

 
Problem
Parallelism is T1 / Tinf is Theta(logn) and that is too small so Merge step needs to
be parallelized

 
spwan 부분은 두 스레드가 동시에 진행하는 것이기 때문에 2를 곱할 필요 없이 
더 오래걸리는 하나의 함수 기준으로 시간 복잡도 측정하면 된다
 
Par-Merge
O(3n/4) <- largest possible size of function
T(n) = O(logn) + T(3n/4) => theta(log^2(n))
 
Par-Merge-Sort
T(n) = T(n/2) + theta(log^2(n)) => theta(log^3(n))
 
Parallelism = T1 / Tinf
이때 T1은 일반적인 merge sort O(nlgn)
Tinf는 theta(log^3(n))
따라서 Parallelism은 theta (n/log^2(n))
 
 
[Parallel Quick Sort]

 

 

 
 

                                                                                                   
 
In CUDA, warp is a group of 32 threads where multiprocessors execute the same instructions at each clock cycle.
 
__global__ runs on device, callable from host code, should have void return type
__host__ runs on host, callable from host
__device__ runs on device, callable fron device
 
Local, Global: Off-chip memory
Shared, Register: On-chip memory
In fact, shared memory latency is roughly 100x lower than uncached global memory latency
 
 
Any call to a __global__ function must specify execution configuration for that call 
 
In GPU, a stream multiprocessor (SM) is basically SIMT processor that executes a warp simultaneously
 
 
 
(1) Static scheduling:
#pragma omp parallel for schedule(static, n)
→ Chunks of size n are assigned to threads in round-robin fashion before execution,
resulting in low overhead but potential load imbalance.
 
 
 
(2) Dynamic scheduling:
#pragma omp parallel for schedule(dynamic, n)
→ Chunks of size n are assigned to threads at runtime as they become available,
resulting in good load balance but high overhead.
 
 
double start_time = omp_get_wtime();
// parallel section
double end_time = omp_get_wtime();
printf("execution time = %.10lf second", end_time - start_time);
 
 
In OpenMp, values of private variables are undefined on entry and exit of parallel region
 
 
Characteristics of __global__ functions:
1) Must specify execution configuration
2) Exected on device
3) Callable from host code
4) Must return void
 
 
cudaMemcpy() parameters:
destination pointer, source pointer, size in byte, type of transfer
 
 
Prefix sum:  si = x1⊕x2⊕⋯⊕xi.  for 1 ≤ i ≤ n
 

 
__syncthreads():
__syncthreads() acts as a barrier,
ensuring that all threads have reached the same point in execution
before any can proceed.
 
 

 
Main advantages of PSRS (Parallel Sorting by Regular Sampling)
over parallel quicksort and hyper-quicksort algorithms are:
 

Better load balancing
Repeated communications of a same value are avoided
The number of processes does not have to be a power of two, which is required by the other two algorithms.
 
 
OpenMP에서 #pragma omp parallel for는 순서를 반드시 지켜야 하지만, 그 뒤에 오는 reduction, private, shared 같은 clause들은 순서를 자유롭게 바꿔도 된다.


PSRS algorithm
In phase 1, each process uses the sequential quicksort algorithm to sort its share of elements, Each process selects regular sample of sorted list
In phase 2, all samples are gathered, sorted and p-1 global pivots are selected and broadcasted
In phase 3, each process partitions its list into p pieces using pivot values and sends partitions to other processes
In phase 4, each process merges its partitions into a single sorted partition

'Multicore' 카테고리의 다른 글
  • [Multicore] Mid-term
vysryoo
vysryoo
  • vysryoo
    vysryoo
    vysryoo
  • 전체
    오늘
    어제
    • 분류 전체보기 (129)
      • Python (20)
      • Data structure (12)
      • Algorithm (14)
      • Operating system (18)
      • Programming language theory (12)
      • Computer architecture (6)
      • Softeware engineering (8)
      • Multicore (2)
      • Data Base (3)
      • Problem solving (24)
  • 블로그 메뉴

    • 홈
    • 태그
    • 방명록
  • 링크

  • 공지사항

  • 인기 글

  • 태그

  • 최근 댓글

  • 최근 글

  • hELLO· Designed By정상우.v4.10.3
vysryoo
[Multicore] Final
상단으로

티스토리툴바