Order Dependencies must be encored
Data dependencies
- Removable dependencies - can be removed by code / loop transformation
- seperable dependencies - data strouctrues could be replicated (replicated data)

Schedule

How are tasks assigned to UEs and how are they scheduled for execution?

Static Scheulde

Defined a priori
low management effort but low numebers of tasks

Dynamic schedule

Higher management but for unknown or unpredictable task load

Define a task queue and when UE completes is current task, it gets a new one from the queue

-> UEs with tasks completed faster will take more from the queue

Work-Stealing: if UE has nothign to do, it looks to the queue of otehr UEs and takes a task to do (adds significant complexity)

Fork / Join Pattern

Implies the cloning of section of code into another (structurally - but not necessarily in terms of data) independent task
teh fork-join pattern is NOT an abstraction of the POSIX fork, if anything vice-versa

Use-cases

recursive structure, such as trees - lend themselves to fork-join
irregular sets of connected tasks
where diffeent functions are mapped onto different tasks

#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>

#define NUM_PROCESSES 4
#define ARRAY_SIZE 1000

int array[ARRAY_SIZE];

void sum_array(int start, int end, int* result) {
    for (int i = start; i < end; i++) {
        *result += array[i];
    }
}

int main() {
    // Initialize array with values 1 to ARRAY_SIZE
    for (int i = 0; i < ARRAY_SIZE; i++) {
        array[i] = i + 1;
    }

    pid_t pids[NUM_PROCESSES];
    int partial_sum[NUM_PROCESSES] = {0};

    for (int i = 0; i < NUM_PROCESSES; i++) {
        pids[i] = fork();

        if (pids[i] < 0) {
            perror("fork");
            exit(EXIT_FAILURE);
        } else if (pids[i] == 0) {
            int start = i * (ARRAY_SIZE / NUM_PROCESSES);
            int end = (i + 1) * (ARRAY_SIZE / NUM_PROCESSES);
            sum_array(start, end, &partial_sum[i]);
            exit(partial_sum[i]);
        }
    }

    int total_sum = 0;
    for (int i = 0; i < NUM_PROCESSES; i++) {
        int status;
        waitpid(pids[i], &status, 0);
        total_sum += WEXITSTATUS(status);
    }

    printf("Total sum using fork-join: %d\n", total_sum);

    return 0;
}

Geometric Decomposition

Sequence of operations on a core data structure

Solution

Assign chunks
Ensure data available for chunks
Define the updating process for chunks
Group and map chunks to UEs

alt text

Exchange Options

(A) A-Priori - perform exchange
- copying or messaging
- updating is straight forward
(B) Computation and communication overlap
- compute local data whils boundary data is being transmitted
- updating requires multi-threading, communication and computation may work

Loop Parallelism Pattern

Loop parallelism is a technique to speed up programs by running loop iterations at the same time on multiple processors.

Problem

You have a loop in your code that takes a long time to run. How can you make it faster?

Solution

Check Independence: Ensure each loop iteration can run without waiting for others.
Parallelize: Use tools (like OpenMP) to split the loop so each part runs on a different processor.

Result: The loop runs faster because multiple processors handle different parts simultaneously.

Benefits

Speed: Reduces the time needed to run the loop.
Efficiency: Uses multiple processors effectively.

Considerations

Complexity: Managing parallel tasks and ensuring no conflicts (race conditions) can be tricky.
Debugging: Finding and fixing errors is harder in parallel programs.

Use it when: Your program has loops that take a long time and can run independently.

Solution

Check Independence: Ensure each loop iteration can run without waiting for others.
Parallelize: Use tools (like OpenMP) to split the loop so each part runs on a different processor.

Distributed Array Pattern

Distributed array is a method for dividing arrays between multiple processors to improve efficiency and performance in parallel computing.

Problem

Large arrays need to be processed efficiently across multiple processors. How can you distribute the array elements effectively?

Solution

Partition Arrays: Divide the array into blocks that can be distributed among processors.
Map Blocks to Processors: Assign blocks to processors in a way that balances the load and minimizes communication overhead.

Result: The array elements are processed more efficiently as each processor works on a different part of the array.

Benefits

Load Balancing: Distributes work evenly among processors.
Memory Efficiency: Ensures efficient use of memory by aligning array distribution with the flow of computation.

Considerations

Communication Overhead: Managing data exchange between processors can introduce delays.
Index Mapping: Aligning local and global indices requires careful management to ensure correct data processing.

Use it when: Your program involves large arrays that need to be processed in parallel, and you want to optimize performance and memory usage.

Single Programm Multiple Data Pattern (SPMD Pattern)

Single Program Multiple Data (SPMD) is a parallel computing model where multiple processors run the same program but on different data.

Problem

Managing interactions between multiple processors running the same program but on different data sets. How can this be organized efficiently?

Solution

Initialize: Load the same program on all processors and establish communication.
Unique Identifiers: Each processor gets a unique identifier to differentiate its tasks.
Distribute Data: Each processor works on its own portion of the data.

Result: The program runs on multiple processors simultaneously, each handling different data, improving overall performance.

Benefits

Scalability: High scalability as each processor can handle different data independently.
Maintainability: Easier to maintain a single code base.

Considerations

Data Distribution: Efficient data distribution and management are crucial.
Synchronization: Proper synchronization to manage data dependencies and interactions between processors.

Use it when: You need to run the same operations on different data sets across multiple processors for better performance and scalability.

Defines Chunks whose size and number are parameterizable -> Used for fitting the solution onto different HW • Impact of granularity
Overhead of dependency management of chunks versus computational effort - The time managing dependencies must be small wrt computational effort - Boundary cells
- Cells at boundary of chunk -> set size increases with surface area
- Cost = constant overhead + per cell effort
- Computational effort -> scales with volume
- Rough-tune dependency management with computational effort

Efficiency

Work versus overhead (see above)
How do chunks map onto UEs

Geometric Decomposition

General Solutions

Array based computations
Recursive data structures

alt text

Tasks

Distinct

independently of each other
do not rely on each other's results & resources during their execution

Dependent

relies on the results of another task

Processes

Unit of resource ownership
Unit of scheduling
POSIX defined

Definition

has its own main
owns resources (memory, file handles, ...)
is assumed to have exclusive access to all computing and hardware resources 100% of the time
may be managed by an operating system
in a multi-process OS, the OS will schedule the processes accoring to some policy
the memory protection unit (HW) will protecte the integrity of the process' memory from other processes

POSIX Processes

Portable Operating System Interface (POSIX)

Behavioural specification
defines how processes and threads are created and terminated
operating systems may (or may not) suppport POSIX

alt text

Process is created through a fork
fork clones (total copy) the calling process to child
child inherits copies of resources
- child/parent variables are handled seperateely (copy on write)
parent and child run independently of each other
- os scheduled seperately by the OS

POSIX Threads

Under POSIX, created within a process
Operates in the context of a process
Can access all process reosurce including memory
Kernel management threads -> kernel knows threads exist and schedules them
- default-linux-behaviour
user-threads -> threading in user space
- kernel does not know of their existence
- user-threads library schedules, not OS scheduler, schedules threads

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

#define NUM_THREADS 4
#define ARRAY_SIZE 1000

int array[ARRAY_SIZE];
int partial_sum[NUM_THREADS] = {0};
pthread_t threads[NUM_THREADS];

void* sum_array(void* arg) {
    int thread_part = *((int*)arg);
    int start = thread_part * (ARRAY_SIZE / NUM_THREADS);
    int end = (thread_part + 1) * (ARRAY_SIZE / NUM_THREADS);

    for (int i = start; i < end; i++) {
        partial_sum[thread_part] += array[i];
    }

    pthread_exit(0);
}

int main() {
    // Initialize array with values 1 to ARRAY_SIZE
    for (int i = 0; i < ARRAY_SIZE; i++) {
        array[i] = i + 1;
    }

    int thread_parts[NUM_THREADS];
    for (int i = 0; i < NUM_THREADS; i++) {
        thread_parts[i] = i;
        pthread_create(&threads[i], NULL, sum_array, &thread_parts[i]);
    }

    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }

    int total_sum = 0;
    for (int i = 0; i < NUM_THREADS; i++) {
        total_sum += partial_sum[i];
    }

    printf("Total sum using pthreads: %d\n", total_sum);

    return 0;
}

UE - Units of Execution

Creation and destruction of threads are costly
reuse them as much as possible

My ZHAW Notes