CPU Architectures

GP - CPU Architectures

alt text

Parallelizing CPU architectures

Multicore
Multiple FUs -> ALUs, Multipliers
- Data parallelism
Multiple pipelines
- Instruction parallelism
Multiple Heterogeneous PEs -> DSP, GPU ...

DSP Architectures (Digital Signal Processing)

Very application specific
Offers mac (.M)
Data access (.D)
GP-ALU (.L)
Shifter / ALU (.S)

alt text

Software Pipelining

Technique to reduce pipeline stalls from instruction
Together with VLIW get very tight loop kernels

alt text

Pipeline Architectures

CISC

Each construciton seperate
- move mem, reg
- move reg, reg
Microcode decodes instruction and performs operation

RISC

Store and load
- lots of instructions required
- hardware architecture, inc. pipelining makes execution fast
- compiler friendly

Superpipelining

increase the number of instructions in the pipeline
instructions can be issued faster

alt text

Superscalar

Increase the number of pipelines
Several instructions issued in a cycle
scheduling becomes an issue

alt text

Scheduling

alt text

Co Processors

alt text

Processor recognises instructions as its own
Processor does not recognise instruction it passes it on to connected Co-Pros
- Either recognised or not recognised
- If not then unknown instruction exception
- If then processed by Co-Pro
Co-pros usually have separate clocks
Often own channels to memory/cache (NEON)

-> Independent/parallel execution possible

ARM Cortex-A53 (v8 diagram)

alt text

NEON (v7 description)

Has instruction queue (16 deep) and data queue (8 entries)
A53 regards instruction as complete (performs checking and fetches)
NEON must decode and process instruction
If instruction/data queue full then A53 stalls

Co-processors face uncertain future

(Co-Pro) Instruction decoders expensive
(Co-Pro) Only really useful for generic operations
(Co-Pro) Video co-processors exis(ted) but standards advance so quickly
(CPU) Tight integration also costs G-CPU silicon
(Toolset) Added compiler maintenance because of architecture specific instruction sets

(SMT) Simultaneous Multi Threading

Intel calls this hyperthreading
Supporting cores execute two threads -> 4 cores -> 8 threads
Superscalar architecture necessary
Assumption is that CPU resources (ALU ...) are not always being used
- With two threads in parallel better chance of available resources being used
BIOS and OS must support this feature
- Core looks like two (thread capable) cores to OS
- Need to lock process on core to avoid hyperthreading when it is turned on
Speedup for some applications, reported by Intel, ~30%
- Improvement more likely dependent on externals like cache stalls than on FU availability
Security issues abound

(AMP) Asymmetric Multiprocessing

AMP - Scenarios

AMP with different Instruction Set Architecture (ISA)
- Typically specialisation – PRU, GPU, DSP ... • AMP with same ISA
- Task driven, one processor for comms, one for I/O ...
- Master-slave (typical) or peer-to-peer driven
AMP with same ISA clocked at different rates
AMP with same ISA, different architectures (64/32-bit)
- F.i. big.LITTLE
AMP with different or no OS

openAMP

rpmsg message passing service
Every device is a communication channel with a remote processor
- Devices are called channels
Each has a source/destination address

// send
int rpmsg_send(struct rpmsg_channel *rpdev, void *data, int len);

// receive
struct rpmsg_endpoint *rpmsg_create_ept(struct rpmsg_device *rpdev, rpmsg_rx_cb_t cb, void *priv, struct rpmsg_channel_info chinfo);

alt text

My ZHAW Notes