CPU Architectures
GP - CPU Architectures
Parallelizing CPU architectures
- Multicore
- Multiple FUs -> ALUs, Multipliers
- Data parallelism
- Multiple pipelines
- Instruction parallelism
- Multiple Heterogeneous PEs -> DSP, GPU ...
DSP Architectures (Digital Signal Processing)
- Very application specific
- Offers mac (.M)
- Data access (.D)
- GP-ALU (.L)
- Shifter / ALU (.S)
Software Pipelining
- Technique to reduce pipeline stalls from instruction
- Together with VLIW get very tight loop kernels
Pipeline Architectures
CISC
- Each construciton seperate
- move mem, reg
- move reg, reg
- Microcode decodes instruction and performs operation
RISC
- Store and load
- lots of instructions required
- hardware architecture, inc. pipelining makes execution fast
- compiler friendly
Superpipelining
- increase the number of instructions in the pipeline
- instructions can be issued faster
Superscalar
- Increase the number of pipelines
- Several instructions issued in a cycle
- scheduling becomes an issue
Scheduling
Co Processors
- Processor recognises instructions as its own
- Processor does not recognise instruction it passes it on to connected Co-Pros
- Either recognised or not recognised
- If not then unknown instruction exception
- If then processed by Co-Pro
- Co-pros usually have separate clocks
- Often own channels to memory/cache (NEON)
-> Independent/parallel execution possible
ARM Cortex-A53 (v8 diagram)
NEON (v7 description)
- Has instruction queue (16 deep) and data queue (8 entries)
- A53 regards instruction as complete (performs checking and fetches)
- NEON must decode and process instruction
- If instruction/data queue full then A53 stalls
Co-processors face uncertain future
- (Co-Pro) Instruction decoders expensive
- (Co-Pro) Only really useful for generic operations
- (Co-Pro) Video co-processors exis(ted) but standards advance so quickly
- (CPU) Tight integration also costs G-CPU silicon
- (Toolset) Added compiler maintenance because of architecture specific instruction sets
(SMT) Simultaneous Multi Threading
- Intel calls this hyperthreading
- Supporting cores execute two threads -> 4 cores -> 8 threads
- Superscalar architecture necessary
- Assumption is that CPU resources (ALU ...) are not always being used
- With two threads in parallel better chance of available resources being used
- BIOS and OS must support this feature
- Core looks like two (thread capable) cores to OS
- Need to lock process on core to avoid hyperthreading when it is turned on
- Speedup for some applications, reported by Intel, ~30%
- Improvement more likely dependent on externals like cache stalls than on FU availability
- Security issues abound
(AMP) Asymmetric Multiprocessing
AMP - Scenarios
- AMP with different Instruction Set Architecture (ISA)
- Typically specialisation – PRU, GPU, DSP ... • AMP with same ISA
- Task driven, one processor for comms, one for I/O ...
- Master-slave (typical) or peer-to-peer driven
- AMP with same ISA clocked at different rates
- AMP with same ISA, different architectures (64/32-bit)
- F.i. big.LITTLE
- AMP with different or no OS
openAMP
- rpmsg message passing service
- Every device is a communication channel with a remote processor
- Devices are called channels
- Each has a source/destination address
// send
int rpmsg_send(struct rpmsg_channel *rpdev, void *data, int len);
// receive
struct rpmsg_endpoint *rpmsg_create_ept(struct rpmsg_device *rpdev, rpmsg_rx_cb_t cb, void *priv, struct rpmsg_channel_info chinfo);