ILP: Instruction-Level Parallelism

한 번에 몇 개의 instruction이 execute될 수 있는가?

program마다 다른 거임.
Average ILP = program 전체의 instruction 수 / PE (processing element)가 무한히 많다고 가정했을 때 필요한 개수
parallel하게 실행이 가능한 프로그램일수록 값이 커짐

그럼 ILP를 cpu 내에서 어떻게 구현할 수 있을까?

Superpipelined machine

pipeline의 개수를 n개로 늘린다.

Superscalar machine

여러 개의 instruction을 동시에 실행한다.

Operation latency

하나의 instruction이 모든 pipeline stage를 통과하는 데 걸리는 시간

Issuing rate (= 이론적인 IPC)

한 Clock cycle당 Instruction이 얼마나 많이 실행될 수 있는가

ex. 4-way superscalar processor: 4 instructions per clock cycle

Superscalar/Superpipelined degree

Superscalar degree: N개의 머신이 패러렐하게 돌고 있다

Superpipelined degree: M개의 스테이지로 파이프라인 되어 있다

Out-of-Order execution

In-order pipeline의 문제점

Instruction 간의 dependency 때문에 parallelism이 일정 수준을 넘어가고 나면 성능이 급격하게 떨어진다. (forwarding으로도 해결이 안 된다)

ex. N개의 superscalar로 여러 개를 동시에 실행하고 이를 M개로 pipelining했는데, 이게 dependent한 instruction 간의 거리보다 커진다면? 문제가 되겠죠

Register renaming

만약 register가 무한히 많았다면 WAR (anti-dependency), WAW(Output dependency) hazard는 발생하지 않는다.
왜냐면 register의 이름 때문에 발생하는 거지 data 자체의 true dependency 가 아니기 때문.
Renaming table로 architectural register를 physical register로 매핑한다. (이 때 physical register는 CPU 밖에서 보이는 것은 아니다)

When is it safe to remove a binding (i.e., de-allocate a physical register)

OoO example

independent한 execution이 순서를 바꿔서 (out-of-order 하게) 실행된다.
내부적으로만 순서가 바뀌는 것으로 CPU 밖에서 보이면 안 됨.

OoO의 또 다른 장점: hiding memory latency

Cache miss, 특히 LLC(last-level cache)로 갈수록 cache miss가 나면 많은 instruction을 낭비해야 하는데 OoO로 stall을 해결할 수 있음

Typical Out-of-Order CPU structure

Fetch→ Decode (→ Register renaming) 까지는 다 in-order
여러 개의 Execution Unit으로 issue한다.

그렇다면 Instruction을 어떻게 OoO하게 만들까?

Issue queue (ISQ) or Issue buffer

Instruction에 있는 모든 operand가 준비되었을 때 issue시킨다.
예를 들어 내가 방금 p4←result를 완료했다면,

ROB(register) / LSQ (load-store)

ROB (ReOrder Buffer)

Programmer’s view에서 instruction은 순서대로 실행해야 하므로, 그 순서 (program order)를 저장해서 순서대로 commit되어 필요한 대로 register 값이 변하도록 해준다.

LSQ (load-store queue)

memory load-store가 순서대로 commit되도록 해주는 것

Store
Load

OoO에서 branch를 잘못 해버린다면? (Speculative Execution)

(앞에서도 많이 나왔지만) Control flow는 모든 instruction의 대략 14퍼 정도를 차지. 근데 OoO를 하면 잘못된 branch prediction에 대한 페널티가 매우 커짐.

in-order/speculative state , commit/rewind

In-order state와 Speculative(추측성) state를 각자 보관한 다음에,
맞으면 Commit한다.
Rewind: 넌 틀렸다. speculative state는 마치 없었던 것철머 롤백한다.

Modern OoO mechanisms

이거까지 해야 되나 싶긴 한데

P6-style execution vs R10k-style exectuion

P6-style(Intel Pentium Pro) : ROB가 physical register file까지 포함

R10k-style (MIPS 어쩌고): ROB는 program order만 저장하고 physical register file이 있음

Branch prediction in superscalar CPUs

Superscalar machine에서 parallel하게 돌고 있는 branch instruction이 2개인데 둘 다 branch하면,

Instruction vs Thread level parallelism (ILP vs TLP)

비유하자면 ILP: 나 혼자서 멀티태스킹, TLP: 알바를 고용함