光阴冢

We make choices and life has a way of making us pay for them.

Notes on AES HWIP Technical Specification

Jul 7, 2020  

Overview

https://docs.opentitan.org/hw/ip/aes/doc/

This text will include some thoughts and notes about this spec.

Theory of Operations

Introduction

  • encryption and decryption for AES/128/192/256 in ECB/CBC/CTR, using a single shared data path
    • either do encryption or decryption but not both at the same time
  • a key expanding mechanism to generate the required round keys on-the-fly from a single initial key provided through the register interface.
    • just need to provied the master key via the register interface
    • benefits:
      • reduced storage requirements and smaller circuit area (15 * 128 b v.s. 3 * 256 b )
        • one to write the initial key
        • one to hold the current full key
        • one to hold the full key of the last encryption round (the start key for decryption)
      • faster re-configuration and key switching
    • price:
      • an initial delay whenever the key is changed (then ECB/CBC are possible)
        • 12/14/16 cycles for AES-128/192/256
      • for CTR mode, there is no such initial delay upon changing the key.
  • a status register to indicate to the processor when ready to receive the next input data block via the register interface.
    • safe for the processor to provide the next input data blocks
    • automatically start the en/decryption once (and (the previous operation is finished) (the next block is ready))
      • which is default setting
      • Every input register must be written at least once for the AES unit to automatically start encryption/decryption.
    • set the MANUAL_OPERATION bit to 1: only start when START bit in TRIGGER is 1.

Block Diagram

  • meduim performance (~1 cycle per round)
  • uses an iterative cipher core architecture
    • allows for a smaller circuit area at the cost of throughtput
  • 128 bit data path
    • achieve the latency requirements of 12/14/16 clock cycles per 16B data block in AES-128/192/256 mode
  • cipher core
    • data path is shared between encryption and decryption in
      • actual cipher (left)
      • round key generation (right)
    • the blocks shown in the diagram always implement the forward and backward
  • a set of control and status registers
    • via TL-UL bus interface
  • a counter module
    • in CTR mode only
  • the IV registers used in CBC and CTR modes

Block Diagram

Hardware Interfaces

Hardware Interfaces -
Primary Clock clk_i
Other Clocks none
Bus Device Interface tlul
Bus Host Interface none
Peripheral Pins for Chip IO none
Interrupts none
Security Alerts none

Design Details

Datapath Architecture and Operation

  • An Equivalent Inverse Cipher

    • allows for more efficient cipher data path sharing
    • operations are applied in the same order (less muxes, simpler control)
    • requires the round key during decryption to be transformed using an inverse MixColumns in all rounds except for the first and the last one
  • Considerations

    • If use CTR only, the inverse cipher is not used at all.
    • If the key is changed extremely rarely, it may pay off to store all round keys instead of generating them on the fly.
    • Future versions of the AES unit might offer compile-time parameters to selectively instantiate the forward/inverse cipher part only to allow for dedicated encryption/decryption-only units.
  • Submodules in the data path are purely combinational.

  • The only sequential logic in the cipher and round key generation are the State, Full Key and Decryption Key registers.

  • the initial key and configuration

    • provided via a set of control and status registers (CSRs)
    • via TL-UL bus interface
    • each key register must be written at least once
    • the order does not matter
  • initialization vector (IV) or initial counter value

    • to the four IV registers
    • via TL-UL bus interface
    • in CBC or CTR mode
      • each IV must be written at least once
      • the order does not matter
      • IV is updated after the current value being consumed
    • in ECB mode, IVs are ignored
  • input data

    • provided via CSRs
    • must be written as least once
  • if new input data is avaliable

    • loads the initial state into State register
      • depending on the cipher mode, it can be a combination of input data as well as IV
      • if in CBC decryption or in CTR, the input data is also registered (Data)
    • the initial ket is loads into the Full Key register
      • if ECB/CBC decryption is performed, the Full Key register is loaded with the value in Decryption Key register
    • to start
      • the IV must be ready
      • written at least once
  • State and Full Key registers have been loaded

    • the AES cipher core start the encryption/decryption
    • adding the first round key to the initial state
    • stored back in State register
  • performs 9/11/13 rounds

    • the four ops
    • In parallel, the full key used for the next round is computed on the fly using the key expand module.
    • if in CTR mode, the countere module iteratively updates the IV in parallel
      • internally, uses one 16-bit counter
        • need 8 cycles to increment the 128-bit counter
        • the counter is used only in first round
        • does not effect the throughput
  • the final de/encryption round in which the MixColumn is skipped.

    • output will be firwarded to the output register ub CSRs.
    • State register cleared with pseudo random data
    • Depending on the cipher mode, the output of the final round can be XORed with the IV (CBC) or the previous input data (CTR)

SubBytes / S-Box

The design of this S-Box and its inverse can have a big impact on circuit area, timing critical path, robustness and power leakage, and is itself its own research topic.

ShiftRows

Can be implemented using 3*4 32-bit 2-input muxes (encryption/decryption).

MixColumns

Can be implemented using 36 2-input XORs and 16 4-input XORs (all 8-bit), 8 2-input muxes (8-bit), as well as 78 2-input and 24 3-input XOR gates.

KeyExpand (KEM)

  • on-the-fly
    • lower storage requirements
    • smaller circuit area
    • initial delay

Timing Diagram