Roadmap

These are the planned features for the next releases of Sheaf.

def for global constants: Immutable top-level bindings, as in Clojure. While not required per-se, it eliminates the need to pass configuration dictionaries everywhere.
loop / recur: Explicit tail-recursive loops. Sheaf uses repeat, but loop and recur are more natural for someone coming from Clojure.
reverse / flip: Reverse a tensor along an axis. Currently requires manual index construction.
stack: Combine multiple tensors into a new dimension.
inc / dec: Small convenience to increment and decrement, as in Clojure. Currently (+ var 1) and (- var 1).
argmax returns integers: argmax and argmin currently return floats. They will return integer tensors for direct use as indices.

Enriched defmacro: range and reduce available at compile time, enabling macros that generate architecture variants from a single template.

Convolution primitives: conv1d and conv2d via stablehlo.convolution, exposed through the standard library.
vmap on dictionaries: vmap currently only accepts tensor arguments. PyTree support (automatic flattening/unflattening of dicts) should be added to match the behavior of value-and-grad.

Gradient checkpointing: Recompute forward activations during the backward pass instead of storing them all. This will reduce memory usage for deep models (the GPT-2 124M training currently uses 13 GB).
Scalar parameters in value-and-grad: Float scalars in parameter dictionaries (e.g., {:w 5.0}) will produce correct gradients. Currently requires wrapping in a 1-element tensor.

KV cache helpers: A cached-attention stdlib function to simplify implementing KV cache in transformer models.
Batch generation mode: Compile the full autoregressive generation loop into a single dispatch, returning all tokens at once.

Error call stack propagation: When an error occurs inside a stdlib function, the error message will show the user's call site, not the stdlib internals.
Jupyter integration: A Sheaf kernel for Jupyter, allowing interactive notebook workflows with inline tensor visualization and training loops.
:trace and :blame in the REPL: These observability modes were tied to V1 semantics and temporarily removed. They will be re-introduced with behavior adapted to the V2 execution model.

NCCL all-reduce: Multi-GPU training via NCCL collective operations, for data-parallel training across multiple devices.