ArchitectureΒΆ
The pipeline is a six-stage LangGraph state machine. Two stages call the LLM; four are deterministic.
π kernel.f90 (Fortran monolithique)
β
βΌ π parser Loki AST β detects INTENT, SAVE, COMMON, loops, I/O
β Deterministic β no LLM
β
βΌ π§ extractor LLM (1 call) β extracts 2-D loops as MODULE
β subroutines, removes COMMON, exposes SAVE
β as INTENT(INOUT)
β β module_kernels.f90 + driver.f90
β
βΌ β¨ pure_elemental AST rules β annotate PURE/ELEMENTAL
β Validates: no I/O, no SAVE, INTENT explicit
β
βΌ π openacc LLM (1 driver call) β !$acc parallel loop
β collapse(2), !$acc data copyin/copy around
β the time loop
β
βΌ π cython_wrapper LLM (2 calls) β .pyx + kernel_c.h (iso_c_binding),
β NumPy typed memoryviews, np.asfortranarray()
β
βΌ β
validation gfortran Γ 2 flavors β nvfortran -acc (GPU)
β Deterministic β compilation
β
π¦ output/fortran_gpu/module_kernels_gpu.f90
output/cython/module.pyx
LLM budget per runΒΆ
Four LLM calls maximum:
Stage |
Calls |
Role |
|---|---|---|
|
1 |
Refactor a monolithic |
|
1 |
Insert OpenACC parallel-loop and data-region pragmas around the driver |
|
2 |
Generate the |
At Mistral-Large tariffs this is roughly 0.06 USD per kernel and about two minutes wall-clock. Loki carries the deterministic AST work; the LLM only intervenes where semantic understanding is required.
State shapeΒΆ
The pipeline state is a typed dict carried through the LangGraph nodes:
fortran_filepath,fortran_codeβ inputs.ast_infoβ Loki AST summary.module_fortran,driver_fortran,kernel_namesβ after extraction.pure_elemental_fortran,openacc_fortranβ after purity and pragma passes.cython_pyx,cython_header,cython_setupβ wrapper artifacts.validation_passed,validation_logβ final compiler outcome.
Each node reads what it needs and writes only its own keys, so individual stages can be replayed without rerunning the full pipeline.
Human-in-the-loopΒΆ
Every intermediate artifact is written to disk before the next stage runs, so a reviewer can inspect (or hand-edit) the extracted module, the OpenACC driver, or the Cython wrapper between stages. Re-running the pipeline from an existing intermediate file skips the LLM call for that stage.
Where the LLM intervenes (and where it does not)ΒΆ
Fortran 90/2003 is a structured domain: the grammar is closed and fully
parseable, OpenACC is a versioned standard with a finite directive
vocabulary, the iso_c_binding mapping between Fortran types and C is
mechanical, and validation reduces to compiling with two reference
toolchains. Most of the transformation work is therefore performed by
deterministic AST and template rules β six of the eight pipeline stages
are LLM-free:
Stage |
Tool |
LLM call? |
What it does |
|---|---|---|---|
1. Parse |
Loki |
No |
AST extraction, |
2. Extract |
LLM |
Yes |
Lift kernels from monolithic |
3. Purity |
AST rules |
No |
Annotate |
4. OpenACC |
LLM |
Yes |
Insert |
5. Cython |
LLM (Γ2) |
Yes |
Generate |
6. CPU validation |
|
No |
Compile original and OpenACC variants; assert syntactic correctness |
7. GPU validation |
|
No |
Compile the OpenACC variant for the target architecture |
8. Equivalence |
Test harness |
No |
Run both binaries on a deterministic input, assert |
LLM intervention is bounded to the three semantic edges where
deterministic rules cannot infer programmer intent. Stage 2 must
decide what constitutes a kernel inside a five-thousand-line
PROGRAM and how to expose its hidden state; stage 4 must place the
!$acc data region at the temporal granularity that minimises
host-device traffic without breaking the time-step dependency;
stage 5 must map Fortran OPTIONAL arguments and array descriptors
to a Cython interface that preserves the column-major NumPy view. A
structured rule-based system either fails on these tasks (no
closed-form rule covers the diversity of production codes) or
requires so many special cases that the rule base becomes a
maintenance liability.
The pipeline distinguishes two model roles. The reasoning role
(kernel extraction, stage 2) defaults to Mistral Large 2. The
code-generation role (OpenACC pragma insertion, Cython wrapping,
docstring synthesis β stages 4 and 5) defaults to Codestral, a smaller
code-specialised model with fill-in-the-middle training. The role
assignment is overridable through the MISTRAL_MODEL_REASONING and
MISTRAL_MODEL_CODE environment variables.
Why an agent, not a one-shot promptΒΆ
A single-shot LLM prompt is insufficient because the three LLM stages
are not statistically independent. A wrong kernel boundary at stage 2
propagates into wrong !$acc data clauses at stage 4 and an
incompatible Cython signature at stage 5; a wrong OpenACC layout at
stage 4 causes a nvfortran -acc failure at stage 7 that the LLM
cannot diagnose unless the compiler log is fed back into its context.
fortranspire therefore orchestrates the eight stages with LangGraph:
each stage reads from and writes to a typed state dictionary, and
validation stages 6β8 can route the pipeline back to stage 4 (or
stage 2) with the compiler log appended to the LLM context, capped at
three retries to bound token spend.
Why this sequence?ΒΆ
The pipeline follows an activation order β each stage makes the next possible. Itβs not arbitrary.
Stage 1 β Extraction: monolithic β modular
Codes like seismic_CPML_2D are monolithic PROGRAMs with inline FD
loops and no explicit INTENT.
Without explicit
INTENTβ impossible to decidecopyin(read-only) vscopy(modified in-place) for OpenACC.Without separate subroutines β OpenACC canβt target the right loops.
Without a
MODULEβ Cython canβt emit a cleancdef extern.
Stage 2 β PURE/ELEMENTAL: side-effects β pure functions
Property |
GPU relevance |
JAX relevance |
|---|---|---|
No I/O |
I/O doesnβt execute on device |
Same |
No |
No hidden state β independent threads |
Required for |
Explicit |
Determines |
Determines JAX arguments |
Determinism |
Result identical regardless of thread order |
Required for |
Stage 3 β OpenACC: complete pattern for a 2D FD stencil
! Kernel β !$acc parallel loop collapse(2), PURE removed
subroutine update_velocity_x(vx, sigma_xx, sigma_xy, rho, DELTAX, DELTAY, DELTAT, NX, NY)
real(dp), intent(in) :: sigma_xx(NX,NY), sigma_xy(NX,NY), rho(NX,NY)
real(dp), intent(inout) :: vx(NX,NY)
real(dp), intent(in) :: DELTAX, DELTAY, DELTAT
integer, intent(in) :: NX, NY
real(dp) :: value_dsigma_xx_dx, value_dsigma_xy_dy ! scalars β private()
!$acc parallel loop collapse(2) private(value_dsigma_xx_dx, value_dsigma_xy_dy)
do j = 2, NY
do i = 2, NX
value_dsigma_xx_dx = (sigma_xx(i,j) - sigma_xx(i-1,j)) / DELTAX
value_dsigma_xy_dy = (sigma_xy(i,j) - sigma_xy(i,j-1)) / DELTAY
vx(i,j) = vx(i,j) + (value_dsigma_xx_dx + value_dsigma_xy_dy) * DELTAT / rho(i,j)
enddo
enddo
!$acc end parallel
end subroutine
! Driver β !$acc data ONCE around the 2000 time steps
!$acc data copyin(lambda,mu,rho,b_x,a_x,K_x,...) &
!$acc copy(vx,vy,sigma_xx,sigma_yy,sigma_xy,memory_dvx_dx,...)
do it = 1, NSTEP
call update_stress_xx_yy(...)
call update_velocity_x(...)
if (mod(it, IT_DISPLAY) == 0) then
!$acc update host(vx, vy) ! pull back to CPU for display only
print *, 'velocnorm =', maxval(sqrt(vx**2 + vy**2))
endif
enddo
!$acc end data
Expected gain: NX=101, NY=641, NSTEP=2000 β ~10s CPU β ~0.1s A100 (Γ100).
Stage 4 β Cython β Python without copy
import numpy as np
import seismic_cpml_2d_gpu as gpu_module
vx = np.asfortranarray(np.zeros((NX, NY))) # column-major = Fortran layout
gpu_module.update_velocity_x(vx, sigma_xx, sigma_xy, rho, ...)
# Typed memoryviews = direct NumPy buffer access, zero copy
Phase 2 β JAX: PURE subroutines become JAX functions directly
Fortran |
JAX equivalent |
|---|---|
|
|
|
JAX argument (immutable) |
|
Returned value |
Independent |
|
|
|
|
|
The translation Fortran PURE β JAX is mechanical: PURE
subroutines are pure mathematical functions β exactly what JAX
compiles to XLA.