Fortran patterns & transformation rulesΒΆ
fortranspire combines deterministic Loki AST analysis with targeted LLM
calls to handle the recurring patterns in legacy scientific Fortran 90.
This page documents each pattern with before/after code, and the
transformation rule applied to OpenACC (Phase 1) and JAX (Phase 2)
targets.
Summary tableΒΆ
# |
Pattern |
What the pipeline does |
|---|---|---|
1 |
Missing |
Inferred from data-flow; `INTENT(IN |
2 |
|
Promoted to explicit |
3 |
|
Replaced by explicit argument list; driver passes through |
4 |
|
Detected; flagged when they prevent |
5 |
Implicit typing |
|
6 |
AoS β SoA + |
Derived-type arrays split; loop nests fused |
7 |
Stencil vs recurrence |
Stencil β |
8 |
|
Pointwise functions made callable from GPU loops |
9 |
Module-private state |
Surfaced as |
10 |
|
Converted to |
11 |
MPI halo exchange |
Phase 3 target β GHEX GPU-to-GPU (planned) |
12 |
Fortran I/O |
Phase 4 target β xarray/zarr + DLPack (planned) |
1. INTENT β the key to everything elseΒΆ
INTENT defines the contract of each argument. Without explicit
INTENT, neither OpenACC nor JAX can work correctly.
! Before β implicit INTENT, total ambiguity
subroutine update_stress(vx, sigma_xx, NX)
double precision vx(NX), sigma_xx(NX) ! IN or INOUT?
! After β explicit INTENT, clear contract
subroutine update_stress(vx, sigma_xx, NX)
integer, intent(in) :: NX
double precision, intent(in) :: vx(NX)
double precision, intent(inout) :: sigma_xx(NX)
Transformation rules:
INTENT |
OpenACC |
JAX |
|---|---|---|
|
|
Immutable argument |
|
|
Returned by the function |
|
|
Return value |
Not declared |
β οΈ Loki infers from reads/writes |
β οΈ Blocking β must be resolved |
2. COMMON blocks β the global state to eliminateΒΆ
COMMON is a memory block shared between routines β the enemy of both
GPU and JAX.
! Before β COMMON block, implicit global state
COMMON /grid/ dx, dy, NX, NY
COMMON /fields/ vx(1000,1000), sigma_xx(1000,1000)
subroutine update_stress()
! vx and sigma_xx are accessible implicitly
sigma_xx(i,j) = sigma_xx(i,j) + vx(i,j) * dx
end subroutine
! After β explicit arguments inside a MODULE
MODULE seismic_kernels
contains
subroutine update_stress(vx, sigma_xx, dx, NX, NY)
integer, intent(in) :: NX, NY
double precision, intent(in) :: dx, vx(NX,NY)
double precision, intent(inout) :: sigma_xx(NX,NY)
sigma_xx(i,j) = sigma_xx(i,j) + vx(i,j) * dx
end subroutine
END MODULE
Transformation rules:
OpenACC: impossible to annotate
copyin/copyonCOMMONβ extract as explicit arguments.JAX: no notion of mutable global state β everything must be an argument or return.
Action: the
extractoragent replaces everyCOMMONwithINTENT(IN|INOUT)arguments in the generatedMODULE.
4. POINTER β dangerous aliasing for GPUΒΆ
Fortran pointers can reference arbitrary memory β incompatible with
!$acc data clauses that require concrete arrays of known size.
! Before β pointer with potential aliasing
real, pointer :: field(:,:)
field => vx ! or sigma_xx depending on context
! After option A β replace with allocatable (if ownership is clear)
real, allocatable :: field(:,:)
allocate(field(NX, NY))
! After option B β pass the target directly as argument INTENT(IN)
subroutine process(field, NX, NY)
real, intent(inout) :: field(NX, NY)
Transformation rules:
OpenACC: pointers work if the target is known and unique, but
!$acc datarequires a concrete sized array β preferallocatable.JAX: no pointers β replace with array slices (
jnp.array[i:j]).Action: Loki detects pointer-target associations; if the target is static, the agent replaces with
allocatableor direct argument.
5. Array of Structures β Structure of Arrays (AoS β SoA + collapse)ΒΆ
Derived-type arrays in Fortran are stored as Array of Structures (AoS): the fields of one element are contiguous in memory. On GPU, all threads of a warp access the same field on different elements β AoS forces non-coalesced accesses.
! AoS β bad for GPU (non-coalesced access)
type :: point_t
real :: x, y, vx, vy
end type
type(point_t) :: particles(N)
do i = 1, N
particles(i)%vx = particles(i)%vx + particles(i)%x * dt ! thread i jumps 4 reals at a time
end do
! SoA β optimal GPU (coalesced access, column-major Fortran)
real :: x(N), y(N), vx(N), vy(N)
!$acc parallel loop
do i = 1, N
vx(i) = vx(i) + x(i) * dt ! contiguous threads β one coalesced memory access
end do
For 2D loops β collapse(2):
Without collapse, only the outer j is parallelised (NY threads).
With collapse(2), both loops fuse into NXΓNY independent threads β
full GPU utilisation.
! Without collapse β only NY threads
!$acc parallel loop
do j = 2, NY
do i = 2, NX ! inner loop stays sequential in each thread
sigma_xx(i,j) = sigma_xx(i,j) + ...
end do
end do
! With collapse(2) β NXΓNY threads, all cells in parallel
!$acc parallel loop collapse(2) private(tmp_dx, tmp_dy)
do j = 2, NY
do i = 2, NX
sigma_xx(i,j) = sigma_xx(i,j) + ...
end do
end do
Transformation rules:
Source |
OpenACC |
JAX |
|---|---|---|
|
Split into scalar SoA arrays |
|
Independent 2D |
|
|
2D loop with stencil |
|
Same β JAX accesses |
β οΈ Fortran is column-major. Dimension i varies fastest in memory. For coalesced GPU access, the inner loop must iterate over i (dimension 1) β typical of FD stencils.
6. Nested dependencies β non-parallelisable loopsΒΆ
A loop is parallelisable only if each iteration is independent.
Dependencies on i-1 in the same dimension break this.
! Case 1 β FD stencil (dependency on i-1 of ANOTHER array) β parallelisable
!$acc parallel loop collapse(2)
do j = 2, NY
do i = 2, NX
vx(i,j) = vx(i,j) + (sigma_xx(i,j) - sigma_xx(i-1,j)) / dx ! sigma_xx is read-only
end do
end do
! Case 2 β recurrence on same array β NOT parallelisable
do i = 2, N
a(i) = coeff * a(i-1) + source(i) ! a(i) depends on a(i-1) from previous step
! Case 3 β time loop (temporal dependency) β sequential on host
do it = 1, NSTEP
call update_stress(...) ! state it+1 depends on state it
call update_velocity(...)
end do
Transformation strategies:
Dependency type |
OpenACC |
JAX |
|---|---|---|
FD stencil |
|
|
Recurrence |
β Not parallelisable β keep sequential or reformulate |
|
Time loop |
|
|
Reduction |
|
|
Worked example β time loop to JAX:
! Fortran β sequential time loop on host, GPU kernels in parallel
!$acc data copyin(lambda,rho) copy(vx,vy,sigma_xx)
do it = 1, NSTEP ! sequential host β temporal dependency
call update_stress(vx, sigma_xx, ...) ! GPU kernel (2D collapse)
call update_velocity(sigma_xx, vx, ...) ! GPU kernel (2D collapse)
end do
!$acc end data
# JAX β time loop β jax.lax.scan (jit-compiled, differentiable)
def time_step(carry, _):
vx, vy, sigma_xx, sigma_yy, sigma_xy = carry
sigma_xx, sigma_yy = update_stress(vx, vy, sigma_xx, sigma_yy, ...)
vx, vy = update_velocity(sigma_xx, sigma_yy, sigma_xy, vx, vy, ...)
return (vx, vy, sigma_xx, sigma_yy, sigma_xy), None
# Launch NSTEP iterations in one XLA-compiled call
(vx_f, vy_f, *_), _ = jax.lax.scan(time_step, init_carry, xs=None, length=NSTEP)
JAX advantage: jax.lax.scan is differentiable β
jax.grad(loss)(params) backpropagates through all NSTEP iterations.
Useful for seismic inversion (FWI) or surrogate training.
7. ELEMENTAL + OpenACC β the right patternΒΆ
An ELEMENTAL procedure cannot contain an OpenACC compute
directive (!$acc parallel, !$acc kernels) β same constraint as
PURE. But itβs perfect for !$acc routine seq: it runs
sequentially in each GPU thread, called from a parent !$acc parallel loop.
! Correct pattern β ELEMENTAL + !$acc routine seq
ELEMENTAL function pml_update(psi, field_deriv, b, a, K) result(corrected)
!$acc routine seq ! β allowed inside ELEMENTAL (not a compute construct)
real(dp), intent(in) :: psi, field_deriv, b, a, K
real(dp) :: psi_new, corrected
psi_new = b * psi + a * field_deriv
corrected = field_deriv / K + psi_new
end function
! The parallel loop is in the PARENT routine β not in the ELEMENTAL
subroutine update_velocity_x(vx, sigma_xx, psi_dvx, b_x, a_x, K_x, ...)
!$acc parallel loop collapse(2) private(dvx_dx)
do j = 2, NY
do i = 2, NX
dvx_dx = (sigma_xx(i,j) - sigma_xx(i-1,j)) / dx
dvx_dx = pml_update(psi_dvx(i,j), dvx_dx, b_x(i), a_x(i), K_x(i)) ! β GPU call
vx(i,j) = vx(i,j) + dvx_dx * dt / rho(i,j)
end do
end do
!$acc end parallel
end subroutine
Allowed combinations:
Procedure |
|
|
Callable from GPU |
|---|---|---|---|
Standard |
β |
β |
With |
|
β (standard) |
β |
With |
|
β |
β β correct usage |
β from parallel loop |
|
β |
β |
β from parallel loop |
π‘ Rule:
ELEMENTALis the right candidate for pointwise computations (one stencil point, a PML correction, a source term). The!$acc parallel loop collapse(2)stays in the parent routine iterating over all points.
8. Explicit types β no compiler-inferred mixed precisionΒΆ
Fortran allows implicit declarations and silent promotions. On GPU, this ambiguity translates to mixed fp32/fp64 instructions and costly inter-register conversions.
Absolute rule before translation: IMPLICIT NONE + explicit
KIND-tagged types.
! Before β precision left to the compiler
REAL dx, dy ! 32 or 64 bits depending on -r8 / -fdefault-real-8?
DOUBLE PRECISION vx(NX, NY) ! portable but stylistically inconsistent
REAL*8 sigma_xx(NX, NY) ! non-standard extension (GCC/Intel only)
INTEGER NX ! OK β 32-bit integers by default
! After β precision declared explicitly via KIND parameter
integer, parameter :: dp = selected_real_kind(15, 307) ! IEEE 754 double (64-bit)
integer, parameter :: sp = selected_real_kind(6, 37) ! IEEE 754 single (32-bit)
real(dp) :: dx, dy ! 64-bit everywhere β consistent with nvfortran -acc
real(dp) :: vx(NX, NY)
real(dp) :: sigma_xx(NX, NY)
integer :: NX, NY ! 32-bit integer β correct
Mixed precision β when itβs intentional:
On A100, fp32 ops are 2Γ faster than fp64. Hybrid codes may use sp for working arrays and dp for accumulation:
! Explicit mixed precision β compiler infers nothing
real(sp), intent(in) :: source_term(NX, NY) ! low-precision input (sensors)
real(dp), intent(inout) :: accumulated(NX, NY) ! high-precision accumulation
! Explicit conversion required (never let the compiler silently promote)
accumulated(i,j) = accumulated(i,j) + real(source_term(i,j), dp)
Transformation rules:
Source pattern |
Transformation |
Note |
|---|---|---|
|
|
Assume dp unless told otherwise |
|
|
Normalise the style |
|
|
Non-standard extension β portable |
|
|
Explicit if intended |
|
|
Visco-elastic, complex acoustics |
Implicit promotion |
|
Never leave |
Literals |
|
Consistent with KIND parameter |
For JAX: jnp.float64 by default; force with
jax.config.update("jax_enable_x64", True). Mixed precision possible
with explicit x.astype(jnp.float32).
9. Logical flags USE_xx β compile-time directivesΒΆ
Scientific Fortran codes often use LOGICAL PARAMETER as feature
switches:
LOGICAL, PARAMETER :: USE_PML = .TRUE.
LOGICAL, PARAMETER :: USE_ATTENUATION = .FALSE.
LOGICAL, PARAMETER :: SAVE_SNAPSHOTS = .TRUE.
GPU problem: even if these constants are compile-time, if (USE_PML) branches inside a !$acc parallel loop generate dead code
that some compilers donβt eliminate cleanly β potential warp
divergence.
Recommended transformation β CPP preprocessor:
! kernel.F90 (extension .F90 = automatic preprocessing with nvfortran/gfortran)
#ifdef USE_PML
! PML memory correction β compiled only if -DUSE_PML
memory_dvx_dx(i,j) = b_x(i) * memory_dvx_dx(i,j) + a_x(i) * dvx_dx
dvx_dx = dvx_dx / K_x(i) + memory_dvx_dx(i,j)
#endif
#ifdef USE_ATTENUATION
sigma_xx(i,j) = sigma_xx(i,j) - tau_sigma * memory_sigma(i,j)
#endif
# Compile with features enabled
nvfortran -acc -gpu=cc80 -cpp \
-DUSE_PML \
-o seismic_gpu kernel.F90
# Variant without PML (comparative benchmark)
nvfortran -acc -gpu=cc80 -cpp \
-o seismic_gpu_nopml kernel.F90
Multi-target equivalences:
Fortran source |
OpenACC / nvfortran |
JAX |
|---|---|---|
|
|
|
|
|
|
|
|
|
Multi-valued flag |
|
|
# JAX β Python flags are evaluated once at jit-trace, not on each iteration
USE_PML = True
@jax.jit
def update_velocity(vx, sigma_xx, psi_dvx, ...):
dvx_dx = (sigma_xx[i,j] - sigma_xx[i-1,j]) / dx
if USE_PML: # β evaluated ONCE at jit, not per iteration
psi_dvx = b_x * psi_dvx + a_x * dvx_dx
dvx_dx = dvx_dx / K_x + psi_dvx
return vx + dvx_dx * dt / rho[i,j], psi_dvx
# If USE_PML must be differentiable β jax.lax.cond
dvx_dx, psi = jax.lax.cond(
use_pml_flag,
lambda args: pml_correction(*args),
lambda args: (args[0], args[1]),
(dvx_dx, psi_dvx),
)
β οΈ Agent action: Loki detects
LOGICAL PARAMETERwith patternUSE_*orAPPLY_*. Theextractorconverts them into#ifdefblocks in the generated.F90and documents active flags in a header.
10. MPI halo exchange β GHEX (GPU-to-GPU)ΒΆ
β οΈ Phase 3 scope β not yet implemented. Documented for planning.
Multi-domain MPI codes exchange halos (ghost bands) between processes at each time step. In the classical scheme, these exchanges go through CPU memory β even if the arrays are on GPU:
GPU (proc 0) CPU GPU (proc 1)
vx_local ββacc update hostβββΊ vx_host ββMPI_SendβββΊ vx_host ββacc update deviceβββΊ vx_local
(device) β β
CPU roundtrip CPU roundtrip
Cost: 2Γ PCIe transfers + MPI latency per time step β cancels most GPU gain on multi-node clusters.
Solution β GHEX (GridTools, ETH ZΓΌrich): direct GPU-to-GPU exchanges via RDMA (NVLink or InfiniBand + CUDA-aware MPI), without CPU roundtrip.
! Current pattern β CPU halo exchange (expensive roundtrip)
!$acc update host(vx, vy) ! GPU β CPU
call MPI_Sendrecv(vx_send, ..., vx_recv, ..., MPI_COMM_WORLD, ...)
!$acc update device(vx, vy) ! CPU β GPU
! GHEX pattern β GPU-to-GPU halo exchange (Phase 3)
! GHEX handles the exchange on device directly
call ghex_exchange(vx_field, vy_field, context) ! GPU-to-GPU RDMA
! No CPU roundtrip β next kernels see up-to-date halos on device
# Python/Cython side β GHEX interface (Phase 3)
import ghex
ctx = ghex.context(MPI.COMM_WORLD, thread_safe=False)
pattern = ghex.structured_pattern(ctx, domain, halo_width=1)
# In the time loop β transparent GPU-to-GPU exchange
pattern.exchange(vx_field, vy_field).wait()
update_stress(vx, vy, sigma_xx, ...)
Transformation rules:
Source pattern |
OpenACC + MPI (Phase 1) |
OpenACC + GHEX (Phase 3) |
|---|---|---|
|
|
|
Shared halo arrays |
|
|
Compute/comm overlap |
β Sequential |
β
async |
Expected gain: 3β10Γ communication reduction on multi-GPU InfiniBand clusters.
11. Fortran I/O β xarray / zarr + DLPackΒΆ
β οΈ Phase 4 scope β not yet implemented. Documented for planning.
Classical Fortran I/O (WRITE, OPEN, PostScript) produces
proprietary binary files or text incompatible with the modern data
science ecosystem.
Problem: seismic codes write .pnm images and .dat seismograms β
unreadable directly by Pandas, xarray, or cloud visualisation tools.
A β DLPack: zero-copy between Fortran GPU and PythonΒΆ
DLPack is a tensor-sharing protocol between frameworks (CUDA, JAX, PyTorch, CuPy) without memory copy. The Cython wrapper exposes GPU arrays directly via DLPack:
# Current Phase 1 β CPU copy required
vx_np = np.asfortranarray(vx) # GPU β CPU β NumPy copy
# Phase 4 target β zero-copy via DLPack
from __dlpack__ import from_dlpack
import cupy as cp
vx_gpu = from_dlpack(seismic_module.vx_dlpack()) # direct DLPack view on GPU memory
vx_jax = jax.dlpack.from_dlpack(vx_gpu) # JAX array without copy
vx_cp = cp.from_dlpack(vx_gpu) # CuPy array without copy
! Fortran side β device pointer exposure via iso_c_binding (Phase 4)
function vx_device_ptr(vx) result(ptr) bind(C, name="vx_device_ptr")
use iso_c_binding
real(dp), device, intent(in) :: vx(:,:) ! device attribute (nvfortran)
type(c_ptr) :: ptr
ptr = c_loc(vx)
end function
B β xarray / zarr outputs (replaces WRITE / PostScript)ΒΆ
# Current β Fortran binary + PostScript files
! WRITE(unit=27,...) image_data_2D β .pnm files
! WRITE(unit=11,...) sisvx(it, irec) β .dat files
# Phase 4 β cloud-native xarray/zarr outputs
import xarray as xr, zarr, numpy as np
# Build a geophysical Dataset with coordinates
ds = xr.Dataset(
{
"vx": (["x", "z", "time"], vx_history), # velocity field
"sigma_xx": (["x", "z", "time"], stress_history), # normal stress
"seismo_x": (["receiver", "time"], sisvx), # seismograms
},
coords={
"x": np.arange(NX) * DELTAX,
"z": np.arange(NY) * DELTAY,
"time": np.arange(NSTEP) * DELTAT,
},
attrs={"source_x": ISOURCE * DELTAX, "source_z": JSOURCE * DELTAY},
)
# Zarr write β compatible with S3-API object storage (Pangeo, Dask, OpenStack Swift)
ds.to_zarr("s3://seismic-results/run_001.zarr", mode="w")
# Read and visualise directly without conversion
import hvplot.xarray
ds["vx"].isel(time=100).hvplot(x="x", y="z", cmap="seismic")
Transformation rules:
Fortran source pattern |
Phase 1 (Cython) |
Phase 4 (xarray/zarr) |
|---|---|---|
|
In-memory NumPy array |
|
|
Python text file |
Zarr dataset on S3-API object storage (e.g. OpenStack Swift) |
|
|
Interactive hvPlot / GeoViews |
Per-receiver seismogram |
NumPy array |
|
Snapshot every N steps |
Accumulated 3D array |
Streaming Zarr append |
See alsoΒΆ
Architecture β the activation-order rationale behind the multi-stage pipeline.
LLM endpoints β sovereign-EU and self-hosted inference options.
Open an issue if you hit a Fortran pattern not on this page β adding a new pattern is usually a one-stage change plus a fixture.