Fortran patterns & transformation rulesΒΆ

fortranspire combines deterministic Loki AST analysis with targeted LLM calls to handle the recurring patterns in legacy scientific Fortran 90. This page documents each pattern with before/after code, and the transformation rule applied to OpenACC (Phase 1) and JAX (Phase 2) targets.

Summary tableΒΆ

#

Pattern

What the pipeline does

1

Missing INTENT

Inferred from data-flow; `INTENT(IN

2

SAVE (implicit or explicit)

Promoted to explicit INTENT(INOUT) argument

3

COMMON blocks

Replaced by explicit argument list; driver passes through

4

POINTER / TARGET

Detected; flagged when they prevent PURE/ELEMENTAL

5

Implicit typing

IMPLICIT NONE enforced; explicit KIND for all reals

6

AoS β†’ SoA + collapse

Derived-type arrays split; loop nests fused

7

Stencil vs recurrence

Stencil β†’ parallel loop collapse(2); recurrence β†’ lax.scan

8

ELEMENTAL + !$acc routine seq

Pointwise functions made callable from GPU loops

9

Module-private state

Surfaced as INTENT(INOUT) at the kernel boundary

10

LOGICAL PARAMETER flags

Converted to #ifdef blocks for compile-time elimination

11

MPI halo exchange

Phase 3 target β€” GHEX GPU-to-GPU (planned)

12

Fortran I/O

Phase 4 target β€” xarray/zarr + DLPack (planned)


1. INTENT β€” the key to everything elseΒΆ

INTENT defines the contract of each argument. Without explicit INTENT, neither OpenACC nor JAX can work correctly.

! Before β€” implicit INTENT, total ambiguity
subroutine update_stress(vx, sigma_xx, NX)
  double precision vx(NX), sigma_xx(NX)   ! IN or INOUT?

! After β€” explicit INTENT, clear contract
subroutine update_stress(vx, sigma_xx, NX)
  integer,          intent(in)    :: NX
  double precision, intent(in)    :: vx(NX)
  double precision, intent(inout) :: sigma_xx(NX)

Transformation rules:

INTENT

OpenACC

JAX

IN

copyin(arr) β€” copied once to GPU before the loop

Immutable argument

INOUT

copy(arr) β€” bidirectional sync

Returned by the function

OUT

copyout(arr) β€” pulled back after compute

Return value

Not declared

⚠️ Loki infers from reads/writes

⚠️ Blocking β€” must be resolved


2. COMMON blocks β€” the global state to eliminateΒΆ

COMMON is a memory block shared between routines β€” the enemy of both GPU and JAX.

! Before β€” COMMON block, implicit global state
COMMON /grid/ dx, dy, NX, NY
COMMON /fields/ vx(1000,1000), sigma_xx(1000,1000)

subroutine update_stress()
  ! vx and sigma_xx are accessible implicitly
  sigma_xx(i,j) = sigma_xx(i,j) + vx(i,j) * dx
end subroutine

! After β€” explicit arguments inside a MODULE
MODULE seismic_kernels
contains
  subroutine update_stress(vx, sigma_xx, dx, NX, NY)
    integer,          intent(in)    :: NX, NY
    double precision, intent(in)    :: dx, vx(NX,NY)
    double precision, intent(inout) :: sigma_xx(NX,NY)
    sigma_xx(i,j) = sigma_xx(i,j) + vx(i,j) * dx
  end subroutine
END MODULE

Transformation rules:

  • OpenACC: impossible to annotate copyin/copy on COMMON β†’ extract as explicit arguments.

  • JAX: no notion of mutable global state β†’ everything must be an argument or return.

  • Action: the extractor agent replaces every COMMON with INTENT(IN|INOUT) arguments in the generated MODULE.


3. SAVE β€” hidden state between callsΒΆ

SAVE preserves a local variable’s value across calls β€” hidden state.

! Before β€” SAVE variable, hidden state between calls
subroutine update_memory(dvx_dx)
  real, save :: psi_vx = 0.0    ! initialised once, persists
  psi_vx = b_x * psi_vx + a_x * dvx_dx
  dvx_dx = dvx_dx / K_x + psi_vx
end subroutine

! After β€” state passed explicitly
subroutine update_memory(dvx_dx, psi_vx, b_x, a_x, K_x)
  real, intent(inout) :: psi_vx   ! state exposed, caller-managed
  real, intent(inout) :: dvx_dx
  real, intent(in)    :: b_x, a_x, K_x
  psi_vx = b_x * psi_vx + a_x * dvx_dx
  dvx_dx = dvx_dx / K_x + psi_vx
end subroutine

Transformation rules:

  • OpenACC: a SAVE variable per GPU thread β†’ race condition. Must become INTENT(INOUT) or a thread-indexed array.

  • JAX: SAVE breaks functional purity β†’ jax.lax.scan handles state across iterations.

  • Action: the extractor lifts SAVE variables to INTENT(INOUT) arguments; subroutines containing SAVE cannot be annotated PURE.


4. POINTER β€” dangerous aliasing for GPUΒΆ

Fortran pointers can reference arbitrary memory β€” incompatible with !$acc data clauses that require concrete arrays of known size.

! Before β€” pointer with potential aliasing
real, pointer :: field(:,:)
field => vx    ! or sigma_xx depending on context

! After option A β€” replace with allocatable (if ownership is clear)
real, allocatable :: field(:,:)
allocate(field(NX, NY))

! After option B β€” pass the target directly as argument INTENT(IN)
subroutine process(field, NX, NY)
  real, intent(inout) :: field(NX, NY)

Transformation rules:

  • OpenACC: pointers work if the target is known and unique, but !$acc data requires a concrete sized array β†’ prefer allocatable.

  • JAX: no pointers β€” replace with array slices (jnp.array[i:j]).

  • Action: Loki detects pointer-target associations; if the target is static, the agent replaces with allocatable or direct argument.


5. Array of Structures β†’ Structure of Arrays (AoS β†’ SoA + collapse)ΒΆ

Derived-type arrays in Fortran are stored as Array of Structures (AoS): the fields of one element are contiguous in memory. On GPU, all threads of a warp access the same field on different elements β€” AoS forces non-coalesced accesses.

! AoS β€” bad for GPU (non-coalesced access)
type :: point_t
  real :: x, y, vx, vy
end type
type(point_t) :: particles(N)

do i = 1, N
  particles(i)%vx = particles(i)%vx + particles(i)%x * dt   ! thread i jumps 4 reals at a time
end do

! SoA β€” optimal GPU (coalesced access, column-major Fortran)
real :: x(N), y(N), vx(N), vy(N)

!$acc parallel loop
do i = 1, N
  vx(i) = vx(i) + x(i) * dt   ! contiguous threads β†’ one coalesced memory access
end do

For 2D loops β€” collapse(2):

Without collapse, only the outer j is parallelised (NY threads). With collapse(2), both loops fuse into NXΓ—NY independent threads β€” full GPU utilisation.

! Without collapse β€” only NY threads
!$acc parallel loop
do j = 2, NY
  do i = 2, NX                  ! inner loop stays sequential in each thread
    sigma_xx(i,j) = sigma_xx(i,j) + ...
  end do
end do

! With collapse(2) β€” NXΓ—NY threads, all cells in parallel
!$acc parallel loop collapse(2) private(tmp_dx, tmp_dy)
do j = 2, NY
  do i = 2, NX
    sigma_xx(i,j) = sigma_xx(i,j) + ...
  end do
end do

Transformation rules:

Source

OpenACC

JAX

type(t) :: arr(N) (AoS)

Split into scalar SoA arrays

pytree or separate jnp.array fields

Independent 2D do j; do i

!$acc parallel loop collapse(2)

jax.vmap on two axes or implicit vectorisation

2D loop with stencil (i-1,j)

collapse(2) OK if i-1 is from an array already on GPU

Same β€” JAX accesses a[i-1] as slice

⚠️ Fortran is column-major. Dimension i varies fastest in memory. For coalesced GPU access, the inner loop must iterate over i (dimension 1) β€” typical of FD stencils.


6. Nested dependencies β€” non-parallelisable loopsΒΆ

A loop is parallelisable only if each iteration is independent. Dependencies on i-1 in the same dimension break this.

! Case 1 β€” FD stencil (dependency on i-1 of ANOTHER array) β†’ parallelisable
!$acc parallel loop collapse(2)
do j = 2, NY
  do i = 2, NX
    vx(i,j) = vx(i,j) + (sigma_xx(i,j) - sigma_xx(i-1,j)) / dx  ! sigma_xx is read-only
  end do
end do

! Case 2 β€” recurrence on same array β†’ NOT parallelisable
do i = 2, N
  a(i) = coeff * a(i-1) + source(i)   ! a(i) depends on a(i-1) from previous step

! Case 3 β€” time loop (temporal dependency) β†’ sequential on host
do it = 1, NSTEP
  call update_stress(...)    ! state it+1 depends on state it
  call update_velocity(...)
end do

Transformation strategies:

Dependency type

OpenACC

JAX

FD stencil a(i,j) ← b(i-1,j) (different arrays)

!$acc parallel loop collapse(2) βœ…

jax.vmap or implicit vectorisation βœ…

Recurrence a(i) = f(a(i-1)) (same array)

❌ Not parallelisable β€” keep sequential or reformulate

jax.lax.scan βœ…

Time loop u(t+1) = f(u(t))

!$acc data around the loop (GPU kernels, time loop on host)

jax.lax.scan with carry βœ…

Reduction sum += a(i)

!$acc loop reduction(+:sum) βœ…

jnp.sum(a) βœ…

Worked example β€” time loop to JAX:

! Fortran β€” sequential time loop on host, GPU kernels in parallel
!$acc data copyin(lambda,rho) copy(vx,vy,sigma_xx)
do it = 1, NSTEP                                ! sequential host β€” temporal dependency
  call update_stress(vx, sigma_xx, ...)         ! GPU kernel (2D collapse)
  call update_velocity(sigma_xx, vx, ...)       ! GPU kernel (2D collapse)
end do
!$acc end data
# JAX β€” time loop β†’ jax.lax.scan (jit-compiled, differentiable)
def time_step(carry, _):
    vx, vy, sigma_xx, sigma_yy, sigma_xy = carry
    sigma_xx, sigma_yy = update_stress(vx, vy, sigma_xx, sigma_yy, ...)
    vx, vy = update_velocity(sigma_xx, sigma_yy, sigma_xy, vx, vy, ...)
    return (vx, vy, sigma_xx, sigma_yy, sigma_xy), None

# Launch NSTEP iterations in one XLA-compiled call
(vx_f, vy_f, *_), _ = jax.lax.scan(time_step, init_carry, xs=None, length=NSTEP)

JAX advantage: jax.lax.scan is differentiable β€” jax.grad(loss)(params) backpropagates through all NSTEP iterations. Useful for seismic inversion (FWI) or surrogate training.


7. ELEMENTAL + OpenACC β€” the right patternΒΆ

An ELEMENTAL procedure cannot contain an OpenACC compute directive (!$acc parallel, !$acc kernels) β€” same constraint as PURE. But it’s perfect for !$acc routine seq: it runs sequentially in each GPU thread, called from a parent !$acc parallel loop.

! Correct pattern β€” ELEMENTAL + !$acc routine seq
ELEMENTAL function pml_update(psi, field_deriv, b, a, K) result(corrected)
  !$acc routine seq           ! ← allowed inside ELEMENTAL (not a compute construct)
  real(dp), intent(in) :: psi, field_deriv, b, a, K
  real(dp) :: psi_new, corrected
  psi_new   = b * psi + a * field_deriv
  corrected = field_deriv / K + psi_new
end function

! The parallel loop is in the PARENT routine β€” not in the ELEMENTAL
subroutine update_velocity_x(vx, sigma_xx, psi_dvx, b_x, a_x, K_x, ...)
  !$acc parallel loop collapse(2) private(dvx_dx)
  do j = 2, NY
    do i = 2, NX
      dvx_dx   = (sigma_xx(i,j) - sigma_xx(i-1,j)) / dx
      dvx_dx   = pml_update(psi_dvx(i,j), dvx_dx, b_x(i), a_x(i), K_x(i))  ! ← GPU call
      vx(i,j)  = vx(i,j) + dvx_dx * dt / rho(i,j)
    end do
  end do
  !$acc end parallel
end subroutine

Allowed combinations:

Procedure

!$acc parallel loop inside

!$acc routine seq

Callable from GPU

Standard SUBROUTINE

βœ…

βœ…

With !$acc routine

PURE SUBROUTINE

❌ (standard)

βœ…

With !$acc routine seq

ELEMENTAL FUNCTION

❌

βœ… ← correct usage

βœ… from parallel loop

ELEMENTAL SUBROUTINE

❌

βœ…

βœ… from parallel loop

πŸ’‘ Rule: ELEMENTAL is the right candidate for pointwise computations (one stencil point, a PML correction, a source term). The !$acc parallel loop collapse(2) stays in the parent routine iterating over all points.


8. Explicit types β€” no compiler-inferred mixed precisionΒΆ

Fortran allows implicit declarations and silent promotions. On GPU, this ambiguity translates to mixed fp32/fp64 instructions and costly inter-register conversions.

Absolute rule before translation: IMPLICIT NONE + explicit KIND-tagged types.

! Before β€” precision left to the compiler
REAL dx, dy                      ! 32 or 64 bits depending on -r8 / -fdefault-real-8?
DOUBLE PRECISION vx(NX, NY)      ! portable but stylistically inconsistent
REAL*8 sigma_xx(NX, NY)          ! non-standard extension (GCC/Intel only)
INTEGER NX                       ! OK β€” 32-bit integers by default

! After β€” precision declared explicitly via KIND parameter
integer, parameter :: dp = selected_real_kind(15, 307)   ! IEEE 754 double (64-bit)
integer, parameter :: sp = selected_real_kind(6,  37)    ! IEEE 754 single (32-bit)

real(dp) :: dx, dy               ! 64-bit everywhere β€” consistent with nvfortran -acc
real(dp) :: vx(NX, NY)
real(dp) :: sigma_xx(NX, NY)
integer  :: NX, NY               ! 32-bit integer β€” correct

Mixed precision β€” when it’s intentional:

On A100, fp32 ops are 2Γ— faster than fp64. Hybrid codes may use sp for working arrays and dp for accumulation:

! Explicit mixed precision β€” compiler infers nothing
real(sp), intent(in)    :: source_term(NX, NY)  ! low-precision input (sensors)
real(dp), intent(inout) :: accumulated(NX, NY)  ! high-precision accumulation

! Explicit conversion required (never let the compiler silently promote)
accumulated(i,j) = accumulated(i,j) + real(source_term(i,j), dp)

Transformation rules:

Source pattern

Transformation

Note

REAL x

real(dp) :: x

Assume dp unless told otherwise

DOUBLE PRECISION x

real(dp) :: x

Normalise the style

REAL*8 x

real(dp) :: x

Non-standard extension β†’ portable

REAL*4 x

real(sp) :: x

Explicit if intended

COMPLEX x

complex(dp) :: x

Visco-elastic, complex acoustics

Implicit promotion

real(x, dp) explicit

Never leave x + 1.0 if x is dp

Literals

1.0_dp instead of 1.0d0

Consistent with KIND parameter

For JAX: jnp.float64 by default; force with jax.config.update("jax_enable_x64", True). Mixed precision possible with explicit x.astype(jnp.float32).


9. Logical flags USE_xx β†’ compile-time directivesΒΆ

Scientific Fortran codes often use LOGICAL PARAMETER as feature switches:

LOGICAL, PARAMETER :: USE_PML        = .TRUE.
LOGICAL, PARAMETER :: USE_ATTENUATION = .FALSE.
LOGICAL, PARAMETER :: SAVE_SNAPSHOTS  = .TRUE.

GPU problem: even if these constants are compile-time, if (USE_PML) branches inside a !$acc parallel loop generate dead code that some compilers don’t eliminate cleanly β†’ potential warp divergence.

Recommended transformation β€” CPP preprocessor:

! kernel.F90 (extension .F90 = automatic preprocessing with nvfortran/gfortran)

#ifdef USE_PML
  ! PML memory correction β€” compiled only if -DUSE_PML
  memory_dvx_dx(i,j) = b_x(i) * memory_dvx_dx(i,j) + a_x(i) * dvx_dx
  dvx_dx = dvx_dx / K_x(i) + memory_dvx_dx(i,j)
#endif
#ifdef USE_ATTENUATION
  sigma_xx(i,j) = sigma_xx(i,j) - tau_sigma * memory_sigma(i,j)
#endif
# Compile with features enabled
nvfortran -acc -gpu=cc80 -cpp \
  -DUSE_PML \
  -o seismic_gpu kernel.F90

# Variant without PML (comparative benchmark)
nvfortran -acc -gpu=cc80 -cpp \
  -o seismic_gpu_nopml kernel.F90

Multi-target equivalences:

Fortran source

OpenACC / nvfortran

JAX

LOGICAL, PARAMETER :: USE_PML = .TRUE.

#define USE_PML β†’ -DUSE_PML

USE_PML = True (Python constant)

if (USE_PML) then ... end if

#ifdef USE_PML ... #endif

if USE_PML: ... (jit-trace time)

if (USE_PML) inside parallel loop

#ifdef β†’ dead-code eliminated

jax.lax.cond(USE_PML, f_pml, f_nopml, args) if differentiable

Multi-valued flag INTEGER, PARAMETER :: SCHEME = 2

#if SCHEME == 2 ... #endif

if SCHEME == 2: ... at trace time

# JAX β€” Python flags are evaluated once at jit-trace, not on each iteration
USE_PML = True

@jax.jit
def update_velocity(vx, sigma_xx, psi_dvx, ...):
    dvx_dx = (sigma_xx[i,j] - sigma_xx[i-1,j]) / dx
    if USE_PML:               # ← evaluated ONCE at jit, not per iteration
        psi_dvx = b_x * psi_dvx + a_x * dvx_dx
        dvx_dx  = dvx_dx / K_x + psi_dvx
    return vx + dvx_dx * dt / rho[i,j], psi_dvx

# If USE_PML must be differentiable β†’ jax.lax.cond
dvx_dx, psi = jax.lax.cond(
    use_pml_flag,
    lambda args: pml_correction(*args),
    lambda args: (args[0], args[1]),
    (dvx_dx, psi_dvx),
)

⚠️ Agent action: Loki detects LOGICAL PARAMETER with pattern USE_* or APPLY_*. The extractor converts them into #ifdef blocks in the generated .F90 and documents active flags in a header.


10. MPI halo exchange β†’ GHEX (GPU-to-GPU)ΒΆ

⚠️ Phase 3 scope β€” not yet implemented. Documented for planning.

Multi-domain MPI codes exchange halos (ghost bands) between processes at each time step. In the classical scheme, these exchanges go through CPU memory β€” even if the arrays are on GPU:

GPU (proc 0)          CPU                GPU (proc 1)
   vx_local  ──acc update host──►  vx_host  ──MPI_Send──►  vx_host  ──acc update device──►  vx_local
   (device)          ↑                                                        ↓
                CPU roundtrip                                          CPU roundtrip

Cost: 2Γ— PCIe transfers + MPI latency per time step β†’ cancels most GPU gain on multi-node clusters.

Solution β€” GHEX (GridTools, ETH ZΓΌrich): direct GPU-to-GPU exchanges via RDMA (NVLink or InfiniBand + CUDA-aware MPI), without CPU roundtrip.

! Current pattern β€” CPU halo exchange (expensive roundtrip)
!$acc update host(vx, vy)                              ! GPU β†’ CPU
call MPI_Sendrecv(vx_send, ..., vx_recv, ..., MPI_COMM_WORLD, ...)
!$acc update device(vx, vy)                            ! CPU β†’ GPU

! GHEX pattern β€” GPU-to-GPU halo exchange (Phase 3)
! GHEX handles the exchange on device directly
call ghex_exchange(vx_field, vy_field, context)        ! GPU-to-GPU RDMA
! No CPU roundtrip β€” next kernels see up-to-date halos on device
# Python/Cython side β€” GHEX interface (Phase 3)
import ghex

ctx     = ghex.context(MPI.COMM_WORLD, thread_safe=False)
pattern = ghex.structured_pattern(ctx, domain, halo_width=1)

# In the time loop β€” transparent GPU-to-GPU exchange
pattern.exchange(vx_field, vy_field).wait()
update_stress(vx, vy, sigma_xx, ...)

Transformation rules:

Source pattern

OpenACC + MPI (Phase 1)

OpenACC + GHEX (Phase 3)

MPI_Sendrecv after update_stress

!$acc update host + MPI + !$acc update device

ghex.exchange().wait()

Shared halo arrays

INTENT(INOUT) + CPU sync

INTENT(INOUT) + GPU sync

Compute/comm overlap

❌ Sequential

βœ… async exchange()

Expected gain: 3–10Γ— communication reduction on multi-GPU InfiniBand clusters.


11. Fortran I/O β†’ xarray / zarr + DLPackΒΆ

⚠️ Phase 4 scope β€” not yet implemented. Documented for planning.

Classical Fortran I/O (WRITE, OPEN, PostScript) produces proprietary binary files or text incompatible with the modern data science ecosystem.

Problem: seismic codes write .pnm images and .dat seismograms β€” unreadable directly by Pandas, xarray, or cloud visualisation tools.

A β€” DLPack: zero-copy between Fortran GPU and PythonΒΆ

DLPack is a tensor-sharing protocol between frameworks (CUDA, JAX, PyTorch, CuPy) without memory copy. The Cython wrapper exposes GPU arrays directly via DLPack:

# Current Phase 1 β€” CPU copy required
vx_np = np.asfortranarray(vx)              # GPU β†’ CPU β†’ NumPy copy

# Phase 4 target β€” zero-copy via DLPack
from __dlpack__ import from_dlpack
import cupy as cp

vx_gpu = from_dlpack(seismic_module.vx_dlpack())   # direct DLPack view on GPU memory
vx_jax = jax.dlpack.from_dlpack(vx_gpu)            # JAX array without copy
vx_cp  = cp.from_dlpack(vx_gpu)                    # CuPy array without copy
! Fortran side β€” device pointer exposure via iso_c_binding (Phase 4)
function vx_device_ptr(vx) result(ptr) bind(C, name="vx_device_ptr")
  use iso_c_binding
  real(dp), device, intent(in) :: vx(:,:)    ! device attribute (nvfortran)
  type(c_ptr) :: ptr
  ptr = c_loc(vx)
end function

B β€” xarray / zarr outputs (replaces WRITE / PostScript)ΒΆ

# Current β€” Fortran binary + PostScript files
! WRITE(unit=27,...) image_data_2D    β†’ .pnm files
! WRITE(unit=11,...) sisvx(it, irec)  β†’ .dat files

# Phase 4 β€” cloud-native xarray/zarr outputs
import xarray as xr, zarr, numpy as np

# Build a geophysical Dataset with coordinates
ds = xr.Dataset(
    {
        "vx":       (["x", "z", "time"], vx_history),      # velocity field
        "sigma_xx": (["x", "z", "time"], stress_history),  # normal stress
        "seismo_x": (["receiver", "time"], sisvx),          # seismograms
    },
    coords={
        "x":    np.arange(NX) * DELTAX,
        "z":    np.arange(NY) * DELTAY,
        "time": np.arange(NSTEP) * DELTAT,
    },
    attrs={"source_x": ISOURCE * DELTAX, "source_z": JSOURCE * DELTAY},
)

# Zarr write β€” compatible with S3-API object storage (Pangeo, Dask, OpenStack Swift)
ds.to_zarr("s3://seismic-results/run_001.zarr", mode="w")

# Read and visualise directly without conversion
import hvplot.xarray
ds["vx"].isel(time=100).hvplot(x="x", y="z", cmap="seismic")

Transformation rules:

Fortran source pattern

Phase 1 (Cython)

Phase 4 (xarray/zarr)

WRITE(unit,...) field(NX,NY)

In-memory NumPy array

xr.DataArray with geo coords

OPEN / WRITE / CLOSE .dat file

Python text file

Zarr dataset on S3-API object storage (e.g. OpenStack Swift)

.pnm image file (PostScript)

matplotlib.imshow

Interactive hvPlot / GeoViews

Per-receiver seismogram .dat

NumPy array

xr.DataArray indexed by receiver

Snapshot every N steps

Accumulated 3D array

Streaming Zarr append


See alsoΒΆ

  • Architecture β€” the activation-order rationale behind the multi-stage pipeline.

  • LLM endpoints β€” sovereign-EU and self-hosted inference options.

  • Open an issue if you hit a Fortran pattern not on this page β€” adding a new pattern is usually a one-stage change plus a fixture.