Seeing Earth in HD

Why Indonesia needs affordable satellite imagery — and how AI makes it possible

Explanation Reference

The Resolution Problem

Indonesia's Spatial Planning Law (UU 26/2007) requires monitoring of building coverage across 500+ cities. But there's a problem: seeing buildings from space is expensive.

Commercial satellite imagery costs $5–25 per km²

Monitoring one medium city (400 km²) costs $2,000–10,000 per pass. With quarterly updates, that's $8,000–40,000/year per city.

Free Sentinel-2 imagery is too blurry

Each pixel covers 10×10 meters (100 m²). Individual buildings are invisible — they're smaller than a single pixel.

The impossible choice: affordable but blurry, or sharp but expensive

What if we could make the free imagery sharper using AI? That's exactly what super-resolution does.

Two Satellites, One Solution

The project pairs two satellite systems. One provides the training target, the other provides the free input.

Sentinel-2 (the input)

Free, 10m resolution, revisits every 5 days. Run by ESA (European Space Agency). Captures 13 spectral bands including visible light and infrared.

PlanetScope (the target)

Commercial (~3m resolution), daily revisit. Run by Planet Labs. Captures 8 bands. Costs $5–10 per km². Sharp enough to see individual buildings and roads.

💡

The 3× Gap

PlanetScope's 3m pixels are 3× smaller than Sentinel-2's 10m pixels. That means the AI needs to learn to fill in 9× more detail (3×3 = 9 sub-pixels for every original pixel). This 3× ratio is the scale factor of the entire pipeline.

Reference

Sensor Specifications

Key parameters for the two satellite systems used in this pipeline:

Parameter	Sentinel-2 (Input)	PlanetScope SuperDove (Target)
Operator	ESA (European Space Agency)	Planet Labs
Cost	Free & open	$5–10 / km²
Spatial Resolution	10m / 20m / 60m (band-dependent)	~3m (all bands)
Spectral Bands	13 bands (443–2190 nm)	8 bands (431–885 nm)
Bands Used (pipeline)	`B01` `B02` `B03` `B04` `B05` `B8A`	Coastal Blue, Blue, Green, Red, Red Edge, NIR
Swath Width	290 km	~24 km
Revisit Time	5 days	Daily
Radiometric Depth	12-bit	12-bit
Data Format	`.SAFE` (JPEG2000 bands)	GeoTIFF
Scale Factor	3× (10m → ~3.3m)

The Super-Resolution Idea

The concept: show a neural network thousands of paired examples — “here's the blurry version, here's the sharp version” — until it learns to predict the sharp details from blurry input alone.

Sentinel-2 Patch

64×64 pixels at 10m

→

Neural Network

Learns 10m → 3m mapping

→

Enhanced Output

192×192 pixels at ~3m

After training, the network can enhance any Sentinel-2 scene — it doesn't need PlanetScope anymore. The deep learning model has internalized the patterns that distinguish 10m imagery from 3m imagery.

Train Once Per City

Here's the economic breakthrough that makes this practical:

1×

One Purchase

Buy a single PlanetScope scene for your city (~$2–4K for 400 km²). This is your only commercial data cost.

⚙

Train a Model

Use the paired S2/PS data to train a city-specific SR model. Takes 2–22 GPU-hours depending on architecture.

∞

Enhance Forever

Apply the trained model to every future free Sentinel-2 pass. S2 revisits every 5 days = 73 enhanced images per year.

💰

From $100K to $50 Per Year

Traditional approach: buying 4 commercial images/year × $10K = $40K+. This approach: one-time $3K investment + ~$50/year in cloud compute for inference. That's a 99.9% cost reduction. For a country with 500+ cities, this is the difference between monitoring a handful of cities and monitoring all of them.

Check Your Understanding

Module 1 Quiz

A city planner wants to monitor construction quarterly. After the SR model is trained, what data do they need for each quarterly update?

Why does each city need its own training instead of one universal model?

If a Sentinel-2 input patch is 64×64 pixels and the scale factor is 3×, how large is the output patch?

Meet the Pipeline

The cast of scripts that turn raw satellite data into enhanced imagery

Explanation Reference How-To

The Assembly Line

Think of a semiconductor fabrication plant. Raw silicon enters one end, and through a series of precisely controlled stations — cleaning, etching, layering, testing — a finished chip emerges at the other end. Each station does one job extremely well.

This pipeline works the same way. Raw satellite files enter, and through 7 stages, enhanced imagery comes out.

Band Harmonization

Unify all Sentinel-2 bands to the same 10m resolution

Scene Matching

Crop the large S2 scene to match the smaller PlanetScope extent

Co-Registration

Align the two images to sub-pixel accuracy

Patch Extraction

Cut both images into thousands of small, aligned training tiles

Normalization

Scale raw sensor values to a 0–1 range the neural network can digest

Training

Feed patch pairs to the neural network until it learns the 10m → 3m mapping

Inference

Apply the trained model to full Sentinel-2 scenes, tile by tile

The Cast of Characters

Each stage is handled by one or more Python scripts. Here's who does what:

Data Preparation

dsen2_harmonize.py Sharpens 20m/60m S2 bands to 10m using a separate neural network

crop_pairs.py Crops S2 scenes to match PlanetScope geographic extent

preprocess.py Co-registers images and extracts aligned patch pairs

dataset.py Loads patches, normalizes values, applies augmentation

Model & Training

s2ps_archs.py Defines the neural network architectures (EDSR, SwinIR, BandAdapter)

s2ps_losses.py Custom loss functions (SAM, ERGAS) for spectral fidelity

s2ps_model.py Training logic: forward pass, loss computation, validation

launch.py Entry point — registers all components and starts training

train_example.py Standalone minimal training script (no framework needed)

Evaluation & Output

inference.py Applies trained model to full scenes with tiled processing

s2ps_metrics.py Quality metrics: PSNR, SSIM, SAM, ERGAS

How They Connect

Click any component to see what it does. Data flows top to bottom.

Raw Input

Sentinel-2 .SAFE

PlanetScope GeoTIFF

↓

Data Preparation

dsen2_harmonize.py

crop_pairs.py

preprocess.py

↓

Training

launch.py

s2ps_archs.py

s2ps_losses.py

↓

Output

inference.py

s2ps_metrics.py

Click any component above to learn what it does.

The Framework Connection

Rather than building everything from scratch, this project plugs into BasicSR, an existing super-resolution framework. The key file is launch.py — it registers all custom pieces:

CODE

# CRITICAL: Import our package FIRST to register all
# custom datasets, archs, losses, metrics with BasicSR.
import s2ps_dataset   # registers datasets
import s2ps_archs     # registers architectures
import s2ps_losses    # registers losses
import s2ps_metrics   # registers metrics
import s2ps_model     # registers models

PLAIN ENGLISH

These lines must run before anything else, so BasicSR knows about our custom satellite-specific components.

Load our dataset loaders — they know how to read paired S2/PS patches and normalize them.

Load our neural network architectures — EDSR, SwinIR, and BandAdapter, all designed for 6-band satellite input.

Load our grading rubrics — SAM loss keeps the spectral colors honest.

Load our measurement tools — PSNR, SSIM, SAM, ERGAS for quality evaluation.

Load our training orchestrator — it knows how to handle 6 bands where BasicSR expects 3.

💡

The Registry Pattern

This “register then use” approach is one of the most powerful patterns in software. Each import statement tells BasicSR: “Hey, I exist — add me to your menu of options.” Then a YAML config file picks which registered components to actually use for a given experiment. Change the config, not the code.

How-To

Quick-Start Commands

Each stage of the pipeline has a corresponding command. Here are the essential recipes:

⚙ Band Harmonization

python dsen2_harmonize.py \ --safe_dir data/sentinel2_raw/S2A_MSIL2A_20230615.SAFE \ --output data/sentinel2_10m/S2A_20230615_10m.tif \ --dsen2_dir ../DSen2

Requires DSen2 cloned alongside this repo. Add --no_dsen2 to fall back to bilinear resampling.

✂ Scene Matching & Patch Extraction

python preprocess.py \ --s2_dir data/paired_scenes/sentinel2 \ --ps_dir data/paired_scenes/planetscope \ --output_dir data/patches \ --patch_size 64 --scale_factor 3 --stride 32 \ --bands B02,B03,B04,B08

Use --stride 32 for 50% overlap (recommended). Use --stride 64 for no overlap (fewer patches, faster).

🚀 Training

# Single GPU python launch.py -opt swinir_semarang.yml # Multi-GPU (4 GPUs) python -m torch.distributed.launch --nproc_per_node=4 \ launch.py -opt swinir_semarang.yml --launcher pytorch

Checkpoints saved every 5,000 iterations in experiments/<name>/models/

🌎 Inference on Full Scene

python inference.py \ --model_path experiments/SwinIR_Semarang/models/net_g_500000.pth \ --arch MultiBandSwinIR \ --input scene_10m.tif --output scene_3m_sr.tif \ --scale 3 --tile_size 64 --tile_overlap 8

For GPU memory issues, reduce --tile_size to 48 or 32. Increase --tile_overlap to 16 for smoother seams.

Check Your Understanding

Module 2 Quiz

You discover that band B01 (Coastal Aerosol) in Sentinel-2 is captured at 60m resolution while the other bands are at 10m. Which script would you use to fix this mismatch?

You want to add a new loss function for your experiment. Given the registry pattern used in this codebase, what's the minimal set of changes?

Sentinel-2 covers ~110km per tile but PlanetScope only covers ~24km. Before you can make training pairs, you need to handle this size mismatch. Which script and why?

Preparing the Data

From raw satellite downloads to perfectly aligned training patches

Explanation Tutorial Reference

Band Harmonization

Sentinel-2 captures light at different wavelengths — but not all bands have the same resolution. It's like having a camera where the red channel is sharp but the blue channel is blurry.

10m

Native 10m Bands

B02 (Blue), B03 (Green), B04 (Red) — these are already sharp. They pass through unchanged.

20m

20m Bands → 10m

B05 (Red Edge), B8A (NIR) — a 2× sharpening step. Critical for vegetation analysis.

60m

60m Band → 10m

B01 (Coastal Aerosol) — a heroic 6× sharpening. This band always has the lowest quality in the final output.

CODE

S2_L2A_BAND_INFO = {
    'B01': ('*B01_60m.jp2', 60),
    'B02': ('*B02_10m.jp2', 10),
    'B03': ('*B03_10m.jp2', 10),
    'B04': ('*B04_10m.jp2', 10),
    'B05': ('*B05_20m.jp2', 20),
    'B8A': ('*B8A_20m.jp2', 20),
}
OUTPUT_BANDS = ['B01','B02','B03','B04','B05','B8A']
NEEDS_DSEN2 = {'B01', 'B05', 'B8A'}

PLAIN ENGLISH

A lookup table mapping each band name to its file pattern and native resolution in meters.

B01 arrives as a JPEG2000 file at 60m — very coarse.

B02, B03, B04 are already at 10m — the sharp ones (Blue, Green, Red).

B05 and B8A arrive at 20m — moderately coarse.

The final output stacks all 6 bands into one file, in this specific order.

Only these three bands need the DSen2 neural network to sharpen them. The rest are already at 10m.

Aligning Two Different Cameras

Imagine taking a photo of a building from a drone at 100m altitude, then asking a friend on a different drone at 300m to photograph the same building. The photos will show the same place but from slightly different angles, at different times, with different zoom levels. Before you can compare them pixel-by-pixel, you need to align them precisely.

That's the co-registration challenge. The pipeline handles it in two steps:

Geographic Cropping (crop_pairs.py)

S2 covers ~110km, PS covers ~24km. Crop the S2 scene to the PS footprint using georeference coordinates. After this, both images cover the same area.

Sub-pixel Alignment (preprocess.py)

Even after cropping, the images may be off by a few meters. Cross-correlation finds the optimal shift (typically 2–5 meters) and slides the PlanetScope image into exact alignment.

🔎

Why Sub-pixel Matters

A misalignment of even 1 pixel (10m) means the neural network is comparing a building in one image to a garden in the other. It would learn noise instead of the real 10m→3m relationship. The pipeline achieves alignment better than 0.5 pixels — less than 5 meters of error across a 24km scene.

Cutting Into Bite-Sized Pieces

A full satellite scene is massive — thousands of pixels wide. Patches are small, fixed-size crops that the neural network can actually process. Think of it like slicing a pizza into squares — you feed the network one square at a time.

64×64 px

Sentinel-2 patch
(640m × 640m)

↔

192×192 px

PlanetScope patch
(same 640m area)

A sliding window with 50% overlap extracts patches across the entire scene. For Semarang: 3,298 training pairs + 366 validation pairs.

CODE

for y in range(0, h_lr - self.patch_size_lr + 1, self.stride):
    for x in range(0, w_lr - self.patch_size_lr + 1, self.stride):
        lr_patch = lr_image[:, y:y+self.patch_size_lr,
                               x:x+self.patch_size_lr]
        y_hr = y * self.scale_factor
        x_hr = x * self.scale_factor
        hr_patch = hr_image[:, y_hr:y_hr+self.patch_size_hr,
                               x_hr:x_hr+self.patch_size_hr]

PLAIN ENGLISH

Slide a 64-pixel window across the image, moving by 'stride' pixels each step (32 = 50% overlap).

Also slide horizontally at each row position.

Cut out the low-resolution patch from Sentinel-2 at this window position. The [:, ...] reads all 6 spectral bands at once.

Calculate where the same spot is in the 3× larger PlanetScope image. Multiply coordinates by 3.

Cut out the matching high-resolution patch — 192×192 pixels covering the exact same ground area.

Speaking the Network's Language

Raw satellite values range from 0 to 10,000+ (units of surface reflectance). Neural networks work best with values between 0 and 1. Normalization bridges this gap.

CODE

def _percentile(self, image):
    result = np.zeros_like(image, dtype=np.float32)
    for c in range(image.shape[0]):
        band = image[c]
        valid = band[band > 0]
        p_low = np.percentile(valid, self.percentile_low)
        p_high = np.percentile(valid, self.percentile_high)
        if p_high - p_low > 0:
            result[c] = np.clip(
                (band - p_low) / (p_high - p_low), 0.0, 1.0
            )

PLAIN ENGLISH

Define the percentile normalization method — this is how raw sensor values get squeezed into 0–1.

Start with an empty result image of the same size.

Process each spectral band independently (6 bands = 6 loops).

Grab the current band's pixel values. Ignore zero-valued pixels (those are "no data" areas like clouds or ocean).

Find the 2nd percentile value — the "dark floor." Anything darker is probably noise.

Find the 98th percentile — the "bright ceiling." Anything brighter is probably glare.

Squeeze the useful range (2nd to 98th percentile) into 0–1 and clip any outliers. This preserves the relative differences between bands — critical for indices like NDVI.

💡

Why Per-Band Matters

Each band captures a different slice of the electromagnetic spectrum. Infrared is naturally brighter than blue. If you normalized all bands together, you'd crush the faint bands into invisibility. Per-band normalization preserves the relative brightness within each band while making them all network-friendly.

Tutorial

Hands-On: Trace a Patch Through the Pipeline

Follow these steps to understand what happens to a single patch pair:

Raw Sentinel-2 Band Values

A pixel in B04 (Red) reads 1,247 — surface reflectance × 10,000. This is raw sensor output.

After Percentile Normalization

The 2nd percentile is 342, 98th is 2,891. Normalized value: (1247 - 342) / (2891 - 342) = 0.355. Now in [0, 1] range.

Patch Extraction

This pixel is at position (row=142, col=87) in the scene. It lands in patch at grid position (row=2, col=1) — the 64×64 crop starting at (128, 64). Our pixel is at local position (14, 23) within the patch.

Matching HR Patch

The corresponding PlanetScope patch starts at (128×3, 64×3) = (384, 192). It's 192×192 pixels covering the exact same 640m × 640m ground area — but with 9× more pixels.

Saved as Training Pair

Both patches are saved as .npy files (NumPy arrays): train/sentinel2/patch_0142.npy shape (6, 64, 64) and train/planetscope/patch_0142.npy shape (6, 192, 192).

Reference

Preprocessing Parameters

Parameter	Default	Description	Impact
`--patch_size`	`64`	LR patch side length (pixels)	Larger = more context but fewer patches. 64 is standard for 3× SR.
`--stride`	`32`	Sliding window step	32 = 50% overlap (recommended). 64 = no overlap. 16 = 75% overlap (4× more patches).
`--scale_factor`	`3`	SR magnification	Must match S2/PS resolution ratio. HR patch = LR patch × scale.
`--bands`	`B01,..,B8A`	Which S2 bands to use	6-band: B01,B02,B03,B04,B05,B8A. 4-band: B02,B03,B04,B08.
`--norm_method`	`percentile`	Normalization strategy	percentile (robust), minmax (simple), zscore (zero-mean).
`--percentile_low`	`2`	Dark floor percentile	Clips dark outliers. Higher = more aggressive clipping.
`--percentile_high`	`98`	Bright ceiling percentile	Clips bright outliers. Lower = more aggressive clipping.
`--min_valid`	`0.8`	Minimum non-nodata fraction	Rejects patches with >20% nodata (clouds, edges).

Check Your Understanding

Module 3 Quiz

The patch extractor uses 50% overlap (stride=32 for patch_size=64). Why not use 0% overlap (stride=64) to avoid redundancy?

What would happen if co-registration failed and the S2/PS images were misaligned by 10 meters?

The Neural Networks

Two architectures that see satellite imagery in fundamentally different ways

Explanation Reference

Two Ways to See

Think of two detectives examining a crime scene photo. One works methodically with a magnifying glass, scanning small areas one at a time. The other steps back, studying how distant clues relate to each other. Both find important details — but the second one catches patterns the first misses.

EDSR (the Magnifying Glass)

A CNN architecture with 16 residual blocks. Fast to train (2 hours), 1.56 million parameters. Sees local patterns within a small window.

SwinIR (the Bird's-Eye View)

A Transformer architecture with 6 attention layers. Slower to train (22 hours), 11.9 million parameters. Can relate distant pixels across the entire patch.

💡

Why Transformers Excel on Satellites

On regular photos, Transformers beat CNNs by ~0.3 dB. On multi-spectral satellite data, SwinIR beats EDSR by 2.91 dB — nearly 10× the usual gap. Why? Self-attention naturally models how different spectral bands relate: “this Red Edge brightness implies this NIR level implies vegetation.” CNNs can only see nearby pixels in one band at a time.

EDSR: The Workhorse CNN

EDSR's architecture is elegantly simple: compress the input through a series of residual blocks, then expand it to the target resolution.

Head

6 bands → 64 features

→

×16

Residual Blocks

Learn fine details

→

↑

PixelShuffle

3× upscale

→

Out

Tail

64 features → 6 bands

CODE

def forward(self, x):
    head = self.head(x)
    body = self.body(head)
    body = body + head  # Global residual
    up = self.upsample(body)
    out = self.tail(up)
    return out

PLAIN ENGLISH

This is the main processing function — data enters as x (a 6-band, 64×64 patch).

Compress 6 input bands into 64 internal feature channels.

Pass through 16 residual blocks — each refines the features a little more.

The key trick: add the original features back. The network only needs to learn the difference between blurry and sharp — like editing a first draft instead of writing from scratch.

Expand from 64×64 to 192×192 using PixelShuffle — the 3× magnification happens here.

Compress 64 features back to 6 output bands. Done! One sharp patch emerges.

The Band Adapter Trick

Most pretrained super-resolution models expect 3-channel RGB input. Our satellite data has 6 bands. The BandAdapter pattern solves this with learnable 1×1 convolutions that wrap any pretrained model:

Adapter In

6 bands → 3 channels

→

🔒

Pretrained Model

RGB weights (frozen or fine-tuned)

→

Adapter Out

3 channels → 6 bands

CODE

def _init_adapters(self):
    with torch.no_grad():
        w_in = self.adapter_in[0].weight
        nn.init.zeros_(w_in)
        for i in range(min(3, self.num_bands)):
            w_in[i, i, 0, 0] = 1.0

PLAIN ENGLISH

Initialize the adapter weights so RGB bands pass through unchanged at the start.

"no_grad" means we're manually setting values, not learning them (yet).

Grab the input adapter's weight matrix (a 3×6 transformation).

Zero everything out first — a clean slate.

Set bands 1, 2, 3 (Blue, Green, Red) to pass through as-is to the pretrained RGB model. The adapter starts as an identity for the first 3 bands and zero for the rest — then gradually learns how to use all 6 bands during fine-tuning.

Reference

Architecture Specifications

Specification	EDSR (MultiBandEDSR)	SwinIR (MultiBandSwinIR)	BandAdapterNet
Type	CNN (Convolutional)	Transformer (Self-Attention)	Adapter wrapper
Parameters	1.56 M	11.9 M	Backbone + ~0.1 M
Input Channels	`num_in_ch: 6`	`num_in_ch: 6`	Any N → 3 → N
Key Config	`num_feat: 64` `num_block: 16`	`embed_dim: 180` `depths: [6,6,6,6,6,6]` `window_size: 8`	`freeze_backbone_epochs`
Upsampler	PixelShuffle (3×)	PixelShuffle (3×)	Inherits from backbone
Training Time	2h 7min (H100)	22h 28min (H100)	Varies (backbone-dependent)
PSNR (Semarang)	23.58 dB	26.49 dB	Depends on backbone
Best For	Fast prototyping, limited GPU	Production quality	Reusing pretrained RGB weights
Config Files	`esrgan_semarang.yml`	`swinir_semarang.yml`	`bandadapter_esrgan_s2ps.yml`

Check Your Understanding

Module 4 Quiz

You're setting up SR for a new city and have limited GPU budget. SwinIR gives +2.91 dB over EDSR but takes 10× longer to train. What's your strategy?

In EDSR's forward pass, what does the line `body = body + head` actually accomplish?

Teaching Machines to See

How loss functions, training loops, and 500,000 iterations shape a model

Explanation How-To Reference

The Grading Rubric

Imagine grading a student's painting of a landscape. You could score it on two axes: (1) “Does each brushstroke land in the right spot?” — pixel accuracy; and (2) “Do the colors look natural together?” — spectral fidelity. This pipeline uses two loss functions that capture exactly these two qualities:

L1 Loss (pixel accuracy) — weight: 1.0

The absolute difference between each predicted pixel and the real pixel, averaged across the entire patch. Penalizes every pixel that's off, proportional to how far off it is. Simple and effective.

∠

SAM Loss (spectral fidelity) — weight: 0.1

Spectral Angle Mapper: measures the angle between the predicted and true spectral signature at each pixel. Ensures band ratios stay correct — critical for NDVI and other indices.

CODE

p = pred.reshape(b, c, -1)
t = target.reshape(b, c, -1)

dot = (p * t).sum(dim=1)
norm_p = p.norm(dim=1).clamp(min=self.eps)
norm_t = t.norm(dim=1).clamp(min=self.eps)

cos_angle = (dot / (norm_p * norm_t)).clamp(-1+self.eps, 1-self.eps)
sam = torch.acos(cos_angle)

PLAIN ENGLISH

Reshape both images so each pixel becomes a 6-number vector (one value per band). Think of each pixel as a point in 6-dimensional color space.

Compute the dot product between predicted and target spectral vectors — this measures alignment.

Calculate the length of each vector. The clamp prevents division by zero (pixels that are completely black).

Divide dot product by the vector lengths to get the cosine of the angle between them.

Take the arc-cosine to convert from cosine to actual angle in radians. Result: 0 = perfectly matching spectra, π/2 = completely different spectra.

The Training Loop

Training is a conversation between the model and its grading rubric, repeated 500,000 times. Each iteration follows the same pattern:

Forward Pass

Feed a batch of 16 LR patches through the model → get 16 predicted HR patches.

Loss Computation

Compare predictions to real HR patches using L1 + 0.1×SAM. Get a single number: how wrong is the model?

Backward Pass

Backpropagation computes gradients — which parameters contributed most to the error.

Parameter Update

The Adam optimizer nudges each parameter slightly in the direction that reduces the loss.

Repeat ×500,000

Each iteration processes 16 patches. Over 500K iterations, the model sees ~8 million patch examples (with augmentation and repetition).

Training by Recipe Card

Every experiment is defined by a YAML configuration file. Change the recipe, change the experiment — no code modifications needed.

📊

Dataset Settings

Which patches to use, normalization method, augmentation on/off, train/val split paths.

🧠

Architecture Settings

Which model (EDSR/SwinIR), number of blocks/layers, embedding dimension, scale factor.

📈

Training Settings

Loss weights, learning rate, optimizer, scheduler, total iterations, checkpoint frequency.

🎯

Validation Settings

Which metrics to compute, how often to validate, early stopping criteria.

💡

Ablation Studies

The project includes 21 different YAML configs — each testing one variable at a time. Change SAM weight from 0.1 to 0.5? New config file. Try 32 residual blocks instead of 16? New config file. This systematic approach, called an ablation study, is how researchers prove which design choices actually improve results.

A Day in Training

Here's what the training process looks like as a conversation between the components:

Training Log — Semarang SwinIR

0 / 7

How-To

Tuning Your Experiment

Common adjustments you can make by editing the YAML config — no code changes needed:

🎯 Increase Spectral Fidelity

# In your .yml config, increase SAM loss weight: train: sam_opt: type: SAMLoss loss_weight: 0.2 # default is 0.1

Higher SAM weight = better spectral fidelity but slightly lower PSNR. Good for NDVI-critical applications. The project tested λ = {0.01, 0.05, 0.1, 0.2, 0.5}.

⚡ Speed Up Training

# Use EDSR instead of SwinIR (10× faster): network_g: type: MultiBandEDSR # instead of MultiBandSwinIR num_feat: 64 num_block: 16 # try 8 for even faster

EDSR with 8 blocks trains in ~1 hour. Ablation showed 8-block gets within 0.5 dB of 16-block.

🔎 Early Stopping Check

# Set validation frequency in .yml: val: val_freq: 5000 # validate every 5K iterations metrics: psnr_per_band: ~ ssim_per_band: ~ sam: ~

If PSNR plateaus for 50K+ iterations, training has converged. For SwinIR, most gains happen by 250K iterations.

Reference

Training Configuration Reference

Parameter	Value	Notes
Optimizer	`Adam`	β1=0.9, β2=0.99
Learning Rate	`2×10^-4`	Warm-up for SwinIR (5K iters), none for EDSR
Scheduler	`CosineAnnealingRestartLR`	4 periods of 250K, restart weights [1, 0.5, 0.5, 0.5]
Total Iterations	`500,000`	Both EDSR and SwinIR
Batch Size	`16`	Per GPU. Scale with GPU count.
L1 Loss Weight	`1.0`	Primary loss — pixel accuracy
SAM Loss Weight	`0.1`	10% contribution — spectral fidelity
Checkpoint Frequency	Every `5,000` iters	Saved to `experiments/<name>/models/`
Validation Frequency	Every `5,000` iters	Metrics: PSNR, SSIM, SAM, ERGAS per-band
GT Size (HR crop)	`192`	Random crop of HR patch during training for augmentation

Check Your Understanding

Module 5 Quiz

A colleague suggests removing SAM loss to simplify training (just use L1). What downstream task would be most affected?

Based on the training conversation, at what point does SwinIR match EDSR's final quality (23.6 dB)?

Assembling the Big Picture

How a trained model processes an entire city — tile by seamless tile

Explanation Tutorial How-To Reference

Why Tiles?

A full Sentinel-2 scene is roughly 10,000 × 10,000 pixels. The neural network was trained on 64×64 patches. You can't feed the whole scene at once — it would need more GPU memory than exists.

The solution: slide a window across the scene, process each tile independently, then stitch the results together. Like mowing a lawn in overlapping rows — but the overlap is where the magic happens.

▦

Tile Grid

64×64 tiles
8px overlap

→

Per-Tile SR

64×64 → 192×192

→

⊜

Blend

Cosine window
smooth edges

→

🌎

GeoTIFF

Georeferenced
3m output

The Seamless Stitch

Without blending, you'd see visible grid lines where tiles meet — each tile's edge pixels would abruptly jump to the next tile's values. The solution: a raised-cosine blend window that tapers each tile's contribution from full strength at the center to zero at the edges.

CODE

def _make_blend_window(size, overlap):
    w = np.ones(size, dtype=np.float32)
    ramp = np.linspace(0, 1, overlap, dtype=np.float32)
    cos_ramp = 0.5 * (1 - np.cos(np.pi * ramp))
    w[:overlap] *= cos_ramp
    w[-overlap:] *= cos_ramp[::-1]
    window = w[np.newaxis, :] * w[:, np.newaxis]
    return window[np.newaxis]

PLAIN ENGLISH

Create a blending window for a tile of the given size with the given overlap.

Start with all 1s — full weight everywhere.

Create a smooth ramp from 0 to 1 over the overlap zone (8 pixels).

Shape the ramp into a cosine curve — starts slow, accelerates in the middle, ends slow. Smoother than a linear fade.

Apply the ramp to the left edge (fades in) and the reversed ramp to the right edge (fades out). The center stays at full weight.

Make it 2D by multiplying horizontal and vertical ramps. Corners get very low weight (fade × fade), edges get moderate weight, center gets full weight.

Add a band dimension so it can be broadcast across all 6 spectral bands at once.

The Core Loop

For every tile: extract, predict, accumulate with blending weights. Then divide by total weight to normalize.

CODE

with torch.no_grad():
  for row in range(n_rows):
    for col in range(n_cols):
      y0 = min(row * stride, h - tile_size)
      x0 = min(col * stride, w - tile_size)
      tile_lr = img_norm[:, y0:y1, x0:x1]
      tile_sr = model(tile_t)
      output[:, oy0:oy1, ox0:ox1] += tile_sr * blend
      weight[:, oy0:oy1, ox0:ox1] += blend

PLAIN ENGLISH

Disable gradient tracking — we're just predicting, not learning. This saves ~50% memory.

Loop through every tile position, row by row, column by column.

Calculate the top-left corner of this tile. The min() ensures the last tile in each row/column doesn't run off the edge.

Extract the 64×64 LR tile from the normalized image (all 6 bands).

Run the neural network — 64×64 in, 192×192 out. This is where the magic happens, one tile at a time.

Add this tile's output to the accumulator, weighted by the cosine blend window. Where tiles overlap, both contribute — but with fading weights so the transition is invisible.

Track the total weight at each pixel. After all tiles, divide output by weight to normalize.

🌎

Preserving Geography

The output GeoTIFF inherits the input's coordinate system but with 3× smaller pixel size. The geotransform is updated: pixel_size / 3, origin unchanged. The enhanced image slots directly into any GIS workflow.

Tutorial

Hands-On: Trace a Tile Through Inference

Input Scene: 500 × 400 pixels (6 bands)

A Sentinel-2 crop covering roughly 5km × 4km at 10m resolution.

Tile Grid: stride = 56 (tile_size 64 - overlap 8)

Columns: ceil((500 - 8) / 56) = 9 tiles. Rows: ceil((400 - 8) / 56) = 7 tiles. Total: 63 tiles to process.

Tile (row=3, col=4): position (168, 224)

A 64×64 crop extracted from the normalized image. Passed through the model in ~5ms on GPU.

Output tile: 192 × 192 pixels at positions (504, 672)

Multiplied by the cosine blend window: center pixels get weight 1.0, corner pixels get ~0.07. Added to the accumulator.

Final output: 1500 × 1200 pixels (6 bands)

After all 63 tiles: divide accumulator by weight map. Write as GeoTIFF with pixel_size = 10m/3 = 3.33m. Done!

How-To

Troubleshooting Inference

🚧 Out of GPU Memory

# Reduce tile size (less memory per forward pass): python inference.py --tile_size 32 --tile_overlap 8 ... # Or run on CPU (slower but no memory limit): python inference.py --device cpu --tile_size 64 ...

Tile size 32 uses ~4× less GPU memory than 64. Processing time increases proportionally (more tiles).

📈 Visible Tile Seams

# Increase overlap for smoother blending: python inference.py --tile_overlap 16 ... # default is 8 # Or use 50% overlap (slower but seamless): python inference.py --tile_overlap 32 --tile_size 64 ...

More overlap = wider blend zone = smoother transitions. 16 pixels is usually sufficient.

🌎 Batch Processing Multiple Scenes

# Process all .tif files in a directory: for f in data/sentinel2_scenes/*.tif; do python inference.py \ --model_path models/net_g_500000.pth \ --arch MultiBandSwinIR \ --input "$f" \ --output "sr_output/$(basename $f)" \ --scale 3 --tile_size 64 --tile_overlap 8 done

For Docker deployment, mount volumes: -v ./data:/data -v ./models:/models

Reference

Inference Parameters

Parameter	Default	Range	Effect
`--tile_size`	`64`	16–128	Larger = faster (fewer tiles) but more GPU memory. Must be ≥ model's training patch size.
`--tile_overlap`	`8`	0–tile_size/2	Larger = smoother seams but slower. 0 = no blending (visible grid). 8 is a good default.
`--scale`	`3`	2, 3, 4, 6, 8	Must match the trained model's scale factor.
`--arch`	—	`MultiBandSwinIR` `MultiBandEDSR` `MultiBandRRDBNet`	Must match the architecture used during training.
`--device`	`cuda`	`cuda`, `cpu`	CPU is ~50× slower but has no memory limit.

📊

Memory vs Speed Tradeoff

A 10,000×10,000 scene at tile_size=64, overlap=8 requires ~2,800 tiles. At ~5ms/tile on an H100, that's ~14 seconds. Reducing tile_size to 32 quadruples the tile count to ~11,200 but halves memory usage. Choose based on your GPU's available memory.

Check Your Understanding

Module 6 Quiz

You notice visible grid lines in an SR output. The blend window was accidentally disabled (all weights = 1.0). Why do the seams appear?

Why does the inference loop use `torch.no_grad()`?

Measuring Success

Four metrics, real results, and what they mean for Indonesian cities

Explanation Reference How-To

The Report Card

How do you know if enhanced imagery is actually good? You need metrics that capture different aspects of quality — like grading an essay on both grammar and content, not just word count.

PSNR

Peak Signal-to-Noise Ratio. How close is each pixel to the ground truth? Measured in decibels — higher is better. Think of it as the “sharpness score.”

SSIM

Structural Similarity. Do edges, textures, and contrasts look right? A blurry image might have decent PSNR but terrible SSIM because structures are smoothed away.

∠

SAM

Spectral Angle Mapper. Are the band ratios correct? Measured in degrees — lower is better. Below 5° means spectral indices (NDVI, NDBI) will be reliable.

ERGAS

ERGAS. The “overall grade” — combines per-band errors normalized by brightness. Lower is better. The standard fusion quality metric in remote sensing.

Semarang Results

Here's how each method performed on the Semarang test set:

Method	PSNR ↑	SSIM ↑	SAM ↓	ERGAS ↓
Bicubic (baseline)	16.32	0.431	15.96°	17.33
EDSR (500K iters)	23.58	0.780	7.25°	7.31
SwinIR (500K iters)	26.49	0.828	6.12°	5.46

💡

The +10 dB Leap

SwinIR achieves 26.49 dB vs bicubic's 16.32 dB — a +10.17 dB improvement. In PSNR terms, every +3 dB means the error is halved. So +10 dB means the error is roughly 10× smaller. The enhanced imagery is an order of magnitude closer to the real PlanetScope data than simple upscaling.

Beyond Pixels: Real-World Impact

Sharp pixels are nice, but do they help with actual urban monitoring? The research tested this with land cover classification:

67%

Sentinel-2 (10m) classification accuracy

Agreement with PlanetScope ground truth. Too coarse — pixels mix buildings with surrounding vegetation.

73%

EDSR-enhanced classification accuracy

+5.9 percentage points. The CNN captures enough detail to separate building from non-building in most cases.

75%

SwinIR-enhanced classification accuracy

+7.8 percentage points over raw S2. The Transformer's spectral awareness further improves vegetation vs. built-up discrimination.

⚠️

Transfer Learning Caveat

Models don't travel well between cities. A Semarang-trained SwinIR applied to Surabaya scored only 19.16 dB — barely better than bicubic (18.14 dB). On one test area, it actually performed worse than no SR at all (17.50 vs 18.89 dB). Each city's urban texture, vegetation patterns, and building density require location-specific training. The model learns “what Semarang looks like at 3m,” not “what any city looks like at 3m.”

The Band-by-Band Story

Not all bands improve equally. The original resolution of each band matters:

B01

B01 Coastal Aerosol — 26.39 dB (lowest)

Originally 60m → DSen2 sharpened to 10m → then SR to 3m. Three upscaling stages mean this band carries the most uncertainty. It's asking the AI to hallucinate 20× more detail than the sensor actually captured.

B04

B04 Red — 26.88 dB (typical)

Natively 10m, only 3× upscaling. The model has rich spatial detail to work with. Gains are largest for Red Edge (+3.36 dB) and NIR (+3.42 dB).

B8A

B8A NIR — 26.99 dB (highest)

Originally 20m → DSen2 to 10m → SR to 3m. Despite the extra upscaling step, NIR bands are structurally simpler (strong vegetation/non-vegetation contrast) so the model predicts them well.

Reference

Metric Quality Thresholds

Use these thresholds to interpret your model's output quality:

Metric	Poor	Acceptable	Good	Excellent	Direction
PSNR	< 20 dB	20–25 dB	25–30 dB	> 30 dB	Higher ↑
SSIM	< 0.6	0.6–0.8	0.8–0.9	> 0.9	Higher ↑
SAM	> 10°	7–10°	5–7°	< 5°	Lower ↓
ERGAS	> 10	5–10	3–5	< 3	Lower ↓

Context	What It Tells You
PSNR high but SSIM low	Pixel values are close but structures (edges, textures) are blurred or distorted
PSNR high but SAM high	Image looks sharp but spectral signatures are distorted — NDVI/NDBI will be unreliable
All metrics good but downstream accuracy low	Sub-pixel alignment may be off, or training data distribution doesn't match test area
One band significantly worse	Check if that band had a different native resolution (B01 at 60m is the usual culprit)

How-To

Running Evaluation

📊 Evaluate with BasicSR Test Pipeline

# Run test metrics using the same config: python launch.py -opt swinir_semarang.yml --test_only \ --model_path experiments/SwinIR_Semarang/models/net_g_500000.pth

Results are logged to experiments/<name>/results/ with per-band PSNR, SSIM, SAM, and ERGAS.

🌱 Compute NDVI from SR Output

# Using GDAL (bands: 1=B01, 2=B02, 3=B03, 4=B04, 5=B05, 6=B8A) # NDVI = (NIR - Red) / (NIR + Red) = (B8A - B04) / (B8A + B04) gdal_calc.py -A sr_output.tif --A_band=6 \ -B sr_output.tif --B_band=4 \ --calc="(A-B)/(A+B+0.0001)" \ --outfile=ndvi_sr.tif --type=Float32

Add 0.0001 to denominator to avoid division by zero. Compare with NDVI from raw S2 to assess spectral improvement.

🌏 Multi-Area Validation

# Evaluate across multiple subregions: python eval_multiarea.py \ --model_path models/net_g_500000.pth \ --arch MultiBandSwinIR \ --areas "semarang_north,semarang_south,semarang_east" \ --output results/multiarea_eval.csv

Tests generalization within a city. Expect consistent performance across subregions if training data was representative.

Final Assessment

Module 7 Quiz

A city government wants to pilot KDB monitoring in 3 months with a limited cloud GPU budget. Based on the results, what's the most practical approach?

Why does B01 (Coastal Aerosol) consistently have the lowest PSNR across all SR methods?

You're expanding to Surabaya and can either (A) train a lightweight EDSR on 6,300 local patches or (B) use the pre-trained Semarang SwinIR. What does the research suggest?

🎉

Course Complete

You now understand the full pipeline — from raw satellite downloads to enhanced 3-meter imagery. The key takeaways:

✓

The “Train Once Per City” paradigm

One commercial purchase enables unlimited future monitoring from free Sentinel-2 data.

✓

SwinIR > EDSR for satellite SR

Self-attention captures cross-band spectral correlations — a +2.91 dB advantage on multi-spectral data.

✓

Data quality > model size

A small model with relevant local data outperforms a large model trained on a different city.

✓

Spectral fidelity matters

SAM loss preserves band ratios, enabling reliable NDVI and land cover analysis from SR output.

Satellite Super-Resolution

The Resolution Problem

Two Satellites, One Solution

Sensor Specifications

The Super-Resolution Idea

Train Once Per City

One Purchase

Train a Model

Enhance Forever

Check Your Understanding

Module 1 Quiz

A city planner wants to monitor construction quarterly. After the SR model is trained, what data do they need for each quarterly update?

Why does each city need its own training instead of one universal model?

If a Sentinel-2 input patch is 64×64 pixels and the scale factor is 3×, how large is the output patch?

The Assembly Line

The Cast of Characters

How They Connect

The Framework Connection

Quick-Start Commands

Check Your Understanding

Module 2 Quiz

You discover that band B01 (Coastal Aerosol) in Sentinel-2 is captured at 60m resolution while the other bands are at 10m. Which script would you use to fix this mismatch?

You want to add a new loss function for your experiment. Given the registry pattern used in this codebase, what's the minimal set of changes?

Sentinel-2 covers ~110km per tile but PlanetScope only covers ~24km. Before you can make training pairs, you need to handle this size mismatch. Which script and why?

Band Harmonization

Native 10m Bands

20m Bands → 10m

60m Band → 10m

Aligning Two Different Cameras

Cutting Into Bite-Sized Pieces

Speaking the Network's Language

Hands-On: Trace a Patch Through the Pipeline

Preprocessing Parameters

Check Your Understanding

Module 3 Quiz

The patch extractor uses 50% overlap (stride=32 for patch_size=64). Why not use 0% overlap (stride=64) to avoid redundancy?

What would happen if co-registration failed and the S2/PS images were misaligned by 10 meters?

Two Ways to See

EDSR: The Workhorse CNN

The Band Adapter Trick

Architecture Specifications

Check Your Understanding

Module 4 Quiz

You're setting up SR for a new city and have limited GPU budget. SwinIR gives +2.91 dB over EDSR but takes 10× longer to train. What's your strategy?

In EDSR's forward pass, what does the line body = body + head actually accomplish?

The Grading Rubric

The Training Loop

Training by Recipe Card

Dataset Settings

Architecture Settings

Training Settings

Validation Settings

A Day in Training

Tuning Your Experiment

Training Configuration Reference

Check Your Understanding

Module 5 Quiz

A colleague suggests removing SAM loss to simplify training (just use L1). What downstream task would be most affected?

Based on the training conversation, at what point does SwinIR match EDSR's final quality (23.6 dB)?

Why Tiles?

The Seamless Stitch

The Core Loop

Hands-On: Trace a Tile Through Inference

Troubleshooting Inference

Inference Parameters

Check Your Understanding

Module 6 Quiz

You notice visible grid lines in an SR output. The blend window was accidentally disabled (all weights = 1.0). Why do the seams appear?

Why does the inference loop use torch.no_grad()?

The Report Card

PSNR

SSIM

SAM

ERGAS

Semarang Results

Beyond Pixels: Real-World Impact

The Band-by-Band Story

Metric Quality Thresholds

Running Evaluation

Final Assessment

In EDSR's forward pass, what does the line `body = body + head` actually accomplish?

Why does the inference loop use `torch.no_grad()`?