How AI enhances free 10-meter satellite imagery into 3-meter resolution — making city-scale urban monitoring affordable for Indonesia's 500+ cities.
Why Indonesia needs affordable satellite imagery — and how AI makes it possible
Indonesia's Spatial Planning Law (UU 26/2007) requires monitoring of building coverage across 500+ cities. But there's a problem: seeing buildings from space is expensive.
Monitoring one medium city (400 km²) costs $2,000–10,000 per pass. With quarterly updates, that's $8,000–40,000/year per city.
Each pixel covers 10×10 meters (100 m²). Individual buildings are invisible — they're smaller than a single pixel.
What if we could make the free imagery sharper using AI? That's exactly what super-resolution does.
The project pairs two satellite systems. One provides the training target, the other provides the free input.
Free, 10m resolution, revisits every 5 days. Run by ESA (European Space Agency). Captures 13 spectral bands including visible light and infrared.
Commercial (~3m resolution), daily revisit. Run by Planet Labs. Captures 8 bands. Costs $5–10 per km². Sharp enough to see individual buildings and roads.
PlanetScope's 3m pixels are 3× smaller than Sentinel-2's 10m pixels. That means the AI needs to learn to fill in 9× more detail (3×3 = 9 sub-pixels for every original pixel). This 3× ratio is the scale factor of the entire pipeline.
Key parameters for the two satellite systems used in this pipeline:
| Parameter | Sentinel-2 (Input) | PlanetScope SuperDove (Target) |
|---|---|---|
| Operator | ESA (European Space Agency) | Planet Labs |
| Cost | Free & open | $5–10 / km² |
| Spatial Resolution | 10m / 20m / 60m (band-dependent) | ~3m (all bands) |
| Spectral Bands | 13 bands (443–2190 nm) | 8 bands (431–885 nm) |
| Bands Used (pipeline) | B01 B02 B03 B04 B05 B8A | Coastal Blue, Blue, Green, Red, Red Edge, NIR |
| Swath Width | 290 km | ~24 km |
| Revisit Time | 5 days | Daily |
| Radiometric Depth | 12-bit | 12-bit |
| Data Format | .SAFE (JPEG2000 bands) | GeoTIFF |
| Scale Factor | 3× (10m → ~3.3m) | |
The concept: show a neural network thousands of paired examples — “here's the blurry version, here's the sharp version” — until it learns to predict the sharp details from blurry input alone.
64×64 pixels at 10m
Learns 10m → 3m mapping
192×192 pixels at ~3m
After training, the network can enhance any Sentinel-2 scene — it doesn't need PlanetScope anymore. The deep learning model has internalized the patterns that distinguish 10m imagery from 3m imagery.
Here's the economic breakthrough that makes this practical:
Buy a single PlanetScope scene for your city (~$2–4K for 400 km²). This is your only commercial data cost.
Use the paired S2/PS data to train a city-specific SR model. Takes 2–22 GPU-hours depending on architecture.
Apply the trained model to every future free Sentinel-2 pass. S2 revisits every 5 days = 73 enhanced images per year.
Traditional approach: buying 4 commercial images/year × $10K = $40K+. This approach: one-time $3K investment + ~$50/year in cloud compute for inference. That's a 99.9% cost reduction. For a country with 500+ cities, this is the difference between monitoring a handful of cities and monitoring all of them.
The cast of scripts that turn raw satellite data into enhanced imagery
Think of a semiconductor fabrication plant. Raw silicon enters one end, and through a series of precisely controlled stations — cleaning, etching, layering, testing — a finished chip emerges at the other end. Each station does one job extremely well.
This pipeline works the same way. Raw satellite files enter, and through 7 stages, enhanced imagery comes out.
Unify all Sentinel-2 bands to the same 10m resolution
Crop the large S2 scene to match the smaller PlanetScope extent
Align the two images to sub-pixel accuracy
Cut both images into thousands of small, aligned training tiles
Scale raw sensor values to a 0–1 range the neural network can digest
Feed patch pairs to the neural network until it learns the 10m → 3m mapping
Apply the trained model to full Sentinel-2 scenes, tile by tile
Each stage is handled by one or more Python scripts. Here's who does what:
Click any component to see what it does. Data flows top to bottom.
Rather than building everything from scratch, this project plugs into BasicSR, an existing super-resolution framework. The key file is launch.py — it registers all custom pieces:
# CRITICAL: Import our package FIRST to register all
# custom datasets, archs, losses, metrics with BasicSR.
import s2ps_dataset # registers datasets
import s2ps_archs # registers architectures
import s2ps_losses # registers losses
import s2ps_metrics # registers metrics
import s2ps_model # registers models
These lines must run before anything else, so BasicSR knows about our custom satellite-specific components.
Load our dataset loaders — they know how to read paired S2/PS patches and normalize them.
Load our neural network architectures — EDSR, SwinIR, and BandAdapter, all designed for 6-band satellite input.
Load our grading rubrics — SAM loss keeps the spectral colors honest.
Load our measurement tools — PSNR, SSIM, SAM, ERGAS for quality evaluation.
Load our training orchestrator — it knows how to handle 6 bands where BasicSR expects 3.
This “register then use” approach is one of the most powerful patterns in software. Each import statement tells BasicSR: “Hey, I exist — add me to your menu of options.” Then a YAML config file picks which registered components to actually use for a given experiment. Change the config, not the code.
Each stage of the pipeline has a corresponding command. Here are the essential recipes:
Requires DSen2 cloned alongside this repo. Add --no_dsen2 to fall back to bilinear resampling.
Use --stride 32 for 50% overlap (recommended). Use --stride 64 for no overlap (fewer patches, faster).
Checkpoints saved every 5,000 iterations in experiments/<name>/models/
For GPU memory issues, reduce --tile_size to 48 or 32. Increase --tile_overlap to 16 for smoother seams.
From raw satellite downloads to perfectly aligned training patches
Sentinel-2 captures light at different wavelengths — but not all bands have the same resolution. It's like having a camera where the red channel is sharp but the blue channel is blurry.
B02 (Blue), B03 (Green), B04 (Red) — these are already sharp. They pass through unchanged.
B05 (Red Edge), B8A (NIR) — a 2× sharpening step. Critical for vegetation analysis.
B01 (Coastal Aerosol) — a heroic 6× sharpening. This band always has the lowest quality in the final output.
S2_L2A_BAND_INFO = {
'B01': ('*B01_60m.jp2', 60),
'B02': ('*B02_10m.jp2', 10),
'B03': ('*B03_10m.jp2', 10),
'B04': ('*B04_10m.jp2', 10),
'B05': ('*B05_20m.jp2', 20),
'B8A': ('*B8A_20m.jp2', 20),
}
OUTPUT_BANDS = ['B01','B02','B03','B04','B05','B8A']
NEEDS_DSEN2 = {'B01', 'B05', 'B8A'}
A lookup table mapping each band name to its file pattern and native resolution in meters.
B01 arrives as a JPEG2000 file at 60m — very coarse.
B02, B03, B04 are already at 10m — the sharp ones (Blue, Green, Red).
B05 and B8A arrive at 20m — moderately coarse.
The final output stacks all 6 bands into one file, in this specific order.
Only these three bands need the DSen2 neural network to sharpen them. The rest are already at 10m.
Imagine taking a photo of a building from a drone at 100m altitude, then asking a friend on a different drone at 300m to photograph the same building. The photos will show the same place but from slightly different angles, at different times, with different zoom levels. Before you can compare them pixel-by-pixel, you need to align them precisely.
That's the co-registration challenge. The pipeline handles it in two steps:
S2 covers ~110km, PS covers ~24km. Crop the S2 scene to the PS footprint using georeference coordinates. After this, both images cover the same area.
Even after cropping, the images may be off by a few meters. Cross-correlation finds the optimal shift (typically 2–5 meters) and slides the PlanetScope image into exact alignment.
A misalignment of even 1 pixel (10m) means the neural network is comparing a building in one image to a garden in the other. It would learn noise instead of the real 10m→3m relationship. The pipeline achieves alignment better than 0.5 pixels — less than 5 meters of error across a 24km scene.
A full satellite scene is massive — thousands of pixels wide. Patches are small, fixed-size crops that the neural network can actually process. Think of it like slicing a pizza into squares — you feed the network one square at a time.
Sentinel-2 patch
(640m × 640m)
PlanetScope patch
(same 640m area)
A sliding window with 50% overlap extracts patches across the entire scene. For Semarang: 3,298 training pairs + 366 validation pairs.
for y in range(0, h_lr - self.patch_size_lr + 1, self.stride):
for x in range(0, w_lr - self.patch_size_lr + 1, self.stride):
lr_patch = lr_image[:, y:y+self.patch_size_lr,
x:x+self.patch_size_lr]
y_hr = y * self.scale_factor
x_hr = x * self.scale_factor
hr_patch = hr_image[:, y_hr:y_hr+self.patch_size_hr,
x_hr:x_hr+self.patch_size_hr]
Slide a 64-pixel window across the image, moving by 'stride' pixels each step (32 = 50% overlap).
Also slide horizontally at each row position.
Cut out the low-resolution patch from Sentinel-2 at this window position. The [:, ...] reads all 6 spectral bands at once.
Calculate where the same spot is in the 3× larger PlanetScope image. Multiply coordinates by 3.
Cut out the matching high-resolution patch — 192×192 pixels covering the exact same ground area.
Raw satellite values range from 0 to 10,000+ (units of surface reflectance). Neural networks work best with values between 0 and 1. Normalization bridges this gap.
def _percentile(self, image):
result = np.zeros_like(image, dtype=np.float32)
for c in range(image.shape[0]):
band = image[c]
valid = band[band > 0]
p_low = np.percentile(valid, self.percentile_low)
p_high = np.percentile(valid, self.percentile_high)
if p_high - p_low > 0:
result[c] = np.clip(
(band - p_low) / (p_high - p_low), 0.0, 1.0
)
Define the percentile normalization method — this is how raw sensor values get squeezed into 0–1.
Start with an empty result image of the same size.
Process each spectral band independently (6 bands = 6 loops).
Grab the current band's pixel values. Ignore zero-valued pixels (those are "no data" areas like clouds or ocean).
Find the 2nd percentile value — the "dark floor." Anything darker is probably noise.
Find the 98th percentile — the "bright ceiling." Anything brighter is probably glare.
Squeeze the useful range (2nd to 98th percentile) into 0–1 and clip any outliers. This preserves the relative differences between bands — critical for indices like NDVI.
Each band captures a different slice of the electromagnetic spectrum. Infrared is naturally brighter than blue. If you normalized all bands together, you'd crush the faint bands into invisibility. Per-band normalization preserves the relative brightness within each band while making them all network-friendly.
Follow these steps to understand what happens to a single patch pair:
A pixel in B04 (Red) reads 1,247 — surface reflectance × 10,000. This is raw sensor output.
The 2nd percentile is 342, 98th is 2,891. Normalized value: (1247 - 342) / (2891 - 342) = 0.355. Now in [0, 1] range.
This pixel is at position (row=142, col=87) in the scene. It lands in patch at grid position (row=2, col=1) — the 64×64 crop starting at (128, 64). Our pixel is at local position (14, 23) within the patch.
The corresponding PlanetScope patch starts at (128×3, 64×3) = (384, 192). It's 192×192 pixels covering the exact same 640m × 640m ground area — but with 9× more pixels.
Both patches are saved as .npy files (NumPy arrays): train/sentinel2/patch_0142.npy shape (6, 64, 64) and train/planetscope/patch_0142.npy shape (6, 192, 192).
| Parameter | Default | Description | Impact |
|---|---|---|---|
--patch_size | 64 | LR patch side length (pixels) | Larger = more context but fewer patches. 64 is standard for 3× SR. |
--stride | 32 | Sliding window step | 32 = 50% overlap (recommended). 64 = no overlap. 16 = 75% overlap (4× more patches). |
--scale_factor | 3 | SR magnification | Must match S2/PS resolution ratio. HR patch = LR patch × scale. |
--bands | B01,..,B8A | Which S2 bands to use | 6-band: B01,B02,B03,B04,B05,B8A. 4-band: B02,B03,B04,B08. |
--norm_method | percentile | Normalization strategy | percentile (robust), minmax (simple), zscore (zero-mean). |
--percentile_low | 2 | Dark floor percentile | Clips dark outliers. Higher = more aggressive clipping. |
--percentile_high | 98 | Bright ceiling percentile | Clips bright outliers. Lower = more aggressive clipping. |
--min_valid | 0.8 | Minimum non-nodata fraction | Rejects patches with >20% nodata (clouds, edges). |
Two architectures that see satellite imagery in fundamentally different ways
Think of two detectives examining a crime scene photo. One works methodically with a magnifying glass, scanning small areas one at a time. The other steps back, studying how distant clues relate to each other. Both find important details — but the second one catches patterns the first misses.
A CNN architecture with 16 residual blocks. Fast to train (2 hours), 1.56 million parameters. Sees local patterns within a small window.
A Transformer architecture with 6 attention layers. Slower to train (22 hours), 11.9 million parameters. Can relate distant pixels across the entire patch.
On regular photos, Transformers beat CNNs by ~0.3 dB. On multi-spectral satellite data, SwinIR beats EDSR by 2.91 dB — nearly 10× the usual gap. Why? Self-attention naturally models how different spectral bands relate: “this Red Edge brightness implies this NIR level implies vegetation.” CNNs can only see nearby pixels in one band at a time.
EDSR's architecture is elegantly simple: compress the input through a series of residual blocks, then expand it to the target resolution.
6 bands → 64 features
Learn fine details
3× upscale
64 features → 6 bands
def forward(self, x):
head = self.head(x)
body = self.body(head)
body = body + head # Global residual
up = self.upsample(body)
out = self.tail(up)
return out
This is the main processing function — data enters as x (a 6-band, 64×64 patch).
Compress 6 input bands into 64 internal feature channels.
Pass through 16 residual blocks — each refines the features a little more.
The key trick: add the original features back. The network only needs to learn the difference between blurry and sharp — like editing a first draft instead of writing from scratch.
Expand from 64×64 to 192×192 using PixelShuffle — the 3× magnification happens here.
Compress 64 features back to 6 output bands. Done! One sharp patch emerges.
Most pretrained super-resolution models expect 3-channel RGB input. Our satellite data has 6 bands. The BandAdapter pattern solves this with learnable 1×1 convolutions that wrap any pretrained model:
6 bands → 3 channels
RGB weights (frozen or fine-tuned)
3 channels → 6 bands
def _init_adapters(self):
with torch.no_grad():
w_in = self.adapter_in[0].weight
nn.init.zeros_(w_in)
for i in range(min(3, self.num_bands)):
w_in[i, i, 0, 0] = 1.0
Initialize the adapter weights so RGB bands pass through unchanged at the start.
"no_grad" means we're manually setting values, not learning them (yet).
Grab the input adapter's weight matrix (a 3×6 transformation).
Zero everything out first — a clean slate.
Set bands 1, 2, 3 (Blue, Green, Red) to pass through as-is to the pretrained RGB model. The adapter starts as an identity for the first 3 bands and zero for the rest — then gradually learns how to use all 6 bands during fine-tuning.
| Specification | EDSR (MultiBandEDSR) | SwinIR (MultiBandSwinIR) | BandAdapterNet |
|---|---|---|---|
| Type | CNN (Convolutional) | Transformer (Self-Attention) | Adapter wrapper |
| Parameters | 1.56 M | 11.9 M | Backbone + ~0.1 M |
| Input Channels | num_in_ch: 6 | num_in_ch: 6 | Any N → 3 → N |
| Key Config | num_feat: 64num_block: 16 | embed_dim: 180depths: [6,6,6,6,6,6]window_size: 8 | freeze_backbone_epochs |
| Upsampler | PixelShuffle (3×) | PixelShuffle (3×) | Inherits from backbone |
| Training Time | 2h 7min (H100) | 22h 28min (H100) | Varies (backbone-dependent) |
| PSNR (Semarang) | 23.58 dB | 26.49 dB | Depends on backbone |
| Best For | Fast prototyping, limited GPU | Production quality | Reusing pretrained RGB weights |
| Config Files | esrgan_semarang.yml | swinir_semarang.yml | bandadapter_esrgan_s2ps.yml |
body = body + head actually accomplish?How loss functions, training loops, and 500,000 iterations shape a model
Imagine grading a student's painting of a landscape. You could score it on two axes: (1) “Does each brushstroke land in the right spot?” — pixel accuracy; and (2) “Do the colors look natural together?” — spectral fidelity. This pipeline uses two loss functions that capture exactly these two qualities:
The absolute difference between each predicted pixel and the real pixel, averaged across the entire patch. Penalizes every pixel that's off, proportional to how far off it is. Simple and effective.
Spectral Angle Mapper: measures the angle between the predicted and true spectral signature at each pixel. Ensures band ratios stay correct — critical for NDVI and other indices.
p = pred.reshape(b, c, -1)
t = target.reshape(b, c, -1)
dot = (p * t).sum(dim=1)
norm_p = p.norm(dim=1).clamp(min=self.eps)
norm_t = t.norm(dim=1).clamp(min=self.eps)
cos_angle = (dot / (norm_p * norm_t)).clamp(-1+self.eps, 1-self.eps)
sam = torch.acos(cos_angle)
Reshape both images so each pixel becomes a 6-number vector (one value per band). Think of each pixel as a point in 6-dimensional color space.
Compute the dot product between predicted and target spectral vectors — this measures alignment.
Calculate the length of each vector. The clamp prevents division by zero (pixels that are completely black).
Divide dot product by the vector lengths to get the cosine of the angle between them.
Take the arc-cosine to convert from cosine to actual angle in radians. Result: 0 = perfectly matching spectra, π/2 = completely different spectra.
Training is a conversation between the model and its grading rubric, repeated 500,000 times. Each iteration follows the same pattern:
Feed a batch of 16 LR patches through the model → get 16 predicted HR patches.
Compare predictions to real HR patches using L1 + 0.1×SAM. Get a single number: how wrong is the model?
Backpropagation computes gradients — which parameters contributed most to the error.
The Adam optimizer nudges each parameter slightly in the direction that reduces the loss.
Each iteration processes 16 patches. Over 500K iterations, the model sees ~8 million patch examples (with augmentation and repetition).
Every experiment is defined by a YAML configuration file. Change the recipe, change the experiment — no code modifications needed.
Which patches to use, normalization method, augmentation on/off, train/val split paths.
Which model (EDSR/SwinIR), number of blocks/layers, embedding dimension, scale factor.
Loss weights, learning rate, optimizer, scheduler, total iterations, checkpoint frequency.
Which metrics to compute, how often to validate, early stopping criteria.
The project includes 21 different YAML configs — each testing one variable at a time. Change SAM weight from 0.1 to 0.5? New config file. Try 32 residual blocks instead of 16? New config file. This systematic approach, called an ablation study, is how researchers prove which design choices actually improve results.
Here's what the training process looks like as a conversation between the components:
Common adjustments you can make by editing the YAML config — no code changes needed:
Higher SAM weight = better spectral fidelity but slightly lower PSNR. Good for NDVI-critical applications. The project tested λ = {0.01, 0.05, 0.1, 0.2, 0.5}.
EDSR with 8 blocks trains in ~1 hour. Ablation showed 8-block gets within 0.5 dB of 16-block.
If PSNR plateaus for 50K+ iterations, training has converged. For SwinIR, most gains happen by 250K iterations.
| Parameter | Value | Notes |
|---|---|---|
| Optimizer | Adam | β1=0.9, β2=0.99 |
| Learning Rate | 2×10-4 | Warm-up for SwinIR (5K iters), none for EDSR |
| Scheduler | CosineAnnealingRestartLR | 4 periods of 250K, restart weights [1, 0.5, 0.5, 0.5] |
| Total Iterations | 500,000 | Both EDSR and SwinIR |
| Batch Size | 16 | Per GPU. Scale with GPU count. |
| L1 Loss Weight | 1.0 | Primary loss — pixel accuracy |
| SAM Loss Weight | 0.1 | 10% contribution — spectral fidelity |
| Checkpoint Frequency | Every 5,000 iters | Saved to experiments/<name>/models/ |
| Validation Frequency | Every 5,000 iters | Metrics: PSNR, SSIM, SAM, ERGAS per-band |
| GT Size (HR crop) | 192 | Random crop of HR patch during training for augmentation |
How a trained model processes an entire city — tile by seamless tile
A full Sentinel-2 scene is roughly 10,000 × 10,000 pixels. The neural network was trained on 64×64 patches. You can't feed the whole scene at once — it would need more GPU memory than exists.
The solution: slide a window across the scene, process each tile independently, then stitch the results together. Like mowing a lawn in overlapping rows — but the overlap is where the magic happens.
64×64 tiles
8px overlap
64×64 → 192×192
Cosine window
smooth edges
Georeferenced
3m output
Without blending, you'd see visible grid lines where tiles meet — each tile's edge pixels would abruptly jump to the next tile's values. The solution: a raised-cosine blend window that tapers each tile's contribution from full strength at the center to zero at the edges.
def _make_blend_window(size, overlap):
w = np.ones(size, dtype=np.float32)
ramp = np.linspace(0, 1, overlap, dtype=np.float32)
cos_ramp = 0.5 * (1 - np.cos(np.pi * ramp))
w[:overlap] *= cos_ramp
w[-overlap:] *= cos_ramp[::-1]
window = w[np.newaxis, :] * w[:, np.newaxis]
return window[np.newaxis]
Create a blending window for a tile of the given size with the given overlap.
Start with all 1s — full weight everywhere.
Create a smooth ramp from 0 to 1 over the overlap zone (8 pixels).
Shape the ramp into a cosine curve — starts slow, accelerates in the middle, ends slow. Smoother than a linear fade.
Apply the ramp to the left edge (fades in) and the reversed ramp to the right edge (fades out). The center stays at full weight.
Make it 2D by multiplying horizontal and vertical ramps. Corners get very low weight (fade × fade), edges get moderate weight, center gets full weight.
Add a band dimension so it can be broadcast across all 6 spectral bands at once.
For every tile: extract, predict, accumulate with blending weights. Then divide by total weight to normalize.
with torch.no_grad():
for row in range(n_rows):
for col in range(n_cols):
y0 = min(row * stride, h - tile_size)
x0 = min(col * stride, w - tile_size)
tile_lr = img_norm[:, y0:y1, x0:x1]
tile_sr = model(tile_t)
output[:, oy0:oy1, ox0:ox1] += tile_sr * blend
weight[:, oy0:oy1, ox0:ox1] += blend
Disable gradient tracking — we're just predicting, not learning. This saves ~50% memory.
Loop through every tile position, row by row, column by column.
Calculate the top-left corner of this tile. The min() ensures the last tile in each row/column doesn't run off the edge.
Extract the 64×64 LR tile from the normalized image (all 6 bands).
Run the neural network — 64×64 in, 192×192 out. This is where the magic happens, one tile at a time.
Add this tile's output to the accumulator, weighted by the cosine blend window. Where tiles overlap, both contribute — but with fading weights so the transition is invisible.
Track the total weight at each pixel. After all tiles, divide output by weight to normalize.
The output GeoTIFF inherits the input's coordinate system but with 3× smaller pixel size. The geotransform is updated: pixel_size / 3, origin unchanged. The enhanced image slots directly into any GIS workflow.
A Sentinel-2 crop covering roughly 5km × 4km at 10m resolution.
Columns: ceil((500 - 8) / 56) = 9 tiles. Rows: ceil((400 - 8) / 56) = 7 tiles. Total: 63 tiles to process.
A 64×64 crop extracted from the normalized image. Passed through the model in ~5ms on GPU.
Multiplied by the cosine blend window: center pixels get weight 1.0, corner pixels get ~0.07. Added to the accumulator.
After all 63 tiles: divide accumulator by weight map. Write as GeoTIFF with pixel_size = 10m/3 = 3.33m. Done!
Tile size 32 uses ~4× less GPU memory than 64. Processing time increases proportionally (more tiles).
More overlap = wider blend zone = smoother transitions. 16 pixels is usually sufficient.
For Docker deployment, mount volumes: -v ./data:/data -v ./models:/models
| Parameter | Default | Range | Effect |
|---|---|---|---|
--tile_size | 64 | 16–128 | Larger = faster (fewer tiles) but more GPU memory. Must be ≥ model's training patch size. |
--tile_overlap | 8 | 0–tile_size/2 | Larger = smoother seams but slower. 0 = no blending (visible grid). 8 is a good default. |
--scale | 3 | 2, 3, 4, 6, 8 | Must match the trained model's scale factor. |
--arch | — | MultiBandSwinIRMultiBandEDSRMultiBandRRDBNet | Must match the architecture used during training. |
--device | cuda | cuda, cpu | CPU is ~50× slower but has no memory limit. |
A 10,000×10,000 scene at tile_size=64, overlap=8 requires ~2,800 tiles. At ~5ms/tile on an H100, that's ~14 seconds. Reducing tile_size to 32 quadruples the tile count to ~11,200 but halves memory usage. Choose based on your GPU's available memory.
torch.no_grad()?Four metrics, real results, and what they mean for Indonesian cities
How do you know if enhanced imagery is actually good? You need metrics that capture different aspects of quality — like grading an essay on both grammar and content, not just word count.
Peak Signal-to-Noise Ratio. How close is each pixel to the ground truth? Measured in decibels — higher is better. Think of it as the “sharpness score.”
Structural Similarity. Do edges, textures, and contrasts look right? A blurry image might have decent PSNR but terrible SSIM because structures are smoothed away.
Spectral Angle Mapper. Are the band ratios correct? Measured in degrees — lower is better. Below 5° means spectral indices (NDVI, NDBI) will be reliable.
ERGAS. The “overall grade” — combines per-band errors normalized by brightness. Lower is better. The standard fusion quality metric in remote sensing.
Here's how each method performed on the Semarang test set:
SwinIR achieves 26.49 dB vs bicubic's 16.32 dB — a +10.17 dB improvement. In PSNR terms, every +3 dB means the error is halved. So +10 dB means the error is roughly 10× smaller. The enhanced imagery is an order of magnitude closer to the real PlanetScope data than simple upscaling.
Sharp pixels are nice, but do they help with actual urban monitoring? The research tested this with land cover classification:
Agreement with PlanetScope ground truth. Too coarse — pixels mix buildings with surrounding vegetation.
+5.9 percentage points. The CNN captures enough detail to separate building from non-building in most cases.
+7.8 percentage points over raw S2. The Transformer's spectral awareness further improves vegetation vs. built-up discrimination.
Models don't travel well between cities. A Semarang-trained SwinIR applied to Surabaya scored only 19.16 dB — barely better than bicubic (18.14 dB). On one test area, it actually performed worse than no SR at all (17.50 vs 18.89 dB). Each city's urban texture, vegetation patterns, and building density require location-specific training. The model learns “what Semarang looks like at 3m,” not “what any city looks like at 3m.”
Not all bands improve equally. The original resolution of each band matters:
Originally 60m → DSen2 sharpened to 10m → then SR to 3m. Three upscaling stages mean this band carries the most uncertainty. It's asking the AI to hallucinate 20× more detail than the sensor actually captured.
Natively 10m, only 3× upscaling. The model has rich spatial detail to work with. Gains are largest for Red Edge (+3.36 dB) and NIR (+3.42 dB).
Originally 20m → DSen2 to 10m → SR to 3m. Despite the extra upscaling step, NIR bands are structurally simpler (strong vegetation/non-vegetation contrast) so the model predicts them well.
Use these thresholds to interpret your model's output quality:
| Metric | Poor | Acceptable | Good | Excellent | Direction |
|---|---|---|---|---|---|
| PSNR | < 20 dB | 20–25 dB | 25–30 dB | > 30 dB | Higher ↑ |
| SSIM | < 0.6 | 0.6–0.8 | 0.8–0.9 | > 0.9 | Higher ↑ |
| SAM | > 10° | 7–10° | 5–7° | < 5° | Lower ↓ |
| ERGAS | > 10 | 5–10 | 3–5 | < 3 | Lower ↓ |
| Context | What It Tells You |
|---|---|
| PSNR high but SSIM low | Pixel values are close but structures (edges, textures) are blurred or distorted |
| PSNR high but SAM high | Image looks sharp but spectral signatures are distorted — NDVI/NDBI will be unreliable |
| All metrics good but downstream accuracy low | Sub-pixel alignment may be off, or training data distribution doesn't match test area |
| One band significantly worse | Check if that band had a different native resolution (B01 at 60m is the usual culprit) |
Results are logged to experiments/<name>/results/ with per-band PSNR, SSIM, SAM, and ERGAS.
Add 0.0001 to denominator to avoid division by zero. Compare with NDVI from raw S2 to assess spectral improvement.
Tests generalization within a city. Expect consistent performance across subregions if training data was representative.
You now understand the full pipeline — from raw satellite downloads to enhanced 3-meter imagery. The key takeaways:
One commercial purchase enables unlimited future monitoring from free Sentinel-2 data.
Self-attention captures cross-band spectral correlations — a +2.91 dB advantage on multi-spectral data.
A small model with relevant local data outperforms a large model trained on a different city.
SAM loss preserves band ratios, enabling reliable NDVI and land cover analysis from SR output.