Custom-trained CNN · SALICON dataset · 6.6M parameters

See what your users
see before they do

Upload any image and get an instant AI-generated heatmap showing where human eyes will look first, powered by a CNN trained on 10,000 images with real human fixation data.

Drop an image to analyse

or click to browse

JPEGPNGWebP·max 10 MB

or try an example

Process

How it works

Upload

Drop any image · website screenshot, ad, poster, or UI design. JPEG, PNG, or WebP up to 10 MB.

AI Analysis

A MobileNetV2 encoder-decoder CNN · trained on 10,000 images with human fixation data · predicts where people look.

Attention Map

Get a heatmap showing primary, secondary, and tertiary focus zones plus rule-based UX insights.

Model

About the model

Trained from scratch on SALICON · 10,000 natural images annotated with crowd-sourced human fixation maps.

Evaluation · SALICON validation set (5,000 images)

AUC-Judd0.9613higher is better

CC0.8756higher is better

NSS2.163higher is better

SIM0.7649higher is better

KL-Div0.2383lower is better

What do these metrics mean?

AUC-Judd

0.9613

Can the model rank fixated pixels above non-fixated ones?

0.5 = random, 1.0 = perfect. 0.96 means the model almost always assigns higher saliency to pixels humans actually looked at.

0.8756

How closely does the predicted heatmap match the ground truth?

Pearson correlation: -1 = perfectly wrong, 0 = no relationship, 1 = perfect. 0.88 is a strong match.

NSS

2.163

How much above average is predicted saliency at fixation points?

Map normalised to mean 0, std 1. NSS is the average value at human fixation locations. 2.16 standard deviations above average is strong.

SIM

0.7649

How much do the predicted and ground truth distributions overlap?

Histogram intersection: 0 = no overlap, 1 = identical. 0.76 means 76% of the probability mass is shared.

KL-Div

0.2383

How different is the predicted distribution from the real one?

Lower is better. 0 = perfect. 0.24 is low, meaning the model closely matches where humans looked.

Architecture

MobileNetV2 encoder with U-Net-style skip connections and an upsampling decoder. 6.6M parameters.

Training data

SALICON: 10,000 train + 5,000 val images from MS COCO 2014 with crowd-sourced human fixation maps.

Loss function

KL Divergence (1.0x) + Correlation Coefficient (0.5x) + BCE (0.1x). Standard saliency research formulation.

Training

13 epochs on GTX 1650 (4 GB), ~113 min. Early stopping. Mixed precision (AMP). Encoder frozen for first 5 epochs.

Inference

Under 500 ms on GPU, under 2 s on CPU. MobileNetV2 keeps it fast enough for real-time interactive use.

Output