Full of Stars

hackthebox

Task: HTB ML challenge gives 256 labeled 'star' cores and 131527 unlabeled 3D points; each point's class index is a byte of a progressive JPEG. Solution: semi-supervised label propagation via multi-source Dijkstra (geodesic) on an anisotropic kNN graph, tracing each curved filament from its single labeled core to rebuild the flag image.

numpy dijkstra ml semi_supervised label_propagation geodesic knn_graph manifold clustering progressive_jpeg 3d_point_cloud

anisotropic_knn_graphimage_from_cluster_labelsmanifold_geodesic_classificationmulti_source_dijkstra

$ ls tags/ techniques/

numpy dijkstra ml semi_supervised label_propagation geodesic knn_graph manifold clustering progressive_jpeg 3d_point_cloud

anisotropic_knn_graphimage_from_cluster_labelsmanifold_geodesic_classificationmulti_source_dijkstra

Full of Stars — HackTheBox

Description

You are assigned the important mission of locating and identifying the infamous space hacker. (HackTheBox ML challenge, medium difficulty.)

Files: challenge.zip (password: hackthebox) →

known_samples.npy.gz — 256 × 3 float64
data.npy.gz — 131527 × 3 float64
challenge.ipynb

The notebook defines the entire task. There are 256 classes; known_samples provides exactly ONE labeled "star" (core) per class. The solver must assign each of the 131527 data points a class label in [0, 256). The labels of the data points, in their given order, ARE the bytes of flag.jpg — a 512×512 progressive JPEG whose decoded image carries the flag as overlaid text.

data       = np.loadtxt("data.npy.gz")           # 131527 points in 3D
core_data  = np.loadtxt("known_samples.npy.gz")  # 256 points in 3D
all_data   = np.vstack((core_data, data))
class_count = len(core_data)                      # 256

out_labels = np.full((all_data.shape[0],), -1)    # <-- the solver must fill this
# ...
assert(np.all(np.logical_and(0 <= out_labels, out_labels < 256)))
file_data = bytearray(out_labels[256:].shape[0])
for ix, val in enumerate(out_labels[256:]):
    file_data[ix] = val
with open("flag.jpg", "wb") as outfile:
    outfile.write(bytes(file_data))

So the mapping is fixed by convention: core row i has label i, and a data point's label equals the JPEG byte value at its position. Classify every data point correctly → write the bytes → decode the JPEG → read the flag.

Analysis

File-format gotcha

Despite the .npy.gz extension, both files are plain text, not gzip-wrapped NumPy arrays. They must be read with np.loadtxt, NOT np.load:

data = np.loadtxt("data.npy.gz")          # correct
# data = np.load("data.npy.gz")           # fails / garbage

Geometry of the point cloud (verified empirically)

The cloud is a thin slab: X ∈ [-285, 289], Y ∈ [-1709, 1842] (very elongated), Z ∈ [-127, 117].
Each of the 256 classes is a globally curved filament that is only locally flat. Fitting each cluster's plane from the 30 nearest points to its core, only 17 % of all points lie within perp < 0.2 of ANY such plane (80 % are off every plane, median perpendicular distance 0.65). A single plane/Gaussian per cluster is therefore globally wrong.
Filaments are elongated along the Y axis; their local plane normals point along Z or X, never Y. They are stacked/overlapping in the Y slab and separated mostly in X–Z. (~96 coplanar pairs + 64 singletons among near-core planes.)
Data points are spatially shuffled (consecutive points ~700 apart), so their order is purely the JPEG byte order — no ordering shortcut exists.
Cluster sizes equal the JPEG byte histogram (byte 0x00 is the largest cluster, 1581 points), confirming the class index = byte value mapping.

Why the obvious methods fail

Each core is a single sample sitting somewhere on a long curved filament, not the centroid of a density blob. This breaks every distance/density method:

Method	Accuracy	Why it fails
Nearest-core (Euclidean 1-NN)	~60 %	the far end of a filament is closer to a neighbor's core than to its own
Anisotropic nearest-core (down-weight Y)	~76 %	helps, but still ignores the curve
Iterative QDA / GaussianMixture	~78 %	Gaussian blobs ≠ curved manifolds; renders an all-black JPEG
Ward / KMeans / SpectralClustering / DBSCAN	60–78 %	density/centroid assumptions wrong

A decisive negative result: seeding QDA from a 97 % geodesic solution and using exact rank-2 (flat-plane) covariances still flipped ~8 % of bytes and broke rendering. Lesson: the clusters are connected manifolds, not density blobs — parametric clustering fundamentally fails; connectivity is the correct inductive bias.

Solution

Treat the union of cores + data as a k-nearest-neighbour graph and propagate the 256 core labels along the manifold by multi-source Dijkstra (geodesic distance). Each filament is then labeled by following its own curve outward from its single labeled core, so a point is assigned to the core it can reach by the shortest path along the data, not by straight-line distance.

Two tweaks improve connectivity and cleanliness:

Anisotropic scaling S = [1.0, 0.4, 1.0] — down-weight the Y axis, because the filaments are elongated in Y but separated in X–Z. This keeps the kNN edges running along each filament instead of jumping between stacked ones.
A symmetric kNN graph with k = 12 connects sparse intra-filament gaps without bridging touching neighbours.

Reference solver:

#!/usr/bin/env python3
import numpy as np, heapq
from scipy.spatial import cKDTree

core = np.loadtxt("known_samples.npy.gz")   # 256 x 3 (one labeled star per class)
data = np.loadtxt("data.npy.gz")            # 131527 x 3 (unlabeled)
N, K = data.shape[0], 256
X = np.vstack((core, data)); M = N + K

# Clusters are elongated along Y -> down-weight Y so the kNN graph follows filaments.
S = np.array([1.0, 0.4, 1.0]); Xs = X * S
tree = cKDTree(Xs); kk = 12
dd, nb = tree.query(Xs, k=kk + 1)

# Multi-source Dijkstra (geodesic) seeded from the 256 labeled cores.
lab = np.full(M, -1); dist = np.full(M, np.inf); heap = []
for c in range(K):
    lab[c] = c; dist[c] = 0.0
    for j in range(1, kk + 1):
        w = dd[c, j]; nbr = int(nb[c, j])
        if w < dist[nbr]:
            dist[nbr] = w; heapq.heappush(heap, (w, nbr, c))
while heap:
    d, node, l = heapq.heappop(heap)
    if lab[node] != -1:
        continue
    lab[node] = l
    for j in range(1, kk + 1):
        x = int(nb[node, j]); nd = d + dd[node, j]
        if lab[x] == -1 and nd < dist[x]:
            dist[x] = nd; heapq.heappush(heap, (nd, x, l))

# Fill any unreached points by nearest already-labeled point.
while (lab == -1).any():
    rem = np.where(lab == -1)[0]
    lt = cKDTree(Xs[lab != -1]); lv = lab[lab != -1]
    _, m = lt.query(Xs[rem]); lab[rem] = lv[m]

open("flag.jpg", "wb").write(bytes(bytearray((lab[256:] % 256).astype(np.uint8))))

Decoding

The written bytes form a 512×512 progressive JPEG (SOF2). The header parses cleanly:

FFD8                 SOI
FFE0  JFIF v1.01     APP0
FFDB x2              two quantization tables (DQT)
FFC2  512x512 3-comp SOF2 (progressive)
FFC4 x2              two Huffman tables (DHT)
FFDD                 DRI (restart interval)
FFDA @ offset 241    first SOS (the DC scan, ~6 KB)
... 7 scans total ...

Decoded, the image is a blue seascape at dusk with the flag text overlaid across the middle band (y ≈ 300–360).

Practical detail — progressive JPEG is fragile. A few-percent byte-error rate (the geodesic reaches ~97–98 %) still renders a coherent but speckled picture, and the small overlaid text needs near-perfect bytes to be legible. Useful diagnostics during solving:

Truncating before the second SOS marker keeps only the DC scan and renders a clean low-resolution version — enough to confirm the seascape scene.
The flag text lives in the higher-frequency AC scans (the mid band), which need the cleanest bytes.
Residual errors concentrate exactly where two curved filaments cross or touch — those points are geometrically ambiguous and are the proven dominant error source. The anisotropic, symmetric geodesic above minimises them enough for the text to resolve.