$ cat writeup.md…
$ cat writeup.md…
hackthebox
Task: HTB ML challenge gives 256 labeled 'star' cores and 131527 unlabeled 3D points; each point's class index is a byte of a progressive JPEG. Solution: semi-supervised label propagation via multi-source Dijkstra (geodesic) on an anisotropic kNN graph, tracing each curved filament from its single labeled core to rebuild the flag image.
You are assigned the important mission of locating and identifying the infamous space hacker. (HackTheBox ML challenge, medium difficulty.)
Files: challenge.zip (password: hackthebox) →
known_samples.npy.gz — 256 × 3 float64data.npy.gz — 131527 × 3 float64challenge.ipynbThe notebook defines the entire task. There are 256 classes; known_samples
provides exactly ONE labeled "star" (core) per class. The solver must assign
each of the 131527 data points a class label in [0, 256). The labels of the
data points, in their given order, ARE the bytes of flag.jpg — a 512×512
progressive JPEG whose decoded image carries the flag as overlaid text.
data = np.loadtxt("data.npy.gz") # 131527 points in 3D core_data = np.loadtxt("known_samples.npy.gz") # 256 points in 3D all_data = np.vstack((core_data, data)) class_count = len(core_data) # 256 out_labels = np.full((all_data.shape[0],), -1) # <-- the solver must fill this # ... assert(np.all(np.logical_and(0 <= out_labels, out_labels < 256))) file_data = bytearray(out_labels[256:].shape[0]) for ix, val in enumerate(out_labels[256:]): file_data[ix] = val with open("flag.jpg", "wb") as outfile: outfile.write(bytes(file_data))
So the mapping is fixed by convention: core row i has label i, and a data point's label equals the JPEG byte value at its position. Classify every data point correctly → write the bytes → decode the JPEG → read the flag.
Despite the .npy.gz extension, both files are plain text, not gzip-wrapped
NumPy arrays. They must be read with np.loadtxt, NOT np.load:
data = np.loadtxt("data.npy.gz") # correct # data = np.load("data.npy.gz") # fails / garbage
X ∈ [-285, 289], Y ∈ [-1709, 1842] (very
elongated), Z ∈ [-127, 117].perp < 0.2 of ANY such plane (80 %
are off every plane, median perpendicular distance 0.65). A single
plane/Gaussian per cluster is therefore globally wrong.0x00 is the largest
cluster, 1581 points), confirming the class index = byte value mapping.Each core is a single sample sitting somewhere on a long curved filament, not the centroid of a density blob. This breaks every distance/density method:
| Method | Accuracy | Why it fails |
|---|---|---|
| Nearest-core (Euclidean 1-NN) | ~60 % | the far end of a filament is closer to a neighbor's core than to its own |
| Anisotropic nearest-core (down-weight Y) | ~76 % | helps, but still ignores the curve |
| Iterative QDA / GaussianMixture | ~78 % | Gaussian blobs ≠ curved manifolds; renders an all-black JPEG |
| Ward / KMeans / SpectralClustering / DBSCAN | 60–78 % | density/centroid assumptions wrong |
A decisive negative result: seeding QDA from a 97 % geodesic solution and using exact rank-2 (flat-plane) covariances still flipped ~8 % of bytes and broke rendering. Lesson: the clusters are connected manifolds, not density blobs — parametric clustering fundamentally fails; connectivity is the correct inductive bias.
Treat the union of cores + data as a k-nearest-neighbour graph and propagate the 256 core labels along the manifold by multi-source Dijkstra (geodesic distance). Each filament is then labeled by following its own curve outward from its single labeled core, so a point is assigned to the core it can reach by the shortest path along the data, not by straight-line distance.
Two tweaks improve connectivity and cleanliness:
S = [1.0, 0.4, 1.0] — down-weight the Y axis,
because the filaments are elongated in Y but separated in X–Z. This keeps the
kNN edges running along each filament instead of jumping between stacked ones.k = 12 connects sparse intra-filament gaps
without bridging touching neighbours.Reference solver:
#!/usr/bin/env python3 import numpy as np, heapq from scipy.spatial import cKDTree core = np.loadtxt("known_samples.npy.gz") # 256 x 3 (one labeled star per class) data = np.loadtxt("data.npy.gz") # 131527 x 3 (unlabeled) N, K = data.shape[0], 256 X = np.vstack((core, data)); M = N + K # Clusters are elongated along Y -> down-weight Y so the kNN graph follows filaments. S = np.array([1.0, 0.4, 1.0]); Xs = X * S tree = cKDTree(Xs); kk = 12 dd, nb = tree.query(Xs, k=kk + 1) # Multi-source Dijkstra (geodesic) seeded from the 256 labeled cores. lab = np.full(M, -1); dist = np.full(M, np.inf); heap = [] for c in range(K): lab[c] = c; dist[c] = 0.0 for j in range(1, kk + 1): w = dd[c, j]; nbr = int(nb[c, j]) if w < dist[nbr]: dist[nbr] = w; heapq.heappush(heap, (w, nbr, c)) while heap: d, node, l = heapq.heappop(heap) if lab[node] != -1: continue lab[node] = l for j in range(1, kk + 1): x = int(nb[node, j]); nd = d + dd[node, j] if lab[x] == -1 and nd < dist[x]: dist[x] = nd; heapq.heappush(heap, (nd, x, l)) # Fill any unreached points by nearest already-labeled point. while (lab == -1).any(): rem = np.where(lab == -1)[0] lt = cKDTree(Xs[lab != -1]); lv = lab[lab != -1] _, m = lt.query(Xs[rem]); lab[rem] = lv[m] open("flag.jpg", "wb").write(bytes(bytearray((lab[256:] % 256).astype(np.uint8))))
The written bytes form a 512×512 progressive JPEG (SOF2). The header parses cleanly:
FFD8 SOI
FFE0 JFIF v1.01 APP0
FFDB x2 two quantization tables (DQT)
FFC2 512x512 3-comp SOF2 (progressive)
FFC4 x2 two Huffman tables (DHT)
FFDD DRI (restart interval)
FFDA @ offset 241 first SOS (the DC scan, ~6 KB)
... 7 scans total ...
Decoded, the image is a blue seascape at dusk with the flag text overlaid across the middle band (y ≈ 300–360).
Practical detail — progressive JPEG is fragile. A few-percent byte-error rate (the geodesic reaches ~97–98 %) still renders a coherent but speckled picture, and the small overlaid text needs near-perfect bytes to be legible. Useful diagnostics during solving:
$ cat /etc/motd
Liked this one?
Pro unlocks every writeup, every flag, and API access. $9/mo.
$ cat pricing.md$ grep --similar