K-Means for AutoCAD Solar Homerun Routing: When It Works, When It Fails, and What to Use Instead

"K-Means would inherently assume that the clusters are convex in shape... the regions where each centroid is the closest form convex shapes." That sentence is the whole story. K-means partitions space by drawing Voronoi cells — the polygonal regions a clustering algorithm carves the plane into — and those cells are convex by construction. If your homerun groups need to wrap around an HVAC unit, k-means literally cannot draw the right boundary. It's a math constraint, not a tuning issue. You can run it a thousand times with different seeds and it will return confident-looking nonsense every time.

Where k-means earns its keep

On a clean rectangular array with no obstructions and no string size cap, k-means will give you something visually plausible. That's why it shows up in tutorials. Two lines of scikit-learn, five colored clusters on a scatter plot, looks right. The trap is that k-means gives you something visually plausible every single time, even when the underlying geometry makes the answer physically impossible. There's no error, no warning, no asterisk. It just draws clean-looking groups and hands them back with a straight face.

We've watched solar engineers reach for k-means and hit the same wall every time. The failure isn't always visible on the first project — it shows up on the second one, with the parapet, or the L-shaped roof.

K-means on a clean array

Here's what k-means looks like when the geometry cooperates. Sixty panels, flat 6×10 grid, five MPPT inputs.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Generate a clean 6-row × 10-column panel grid
# Each panel is 1 unit wide, 2 units tall, 0.1 unit gap
cols, rows = 10, 6
panel_w, panel_h, gap = 1.0, 2.0, 0.1

panels = []
for r in range(rows):
    for c in range(cols):
        x = c * (panel_w + gap) + panel_w / 2
        y = r * (panel_h + gap) + panel_h / 2
        panels.append([x, y])

panels = np.array(panels)  # shape (60, 2)

# Group into 5 MPPT clusters — one per inverter input
k = 5
km = KMeans(n_clusters=k, random_state=42, n_init="auto")
labels = km.fit_predict(panels)

# Plot
colors = ["#e6194b", "#3cb44b", "#4363d8", "#f58231", "#911eb4"]
fig, ax = plt.subplots(figsize=(10, 5))
for i, (x, y) in enumerate(panels):
    ax.add_patch(plt.Rectangle(
        (x - panel_w/2, y - panel_h/2), panel_w, panel_h,
        color=colors[labels[i]], alpha=0.7
    ))
for c_idx, center in enumerate(km.cluster_centers_):
    ax.plot(*center, "k*", markersize=12)

ax.set_xlim(-0.5, cols * (panel_w + gap))
ax.set_ylim(-0.5, rows * (panel_h + gap))
ax.set_aspect("equal")
ax.set_title("K-Means Homerun Groups — Clean 6×10 Array, k=5")
plt.tight_layout()
plt.savefig("kmeans_clean_array.png", dpi=150)

This works. The five groups look roughly equal, the boundaries make visual sense, and if you drew the homeruns by hand you'd probably come up with something similar.

The trouble starts the moment that array has a skylight gap in the middle, or you need each cluster to be exactly 12 panels because that's your string size.

The four ways k-means quietly lies to you

Scikit-learn's own documentation has a dedicated demo page for k-means failure modes. Four cases, side by side. Here's each one translated for a roof.

Failure 1: You have to pick `k`, and there's no right answer from the data

K-means needs you to tell it how many clusters to create. The standard advice is to use an elbow plot: run k-means for k=1 through k=10, plot the inertia (total within-cluster distance), and look for an elbow where the curve flattens. It sounds reasonable.

Here's the problem with that advice: "As long as we keep increasing the number of clusters k, the inertia will always decrease because the points will be closer to their centroids." The inertia never stops decreasing. There is no mathematical mechanism that makes it flatten at the "right" k. People stare at noisy curves and see an elbow because they want to see one.

Scikit-learn puts it more carefully: "in a real setting there is no uniquely defined true number of clusters. An appropriate number of clusters has to be decided from data-based criteria and knowledge of the intended goal."

For solar, none of this matters anyway. The right k isn't a clustering question at all. It's the number of MPPT inputs on your inverter — a constraint you bring to the problem from the equipment schedule. The data has no opinion. The inverter has one MPPT or four, and that's your k. The elbow method applied here isn't just noisy; it's answering a question that was never up for debate.

Failure 2: Anisotropic blobs (the long building edge problem)

Anisotropic means non-uniform in direction — stretched, not round. Think strings arranged along the long edge of a commercial rooftop, 80 panels wide and 4 panels deep. Every cluster in that array is a wide horizontal bar, not a circular blob.

K-means assumes the opposite. From the docs: "k-means is more appropriate for clusters that are isotropic and normally distributed (i.e. spherical gaussians)." Isotropic means the same in every direction — like a perfect blob, not stretched. A 4×80 array is about as far from that as solar gets.

When clusters are stretched along an axis, k-means still produces k groups — they just won't follow the array geometry. Diagonal boundaries cut across rows instead of vertical divisions that an electrician would actually run conduit along.

Failure 3: Unequal variance (the dense corner problem)

When one region of your array is densely packed and another is sparse — say, a full field on one side of the roof and a partial fill on the other — k-means biases toward equal-area clusters. The dense region gets overcrowded into too few groups; the sparse region gets more groups than it warrants.

There's no setting to fix this. The Euclidean distance objective that k-means minimizes doesn't know about panel count per cluster. It only knows about centroid distance.

Failure 4: Non-convex shapes (the L-shaped roof problem)

This is the ceiling. An L-shaped roof has two wings. The logical homerun grouping follows the wings — all panels in the east wing on strings running to one combiner, all panels in the west wing on strings running to another. K-means cannot learn that partition. Its Voronoi cells are convex by construction. An L-shape requires a non-convex boundary, and Voronoi boundaries are always straight-edged.

Here's a minimal demonstration:

import numpy as np
from sklearn.cluster import KMeans

# L-shaped roof: east wing + south wing
east_wing = np.column_stack([
    np.random.uniform(10, 20, 30),  # x: columns 10–20
    np.random.uniform(0, 10, 30),   # y: rows 0–10
])
south_wing = np.column_stack([
    np.random.uniform(0, 10, 30),   # x: columns 0–10
    np.random.uniform(0, 5, 30),    # y: rows 0–5
])
panels = np.vstack([east_wing, south_wing])
true_labels = np.array([0] * 30 + [1] * 30)

km = KMeans(n_clusters=2, random_state=0, n_init="auto")
predicted = km.fit_predict(panels)

# Check: how many panels got misclassified?
# (flip labels if needed — k-means label order is arbitrary)
match_a = np.mean(predicted == true_labels)
match_b = np.mean(predicted == (1 - true_labels))
accuracy = max(match_a, match_b)
print(f"Panels correctly grouped: {accuracy:.0%}")
# Output on a typical run: 60–75% — a quarter to a third of panels
# end up on the wrong homerun before you touch the wire

That 60–75% accuracy isn't a fluke from a bad seed. It's the algorithm doing the only thing it can: drawing a straight line through your L. Panels from the east wing end up in the south wing's cluster, every run.

The 20% bogus rate — on data k-means is good at

A 2016 PLOS ONE paper measured how often k-means produces visibly wrong clusterings on textbook Gaussian blob data — the kind it's designed for. The result: "With experimental datasets, the R implementation of kmeans generates bogus clusters about 20% of the time." One in five runs, on clean spherical blobs with no constraints. For solar arrays, where every k-means assumption is violated by default, that fraction is a floor, not a ceiling.

The first upgrade: k-means-constrained

When solar engineers hit the "equal cluster size" wall, the natural next move is to constrain k-means to produce clusters of at most max_panels points. That library exists. It's called k-means-constrained and it's on PyPI.

from k_means_constrained import KMeansConstrained
import numpy as np

# 60 panels, 5 MPPT inputs, each string must be exactly 12 panels
panels = np.array([...])  # your (60, 2) array of panel centroids

clf = KMeansConstrained(
    n_clusters=5,
    size_min=10,   # at least 10 panels per MPPT group
    size_max=14,   # at most 14 panels per MPPT group
    random_state=42,
)
labels = clf.fit_predict(panels)

The README states: "K-means clustering implementation whereby a minimum and/or maximum size for each cluster can be specified." That's exactly the solar string sizing constraint.

Here's the part that doesn't appear on page one of any clustering tutorial: this library is not k-means with extra rules. Under the hood, it's a completely different algorithm. From the README: the assignment step is formulated as a "Minimum Cost Flow (MCF) linear network optimisation problem" solved via "Google's Operations Research tools's SimpleMinCostFlow".

Min-cost flow is a way to push "units" through a network at minimum total cost — used for assignment problems. It has nothing to do with centroid geometry. The k-means-shaped API is window dressing. The moment you said "each string must have exactly N panels," you stopped doing clustering and started doing combinatorial optimization.

That shift has a cost. From the README: "k-means-constrained is a more complex algorithm than vanilla k-means and therefore will take longer to execute and has worse scaling characteristics." Time complexity jumps from O(nc) for vanilla k-means to O((n³c + n²c² + nc³) log(n+c)). For 50–500 panels it's still fast. For a 2,000-panel ground-mount it matters.

k-means-constrained fixes the capacity constraint but not the convexity problem. The L-shaped roof still breaks it — you're adding string size limits on top of an algorithm that still cannot draw non-convex regions.

What the problem is actually called

In 2023, a developer posted NetworkX Discussion #7491 asking for help with exactly this:

"a minimum spanning tree with the following conditions: all branches connect to a root vertex, each branch contains (at most) x number of vertices, no edges overlap/cross"

That's the solar homerun problem in three lines. Root vertex = inverter. Each branch = one MPPT run. At most x vertices = string size cap. No crossing edges = no conduit crossing conduit on a roof.

The NetworkX maintainer's response: there's no built-in function for this. They pointed at the Cable Trench Problem, an open research area in combinatorial optimization. The formal academic name is the Capacitated Minimum Spanning Tree (CMST) — a tree of cables where each branch can hold at most N panels. Both terms are NP-hard (no known algorithm finishes fast as the input grows large enough). Neither has a production-ready Python library that ships today.

Search terms for anyone who wants to follow this into the academic literature: "Cable Trench Problem," "Capacitated Minimum Spanning Tree," "Steiner tree problem in graphs." The stringing problem has its own version of this combinatorial explosion — once grouping is solved, ordering within each group is a separate fight.

The iceberg below the clustering step

Even if you solve homerun grouping — correctly, with all four k-means failure modes handled — you're looking at the surface of the production problem. What's underneath: cable schedule generation, conduit fill rules (NEC 310 and 358), max current per fuse size, pull box capacity at junction points, minimum bend radius for your conduit diameter, ampacity derating for ambient temperature and conduit bundling, NEC 690.31 compliance for PV source and output circuits, jumper handling between module rows, and voltage drop limits across long horizontal runs.

Each feeds back into the routing decision. A homerun that looks short on the roof can exceed voltage drop limits if it traverses four rows of series-wired modules. A grouping that respects string size can fail conduit fill if the homerun path runs through a 1" EMT already carrying three other circuits.

If your roof is rectangular and your inverter has one MPPT, the k-means-constrained code above will get you there. If you're routing 200kW commercial systems with multiple inverters, obstructed rooftops, and a permit reviewer who checks conduit fill on every run, see how Branch does it — or see how it compares to other tools that try to automate this step in the Branch vs. PVCAD comparison.

Related reading: how solar stringing automation works covers the full pipeline from module layout to completed CDs. For the AutoCAD side of this — once you have the routing solved and need to draw it — the AutoCAD solar plugin guide picks up where this post ends. The best AutoCAD solar stringing tools of 2026 benchmarks the commercial options if you'd rather not build it at all.

K-Means for AutoCAD Solar Homerun Routing: When It Works, When It Fails, and What to Use Instead

K-Means for AutoCAD Solar Homerun Routing: When It Works, When It Fails, and What to Use Instead

Where k-means earns its keep

K-means on a clean array

The four ways k-means quietly lies to you

Failure 1: You have to pick `k`, and there's no right answer from the data

Failure 2: Anisotropic blobs (the long building edge problem)

Failure 3: Unequal variance (the dense corner problem)

Failure 4: Non-convex shapes (the L-shaped roof problem)

The 20% bogus rate — on data k-means is good at

The first upgrade: k-means-constrained

What the problem is actually called

The iceberg below the clustering step

Related Posts

Stringing Solar Panels for AutoCAD with Python: A 200-Line Solver You Can Run Today

Parsing SolarEdge Designer PDFs for AutoCAD with Python and pdfplumber

Shipping Python Scripts for Solar: From VS Code to a Coworker's Desktop

K-Means for AutoCAD Solar Homerun Routing: When It Works, When It Fails, and What to Use Instead

Where k-means earns its keep

K-means on a clean array

The four ways k-means quietly lies to you

Failure 1: You have to pick k, and there's no right answer from the data

Failure 2: Anisotropic blobs (the long building edge problem)

Failure 3: Unequal variance (the dense corner problem)

Failure 4: Non-convex shapes (the L-shaped roof problem)

The 20% bogus rate — on data k-means is good at

The first upgrade: k-means-constrained

What the problem is actually called

The iceberg below the clustering step

Related Posts

Stringing Solar Panels for AutoCAD with Python: A 200-Line Solver You Can Run Today

Parsing SolarEdge Designer PDFs for AutoCAD with Python and pdfplumber

Shipping Python Scripts for Solar: From VS Code to a Coworker's Desktop

Failure 1: You have to pick `k`, and there's no right answer from the data