K-Means Clustering Explained for Solar Engineers: How It Groups Panels (and Where It Fails)

You searched "how to group solar panels into homerun clusters" and the first result was a tutorial recommending k-means. Two lines of Python, scikit-learn, done. You ran it on a sample rooftop and the output looked clean. Then you ran it on a real project — the one with the parapet wall down the middle and the HVAC unit on the south end — and the result was nonsense. Some panels in the wrong group, boundaries that no electrician would actually wire, and no error message telling you anything went wrong. K-means did exactly what it was designed to do. The problem is what it was designed to do isn't what you needed.

By the end of this post you'll understand what k-means is doing, what kinds of arrays it gets right, and exactly why it falls apart on real rooftops. No math. Plan for 25 minutes.

What clustering even is

Clustering is the general name for any algorithm that takes a pile of points and divides them into groups. The points might be customers by purchase history, images by visual similarity, or solar panels by position on a roof. The grouping criterion is always the same: things in the same group should be similar to each other, and different from things in other groups.

For solar panels, "similar" almost always means "physically close on the roof." Closer panels mean shorter wire runs, which means less copper and a cleaner install. The clustering algorithm doesn't know anything about electricity — it just groups points in space. The connection to wire length is your interpretation.

Clustering algorithms come in several flavors. K-means is the best-known and the simplest, which is why every introductory tutorial reaches for it. There are dozens of others — DBSCAN, hierarchical clustering, spectral clustering, mean shift — each with different assumptions about what "good" grouping means. They produce different results on the same data, and each one has geometries where it excels and geometries where it quietly fails.

The reason k-means dominates beginner tutorials is the API. In Python with scikit-learn:

from sklearn.cluster import KMeans
labels = KMeans(n_clusters=5).fit_predict(panel_coordinates)

Two lines. You hand it a list of panel positions and the number of groups you want, and it tells you which group each panel belongs to. Nothing is more accessible.

The problem is that the algorithm has hidden assumptions baked into those two lines. When those assumptions match your data, k-means produces results that look correct. When they don't, k-means still produces results — they just don't make sense for your roof. And it never tells you which case you're in.

The "k" in k-means: you have to pick a number

The "k" in k-means is the number of groups you want. You have to tell the algorithm before it runs. K-means cannot figure out the right number on its own.

For solar stringing, this is usually straightforward. The number of groups is determined by your inverter's MPPT input count. If you have a 4-MPPT inverter, k = 4. The inverter makes the decision, not the data.

For other problems — customer segmentation, image compression — picking k is its own research project. There's a method called the elbow method that's supposed to help you find the right number. It involves running k-means multiple times with different k values and plotting how "tight" the groups get as k increases. In practice, the plot is usually noisy and the "elbow" is wherever you wanted it to be.

For solar stringing, none of that matters. K equals the number of MPPT inputs on your inverter. Done.

How k-means actually runs

Imagine 60 panels on a commercial rooftop — a clean 6-row by 10-column array — and you want to split them into 5 groups, one per MPPT input. Here's exactly what k-means does, step by step.

Step 1: Drop 5 random "centroids" on the roof. A centroid is a single point that represents the center of a group. K-means starts by picking 5 random spots somewhere on the roof. They don't have to land on any actual panel. Just 5 random X/Y coordinates inside the bounding box of the array.

Picture a scatter plot of 60 colored dots (your panels) with 5 black X marks dropped randomly across them. That's the starting state.

Step 2: Assign every panel to its nearest centroid. For each of the 60 panels, find which of the 5 black X marks is closest and assign that panel to that group. You now have 5 groups, each containing some panels. The sizes will be uneven — this is just the first pass.

Step 3: Move each centroid to the average position of its group. Take all the panels in group 1, average their X coordinates, average their Y coordinates, and move centroid 1 to that average position. Do the same for groups 2 through 5. Each centroid has now "drifted" toward the middle of its group.

Step 4: Reassign panels based on the new centroid positions. Because the centroids moved, some panels are now closer to a different centroid than they were before. Reassign them to the nearest centroid.

Step 5: Repeat steps 3 and 4 until nothing changes. Keep averaging and reassigning. After several iterations, the centroids stop moving and the panel assignments stop changing. K-means has "converged" — it found a stable configuration.

That's the whole algorithm. No neural networks, no training data, no machine learning in the technical sense. Just repeated averaging until things settle.

What this process actually implies. Each centroid ends up "owning" a region of the roof — the area where it's the closest centroid out of the five. Those regions always form polygons with straight edges. In geometry, this kind of partition is called a Voronoi diagram: space divided by straight-line boundaries, each region containing exactly the points closest to one centroid. The boundaries between your homerun groups are always straight lines.

That last sentence is the source of every k-means failure mode on real rooftops.

Where k-means works

On a clean rectangular array with no obstructions, k-means produces a result that looks right. Picture that 6×10 array split into 5 vertical strips of 12 panels each. That's what k-means produces on clean rectangular data, and that's roughly what an experienced electrician would lay out by hand. The groups are equal-sized, the boundaries are sensible, the wire runs are short.

If your array fits all of these descriptions:

Rectangular grid, single orientation
No obstructions cutting through the middle
Single inverter, fixed MPPT count
No L-shape, U-shape, or T-shape

...then k-means will give you something close to optimal. On that narrow set of conditions, two lines of scikit-learn actually get you where you need to go.

The trap is how rarely that description fits a real commercial project. A rooftop with an HVAC curb in the middle: broken from the start. An L-shaped building footprint: broken from the start. A ground-mount with cleared maintenance lanes between row groups: mostly fine, but k-means doesn't understand the row structure and will sometimes cut across rows instead of following them.

Here's the pattern you'll notice: the arrays where k-means works cleanly are the arrays where you barely needed an algorithm. The layouts that are genuinely hard to partition — the ones with obstructions, irregular shapes, or physical constraints on where conduit can run — are exactly the layouts where k-means fails quietly and confidently.

Where k-means falls apart

There are four ways k-means produces wrong answers on real solar arrays. Each one is the algorithm doing exactly what it was designed to do, on data that doesn't fit its assumptions.

Failure 1: You have to pick k, and the algorithm has no opinion. Already covered above. For solar this is solvable — your inverter spec tells you the answer. But it's worth naming: there is no k-means setting that says "figure out the natural groupings." The algorithm needs a number to start. If you hand it the wrong number, it will produce k groups regardless, and the result will look plausible anyway.

Failure 2: Long, narrow arrays. K-means assumes groups are roughly round — close to circular, equally spread in every direction. A commercial rooftop array that's 80 panels wide and 4 panels deep violates that assumption completely. K-means tries to draw round-ish groups, which means it cuts diagonally across the long dimension instead of making vertical divisions. The result is boundaries that cross row lines — not where an electrician would run conduit.

Picture a heat map of that 4×80 array with k-means groups drawn over it. Instead of 5 vertical columns of 64 panels each, you get 5 diagonal wedges that each contain panels from all 4 rows. The boundaries are straight lines, just oriented the wrong way.

Failure 3: Dense and sparse regions. K-means biases toward equal-area groups, not equal-panel-count groups. If one section of your roof has densely packed panels and another has a sparse partial fill — common on rooftops with setback requirements or around dormers — the algorithm over-groups the sparse region and under-groups the dense one. The dense side ends up with too few homerun groups (overcrowded strings); the sparse side ends up with too many. There's no parameter you can adjust to fix this. The distance objective k-means minimizes doesn't know what a panel count is.

Failure 4: L-shapes, U-shapes, and obstructions. This is the failure mode that makes k-means the wrong tool for commercial solar. Picture an L-shaped roof — an east wing and a south wing meeting at the corner. The correct homerun partition follows the physical geometry: panels on the east wing routed to one combiner location, panels on the south wing routed to another. The boundary between the two groups wraps around the inside corner of the L.

K-means cannot draw that boundary. Its Voronoi partitions are convex by construction, meaning the boundary between any two groups is always a straight line. An L-shape requires a boundary that bends 90 degrees. No k-means configuration — no seed, no initialization strategy, no number of restarts — can represent that answer. It will always draw a diagonal through your L, putting east-wing panels in the south-wing group and vice versa.

The same problem appears with any non-convex array shape: rooftops with cutouts for skylights or equipment, buildings with recessed sections, ground-mounts with excluded zones. The moment the correct grouping boundary needs to do anything other than go in a straight line, k-means cannot find it. That's not a bug in scikit-learn. It's a property of the geometry.

What to use instead

When k-means fails on a real project, the upgrade path for solar stringing looks like this:

k-means-constrained — Adds minimum and maximum cluster sizes to k-means. Fixes the "overcrowded string" problem by enforcing that each group stays within a panel count range. Doesn't fix the L-shape problem — you're still drawing straight-line boundaries, just with a size cap. Available as a pip-installable Python library.
Capacitated Minimum Spanning Tree algorithms — A formal approach to the problem of grouping points into fixed-size branches from a root (your inverter location). Handles non-convex shapes better than k-means because it reasons about connectivity along a tree structure, not distance from a centroid. Slower on large arrays, and there's no easy one-liner install.
Mixed integer linear programming — The exact solution. You express the grouping as an optimization problem with hard constraints — panel count limits, conduit path costs, physical adjacency requirements — and a solver finds the best answer. Slow on arrays with more than a few hundred panels. Used by commercial tools for a final-pass refinement after faster heuristics narrow the search space.
Reinforcement learning — A newer approach where the algorithm learns from examples of human-drafted projects instead of optimizing a mathematical objective directly. Branch uses this. It handles irregular shapes and physical constraints because it learned from real engineers who already dealt with them.

The deep-dive post covers each of these in detail with working code: K-Means for AutoCAD Solar Homerun Routing: When It Fails.

Where to go next

K-Means for AutoCAD Solar Homerun Routing: When It Works, When It Fails, and What to Use Instead — the production deep-dive. Real code demonstrating each failure mode, the k-means-constrained upgrade, and the full picture of what "solving the homerun problem" actually requires. (~30 minutes)
What is an Auto-Stringer? The Algorithm Behind Every Solar Stringing Plugin — broader context on the full stringing problem, from module layout to completed string plan. Clustering is one step in a longer pipeline. (~20 minutes)
Stringing Solar Panels for AutoCAD with Python — a working 200-line implementation. The clustering step is there, and so is the scaffolding around it that makes it usable on real projects. (~45 minutes)

Branch handles all of this in AutoCAD without requiring you to evaluate clustering algorithms or maintain Python dependencies. It works on the arrays k-means breaks — L-shapes, obstructed rooftops, multi-inverter layouts — and outputs completed stringing directly into your drawing. See how it works at /product.

K-Means Clustering Explained for Solar Engineers: How It Groups Panels (and Where It Fails)

K-Means Clustering Explained for Solar Engineers: How It Groups Panels (and Where It Fails)

What clustering even is

The "k" in k-means: you have to pick a number

How k-means actually runs

Where k-means works

Where k-means falls apart

What to use instead

Where to go next

Related Posts

C# for AutoLISP Programmers: The 10 Concepts That Translate (and Where the Analogy Breaks)

Python Functions, Classes, and Dataclasses for AutoLISP Programmers

What is an Auto-Stringer? The Algorithm Behind Every Solar Stringing Plugin for AutoCAD