Parsing SolarEdge Designer PDFs for AutoCAD with Python and pdfplumber

The maintainer of pdfplumber admitted in a public GitHub issue that rotated pages "basically don't work." That's jsvine's own words in issue #848. SolarEdge Designer report PDFs have rotated elements. So does that mean you're stuck?

Not quite. But it means the naive approach will waste your afternoon.

We've been parsing SolarEdge Designer PDFs in production for two years. We've hit every wall in this post. Here's the cheat sheet we wish someone had handed us.

What you're actually trying to extract

A SolarEdge Designer report has five things worth extracting programmatically:

Panel positions — X/Y coordinates on the layout page, which map to real-world module locations
String assignments — which panel belongs to which string number, on which MPPT input
Inverter mapping — which strings connect to which inverter (critical for multi-inverter projects)
Optimizer placements — optimizer serial ranges are in the equipment table
Cable lengths — circuit-level distances from the string summary table

You won't get all five from a single extraction pass. Layout data (panel positions, string assignments) lives on the visual layout page. Cable and equipment data lives in text tables on separate pages. Plan for two extraction strategies from the start.

Setup

pip install "pdfplumber>=0.10.0"

Pin to 0.9.0 or later. Earlier versions have a ligature bug: "Office" comes back as "Oﬃce" (a single U+FB03 glyph). GitHub issue #598 has the full story. Fixed since 0.9.0, but CI pipelines pinned to old versions still get burned.

import pdfplumber

with pdfplumber.open("solaredge_report.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        print(f"Page {i}: {page.width:.1f} x {page.height:.1f} pts, rotation={page.rotation}")

Run this first. A 792 x 612 pt page is landscape letter; 612 x 792 is portrait. If the report has been through a print driver, the dimensions can drift. page.width and page.height are your ground truth.

Pitfall 1: The Y-axis flip will bite you on day one

In the native PDF spec, (0, 0) is at the bottom-left of the page and Y increases upward. pdfplumber silently flips this: origin is top-left, Y increases downward.

The consequence: the bottom field on every word record is the distance of that element's bottom edge from the top of the page, not from the bottom. GitHub issue #1181 has the full confusion documented.

# pdfplumber fields:
# x0   = distance from LEFT edge of page
# top  = distance from TOP of page (NOT from bottom)
# bottom = also from TOP (just further down)

with pdfplumber.open("solaredge_report.pdf") as pdf:
    layout_page = pdf.pages[1]  # adjust for your report
    for w in layout_page.extract_words()[:5]:
        print(f"'{w['text']}' x0={w['x0']:.1f}, top={w['top']:.1f}")

To convert to AutoCAD Y (bottom-left origin):

pdf_native_y = layout_page.height - word["top"]

Get this wrong and your entire array is mirrored vertically in the drawing.

Pitfall 2: `extract_table()` returns rows full of `None` — invisible blue rectangles

Before you call extract_table() on the string assignment or equipment table pages, know this: pdfplumber returns None for cells whose column positions overlap an invisible non-stroking rectangle. The rectangles don't render visually, but pdfminer parses them as objects and the table-finder uses them as phantom cell boundaries.

GitHub discussion #719 has the exact failure: rows like ['102', None, None, None, None, ...] when the first column was clearly populated. jsvine's diagnosis: blue rectangles with non_stroking_color == (0, 0, 1).

Filter them out before extracting:

with pdfplumber.open("solaredge_report.pdf") as pdf:
    table_page = pdf.pages[2]  # adjust for your report

    filtered_page = table_page.filter(
        lambda obj: not (
            obj.get("non_stroking_color") == (0, 0, 1)
            and obj.get("object_type") == "rect"
        )
    )

    tables = filtered_page.dedupe_chars().find_tables()
    for table in tables:
        rows = table.extract()
        for row in rows:
            print(row)

Still getting None columns? Turn on visual debugging:

im = filtered_page.to_image(resolution=150)
im.debug_tablefinder()
im.save("debug_tablefinder.png")

That PNG shows every line and cell boundary pdfplumber found. Any horizontal line crossing a column in the wrong place is a phantom rectangle you haven't filtered. Visual debugging is not optional for this class of problem.

Pitfall 3: Rotated pages don't extract cleanly

SolarEdge Designer reports often store a landscape-oriented layout page as a portrait page with a /Rotate 90 flag. pdfplumber's extract_text() on that page returns garbled output. Issue #848 shows text coming back reversed: "OHW / campaign in Yemen :otohP" (:otohP = "Photo:" reversed).

pdfplumber exposes page.rotation (0, 90, 180, or 270) but has no .rotate() method. Normalize rotation with PyMuPDF first:

import fitz  # pip install pymupdf
import pdfplumber

doc = fitz.open("solaredge_report.pdf")
for page in doc:
    if page.rotation != 0:
        page.set_rotation(0)
doc.save("solaredge_normalized.pdf")

with pdfplumber.open("solaredge_normalized.pdf") as pdf:
    words = pdf.pages[1].extract_words()

For rotated label text within a page (string IDs printed vertically next to panel rows), use the character transformation matrix:

from pdfplumber.ctm import CTM

with pdfplumber.open("solaredge_normalized.pdf") as pdf:
    for char in pdf.pages[1].chars:
        ctm = CTM(*char["matrix"])
        if abs(ctm.skew_x) > 0.1:
            print(f"Rotated: '{char['text']}' at ({char['x0']:.1f}, {char['top']:.1f})")

This identifies rotated string labels to handle separately from horizontal text.

Pitfall 4: Clipping paths and overlapping columns

If extracted text has characters doubled up or merged in ways that don't match what you see on screen, you're hitting PDF clipping paths. GitHub issue #912 has the exact failure: PDFs use W* operators to mask overflow text, and pdfplumber's character extractor can't see the mask. Text clipped out of the visible area still gets extracted.

dedupe_chars() removes exact duplicates but won't help when masking creates positions that are close but not identical. Crop each column explicitly:

w, h = page.width, page.height
# Adjust X bounds for your report version
string_col  = page.crop((0,           100, w * 0.25, h - 50))
panel_col   = page.crop((w * 0.25,    100, w * 0.55, h - 50))
mppt_col    = page.crop((w * 0.55,    100, w,        h - 50))

Inspect the X boundaries once with visual debugging, hardcode them, and note the Designer version in a comment. When SolarEdge updates the template, you'll know exactly where to look.

The Ghostscript pre-repair trick

If extraction fails in ways that don't fit the above patterns (empty tables, disappearing lines), run the PDF through Ghostscript first:

gswin64.exe -o repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress original.pdf

On Mac/Linux, replace gswin64.exe with gs. This is folklore from pdfplumber discussion #647. Ghostscript re-emits the PDF content stream, normalizing objects that pdfminer's parser trips on. The output looks identical but pdfplumber can read the structure.

PyMuPDF, Camelot, tabula

PyMuPDF (fitz) is 30–45x faster than pdfminer. Use it for scanning batches to pull specific fields (panel count, inverter model, project name). Use pdfplumber when you need to debug why extraction is returning garbage.

Camelot (lattice mode) wins on ruled tables. SolarEdge Designer tables are mixed — some have borders, some don't. For ruled ones, camelot.read_pdf("report.pdf", flavor="lattice") requires zero tuning.

tabula-py wraps Java and struggles with non-standard column spacing. Skip it.

Per the Camelot maintainers' comparison wiki: no single tool wins on all document types. Start with pdfplumber and visual debugging. Once you understand your report's structure, switch to PyMuPDF for speed.

Assembling panel positions and string assignments

With the pitfalls handled, the extraction loop for the layout page looks roughly like:

import pdfplumber
from collections import defaultdict

def extract_string_assignments(pdf_path: str) -> dict:
    # Returns: string_id -> [(pdf_x, pdf_native_y), ...]
    # pdf_native_y is bottom-left origin (matches AutoCAD UCS)
    # Assumes page rotation has been normalized (see Pitfall 3)
    string_map = defaultdict(list)

    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[1]  # adjust per report
        filtered = page.filter(
            lambda obj: not (
                obj.get("non_stroking_color") == (0, 0, 1)
                and obj.get("object_type") == "rect"
            )
        )
        words = filtered.dedupe_chars().extract_words()
        page_height = page.height

        # Cluster words by row (3-pt Y tolerance)
        rows = defaultdict(list)
        for w in words:
            rows[round(w["top"] / 3) * 3].append(w)

        for row_y in sorted(rows):
            row = sorted(rows[row_y], key=lambda w: w["x0"])
            if len(row) >= 2:
                string_id = row[0]["text"]
                for w in row[1:]:
                    string_map[string_id].append((
                        (w["x0"] + w["x1"]) / 2,
                        page_height - w["top"],  # flip Y
                    ))

    return dict(string_map)

This is a starting point. Exact row structure depends on your Designer version and MPPT count.

The iceberg

The code above handles a single-inverter project with standard column alignment and no rotated labels. Everything past that gets harder.

Multi-inverter projects have one layout section per inverter, each with its own coordinate space. String IDs like "S1-1" and "S2-1" appear in the same column but belong to different inverters. The page structure distinguishing them is visual spacing, not any logical marker in the content stream.

SolarEdge updates Designer. When they do, non_stroking_color == (0, 0, 1) might change, column X-boundaries shift, and the page that was index 1 becomes index 2. Your parser silently returns empty data and you won't know until a drawing ships with no strings.

Converting PDF coordinates to real-world AutoCAD coordinates requires finding two reference points (a scale bar, a dimensioned edge), computing a scale factor and rotation, and applying it to every panel position. There's no API for this. It's custom geometry on top of the coordinate dump you extracted.

We built Branch because doing all of this reliably, across every variant of SolarEdge Designer report we've encountered over two years, is a thousand-hour project. Not a weekend project.

If your project is a clean single-inverter layout and you control the Designer version, the code in this post will get you most of the way there. If you're running a design firm processing 30 projects a month across multiple inverter configurations and Designer keeps changing under you, see how Branch handles the import.

One note on alternatives: PVCAD claims SolarEdge support but imports the DXF export, not the PDF. That gives you geometry with no string data attached. See our Branch vs. PVCAD comparison for the full breakdown.

Parsing SolarEdge Designer PDFs for AutoCAD with Python and pdfplumber

Parsing SolarEdge Designer PDFs for AutoCAD with Python and pdfplumber

What you're actually trying to extract

Setup

Pitfall 1: The Y-axis flip will bite you on day one

Pitfall 2: `extract_table()` returns rows full of `None` — invisible blue rectangles

Pitfall 3: Rotated pages don't extract cleanly

Pitfall 4: Clipping paths and overlapping columns

The Ghostscript pre-repair trick

PyMuPDF, Camelot, tabula

Assembling panel positions and string assignments

The iceberg

Related Posts

Reading SolarEdge PDFs in Python: A Beginner's Guide to pdfplumber

K-Means for AutoCAD Solar Homerun Routing: When It Works, When It Fails, and What to Use Instead

Stringing Solar Panels for AutoCAD with Python: A 200-Line Solver You Can Run Today

Parsing SolarEdge Designer PDFs for AutoCAD with Python and pdfplumber

What you're actually trying to extract

Setup

Pitfall 1: The Y-axis flip will bite you on day one

Pitfall 2: extract_table() returns rows full of None — invisible blue rectangles

Pitfall 3: Rotated pages don't extract cleanly

Pitfall 4: Clipping paths and overlapping columns

The Ghostscript pre-repair trick

PyMuPDF, Camelot, tabula

Assembling panel positions and string assignments

The iceberg

Related Posts

Reading SolarEdge PDFs in Python: A Beginner's Guide to pdfplumber

K-Means for AutoCAD Solar Homerun Routing: When It Works, When It Fails, and What to Use Instead

Stringing Solar Panels for AutoCAD with Python: A 200-Line Solver You Can Run Today

Pitfall 2: `extract_table()` returns rows full of `None` — invisible blue rectangles