Reading SolarEdge PDFs in Python: A Beginner's Guide to pdfplumber
How to install pdfplumber, open a SolarEdge Designer PDF, and extract text and tables for AutoCAD import — a friendly introduction before tackling the production pitfalls.
Reading SolarEdge PDFs in Python: A Beginner's Guide to pdfplumber
SolarEdge Designer reports are PDFs. Every project drops one in your inbox: panel positions, string assignments, optimizer placements, inverter mappings — all locked inside a file that AutoCAD can't import. Manually retyping that data is a 90-minute job per project that introduces typos. pdfplumber is the Python library that gets the data out.
By the end of this post you'll have pdfplumber installed and a script that opens a SolarEdge Designer report, reads its pages, and dumps the text. This is the gentle introduction. The production extractor — with its four pitfalls — is the deep-dive after this. Plan for 30 minutes.
What pdfplumber actually does
A PDF is not a spreadsheet. It's a list of drawing commands: "put this character at this X/Y position on the page, draw this rectangle, draw this line." There's no concept of "row" or "column" anywhere in the file format. Every PDF library has to reconstruct rows and columns from the positions of individual characters.
pdfplumber does that reconstruction for you. You give it a PDF file; it gives you words, tables, and text positions. It is read-only — it cannot fill in fields or modify the PDF in any way.
For SolarEdge Designer reports specifically, pdfplumber gets you:
- The text labels on the layout page (string IDs, panel IDs, dimension callouts)
- The X/Y position of every text label, which gives you each panel's location on the page
- The tables on the equipment summary and string summary pages
That's the data you want for AutoCAD. The four pitfalls that can break each of those steps — rotated pages, phantom column boundaries, the Y-axis flip, and clipping paths — are covered in the deep-dive after this post. Here, you're learning the fundamentals with a working script.
Step 1: Install pdfplumber
Open PowerShell. You should still have Python from the first post in this series. Verify it:
python --version
You should see something like Python 3.12.3. Now install pdfplumber:
pip install "pdfplumber>=0.10.0"
pip is Python's package manager. It connects to the Python Package Index — the central repository where tens of thousands of Python libraries are hosted — downloads pdfplumber, and installs it. This is how you add any third-party library to Python.
The version pin (>=0.10.0) matters. Earlier versions have a known bug where ligatures like "fi" in "Office" come back as a single weird character (ffi, Unicode U+FB03 instead of three separate letters). The deep-dive post explains why. For now, just pin it.
You'll see a flurry of "Collecting" and "Installing" lines, ending with Successfully installed pdfplumber-0.X.X. If you get 'pip' is not recognized, the PATH issue from the install post is back — reinstall Python with the "Add to PATH" box checked.
Confirm the install worked:
python -c "import pdfplumber; print(pdfplumber.__version__)"
You should see a version number. You're ready.
Step 2: Get a sample PDF
You need a SolarEdge Designer report PDF. If you have a real one from a recent project, use it. If not, any multi-page PDF will work for the basic operations in this post — the SolarEdge-specific details come up when you extract words by position in Step 5.
For this walkthrough, save your PDF to C:\temp\solar_report.pdf. If C:\temp\ doesn't exist, open File Explorer and create the folder. We'll reference that path throughout.
Step 3: Open the PDF and inspect its pages
Create a new file in VS Code called read_pdf.py in C:\temp\. Type this in:
import pdfplumber
pdf_path = r"C:\temp\solar_report.pdf"
with pdfplumber.open(pdf_path) as pdf:
print(f"Number of pages: {len(pdf.pages)}")
for i, page in enumerate(pdf.pages):
print(f"Page {i}: {page.width:.1f} x {page.height:.1f} pts, rotation={page.rotation}")
Run it — press F5 in VS Code, then pick "Python File" if it asks. You'll see something like:
Number of pages: 8
Page 0: 612.0 x 792.0 pts, rotation=0
Page 1: 792.0 x 612.0 pts, rotation=0
Page 2: 612.0 x 792.0 pts, rotation=0
Page 3: 612.0 x 792.0 pts, rotation=0
...
What you're seeing:
- Dimensions are in PostScript points. One point is 1/72 of an inch.
- 612 × 792 is letter portrait (8.5" × 11"). 792 × 612 is letter landscape.
rotationis a flag stored in the PDF itself. A page withrotation=90looks landscape on screen but is stored as portrait with a rotation applied. Some SolarEdge Designer reports use this. If you see it, the deep-dive post covers how to handle it — it's Pitfall 3 over there.
The with pdfplumber.open(...) as pdf: block is the same pattern as with open(...) from the first post. pdfplumber closes the file automatically when the block exits, even if something goes wrong inside it.
Step 4: Extract all text from a page
The simplest extraction is extract_text(). It pulls every text character on a page and reassembles them into one string, reading left-to-right, top-to-bottom.
Replace the loop in your script with this:
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[0] # the first page, index 0
text = page.extract_text()
print(text)
Run it. The entire text content of the first page dumps to your terminal. For a SolarEdge cover page, you'll see the project name, address, system size, designer name, and report date.
This works well on most cover pages and summary pages. It struggles when:
- The page has
rotation != 0(you'll get garbled or reversed text) - There are overlapping text layers, like a watermark behind the content
- The font uses unusual encoding
For the SolarEdge layout page — the one with hundreds of small panel labels — you don't want one big blob of text anyway. You want each label with its position. That's the next step.
Step 5: Extract words with their positions
pdfplumber can return every word on a page as a dictionary that includes its text and its exact location. This is what you need for panel position extraction.
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[1] # adjust to the layout page in your report
words = page.extract_words()
for w in words[:10]:
print(f"'{w['text']}' x={w['x0']:.1f} top={w['top']:.1f}")
Each word dictionary has these keys:
text— the word itselfx0— distance from the left edge of the page, in pointsx1— the right edge of the wordtop— distance from the top of the page, in pointsbottom— also measured from the top of the page, just further down
That last point trips up almost everyone the first time: top and bottom are both distances from the top of the page, not from the bottom. So a word near the bottom of the page has a large top value, not a small one. This is the opposite of the AutoCAD UCS, where Y increases upward. The conversion formula and why it matters are covered in detail in the deep-dive (search for "Pitfall 1" over there).
For a SolarEdge layout page, extract_words() is how you get panel positions: every word whose text matches a panel ID format has an x0 and top you can record.
Step 6: Extract a table
The string summary and equipment pages in a SolarEdge Designer report are structured as grids. pdfplumber can extract those as proper rows and columns:
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[2] # adjust to a table page in your report
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
extract_tables() returns a list of tables found on the page. Each table is a list of rows. Each row is a list of cell values (strings, or None for empty cells).
Try this on a table page from a real SolarEdge report and you'll probably see some rows with None in several columns where there's clearly text on the page. This is one of the four pitfalls in the deep-dive — pdfplumber gets confused by invisible blue rectangles in the PDF and treats them as phantom column boundaries. There's a clean fix using pdfplumber's .filter() method, but it belongs in the production post, not here.
What you get from extract_tables() right now is what pdfplumber sees with no preprocessing. If you can read the data you care about in those rows — string assignments, panel counts, cable lengths — the basic extraction is working. The production pass in the deep-dive just cleans it up.
The errors you're going to hit
ModuleNotFoundError: No module named 'pdfplumber' — You didn't run pip install pdfplumber, or you installed it under a different Python version than the one VS Code is using. Check the interpreter: press Ctrl+Shift+P → "Python: Select Interpreter" — make sure VS Code is pointed at the same Python that pip installed into.
FileNotFoundError — The path is wrong or the file isn't where you think. Open File Explorer and confirm solar_report.pdf is in C:\temp\.
extract_text() returns garbled or backwards text — The page has rotation != 0. Covered in the deep-dive (search for "Pitfall 3").
extract_tables() returns rows full of None — Phantom blue rectangles. Covered in the deep-dive (search for "Pitfall 2").
extract_words() returns an empty list — The page has no text layer. It's a scanned image — the PDF contains a photograph of a page, not actual characters. Getting text from a scanned PDF requires OCR (Tesseract, typically), which is a separate tool and a different problem.
Everything else — copy the entire error message and paste it into Google with the word "pdfplumber." The library has an active GitHub Issues page and most problems have been documented.
Where to go next
-
Parsing SolarEdge Designer PDFs for AutoCAD with Python and pdfplumber — the production deep-dive. All four pitfalls covered with working fixes: the Y-axis flip, phantom blue rectangles, rotated pages, and clipping paths. Ends with a full extraction loop for panel positions and string assignments. (~45 minutes)
-
Stringing Solar Panels for AutoCAD with Python — once you have panel positions out of the PDF, the next problem is grouping them into strings. This post covers the algorithm behind it and a working 200-line solver. (~45 minutes)
If you'd rather not maintain a custom PDF parser that needs updating every time SolarEdge ships a new Designer template, Branch handles the import natively — panel positions, string assignments, and inverter mappings come straight into AutoCAD without you writing a line of code.