# Spreadsheet Integrity Pipeline

Every xlsx deliverable is built and verified through a role-based workflow. Three roles collaborate in sequence: **Blueprint Architect**, **Builder**, and **Inspector**. Each role has explicit responsibilities and handoff criteria.

---

## Tool Reference: xlsx.py

All commands: `python3 "$XLSX_SKILL_DIR/xlsx.py" <command> [arguments]`

| Command | Purpose | Called By |
|---------|---------|-----------|
| `recalc <file>` | Recalculate formulas via LibreOffice, scan for errors | Builder (self-check) |
| `audit <file>` | Deep formula error scan + zero-value + implicit array detection | Builder (self-check) |
| `scan <file>` | Detect out-of-range, header-included, small-aggregate, inconsistent patterns | Builder (self-check) |
| `inspect <file> --pretty` | Get sheet structure, data ranges, headers (JSON) | Blueprint Architect |
| `pivot <in> <out> --source --values [--rows --cols --filters --style --chart]` | Create PivotTable | Builder (final step only) |
| `chart-verify <file>` | Verify embedded charts have data | Builder (self-check) |
| `validate <file>` | Structural validation (release gate) | Inspector |

---

## Role 1: Blueprint Architect

Before any code runs, the Architect produces a build plan:

- **Decompose the request**: separate explicit requirements from implicit business context
- **Map every sheet**: name, column structure, formula dependencies, cross-references
- **Identify data flow**: which sheets feed into which (source → derived → summary)
- **Flag ambiguity**: if the request is unclear, ask — don't guess

The Architect's output is a mental blueprint. No files are created yet.

---

## Role 2: Builder

The Builder writes code and produces the workbook. The Builder operates under a strict **single-sheet discipline**: complete one sheet fully, verify it, then move on.

### Build Cycle (per sheet)

```
┌─────────────────────────────────────────────┐
│  Write sheet (data, formulas, styling, charts)  │
│                    ↓                            │
│  Save workbook to disk                          │
│                    ↓                            │
│  Self-check chain:                              │
│    recalc → audit → scan                        │
│    + chart-verify (if sheet has charts)          │
│                    ↓                            │
│  All clear? ──Yes──→ Proceed to next sheet      │
│       │                                         │
│      No                                         │
│       ↓                                         │
│  Fix errors → re-save → re-run self-check       │
│  (loop until clean)                             │
└─────────────────────────────────────────────┘
```

### Builder Constraints

- **No batch-then-check**: you cannot create all sheets first and verify later. Errors in early sheets propagate silently into later sheets.
- **No error forwarding**: a sheet with unresolved errors blocks all subsequent work.
- **No silent delivery**: a file that hasn't passed self-check is not a deliverable — it's a draft.

### Pivot Tables — Special Sequencing

PivotTables depend on finalized source data. They are always the **last data operation**:

```bash
python3 "$XLSX_SKILL_DIR/xlsx.py" inspect input.xlsx --pretty   # understand structure
python3 "$XLSX_SKILL_DIR/xlsx.py" pivot input.xlsx output.xlsx \
    --source "Sheet!A1:F100" \
    --values "Revenue:sum,Units:count" \
    --rows "Product,Region" \
    --cols "Quarter" \
    --filters "Year" \
    --location "Summary!A3" \
    --style "finance" \
    --chart "bar"
```

Aggregations: `sum`, `count`, `average`/`avg`, `max`, `min`
Chart types: `bar` (default), `line`, `pie`
Styles: `monochrome` (default), `finance`

**Never modify pivot output with openpyxl afterward** — it corrupts the pivotCache.

---

## Role 3: Inspector

The Inspector runs after all sheets are built. Two levels of inspection: **Semantic** and **Structural**.

### Semantic Inspection (for edit/transform tasks)

When the task involves transforming existing data (not creating from scratch), verify the transformation didn't corrupt meaning:

| Check | Method |
|-------|--------|
| **Row count** | Does output have the expected number of rows? (e.g., grouping 15 rows by 5 keys → expect 5 rows) |
| **Column totals** | Do numeric sums in output match source? (or expected transformation) |
| **Spot-check** | Compare 2-3 specific rows between source and output |
| **Formula evaluability** | Can formulas be verified in Python? If self-referencing or cross-sheet, verify computed values instead |

```python
# Semantic verification template
source_total = sum(normalize_cell_value(ws_src.cell(row=r, column=c).value) or 0
                   for r in range(start, end + 1))
output_total = sum(normalize_cell_value(ws_out.cell(row=r, column=c).value) or 0
                   for r in range(out_start, out_end + 1))
assert abs(source_total - output_total) < 0.01, f"Total mismatch: {source_total} vs {output_total}"
```

### Structural Inspection (release gate)

```bash
python3 "$XLSX_SKILL_DIR/xlsx.py" validate output.xlsx
```

- Exit 0 → file is releasable
- Non-zero → Builder must regenerate from scratch with corrected code

---

## Known Traps & Countermeasures

These are recurring failure modes. The Builder must internalize them.

| Trap | What Goes Wrong | Countermeasure |
|------|----------------|----------------|
| `data_only=True` then save | Formulas permanently replaced with cached values | Never save after opening with `data_only=True` |
| Column index miscalculation | col 64 ≠ "BK" | Always use `openpyxl.utils.get_column_letter()` |
| Row offset confusion | DataFrame row 5 = Excel row 6 | Excel is 1-indexed, pandas is 0-indexed |
| NaN leaks into formulas | `=A1+nan` → broken formula string | Check `pd.notna()` before referencing |
| Cross-sheet reference typo | `Sheet1!A1` vs `'Sheet 1'!A1` | Quote sheet names containing spaces |
| Division by zero | `#DIV/0!` in Excel | Wrap with `IFERROR()` or `IF(denom=0,...)` |
| Text starting with `=` | `#NAME?` error | Prefix descriptive text with `'` |
| Implicit array formula | `#N/A` in Excel | Avoid `MATCH(TRUE(),range>0,0)`, use `SUMPRODUCT` |
| Chart renders blank | Formula cells have no cached values | Run `recalc` before creating charts |
| Hidden rows → empty chart | Chart skips hidden data | Set `chart.plot_visible_only = False` |
| Overlapping charts | Multiple charts stacked on same cells | Calculate anchor: ~15 rows per chart + 2 rows gap |
| Verify newly-written formulas with `data_only=True` → get `None` | openpyxl doesn't evaluate formulas; `data_only=True` only reads Excel's cached values which don't exist for new formulas | Compute expected values in Python and compare directly. For TOTAL rows needing verification, write computed values (see SKILL.md Design Principle #1 Exception) |
| Manual row sort breaks references | Value-swap sorting doesn't update formula references | After sorting by swapping data, regenerate all formula strings with updated row numbers |
| NBSP (`\xa0`) treated as non-empty | Cells containing `\xa0` or `\u200b` look blank but fail `is None` | Normalize: `\xa0`, `\u200b`, whitespace-only → `None` before comparison or aggregation |

---

## Cross-Validation Review Sheet

For analysis-heavy deliverables, embed a self-checking Review sheet in the workbook.

### When Required

- Deliverables with computed metrics or aggregated data
- Financial models with cross-sheet references
- Data sourced from external APIs or web searches

### Structure

```python
review_ws = wb.create_sheet("Review")
review_ws.sheet_properties.tabColor = "FFC000"  # amber tab

checks = [
    ["Check", "Expected", "Actual", "Status"],
    ["Total Revenue", "=SUM(Data!B2:B100)", "=Summary!B10", '=IF(B2=C2,"✓ PASS","✗ FAIL")'],
    ["Row Count", "=COUNTA(Data!A:A)-1", "=Summary!B3", '=IF(B3=C3,"✓ PASS","✗ FAIL")'],
    ["Grand Total Match", "=Detail!F50", "=Dashboard!C5", '=IF(B4=C4,"✓ PASS","✗ FAIL")'],
]
for i, row in enumerate(checks, 1):
    for j, val in enumerate(row, 1):
        review_ws.cell(row=i, column=j, value=val)
```

### Rules

- Every Summary/Dashboard metric must have a cross-check formula back to source data
- Status column uses live formulas — green if correct, red if mismatch
- Review is the **last sheet** in the workbook (before Sources, if present)

---

## Release Checklist

Before handing the file to the user:

- [ ] Every sheet passed the Builder's self-check chain
- [ ] Semantic inspection passed (if applicable)
- [ ] `validate` returned exit code 0
- [ ] All temp files, drafts, and retry artifacts removed
- [ ] If multiple versions exist from retries, only the latest correct version remains
- [ ] Every remaining file in the output directory is an expected deliverable
- [ ] **VBA check** (if `.xlsm`): VBA modules preserved, no unintended macro removal
- [ ] **VBA security** (if VBA generated): passes security checklist in `scenes/vba.md`
