---
name: invoice-extractor
description: >
  Processes one or multiple invoices (PDF, image, or pasted text) and performs
  structured data extraction, normalization, duplicate detection, and anomaly
  flagging — returning a clean JSON result plus a rich interactive HTML dashboard.
  Use this skill whenever the user uploads or pastes any invoice
  or batch of invoices and asks to extract data, find duplicates, check for
  anomalies, or audit invoices. Also triggers for phrases like "process this
  invoice", "extract invoice data", "check these invoices for duplicates",
  "flag any issues in these invoices", "batch process invoices", "invoice audit",
  "parse this bill", "dedup these invoices", or any request combining invoice/bill
  content with extraction, analysis, or validation. Always use this skill — even
  for a single invoice — to ensure consistent structured output and anomaly checks.
---

# Invoice Extraction, Deduplication & Anomaly Detection

## Overview

This skill processes one or many invoices and returns:
1. **Structured JSON** — normalised, field-consistent invoice records
2. **Duplicate report** — suspected duplicates with reasoning
3. **Anomaly report** — flagged issues per invoice
4. **Interactive HTML dashboard** — rendered output for human review

---

## Step-by-Step Workflow

### Step 1 — Ingest Inputs

Accept any combination of:
- **PDF files** → read via `bash_tool` (pdftotext or python pdfplumber)
- **Images** (JPG/PNG) → pass directly to Claude vision in the API call
- **Pasted text** → use directly

If files are present, check `/mnt/user-data/uploads/` first.

```bash
# For PDFs, extract text:
pip install pdfplumber --break-system-packages -q
python3 - <<'EOF'
import pdfplumber, sys, json
results = {}
for path in sys.argv[1:]:
    with pdfplumber.open(path) as pdf:
        text = "\n".join(p.extract_text() or "" for p in pdf.pages)
    results[path] = text
print(json.dumps(results))
EOF /path/to/invoice.pdf
```

For images, encode as base64 and pass as `image` content blocks to the API (see Step 2).

---

### Step 2 — Extract & Normalise via Claude API

Call the Claude API with **all invoice content** in a single prompt (batch processing). Use `claude-sonnet-4-20250514`.

**System prompt to use:**

```
You are a precise invoice data extractor. For each invoice provided, extract all fields below. Return ONLY valid JSON — no markdown fences, no commentary.

EXTRACTION FIELDS (per invoice):
- vendor_name, vendor_address, vendor_email
- invoice_number, invoice_date (YYYY-MM-DD), due_date (YYYY-MM-DD)
- bill_to (name/company of the recipient)
- line_items: array of { description, quantity, unit_price, amount }
- subtotal, tax_rate (as number, e.g. 10 for 10%), tax_amount, total_due
- currency (ISO 4217, e.g. "USD", "GBP", "THB")
- notes

NORMALISATION RULES:
- All dates → YYYY-MM-DD
- All monetary values → plain decimal number (no symbols, no commas), e.g. 1500.00
- Null for any field not found — do NOT invent values
- Consistent snake_case field names

DUPLICATE DETECTION:
Compare all invoices. Flag duplicates where:
- Same vendor_name AND same invoice_number
- OR same total_due AND same invoice_date (within same vendor)

ANOMALY DETECTION — flag any of:
- missing_required_field: invoice_number, total_due, invoice_date, vendor_name absent
- total_mismatch: sum(line_items.amount) ≠ subtotal, or subtotal + tax_amount ≠ total_due (allow ±0.02 rounding)
- duplicate_line_item: two identical line items (same description + amount) within one invoice
- extreme_tax_rate: tax_rate < 0 or tax_rate > 30
- missing_vendor_contact: vendor_email is null
- unusual_pricing: a unit_price that is >3× or <1/3 of the median unit_price across all invoices in the batch (only flag if batch ≥ 3 invoices)
- suspicious_vendor: vendor details inconsistent or incomplete (no address + no email)

OUTPUT SCHEMA:
{
  "invoices": [
    {
      "invoice_id": "inv_001",
      "source": "<filename or 'pasted_text_1'>",
      "vendor_name": "...",
      "vendor_address": "...",
      "vendor_email": "...",
      "invoice_number": "...",
      "invoice_date": "YYYY-MM-DD",
      "due_date": "YYYY-MM-DD",
      "bill_to": "...",
      "line_items": [
        { "description": "...", "quantity": 1, "unit_price": 0.00, "amount": 0.00 }
      ],
      "subtotal": 0.00,
      "tax_rate": 0,
      "tax_amount": 0.00,
      "total_due": 0.00,
      "currency": "USD",
      "notes": null
    }
  ],
  "duplicates": [
    {
      "invoice_ids": ["inv_001", "inv_002"],
      "reason": "Same vendor_name and invoice_number"
    }
  ],
  "anomalies": [
    {
      "invoice_id": "inv_001",
      "type": "total_mismatch",
      "explanation": "Line items sum to 1400.00 but subtotal is 1500.00"
    }
  ]
}
```

**User message:** Include all invoice text/images, labelled by source filename.

---

### Step 3 — Validate the JSON

After receiving the API response:

```python
import json

def validate_invoice_json(raw):
    data = json.loads(raw)
    assert "invoices" in data
    assert "duplicates" in data
    assert "anomalies" in data
    for inv in data["invoices"]:
        assert "invoice_id" in inv
        assert "total_due" in inv
    return data
```

If parsing fails, retry once with an explicit "return ONLY valid JSON" instruction.

---

### Step 4 — Build the HTML Dashboard

Generate a single-file interactive HTML dashboard. Default palette (replace with your own brand colours):
- **Primary:** `#f3af00` (yellow)
- **Dark:** `#201600`
- **Card bg:** `#ffffff`
- **Warning:** `#ff6b35`
- **Success:** `#27ae60`

**Dashboard sections:**

1. **Summary Bar** — total invoices processed, total value, duplicate count, anomaly count
2. **Invoice Cards** — one card per invoice showing all fields; line items in a mini table; badges for anomalies
3. **Duplicates Panel** — grouped duplicate sets with reason
4. **Anomalies Panel** — table of all anomalies with type, invoice ID, explanation, severity icon
5. **Raw JSON Tab** — collapsible pre-formatted JSON export (full output)
6. **Export Button** — downloads the JSON as `invoice_report_<date>.json`

Use tab navigation between sections. Mark invoices with anomalies with a ⚠️ badge on their card. Mark duplicates with a 🔁 badge.

---

### Step 5 — Save & Present

```bash
cp /home/claude/invoice_dashboard.html /mnt/user-data/outputs/invoice_dashboard.html
```

Then call `present_files` with the output path. Also paste the raw JSON in a code block below the file link for easy copying.

---

## Field Reference

| Field | Type | Required | Notes |
|---|---|---|---|
| vendor_name | string | ✅ | |
| vendor_address | string | ⚪ | |
| vendor_email | string | ⚪ | Flag if missing |
| invoice_number | string | ✅ | |
| invoice_date | string | ✅ | YYYY-MM-DD |
| due_date | string | ⚪ | |
| bill_to | string | ⚪ | |
| line_items | array | ✅ | |
| subtotal | number | ✅ | |
| tax_rate | number | ⚪ | % as number |
| tax_amount | number | ⚪ | |
| total_due | number | ✅ | |
| currency | string | ⚪ | ISO 4217 |
| notes | string | ⚪ | |

---

## Anomaly Severity Guide

| Type | Severity | Action |
|---|---|---|
| total_mismatch | 🔴 High | Review immediately |
| duplicate_invoice | 🔴 High | Do not pay twice |
| missing_required_field | 🟠 Medium | Request from vendor |
| extreme_tax_rate | 🟠 Medium | Verify with vendor |
| duplicate_line_item | 🟡 Low | May be intentional |
| unusual_pricing | 🟡 Low | Spot-check |
| missing_vendor_contact | ⚪ Info | Nice to have |
| suspicious_vendor | 🟠 Medium | Verify identity |

---

## Edge Cases

- **Scanned/blurry PDFs**: Extract what's visible; null out unreadable fields; add anomaly `type: "low_confidence_extraction"` 
- **Multi-page invoices**: Treat as one invoice; concatenate pages before extraction
- **Foreign currencies**: Preserve original currency code; do NOT convert
- **No line items**: Set `line_items: []`; flag as `missing_required_field` for line_items
- **Single invoice**: Full workflow still applies — skip unusual_pricing anomaly (needs ≥3 for comparison)
