# `PhoenixKitCatalogue.Workers.PdfExtractor`
[🔗](https://github.com/BeamLabEU/phoenix_kit_catalogue/blob/0.8.0/lib/phoenix_kit_catalogue/workers/pdf_extractor.ex#L1)

Oban worker that extracts text page-by-page from a PDF using
`pdfinfo` (page count) + `pdftotext` (per-page text).

Keyed by `file_uuid` (core's `phoenix_kit_files.uuid`), not the
per-upload `phoenix_kit_cat_pdfs.uuid` — so two uploads of identical
content share one extraction job.

## Lifecycle

1. Look up the extraction row by `file_uuid`. If terminal
   (`extracted` / `scanned_no_text` / `failed`), no-op (retry of an
   already-done job, or duplicate enqueue from a content-dedup
   upload).
2. Resolve the binary via `Storage.retrieve_file/1` — returns a
   temp path. Works whether the file lives on local disk, S3, or
   anything core supports.
3. Mark `"extracting"`.
4. `pdfinfo` for page count. Treat parse failures as fatal.
5. For each page, `pdftotext -layout`, normalize, hash, upsert into
   the per-page content cache, insert a `pdf_pages` row.
6. Transition to `extracted` (or `scanned_no_text` if all pages
   came back empty). Failures mid-loop transition to `failed`.

## Concurrency

Configured via the host app's Oban queue config. Recommend
`queue: :catalogue_pdf, limit: 2` so a 1000-page PDF doesn't pin
CPU or block other queues.

## Deduplication

Re-enqueueing the same content (duplicate-content upload, the self-heal
`requeue_stuck_extractions/1`, or the per-PDF Retry button) is deduped
*application-side* in `PdfLibrary.insert_extraction_job/1` — it skips
the insert when a non-terminal `PdfExtractor` job already exists for the
`file_uuid`. We deliberately do **not** use Oban's built-in `unique:`
option: satisfying its compile-time check requires listing every
incomplete state including `:suspended`, but that enum value is absent
from the `oban_job_state` enum on hosts that upgraded the Oban *library*
without running its latest *migration* — the uniqueness query then
raises `22P02` and kills every enqueue. The app-side guard queries only
the four states (`available` / `scheduled` / `executing` / `retryable`)
present in every Oban version. Races are harmless: this worker
short-circuits on a terminal status and page inserts are upserts.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
