# `PhoenixKitCatalogue.Catalogue.PdfLibrary`
[🔗](https://github.com/BeamLabEU/phoenix_kit_catalogue/blob/0.8.0/lib/phoenix_kit_catalogue/catalogue/pdf_library.ex#L1)

PDF library — upload, extract, search.

Layered on top of core's `phoenix_kit_files` system. The catalogue
owns only:

  * `phoenix_kit_cat_pdfs` — per-upload row (the user-facing
    "this name in the library"). Soft-delete via
    `status` (`active` / `trashed`).
  * `phoenix_kit_cat_pdf_extractions` — per unique file content
    (one row per `file_uuid`). Holds the worker state machine.
  * `phoenix_kit_cat_pdf_pages` — per-page join.
  * `phoenix_kit_cat_pdf_page_contents` — content-addressed
    page text dedup cache.

Core handles binary storage, content checksum dedup, multi-bucket
redundancy, on-disk lifecycle (`Storage.trash_file/1`,
`PruneTrashJob`).

Public surface re-exported from `PhoenixKitCatalogue.Catalogue`.
Activity logging follows the catalogue convention — success-only on
the context layer; the LV layer's `Web.Helpers.log_operation_error/3`
writes the `db_pending: true` audit row on failure.

## Authorization

The mutating context functions accept `:actor_uuid` for activity
attribution but **do not enforce role checks** — authorization is
the LV mount layer's job (admin `live_session` + `on_mount` hook).
Same convention as the rest of the catalogue context. New non-LV
callers (background jobs, RPC, extension modules) MUST verify the
caller is allowed before invoking these functions.

`create_pdf_from_upload/3` does require a non-nil `:actor_uuid` —
not as authorization, but because core's `phoenix_kit_files.user_uuid`
is NOT NULL and we'd otherwise crash mid-flow after writing bytes
to disk. Returns `{:error, :missing_actor}` cleanly when missing.

# `group`

```elixir
@type group() :: %{
  pdf: PhoenixKitCatalogue.Schemas.Pdf.t(),
  total_matches: non_neg_integer(),
  hits: [hit()]
}
```

Per-PDF group returned by `search_pdfs_for_item/2`.

# `hit`

```elixir
@type hit() :: %{
  pdf: PhoenixKitCatalogue.Schemas.Pdf.t(),
  page_number: pos_integer(),
  snippet: String.t(),
  score: float()
}
```

One PDF search hit returned to the UI.

# `count_pdfs`

```elixir
@spec count_pdfs(keyword()) :: non_neg_integer()
```

Returns the total PDF count, matching the optional status filter.

# `create_pdf_from_upload`

```elixir
@spec create_pdf_from_upload(String.t(), String.t(), keyword()) ::
  {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, term()}
```

Stores an uploaded PDF.

`tmp_path` is the local file from `consume_uploaded_entry`'s callback.
`original_filename` is the user's chosen name. `byte_size` is from
`entry.client_size`.

Flow:

  1. `Storage.store_file/2` (core) — handles SHA-256 dedup, on-disk
     placement, multi-bucket redundancy. Same content uploaded
     twice (any name) returns the same `file_uuid`.
  2. Upsert the per-file extraction row. If newly created, enqueue
     the worker — otherwise the previous extraction is reused.
  3. Always insert a fresh `phoenix_kit_cat_pdfs` row so each
     upload gets its own per-name entry in the library.
  4. Activity action: `pdf.uploaded`. Metadata flags
     `content_dedup: true` when the file row was a hit.

Returns `{:ok, pdf}` on success.

The persisted `byte_size` is read from the file on disk via
`File.stat!/1` — never from a browser-supplied value — so the
recorded size always matches the actual stored bytes.

# `get_extraction`

```elixir
@spec get_extraction(PhoenixKitCatalogue.Schemas.Pdf.t() | Ecto.UUID.t()) ::
  PhoenixKitCatalogue.Schemas.PdfExtraction.t() | nil
```

Returns the extraction state for a PDF (or its `file_uuid`), or
`nil` if the file has no extraction row yet.

# `get_pdf`

```elixir
@spec get_pdf(Ecto.UUID.t()) :: PhoenixKitCatalogue.Schemas.Pdf.t() | nil
```

Fetches a PDF by UUID. Returns `nil` if not found.

# `get_pdf!`

```elixir
@spec get_pdf!(Ecto.UUID.t()) :: PhoenixKitCatalogue.Schemas.Pdf.t()
```

Fetches a PDF by UUID. Raises `Ecto.NoResultsError` if not found.

# `list_pdfs`

```elixir
@spec list_pdfs(keyword()) :: [PhoenixKitCatalogue.Schemas.Pdf.t()]
```

Lists PDFs in the library, newest first.

## Options

  * `:status` — filter to a status string (`"active"` / `"trashed"`).
    Pass `nil` to include all. Defaults to `"active"`.
  * `:limit` (default 100), `:offset` (default 0)

# `more_pdf_matches_for_item`

```elixir
@spec more_pdf_matches_for_item(
  PhoenixKitCatalogue.Schemas.Item.t(),
  Ecto.UUID.t(),
  keyword()
) :: [
  hit()
]
```

Loads additional hits for one PDF beyond what the initial grouped
search returned. Used by the modal's per-PDF "Show more matches"
expand action.

Returns a flat list of `hit()` ordered by `page_number ASC` (literal
search) or `similarity DESC` (when a `:trigram_query` opt is given).

## Options

  * `:offset` (default 0)
  * `:limit` (default 50)
  * `:trigram_query` — when set, score by `pg_trgm` similarity
    against this string (matches the trigram fallback's ordering).

# `permanently_delete_pdf`

```elixir
@spec permanently_delete_pdf(
  PhoenixKitCatalogue.Schemas.Pdf.t(),
  keyword()
) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, Ecto.Changeset.t()}
```

Permanently removes a `phoenix_kit_cat_pdfs` row.

When this is the last (active OR trashed) row referencing the
underlying `file_uuid`, hands the file off to `Storage.trash_file/1`
so core's daily `PruneTrashJob` deletes the binary, cascading to
the extraction and page rows.

# `prune_orphan_page_contents`

```elixir
@spec prune_orphan_page_contents() :: non_neg_integer()
```

Removes `phoenix_kit_cat_pdf_page_contents` rows that no
`phoenix_kit_cat_pdf_pages` row references anymore. Safe to call
any time.

Returns the number of rows removed. Suitable for wiring to a daily
Oban cron once the corpus is large enough to care.

# `requeue_stuck_extractions`

```elixir
@spec requeue_stuck_extractions(keyword()) ::
  {:ok,
   %{
     requeued: non_neg_integer(),
     skipped: non_neg_integer(),
     failed: non_neg_integer()
   }}
```

Re-enqueues extraction for every PDF stuck in a non-terminal state.

The heal path for PDFs uploaded while the `:catalogue_pdf` queue was
unavailable (their jobs never ran) or orphaned `extracting` rows whose
worker died mid-run. The per-upload `enqueue_extraction/1` guard only
fires at upload time, so without this nothing ever re-drives those rows.

`pending` rows are always re-enqueued — no live job can exist for them.
`extracting` rows are re-enqueued only when older than
`:stale_after_seconds` (default `900`) so an actively-running
extraction isn't double-processed.

Returns `{:ok, %{requeued: n, skipped: s, failed: m}}`:

  * `requeued` — rows whose extraction job was actually (re-)enqueued.
  * `skipped` — rows a live job already covers, so there was nothing to
    do (the app-level dedup). Reported separately so `requeued` can't
    claim credit for rows we didn't touch.
  * `failed` — rows whose enqueue was refused (e.g. the `:catalogue_pdf`
    queue is still not running, so they were marked `failed` with the
    actionable message instead).

The split keeps "re-queued N" honest when every enqueue actually failed
or was a no-op. Safe to call repeatedly (the worker is idempotent).

The whole selection is de-duped against live jobs in a single query and
enqueued with one `Oban.insert_all/1`, so a full `1000`-row
click is a handful of statements rather than ~2k per-row round-trips.

Capped at `1000` rows per call; re-run to process more.

## Options

  * `:stale_after_seconds` (default `900`) — minimum age of an
    `extracting` row before it's considered orphaned.
  * `:limit` (default `1000`) — max rows touched per call.

# `restore_pdf`

```elixir
@spec restore_pdf(
  PhoenixKitCatalogue.Schemas.Pdf.t(),
  keyword()
) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, Ecto.Changeset.t()}
```

Restores a trashed PDF back to active.

# `retry_extraction`

```elixir
@spec retry_extraction(
  PhoenixKitCatalogue.Schemas.Pdf.t() | Ecto.UUID.t(),
  keyword()
) :: {:ok, PhoenixKitCatalogue.Schemas.PdfExtraction.t()} | {:error, term()}
```

Retries text extraction for a single PDF.

Resets the extraction row to `pending` (clearing any prior
`error_message`) and re-enqueues the worker. Use for a `failed` row
(transient failure: queue was down, `pdftotext` hiccup) or one that
looks stuck in `pending` / `extracting`.

This is a **retry**, not a full re-extract: it does not delete existing
`pdf_pages` rows or clear `page_count` / `extracted_at`. The worker's
page inserts are upserts and `mark_extracted/2` overwrites `page_count`
on success, so a re-run self-heals. The admin UI only offers Retry on
`failed` rows (which carry no successful page data), so the distinction
rarely matters in practice.

The worker no-ops on a terminal status, so resetting to `pending`
first is what lets a `failed` row run again.

Returns:

  * `{:ok, extraction}` — reset + enqueued.
  * `{:error, :no_extraction}` — the file has no extraction row.
  * `{:error, :already_extracted}` — the row is already in a SUCCESS
    terminal (`extracted` / `scanned_no_text`). Refused so a stray
    caller can't reset a good extraction back to `pending` and drop the
    PDF out of search mid-run. Pass `force: true` to override (e.g. a
    deliberate re-extract after a normalizer change). The admin UI only
    offers Retry on `failed` rows, so this only bites a programmatic
    caller.
  * `{:error, reason}` — the enqueue guard refused (e.g.
    `:extraction_queue_unavailable` when the `:catalogue_pdf` queue
    still isn't running). The row is left `failed` with the
    actionable message in that case, exactly as on upload.

Accepts a `%Pdf{}` (the LV path) or a bare `file_uuid`.

## Options

  * `:force` (default `false`) — re-run even a success-terminal row.

# `search_pdfs_for_item`

```elixir
@spec search_pdfs_for_item(
  PhoenixKitCatalogue.Schemas.Item.t(),
  keyword()
) :: [group()]
```

Searches the PDF library for any active PDF whose pages match one of
the item's translated names.

Returns groups keyed by PDF, each with the **total match count for
the corpus** plus the first `:per_pdf` hits (default 5). Use
`more_pdf_matches_for_item/3` to load additional hits within one PDF
on demand (the "Show more matches" expand action).

Strategy:

  1. Build the title list from the item's primary name + every
     enabled language's translated name. Drop blanks and duplicates.
  2. Literal `ILIKE ANY` against the deduped page-content table —
     fast and precise. Joined to active `phoenix_kit_cat_pdfs` rows
     via `file_uuid`. Rows are window-ranked per PDF and
     window-counted per PDF in a single SQL pass; the outer query
     caps at `rn <= per_pdf` so the result is bounded by
     `per_pdf × distinct PDFs that match`.
  3. If literal returns nothing, fall back to a `pg_trgm` similarity
     search using the longest title (default threshold 0.4) — same
     grouping shape, best similarity first within each PDF.

Trashed PDFs are excluded. Groups are ordered newest-PDF-first.

## Options

  * `:per_pdf` (default 5) — preview hits returned per PDF.
  * `:similarity_threshold` (default 0.4) — trigram fallback threshold.

# `trash_pdf`

```elixir
@spec trash_pdf(
  PhoenixKitCatalogue.Schemas.Pdf.t(),
  keyword()
) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, Ecto.Changeset.t()}
```

Soft-deletes a PDF: flips status to `"trashed"` and records
`trashed_at`. Underlying file + extraction + page rows untouched
(other live PDF entries may still reference them).

---

*Consult [api-reference.md](api-reference.md) for complete listing*