Wizard Extract

WizardExtract Banner
PyPI - Version PyPI - Downloads/month License

WizardExtract is a Python library for reliable text extraction from PDFs, Office documents, and images. It supports local OCR with Tesseract and cloud OCR with Azure Document Intelligence. It provides page and sheet selection, hybrid PDF handling that combines native text with OCR, and deterministic I/O. With Azure prebuilt-layout it can also return tables and key-value pairs.

Installation

Requires Python 3.9+.

pip install wizardextract

Optional extras:

# Azure OCR
pip install "wizardextract[azure]"

Note

For OCR, install Tesseract.

Quick start

import wizardextract as we

text = we.extract_text("example.pdf")
print(text)

API overview

Method

Purpose

extract_text

Local text extraction with optional Tesseract OCR

extract_text_azure

Cloud extraction via Azure (text, tables, key-value)

Text extraction

Parameters

  • input_data: str | bytes | Path

  • extension: Required only if input_data is bytes.

  • pages: Page/sheet selection.

    • Paged (PDF, DOCX, TIFF): 1, "1-3", [1, 3, "5-8"]

    • Excel (XLSX/XLS): sheet index (int), name (str), or mixed list

  • ocr: Enable Tesseract OCR for images and scanned PDFs/DOCX.

  • language_ocr: OCR language, default "eng".

Examples

Basic:

import wizardextract as we
txt = we.extract_text("docs/report.pdf")
print(txt)

From bytes:

from pathlib import Path
import wizardextract as we

raw = Path("img.png").read_bytes()
txt_img = we.extract_text(raw, extension="png")
print(txt_img)

Paged selection and OCR:

import wizardextract as we

sel = we.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
ocr_txt = we.extract_text("scan.tiff", ocr=True, language_ocr="ita")
print(sel); print(ocr_txt)

Supported Formats

Format

OCR

PDF

Optional

DOC

No

DOCX

Optional

XLSX

No

XLS

No

TXT

No

CSV

No

JSON

No

HTML

No

HTM

No

TIF

Default

TIFF

Default

JPG

Default

JPEG

Default

PNG

Default

GIF

Default

Azure OCR

Parameters

  • input_data: str | bytes | Path

  • extension: File extension when bytes are passed.

  • language_ocr: OCR language code (ISO-639).

  • pages: Page selection (int, "1,3,5-7", or list).

  • azure_endpoint: Azure Document Intelligence endpoint URL.

  • azure_key: Azure API key.

  • azure_model_id: "prebuilt-read" (text only) or "prebuilt-layout" (text + tables + key-value).

  • hybrid: If True, for PDFs: native text for text pages and OCR for raster pages.

Example

import wizardextract as we

res = we.extract_text_azure(
    "invoice.pdf",
    language_ocr="ita",
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_key="<KEY>",
    azure_model_id="prebuilt-layout",
    hybrid=True,
)

print(res.text)
print(res.pretty_tables)
print(res.key_value)

License

AGPL-3.0-or-later.

Resources

Contact & Author

Author:

Mattia Rubino

Email:

textwizard.dev@gmail.com