Wizard Extract¶
WizardExtract is a Python library for reliable text extraction from PDFs, Office documents, and images. It supports local OCR with Tesseract and cloud OCR with Azure Document Intelligence. It provides page and sheet selection, hybrid PDF handling that combines native text with OCR, and deterministic I/O. With Azure prebuilt-layout it can also return tables and key-value pairs.
Installation¶
Requires Python 3.9+.
pip install wizardextract
Optional extras:
# Azure OCR
pip install "wizardextract[azure]"
Note
For OCR, install Tesseract.
Quick start¶
import wizardextract as we
text = we.extract_text("example.pdf")
print(text)
API overview¶
Method |
Purpose |
|---|---|
|
Local text extraction with optional Tesseract OCR |
|
Cloud extraction via Azure (text, tables, key-value) |
Text extraction¶
Parameters¶
input_data:str | bytes | Pathextension: Required only ifinput_dataisbytes.pages: Page/sheet selection.Paged (PDF, DOCX, TIFF):
1,"1-3",[1, 3, "5-8"]Excel (XLSX/XLS): sheet index (
int), name (str), or mixed list
ocr: Enable Tesseract OCR for images and scanned PDFs/DOCX.language_ocr: OCR language, default"eng".
Examples¶
Basic:
import wizardextract as we
txt = we.extract_text("docs/report.pdf")
print(txt)
From bytes:
from pathlib import Path
import wizardextract as we
raw = Path("img.png").read_bytes()
txt_img = we.extract_text(raw, extension="png")
print(txt_img)
Paged selection and OCR:
import wizardextract as we
sel = we.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
ocr_txt = we.extract_text("scan.tiff", ocr=True, language_ocr="ita")
print(sel); print(ocr_txt)
Supported Formats¶
Format |
OCR |
|---|---|
Optional |
|
DOC |
No |
DOCX |
Optional |
XLSX |
No |
XLS |
No |
TXT |
No |
CSV |
No |
JSON |
No |
HTML |
No |
HTM |
No |
TIF |
Default |
TIFF |
Default |
JPG |
Default |
JPEG |
Default |
PNG |
Default |
GIF |
Default |
Azure OCR¶
Parameters¶
input_data:str | bytes | Pathextension: File extension whenbytesare passed.language_ocr: OCR language code (ISO-639).pages: Page selection (int,"1,3,5-7", or list).azure_endpoint: Azure Document Intelligence endpoint URL.azure_key: Azure API key.azure_model_id:"prebuilt-read"(text only) or"prebuilt-layout"(text + tables + key-value).hybrid: IfTrue, for PDFs: native text for text pages and OCR for raster pages.
Example¶
import wizardextract as we
res = we.extract_text_azure(
"invoice.pdf",
language_ocr="ita",
azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
azure_key="<KEY>",
azure_model_id="prebuilt-layout",
hybrid=True,
)
print(res.text)
print(res.pretty_tables)
print(res.key_value)