Parser - pdf
Overview
The PDF document parser is an implementation of the Document Parser interface that parses PDF file content into plain text. It follows Eino: Document Parser Interface Guide and is used for:
- converting PDF documents into plain text
- splitting PDF content by pages
Features
PDF parser provides:
- basic text extraction
- optional page‑based splitting
- automatic handling of fonts and encodings
- support for multi‑page PDFs
Notes:
- may not fully support all PDF formats
- does not preserve whitespace/newline formatting
- complex layouts may affect extraction quality
Usage
Initialization
Initialize via NewPDFParser with configuration:
import (
"github.com/cloudwego/eino-ext/components/document/parser/pdf"
)
func main() {
parser, err := pdf.NewPDFParser(ctx, &pdf.Config{
ToPages: true, // split by pages
})
}
Config:
ToPages: whether to split the PDF into multiple documents by pages, default false
Parse Documents
docs, err := parser.Parse(ctx, reader, opts...)
Options:
parser.WithURIto set document URIparser.WithExtraMetato add extra metadata
Complete Example
Basic Usage
package main
import (
"context"
"os"
"github.com/cloudwego/eino-ext/components/document/parser/pdf"
"github.com/cloudwego/eino/components/document/parser"
)
func main() {
ctx := context.Background()
// init parser
p, err := pdf.NewPDFParser(ctx, &pdf.Config{ ToPages: false })
if err != nil {
panic(err)
}
// open file
file, err := os.Open("document.pdf")
if err != nil {
panic(err)
}
defer file.Close()
// parse
docs, err := p.Parse(ctx, file,
parser.WithURI("document.pdf"),
parser.WithExtraMeta(map[string]any{ "source": "./document.pdf" }),
)
if err != nil {
panic(err)
}
for _, doc := range docs {
println(doc.Content)
}
}
In loader
See examples in Eino: Document Parser Interface Guide
References
Last modified
December 16, 2025
: fix: improve readability of websocket and swagger docs (#1480) (f63ff55)