Parser - html
Overview
The HTML document parser is an implementation of the Document Parser interface that parses HTML page content into plain text. It follows Eino: Document Parser Interface Guide and is used for:
- extracting plain text from web pages
- retrieving page metadata (title, description, etc.)
Features
HTML parser provides:
- selective content extraction with flexible selectors (html selector)
- automatic metadata extraction
- safe HTML parsing
Usage
Initialization
Initialize via NewParser with configuration:
import (
"github.com/cloudwego/eino-ext/components/document/parser/html"
)
parser, err := html.NewParser(ctx, &html.Config{
Selector: &selector, // optional: content selector, defaults to body
})
Config:
Selector: optional, the region to extract using goquery selector syntax- e.g.,
bodyextracts<body>content #contentextracts the element with id “content”
- e.g.,
Metadata Keys
Parser auto‑extracts:
html.MetaKeyTitle("_title"): page titlehtml.MetaKeyDesc("_description"): page descriptionhtml.MetaKeyLang("_language"): page languagehtml.MetaKeyCharset("_charset"): charsethtml.MetaKeySource("_source"): document source URI
Complete Example
Basic Usage
package main
import (
"context"
"strings"
"github.com/cloudwego/eino-ext/components/document/parser/html"
"github.com/cloudwego/eino/components/document/parser"
)
func main() {
ctx := context.Background()
// init parser
p, err := html.NewParser(ctx, nil) // default config
if err != nil {
panic(err)
}
// HTML content
htmlContent := `
<html lang="en">
<head>
<title>Sample Page</title>
<meta name="description" content="This is a sample page">
<meta charset="UTF-8">
</head>
<body>
<div id="content">
<h1>Welcome</h1>
<p>Main body.</p>
</div>
</body>
</html>
`
// parse
docs, err := p.Parse(ctx, strings.NewReader(htmlContent),
parser.WithURI("https://example.com"),
parser.WithExtraMeta(map[string]any{
"custom": "value",
}),
)
if err != nil {
panic(err)
}
doc := docs[0]
println("content:", doc.Content)
println("title:", doc.MetaData[html.MetaKeyTitle])
println("desc:", doc.MetaData[html.MetaKeyDesc])
println("lang:", doc.MetaData[html.MetaKeyLang])
}
Using Selectors
package main
import (
"context"
"github.com/cloudwego/eino-ext/components/document/parser/html"
)
func main() {
ctx := context.Background()
// only extract element with id content
selector := "#content"
p, err := html.NewParser(ctx, &html.Config{ Selector: &selector })
if err != nil {
panic(err)
}
// ... parsing code ...
}
In loader
See examples in Eino: Document Parser Interface Guide
References
Last modified
December 16, 2025
: fix: improve readability of websocket and swagger docs (#1480) (f63ff55)