html

package module

v1.0.5 Latest Latest Go to latest Published: Jan 14, 2026 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cybergodev/html

Links

Open Source Insights

README ¶

HTML Library

A Go library for intelligent HTML content extraction. Compatible with golang.org/x/net/html — use it as a drop-in replacement, plus get enhanced content extraction features.

📖 中文文档 - User guide

✨ Core Features

🎯 Content Extraction

Article Detection: Identifies main content using scoring algorithms (text density, link density, semantic tags)
Smart Text Extraction: Preserves structure, handles newlines, calculates word count and reading time
Media Extraction: Images, videos, audio with metadata (URL, dimensions, alt text, type detection)
Link Analysis: External/internal detection, nofollow attributes, anchor text extraction

⚡ Performance

Content-Addressable Caching: SHA256-based keys with TTL and LRU eviction
Batch Processing: Parallel extraction with configurable worker pools
Thread-Safe: Concurrent use without external synchronization
Resource Limits: Configurable input size, nesting depth, and timeout protection

📖 Use Cases

📰 News Aggregators: Extract article content from news sites
🤖 Web Scrapers: Get structured data from HTML pages
📝 Content Management: Convert HTML to Markdown or other formats
🔍 Search Engines: Index main content without navigation/ads
📊 Data Analysis: Extract and analyze web content at scale
📱 RSS/Feed Generators: Create feeds from HTML content
🎓 Documentation Tools: Convert HTML docs to other formats

📦 Installation

go get github.com/cybergodev/html

⚡ 5-Minute Quick Start

import "github.com/cybergodev/html"

// Extract clean text from HTML
htmlContent, _ := html.ExtractText(`
    <html>
        <nav>Navigation</nav>
        <article><h1>Hello World</h1><p>Content here...</p></article>
        <footer>Footer</footer>
    </html>
`)
fmt.Println(htmlContent) // "Hello World\nContent here..."

That's it! The library automatically:

Removes navigation, footers, ads
Extracts main content
Cleans up whitespace

🚀 Quick Guide

One-Liner Functions

Just want to get something done? Use these package-level functions:

// Extract text only
text, _ := html.ExtractText(htmlContent)

// Extract everything
result, _ := html.Extract(htmlContent)
fmt.Println(result.Title)     // Hello World
fmt.Println(result.Text)      // Content here...
fmt.Println(result.WordCount) // 4

// Extract only specific elements
title, err := html.ExtractTitle(htmlContent)
images, err := html.ExtractImages(htmlContent)
links, err := html.ExtractLinks(htmlContent)

// Convert formats
markdown, err := html.ExtractToMarkdown(htmlContent)
jsonData, err := html.ExtractToJSON(htmlContent)

// Content analysis
wordCount, err := html.GetWordCount(htmlContent)
readingTime, err := html.GetReadingTime(htmlContent)
summary, err := html.Summarize(htmlContent, 50) // max 50 words

When to use: Simple scripts, one-off tasks, quick prototyping

Basic Processor Usage

Need more control, Create a processor:

processor := html.NewWithDefaults()
defer processor.Close()

// Extract with defaults
result, err := processor.ExtractWithDefaults(htmlContent)

// Extract from file
result, err = processor.ExtractFromFile("page.html", html.DefaultExtractConfig())

// Batch processing
htmlContents := []string{html1, html2, html3}
results, err := processor.ExtractBatch(htmlContents, html.DefaultExtractConfig())

When to use: Multiple extractions, processing many files, web scrapers

Custom Configuration

Fine-tune what gets extracted:

config := html.ExtractConfig{
    ExtractArticle:    true,   // Auto-detect main content
    PreserveImages:    true,   // Extract image metadata
    PreserveLinks:     true,   // Extract link metadata
    PreserveVideos:    false,  // Skip videos
    PreserveAudios:    false,  // Skip audio
    InlineImageFormat: "none", // Options: "none", "placeholder", "markdown", "html"
}

processor := html.NewWithDefaults()
defer processor.Close()

result, err := processor.Extract(htmlContent, config)

When to use: Specific extraction needs, format conversion, custom output

Advanced Features

Custom Processor Configuration

config := html.Config{
    MaxInputSize:       10 * 1024 * 1024, // 10MB limit
    ProcessingTimeout:  30 * time.Second,
    MaxCacheEntries:    500,
    CacheTTL:           30 * time.Minute,
    WorkerPoolSize:     8,
    EnableSanitization: true,  // Remove <script>, <style> tags
    MaxDepth:           50,    // Prevent deep nesting attacks
}

processor, err := html.New(config)
defer processor.Close()

Link Extraction

// Extract all resource links
links, err := html.ExtractAllLinks(htmlContent)

// Group by type
byType := html.GroupLinksByType(links)
cssLinks := byType["css"]
jsLinks := byType["js"]
images := byType["image"]

// Advanced configuration
processor := html.NewWithDefaults()
linkConfig := html.LinkExtractionConfig{
    BaseURL:              "https://example.com",
    ResolveRelativeURLs:  true,
    IncludeImages:        true,
    IncludeVideos:        true,
    IncludeCSS:           true,
    IncludeJS:            true,
}
links, err = processor.ExtractAllLinks(htmlContent, linkConfig)

Caching & Statistics

processor := html.NewWithDefaults()
defer processor.Close()

// Automatic caching enabled
result1, err := processor.ExtractWithDefaults(htmlContent)
result2, err := processor.ExtractWithDefaults(htmlContent) // Cache hit!

// Check performance
stats := processor.GetStatistics()
fmt.Printf("Cache hits: %d/%d\n", stats.CacheHits, stats.TotalProcessed)

// Clear cache if needed
processor.ClearCache()

Configuration Presets

processor := html.NewWithDefaults()
defer processor.Close()

// RSS feed generation
result, err := processor.Extract(htmlContent, html.ConfigForRSS())

// Summary generation (text only)
result, err = processor.Extract(htmlContent, html.ConfigForSummary())

// Search indexing (all metadata)
result, err = processor.Extract(htmlContent, html.ConfigForSearchIndex())

// Markdown output
result, err = processor.Extract(htmlContent, html.ConfigForMarkdown())

When to use: Production applications, performance optimization, specific use cases

📖 Common Recipes

Copy-paste solutions for common tasks:

Extract Article Text (Clean)

text, err := html.ExtractText(htmlContent)
// Returns clean text without navigation/ads

Extract with Images

result, err := html.Extract(htmlContent)
for _, img := range result.Images {
    fmt.Printf("Image: %s (alt: %s)\n", img.URL, img.Alt)
}

Convert to Markdown

markdown, err := html.ExtractToMarkdown(htmlContent)
// Images become: ![alt](url)

Extract All Links

links, err := html.ExtractAllLinks(htmlContent)
for _, link := range links {
    fmt.Printf("%s: %s\n", link.Type, link.URL)
}

Get Reading Time

minutes, err := html.GetReadingTime(htmlContent)
fmt.Printf("Reading time: %.1f min", minutes)

Batch Process Files

processor := html.NewWithDefaults()
defer processor.Close()

files := []string{"page1.html", "page2.html", "page3.html"}
results, err := processor.ExtractBatchFiles(files, html.DefaultExtractConfig())

Create RSS Feed Content

processor := html.NewWithDefaults()
defer processor.Close()

result, err := processor.Extract(htmlContent, html.ConfigForRSS())
// Optimized for RSS: fast, includes images/links, no article detection

🔧 API Quick Reference

Package-Level Functions

// Extraction
Extract(htmlContent string) (*Result, error)
ExtractText(htmlContent string) (string, error)
ExtractFromFile(path string) (*Result, error)

// Format Conversion
ExtractToMarkdown(htmlContent string) (string, error)
ExtractToJSON(htmlContent string) ([]byte, error)

// Specific Elements
ExtractTitle(htmlContent string) (string, error)
ExtractImages(htmlContent string) ([]ImageInfo, error)
ExtractVideos(htmlContent string) ([]VideoInfo, error)
ExtractAudios(htmlContent string) ([]AudioInfo, error)
ExtractLinks(htmlContent string) ([]LinkInfo, error)
ExtractWithTitle(htmlContent string) (string, string, error)

// Analysis
GetWordCount(htmlContent string) (int, error)
GetReadingTime(htmlContent string) (float64, error)
Summarize(htmlContent string, maxWords int) (string, error)
ExtractAndClean(htmlContent string) (string, error)

// Links
ExtractAllLinks(htmlContent string, baseURL ...string) ([]LinkResource, error)
GroupLinksByType(links []LinkResource) map[string][]LinkResource

Processor Methods

// Creation
NewWithDefaults() *Processor
New(config Config) (*Processor, error)
processor.Close()

// Extraction
processor.Extract(htmlContent string, config ExtractConfig) (*Result, error)
processor.ExtractWithDefaults(htmlContent string) (*Result, error)
processor.ExtractFromFile(path string, config ExtractConfig) (*Result, error)

// Batch
processor.ExtractBatch(contents []string, config ExtractConfig) ([]*Result, error)
processor.ExtractBatchFiles(paths []string, config ExtractConfig) ([]*Result, error)

// Links
processor.ExtractAllLinks(htmlContent string, config LinkExtractionConfig) ([]LinkResource, error)

// Monitoring
processor.GetStatistics() Statistics
processor.ClearCache()

Configuration Presets

DefaultExtractConfig()      ExtractConfig
ConfigForRSS()               ExtractConfig
ConfigForSummary()           ExtractConfig
ConfigForSearchIndex()       ExtractConfig
ConfigForMarkdown()          ExtractConfig
DefaultLinkExtractionConfig() LinkExtractionConfig

Result Structure

type Result struct {
    Text           string        // Clean text content
    Title          string        // Page/article title
    Images         []ImageInfo   // Image metadata
    Links          []LinkInfo    // Link metadata
    Videos         []VideoInfo   // Video metadata
    Audios         []AudioInfo   // Audio metadata
    WordCount      int           // Total words
    ReadingTime    time.Duration // Estimated reading time
    ProcessingTime time.Duration // Time taken
}

type ImageInfo struct {
    URL          string  // Image URL
    Alt          string  // Alt text
    Title        string  // Title attribute
    Width        string  // Width attribute
    Height       string  // Height attribute
    IsDecorative bool    // No alt text
}

type LinkInfo struct {
    URL        string  // Link URL
    Text       string  // Anchor text
    IsExternal bool    // External domain
    IsNoFollow bool    // rel="nofollow"
}

Examples

See examples/ directory for complete, runnable code:

Example	Description
01_quick_start.go	Quick start with one-liners
02_content_extraction.go	Content extraction basics
03_link_extraction.go	Link extraction patterns
04_media_extraction.go	Media (images/videos/audio)
04_advanced_features.go	Advanced features & compatibility
05_advanced_usage.go	Batch processing & performance
06_compatibility.go	golang.org/x/net/html compatibility
07_convenience_api.go	Package-level convenience API

Compatibility

This library is a drop-in replacement for golang.org/x/net/html:

// Just change the import
- import "golang.org/x/net/html"
+ import "github.com/cybergodev/html"

// All existing code works
doc, err := html.Parse(reader)
html.Render(writer, doc)
escaped := html.EscapeString("<script>")

See COMPATIBILITY.md for details.

Thread Safety

The Processor is safe for concurrent use:

processor := html.NewWithDefaults()
defer processor.Close()

// Safe to use from multiple goroutines
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
    wg.Add(1)
    go func() {
        defer wg.Done()
        processor.ExtractWithDefaults(htmlContent)
    }()
}
wg.Wait()

🤝 Contributing

Contributions, issue reports, and suggestions are welcome!

📄 License

MIT License - See LICENSE file for details.

Crafted with care for the Go community ❤️ | If this project helps you, please give it a ⭐️ Star!

Documentation ¶

Index ¶

Constants
Variables
func ExtractAndClean(htmlContent string) (string, error)
func ExtractText(htmlContent string) (string, error)
func ExtractTitle(htmlContent string) (string, error)
func ExtractToJSON(htmlContent string) ([]byte, error)
func ExtractToMarkdown(htmlContent string) (string, error)
func ExtractWithTitle(htmlContent string) (string, string, error)
func GetReadingTime(htmlContent string) (float64, error)
func GetWordCount(htmlContent string) (int, error)
func GroupLinksByType(links []LinkResource) map[string][]LinkResource
func Summarize(htmlContent string, maxWords int) (string, error)
type Attribute
type AudioInfo
- func ExtractAudios(htmlContent string) ([]AudioInfo, error)
type Config
- func DefaultConfig() Config
type ExtractConfig
- func ConfigForMarkdown() ExtractConfig
- func ConfigForRSS() ExtractConfig
- func ConfigForSearchIndex() ExtractConfig
- func ConfigForSummary() ExtractConfig
- func DefaultExtractConfig() ExtractConfig
type ImageInfo
- func ExtractImages(htmlContent string) ([]ImageInfo, error)
type LinkExtractionConfig
- func DefaultLinkExtractionConfig() LinkExtractionConfig
type LinkInfo
- func ExtractLinks(htmlContent string) ([]LinkInfo, error)
type LinkResource
- func ExtractAllLinks(htmlContent string, configs ...LinkExtractionConfig) ([]LinkResource, error)
type Node
type NodeType
type Processor
- func New(config Config) (*Processor, error)
- func NewWithDefaults() *Processor
- func (p *Processor) ClearCache()
- func (p *Processor) Close() error
- func (p *Processor) Extract(htmlContent string, configs ...ExtractConfig) (*Result, error)
- func (p *Processor) ExtractAllLinks(htmlContent string, configs ...LinkExtractionConfig) ([]LinkResource, error)
- func (p *Processor) ExtractBatch(htmlContents []string, configs ...ExtractConfig) ([]*Result, error)
- func (p *Processor) ExtractBatchFiles(filePaths []string, configs ...ExtractConfig) ([]*Result, error)
- func (p *Processor) ExtractFromFile(filePath string, configs ...ExtractConfig) (*Result, error)
- func (p *Processor) ExtractWithDefaults(htmlContent string) (*Result, error)
- func (p *Processor) GetStatistics() Statistics
type Result
- func Extract(htmlContent string, configs ...ExtractConfig) (*Result, error)
- func ExtractFromFile(filePath string, configs ...ExtractConfig) (*Result, error)
type Statistics
type Token
type TokenType
type Tokenizer
type VideoInfo
- func ExtractVideos(htmlContent string) ([]VideoInfo, error)

Constants ¶

View Source

const (
	ErrorNode    = html.ErrorNode
	TextNode     = html.TextNode
	DocumentNode = html.DocumentNode
	ElementNode  = html.ElementNode
	CommentNode  = html.CommentNode
	DoctypeNode  = html.DoctypeNode

	ErrorToken          = html.ErrorToken
	TextToken           = html.TextToken
	StartTagToken       = html.StartTagToken
	EndTagToken         = html.EndTagToken
	SelfClosingTagToken = html.SelfClosingTagToken
	CommentToken        = html.CommentToken
	DoctypeToken        = html.DoctypeToken
)

View Source

const (
	DefaultMaxInputSize      = 50 * 1024 * 1024
	DefaultMaxCacheEntries   = 1000
	DefaultWorkerPoolSize    = 4
	DefaultCacheTTL          = time.Hour
	DefaultMaxDepth          = 100
	DefaultProcessingTimeout = 30 * time.Second
)

Variables ¶

View Source

var (
	// ErrInputTooLarge is returned when input exceeds MaxInputSize.
	ErrInputTooLarge = errors.New("html: input size exceeds maximum")

	// ErrInvalidHTML is returned when HTML parsing fails.
	ErrInvalidHTML = errors.New("html: invalid HTML")

	// ErrProcessorClosed is returned when operations are attempted on a closed processor.
	ErrProcessorClosed = errors.New("html: processor closed")

	// ErrMaxDepthExceeded is returned when HTML nesting exceeds MaxDepth.
	ErrMaxDepthExceeded = errors.New("html: max depth exceeded")

	// ErrInvalidConfig is returned when configuration validation fails.
	ErrInvalidConfig = errors.New("html: invalid config")

	// ErrProcessingTimeout is returned when processing exceeds ProcessingTimeout.
	ErrProcessingTimeout = errors.New("html: processing timeout exceeded")

	// ErrFileNotFound is returned when specified file cannot be read.
	ErrFileNotFound = errors.New("html: file not found")
)

Error definitions for the `cybergodev/html` package.

View Source

var (
	Parse          = html.Parse
	ParseFragment  = html.ParseFragment
	Render         = html.Render
	EscapeString   = html.EscapeString
	UnescapeString = html.UnescapeString
	NewTokenizer   = html.NewTokenizer
)

Functions ¶

func ExtractAndClean ¶ added in v1.0.4

func ExtractAndClean(htmlContent string) (string, error)

func ExtractText ¶ added in v1.0.2

func ExtractText(htmlContent string) (string, error)

ExtractText extracts only text content without metadata.

func ExtractTitle ¶ added in v1.0.4

func ExtractTitle(htmlContent string) (string, error)

func ExtractToJSON ¶ added in v1.0.4

func ExtractToJSON(htmlContent string) ([]byte, error)

func ExtractToMarkdown ¶ added in v1.0.4

func ExtractToMarkdown(htmlContent string) (string, error)

func ExtractWithTitle ¶ added in v1.0.4

func ExtractWithTitle(htmlContent string) (string, string, error)

func GetReadingTime ¶ added in v1.0.4

func GetReadingTime(htmlContent string) (float64, error)

func GetWordCount ¶ added in v1.0.4

func GetWordCount(htmlContent string) (int, error)

func GroupLinksByType ¶ added in v1.0.2

func GroupLinksByType(links []LinkResource) map[string][]LinkResource

GroupLinksByType groups LinkResource slice by their Type field.

func Summarize ¶ added in v1.0.4

func Summarize(htmlContent string, maxWords int) (string, error)

Types ¶

type Attribute ¶

type Attribute = html.Attribute

type AudioInfo ¶

type AudioInfo struct {
	URL      string
	Type     string
	Duration string
}

func ExtractAudios ¶ added in v1.0.4

func ExtractAudios(htmlContent string) ([]AudioInfo, error)

type Config ¶

type Config struct {
	MaxInputSize       int
	MaxCacheEntries    int
	CacheTTL           time.Duration
	WorkerPoolSize     int
	EnableSanitization bool
	MaxDepth           int
	ProcessingTimeout  time.Duration
}

func DefaultConfig ¶

func DefaultConfig() Config

type ExtractConfig ¶

type ExtractConfig struct {
	ExtractArticle    bool
	PreserveImages    bool
	PreserveLinks     bool
	PreserveVideos    bool
	PreserveAudios    bool
	InlineImageFormat string
}

func ConfigForMarkdown ¶ added in v1.0.4

func ConfigForMarkdown() ExtractConfig

func ConfigForRSS ¶ added in v1.0.4

func ConfigForRSS() ExtractConfig

func ConfigForSearchIndex ¶ added in v1.0.4

func ConfigForSearchIndex() ExtractConfig

func ConfigForSummary ¶ added in v1.0.4

func ConfigForSummary() ExtractConfig

func DefaultExtractConfig ¶

func DefaultExtractConfig() ExtractConfig

type ImageInfo ¶

type ImageInfo struct {
	URL          string
	Alt          string
	Title        string
	Width        string
	Height       string
	IsDecorative bool
	Position     int
}

func ExtractImages ¶ added in v1.0.4

func ExtractImages(htmlContent string) ([]ImageInfo, error)

type LinkExtractionConfig ¶ added in v1.0.2

type LinkExtractionConfig struct {
	ResolveRelativeURLs  bool
	BaseURL              string
	IncludeImages        bool
	IncludeVideos        bool
	IncludeAudios        bool
	IncludeCSS           bool
	IncludeJS            bool
	IncludeContentLinks  bool
	IncludeExternalLinks bool
	IncludeIcons         bool
}

func DefaultLinkExtractionConfig ¶ added in v1.0.2

func DefaultLinkExtractionConfig() LinkExtractionConfig

type LinkInfo ¶

type LinkInfo struct {
	URL        string
	Text       string
	Title      string
	IsExternal bool
	IsNoFollow bool
}

func ExtractLinks ¶ added in v1.0.4

func ExtractLinks(htmlContent string) ([]LinkInfo, error)

type LinkResource ¶ added in v1.0.2

type LinkResource struct {
	URL   string
	Title string
	Type  string
}

func ExtractAllLinks ¶ added in v1.0.2

func ExtractAllLinks(htmlContent string, configs ...LinkExtractionConfig) ([]LinkResource, error)

type Node ¶

type Node = html.Node

type NodeType ¶

type NodeType = html.NodeType

type Processor ¶

type Processor struct {
	// contains filtered or unexported fields
}

func New ¶

func New(config Config) (*Processor, error)

func NewWithDefaults ¶

func NewWithDefaults() *Processor

func (*Processor) ClearCache ¶

func (p *Processor) ClearCache()

func (*Processor) Close ¶

func (p *Processor) Close() error

func (*Processor) Extract ¶

func (p *Processor) Extract(htmlContent string, configs ...ExtractConfig) (*Result, error)

func (*Processor) ExtractAllLinks ¶ added in v1.0.2

func (p *Processor) ExtractAllLinks(htmlContent string, configs ...LinkExtractionConfig) ([]LinkResource, error)

func (*Processor) ExtractBatch ¶

func (p *Processor) ExtractBatch(htmlContents []string, configs ...ExtractConfig) ([]*Result, error)

func (*Processor) ExtractBatchFiles ¶

func (p *Processor) ExtractBatchFiles(filePaths []string, configs ...ExtractConfig) ([]*Result, error)

func (*Processor) ExtractFromFile ¶

func (p *Processor) ExtractFromFile(filePath string, configs ...ExtractConfig) (*Result, error)

func (*Processor) ExtractWithDefaults ¶

func (p *Processor) ExtractWithDefaults(htmlContent string) (*Result, error)

func (*Processor) GetStatistics ¶

func (p *Processor) GetStatistics() Statistics

type Result ¶

type Result struct {
	Text           string
	Title          string
	Images         []ImageInfo
	Links          []LinkInfo
	Videos         []VideoInfo
	Audios         []AudioInfo
	ProcessingTime time.Duration
	WordCount      int
	ReadingTime    time.Duration
}

func Extract ¶ added in v1.0.2

func Extract(htmlContent string, configs ...ExtractConfig) (*Result, error)

func ExtractFromFile ¶ added in v1.0.2

func ExtractFromFile(filePath string, configs ...ExtractConfig) (*Result, error)

type Statistics ¶

type Statistics struct {
	TotalProcessed     int64
	CacheHits          int64
	CacheMisses        int64
	ErrorCount         int64
	AverageProcessTime time.Duration
}

type Token ¶

type Token = html.Token

type TokenType ¶

type TokenType = html.TokenType

type Tokenizer ¶

type Tokenizer = html.Tokenizer

type VideoInfo ¶

type VideoInfo struct {
	URL      string
	Type     string
	Poster   string
	Width    string
	Height   string
	Duration string
}

func ExtractVideos ¶ added in v1.0.4

func ExtractVideos(htmlContent string) ([]VideoInfo, error)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples Package main demonstrates the enhanced convenience API of the html library.	Package main demonstrates the enhanced convenience API of the html library.
internal

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL