html

package module
v1.0.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 14, 2026 License: MIT Imports: 15 Imported by: 0

README ΒΆ

HTML Library

Go Version pkg.go.dev Performance Thread Safe

A Go library for intelligent HTML content extraction. Compatible with golang.org/x/net/html β€” use it as a drop-in replacement, plus get enhanced content extraction features.

πŸ“– δΈ­ζ–‡ζ–‡ζ‘£ - User guide

✨ Core Features

🎯 Content Extraction
  • Article Detection: Identifies main content using scoring algorithms (text density, link density, semantic tags)
  • Smart Text Extraction: Preserves structure, handles newlines, calculates word count and reading time
  • Media Extraction: Images, videos, audio with metadata (URL, dimensions, alt text, type detection)
  • Link Analysis: External/internal detection, nofollow attributes, anchor text extraction
⚑ Performance
  • Content-Addressable Caching: SHA256-based keys with TTL and LRU eviction
  • Batch Processing: Parallel extraction with configurable worker pools
  • Thread-Safe: Concurrent use without external synchronization
  • Resource Limits: Configurable input size, nesting depth, and timeout protection
πŸ“– Use Cases
  • πŸ“° News Aggregators: Extract article content from news sites
  • πŸ€– Web Scrapers: Get structured data from HTML pages
  • πŸ“ Content Management: Convert HTML to Markdown or other formats
  • πŸ” Search Engines: Index main content without navigation/ads
  • πŸ“Š Data Analysis: Extract and analyze web content at scale
  • πŸ“± RSS/Feed Generators: Create feeds from HTML content
  • πŸŽ“ Documentation Tools: Convert HTML docs to other formats

πŸ“¦ Installation

go get github.com/cybergodev/html

⚑ 5-Minute Quick Start

import "github.com/cybergodev/html"

// Extract clean text from HTML
htmlContent, _ := html.ExtractText(`
    <html>
        <nav>Navigation</nav>
        <article><h1>Hello World</h1><p>Content here...</p></article>
        <footer>Footer</footer>
    </html>
`)
fmt.Println(htmlContent) // "Hello World\nContent here..."

That's it! The library automatically:

  • Removes navigation, footers, ads
  • Extracts main content
  • Cleans up whitespace

πŸš€ Quick Guide

One-Liner Functions

Just want to get something done? Use these package-level functions:

// Extract text only
text, _ := html.ExtractText(htmlContent)

// Extract everything
result, _ := html.Extract(htmlContent)
fmt.Println(result.Title)     // Hello World
fmt.Println(result.Text)      // Content here...
fmt.Println(result.WordCount) // 4

// Extract only specific elements
title, err := html.ExtractTitle(htmlContent)
images, err := html.ExtractImages(htmlContent)
links, err := html.ExtractLinks(htmlContent)

// Convert formats
markdown, err := html.ExtractToMarkdown(htmlContent)
jsonData, err := html.ExtractToJSON(htmlContent)

// Content analysis
wordCount, err := html.GetWordCount(htmlContent)
readingTime, err := html.GetReadingTime(htmlContent)
summary, err := html.Summarize(htmlContent, 50) // max 50 words

When to use: Simple scripts, one-off tasks, quick prototyping


Basic Processor Usage

Need more control, Create a processor:

processor := html.NewWithDefaults()
defer processor.Close()

// Extract with defaults
result, err := processor.ExtractWithDefaults(htmlContent)

// Extract from file
result, err = processor.ExtractFromFile("page.html", html.DefaultExtractConfig())

// Batch processing
htmlContents := []string{html1, html2, html3}
results, err := processor.ExtractBatch(htmlContents, html.DefaultExtractConfig())

When to use: Multiple extractions, processing many files, web scrapers


Custom Configuration

Fine-tune what gets extracted:

config := html.ExtractConfig{
    ExtractArticle:    true,   // Auto-detect main content
    PreserveImages:    true,   // Extract image metadata
    PreserveLinks:     true,   // Extract link metadata
    PreserveVideos:    false,  // Skip videos
    PreserveAudios:    false,  // Skip audio
    InlineImageFormat: "none", // Options: "none", "placeholder", "markdown", "html"
}

processor := html.NewWithDefaults()
defer processor.Close()

result, err := processor.Extract(htmlContent, config)

When to use: Specific extraction needs, format conversion, custom output


Advanced Features
Custom Processor Configuration
config := html.Config{
    MaxInputSize:       10 * 1024 * 1024, // 10MB limit
    ProcessingTimeout:  30 * time.Second,
    MaxCacheEntries:    500,
    CacheTTL:           30 * time.Minute,
    WorkerPoolSize:     8,
    EnableSanitization: true,  // Remove <script>, <style> tags
    MaxDepth:           50,    // Prevent deep nesting attacks
}

processor, err := html.New(config)
defer processor.Close()
// Extract all resource links
links, err := html.ExtractAllLinks(htmlContent)

// Group by type
byType := html.GroupLinksByType(links)
cssLinks := byType["css"]
jsLinks := byType["js"]
images := byType["image"]

// Advanced configuration
processor := html.NewWithDefaults()
linkConfig := html.LinkExtractionConfig{
    BaseURL:              "https://example.com",
    ResolveRelativeURLs:  true,
    IncludeImages:        true,
    IncludeVideos:        true,
    IncludeCSS:           true,
    IncludeJS:            true,
}
links, err = processor.ExtractAllLinks(htmlContent, linkConfig)
Caching & Statistics
processor := html.NewWithDefaults()
defer processor.Close()

// Automatic caching enabled
result1, err := processor.ExtractWithDefaults(htmlContent)
result2, err := processor.ExtractWithDefaults(htmlContent) // Cache hit!

// Check performance
stats := processor.GetStatistics()
fmt.Printf("Cache hits: %d/%d\n", stats.CacheHits, stats.TotalProcessed)

// Clear cache if needed
processor.ClearCache()
Configuration Presets
processor := html.NewWithDefaults()
defer processor.Close()

// RSS feed generation
result, err := processor.Extract(htmlContent, html.ConfigForRSS())

// Summary generation (text only)
result, err = processor.Extract(htmlContent, html.ConfigForSummary())

// Search indexing (all metadata)
result, err = processor.Extract(htmlContent, html.ConfigForSearchIndex())

// Markdown output
result, err = processor.Extract(htmlContent, html.ConfigForMarkdown())

When to use: Production applications, performance optimization, specific use cases


πŸ“– Common Recipes

Copy-paste solutions for common tasks:

Extract Article Text (Clean)
text, err := html.ExtractText(htmlContent)
// Returns clean text without navigation/ads
Extract with Images
result, err := html.Extract(htmlContent)
for _, img := range result.Images {
    fmt.Printf("Image: %s (alt: %s)\n", img.URL, img.Alt)
}
Convert to Markdown
markdown, err := html.ExtractToMarkdown(htmlContent)
// Images become: ![alt](url)
links, err := html.ExtractAllLinks(htmlContent)
for _, link := range links {
    fmt.Printf("%s: %s\n", link.Type, link.URL)
}
Get Reading Time
minutes, err := html.GetReadingTime(htmlContent)
fmt.Printf("Reading time: %.1f min", minutes)
Batch Process Files
processor := html.NewWithDefaults()
defer processor.Close()

files := []string{"page1.html", "page2.html", "page3.html"}
results, err := processor.ExtractBatchFiles(files, html.DefaultExtractConfig())
Create RSS Feed Content
processor := html.NewWithDefaults()
defer processor.Close()

result, err := processor.Extract(htmlContent, html.ConfigForRSS())
// Optimized for RSS: fast, includes images/links, no article detection

πŸ”§ API Quick Reference

Package-Level Functions
// Extraction
Extract(htmlContent string) (*Result, error)
ExtractText(htmlContent string) (string, error)
ExtractFromFile(path string) (*Result, error)

// Format Conversion
ExtractToMarkdown(htmlContent string) (string, error)
ExtractToJSON(htmlContent string) ([]byte, error)

// Specific Elements
ExtractTitle(htmlContent string) (string, error)
ExtractImages(htmlContent string) ([]ImageInfo, error)
ExtractVideos(htmlContent string) ([]VideoInfo, error)
ExtractAudios(htmlContent string) ([]AudioInfo, error)
ExtractLinks(htmlContent string) ([]LinkInfo, error)
ExtractWithTitle(htmlContent string) (string, string, error)

// Analysis
GetWordCount(htmlContent string) (int, error)
GetReadingTime(htmlContent string) (float64, error)
Summarize(htmlContent string, maxWords int) (string, error)
ExtractAndClean(htmlContent string) (string, error)

// Links
ExtractAllLinks(htmlContent string, baseURL ...string) ([]LinkResource, error)
GroupLinksByType(links []LinkResource) map[string][]LinkResource
Processor Methods
// Creation
NewWithDefaults() *Processor
New(config Config) (*Processor, error)
processor.Close()

// Extraction
processor.Extract(htmlContent string, config ExtractConfig) (*Result, error)
processor.ExtractWithDefaults(htmlContent string) (*Result, error)
processor.ExtractFromFile(path string, config ExtractConfig) (*Result, error)

// Batch
processor.ExtractBatch(contents []string, config ExtractConfig) ([]*Result, error)
processor.ExtractBatchFiles(paths []string, config ExtractConfig) ([]*Result, error)

// Links
processor.ExtractAllLinks(htmlContent string, config LinkExtractionConfig) ([]LinkResource, error)

// Monitoring
processor.GetStatistics() Statistics
processor.ClearCache()
Configuration Presets
DefaultExtractConfig()      ExtractConfig
ConfigForRSS()               ExtractConfig
ConfigForSummary()           ExtractConfig
ConfigForSearchIndex()       ExtractConfig
ConfigForMarkdown()          ExtractConfig
DefaultLinkExtractionConfig() LinkExtractionConfig

Result Structure

type Result struct {
    Text           string        // Clean text content
    Title          string        // Page/article title
    Images         []ImageInfo   // Image metadata
    Links          []LinkInfo    // Link metadata
    Videos         []VideoInfo   // Video metadata
    Audios         []AudioInfo   // Audio metadata
    WordCount      int           // Total words
    ReadingTime    time.Duration // Estimated reading time
    ProcessingTime time.Duration // Time taken
}

type ImageInfo struct {
    URL          string  // Image URL
    Alt          string  // Alt text
    Title        string  // Title attribute
    Width        string  // Width attribute
    Height       string  // Height attribute
    IsDecorative bool    // No alt text
}

type LinkInfo struct {
    URL        string  // Link URL
    Text       string  // Anchor text
    IsExternal bool    // External domain
    IsNoFollow bool    // rel="nofollow"
}

Examples

See examples/ directory for complete, runnable code:

Example Description
01_quick_start.go Quick start with one-liners
02_content_extraction.go Content extraction basics
03_link_extraction.go Link extraction patterns
04_media_extraction.go Media (images/videos/audio)
04_advanced_features.go Advanced features & compatibility
05_advanced_usage.go Batch processing & performance
06_compatibility.go golang.org/x/net/html compatibility
07_convenience_api.go Package-level convenience API

Compatibility

This library is a drop-in replacement for golang.org/x/net/html:

// Just change the import
- import "golang.org/x/net/html"
+ import "github.com/cybergodev/html"

// All existing code works
doc, err := html.Parse(reader)
html.Render(writer, doc)
escaped := html.EscapeString("<script>")

See COMPATIBILITY.md for details.


Thread Safety

The Processor is safe for concurrent use:

processor := html.NewWithDefaults()
defer processor.Close()

// Safe to use from multiple goroutines
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
    wg.Add(1)
    go func() {
        defer wg.Done()
        processor.ExtractWithDefaults(htmlContent)
    }()
}
wg.Wait()

🀝 Contributing

Contributions, issue reports, and suggestions are welcome!

πŸ“„ License

MIT License - See LICENSE file for details.


Crafted with care for the Go community ❀️ | If this project helps you, please give it a ⭐️ Star!

Documentation ΒΆ

Index ΒΆ

Constants ΒΆ

View Source
const (
	ErrorNode    = html.ErrorNode
	TextNode     = html.TextNode
	DocumentNode = html.DocumentNode
	ElementNode  = html.ElementNode
	CommentNode  = html.CommentNode
	DoctypeNode  = html.DoctypeNode

	ErrorToken          = html.ErrorToken
	TextToken           = html.TextToken
	StartTagToken       = html.StartTagToken
	EndTagToken         = html.EndTagToken
	SelfClosingTagToken = html.SelfClosingTagToken
	CommentToken        = html.CommentToken
	DoctypeToken        = html.DoctypeToken
)
View Source
const (
	DefaultMaxInputSize      = 50 * 1024 * 1024
	DefaultMaxCacheEntries   = 1000
	DefaultWorkerPoolSize    = 4
	DefaultCacheTTL          = time.Hour
	DefaultMaxDepth          = 100
	DefaultProcessingTimeout = 30 * time.Second
)

Variables ΒΆ

View Source
var (
	// ErrInputTooLarge is returned when input exceeds MaxInputSize.
	ErrInputTooLarge = errors.New("html: input size exceeds maximum")

	// ErrInvalidHTML is returned when HTML parsing fails.
	ErrInvalidHTML = errors.New("html: invalid HTML")

	// ErrProcessorClosed is returned when operations are attempted on a closed processor.
	ErrProcessorClosed = errors.New("html: processor closed")

	// ErrMaxDepthExceeded is returned when HTML nesting exceeds MaxDepth.
	ErrMaxDepthExceeded = errors.New("html: max depth exceeded")

	// ErrInvalidConfig is returned when configuration validation fails.
	ErrInvalidConfig = errors.New("html: invalid config")

	// ErrProcessingTimeout is returned when processing exceeds ProcessingTimeout.
	ErrProcessingTimeout = errors.New("html: processing timeout exceeded")

	// ErrFileNotFound is returned when specified file cannot be read.
	ErrFileNotFound = errors.New("html: file not found")
)

Error definitions for the `cybergodev/html` package.

View Source
var (
	Parse          = html.Parse
	ParseFragment  = html.ParseFragment
	Render         = html.Render
	EscapeString   = html.EscapeString
	UnescapeString = html.UnescapeString
	NewTokenizer   = html.NewTokenizer
)

Functions ΒΆ

func ExtractAndClean ΒΆ added in v1.0.4

func ExtractAndClean(htmlContent string) (string, error)

func ExtractText ΒΆ added in v1.0.2

func ExtractText(htmlContent string) (string, error)

ExtractText extracts only text content without metadata.

func ExtractTitle ΒΆ added in v1.0.4

func ExtractTitle(htmlContent string) (string, error)

func ExtractToJSON ΒΆ added in v1.0.4

func ExtractToJSON(htmlContent string) ([]byte, error)

func ExtractToMarkdown ΒΆ added in v1.0.4

func ExtractToMarkdown(htmlContent string) (string, error)

func ExtractWithTitle ΒΆ added in v1.0.4

func ExtractWithTitle(htmlContent string) (string, string, error)

func GetReadingTime ΒΆ added in v1.0.4

func GetReadingTime(htmlContent string) (float64, error)

func GetWordCount ΒΆ added in v1.0.4

func GetWordCount(htmlContent string) (int, error)

func GroupLinksByType ΒΆ added in v1.0.2

func GroupLinksByType(links []LinkResource) map[string][]LinkResource

GroupLinksByType groups LinkResource slice by their Type field.

func Summarize ΒΆ added in v1.0.4

func Summarize(htmlContent string, maxWords int) (string, error)

Types ΒΆ

type Attribute ΒΆ

type Attribute = html.Attribute

type AudioInfo ΒΆ

type AudioInfo struct {
	URL      string
	Type     string
	Duration string
}

func ExtractAudios ΒΆ added in v1.0.4

func ExtractAudios(htmlContent string) ([]AudioInfo, error)

type Config ΒΆ

type Config struct {
	MaxInputSize       int
	MaxCacheEntries    int
	CacheTTL           time.Duration
	WorkerPoolSize     int
	EnableSanitization bool
	MaxDepth           int
	ProcessingTimeout  time.Duration
}

func DefaultConfig ΒΆ

func DefaultConfig() Config

type ExtractConfig ΒΆ

type ExtractConfig struct {
	ExtractArticle    bool
	PreserveImages    bool
	PreserveLinks     bool
	PreserveVideos    bool
	PreserveAudios    bool
	InlineImageFormat string
}

func ConfigForMarkdown ΒΆ added in v1.0.4

func ConfigForMarkdown() ExtractConfig

func ConfigForRSS ΒΆ added in v1.0.4

func ConfigForRSS() ExtractConfig

func ConfigForSearchIndex ΒΆ added in v1.0.4

func ConfigForSearchIndex() ExtractConfig

func ConfigForSummary ΒΆ added in v1.0.4

func ConfigForSummary() ExtractConfig

func DefaultExtractConfig ΒΆ

func DefaultExtractConfig() ExtractConfig

type ImageInfo ΒΆ

type ImageInfo struct {
	URL          string
	Alt          string
	Title        string
	Width        string
	Height       string
	IsDecorative bool
	Position     int
}

func ExtractImages ΒΆ added in v1.0.4

func ExtractImages(htmlContent string) ([]ImageInfo, error)

type LinkExtractionConfig ΒΆ added in v1.0.2

type LinkExtractionConfig struct {
	ResolveRelativeURLs  bool
	BaseURL              string
	IncludeImages        bool
	IncludeVideos        bool
	IncludeAudios        bool
	IncludeCSS           bool
	IncludeJS            bool
	IncludeContentLinks  bool
	IncludeExternalLinks bool
	IncludeIcons         bool
}

func DefaultLinkExtractionConfig ΒΆ added in v1.0.2

func DefaultLinkExtractionConfig() LinkExtractionConfig

type LinkInfo ΒΆ

type LinkInfo struct {
	URL        string
	Text       string
	Title      string
	IsExternal bool
	IsNoFollow bool
}
func ExtractLinks(htmlContent string) ([]LinkInfo, error)

type LinkResource ΒΆ added in v1.0.2

type LinkResource struct {
	URL   string
	Title string
	Type  string
}
func ExtractAllLinks(htmlContent string, configs ...LinkExtractionConfig) ([]LinkResource, error)

type Node ΒΆ

type Node = html.Node

type NodeType ΒΆ

type NodeType = html.NodeType

type Processor ΒΆ

type Processor struct {
	// contains filtered or unexported fields
}

func New ΒΆ

func New(config Config) (*Processor, error)

func NewWithDefaults ΒΆ

func NewWithDefaults() *Processor

func (*Processor) ClearCache ΒΆ

func (p *Processor) ClearCache()

func (*Processor) Close ΒΆ

func (p *Processor) Close() error

func (*Processor) Extract ΒΆ

func (p *Processor) Extract(htmlContent string, configs ...ExtractConfig) (*Result, error)
func (p *Processor) ExtractAllLinks(htmlContent string, configs ...LinkExtractionConfig) ([]LinkResource, error)

func (*Processor) ExtractBatch ΒΆ

func (p *Processor) ExtractBatch(htmlContents []string, configs ...ExtractConfig) ([]*Result, error)

func (*Processor) ExtractBatchFiles ΒΆ

func (p *Processor) ExtractBatchFiles(filePaths []string, configs ...ExtractConfig) ([]*Result, error)

func (*Processor) ExtractFromFile ΒΆ

func (p *Processor) ExtractFromFile(filePath string, configs ...ExtractConfig) (*Result, error)

func (*Processor) ExtractWithDefaults ΒΆ

func (p *Processor) ExtractWithDefaults(htmlContent string) (*Result, error)

func (*Processor) GetStatistics ΒΆ

func (p *Processor) GetStatistics() Statistics

type Result ΒΆ

type Result struct {
	Text           string
	Title          string
	Images         []ImageInfo
	Links          []LinkInfo
	Videos         []VideoInfo
	Audios         []AudioInfo
	ProcessingTime time.Duration
	WordCount      int
	ReadingTime    time.Duration
}

func Extract ΒΆ added in v1.0.2

func Extract(htmlContent string, configs ...ExtractConfig) (*Result, error)

func ExtractFromFile ΒΆ added in v1.0.2

func ExtractFromFile(filePath string, configs ...ExtractConfig) (*Result, error)

type Statistics ΒΆ

type Statistics struct {
	TotalProcessed     int64
	CacheHits          int64
	CacheMisses        int64
	ErrorCount         int64
	AverageProcessTime time.Duration
}

type Token ΒΆ

type Token = html.Token

type TokenType ΒΆ

type TokenType = html.TokenType

type Tokenizer ΒΆ

type Tokenizer = html.Tokenizer

type VideoInfo ΒΆ

type VideoInfo struct {
	URL      string
	Type     string
	Poster   string
	Width    string
	Height   string
	Duration string
}

func ExtractVideos ΒΆ added in v1.0.4

func ExtractVideos(htmlContent string) ([]VideoInfo, error)

Directories ΒΆ

Path Synopsis
Package main demonstrates the enhanced convenience API of the html library.
Package main demonstrates the enhanced convenience API of the html library.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL