Documentation
¶
Overview ¶
Package htmlutil provides HTML form and field extraction utilities.
Index ¶
- func FindLabel(form *goquery.Selection, elem *goquery.Selection) *goquery.Selection
- func GetAllFormText(form *goquery.Selection) string
- func GetBodyText(doc *goquery.Document, maxLen int) string
- func GetErrorIndicators(doc *goquery.Document) map[string]any
- func GetFieldsToAnnotate(form *goquery.Selection) []*goquery.Selection
- func GetFormAction(form *goquery.Selection) string
- func GetFormCSS(form *goquery.Selection) string
- func GetFormMethod(form *goquery.Selection) string
- func GetForms(doc *goquery.Document) []*goquery.Selection
- func GetH1Text(doc *goquery.Document) string
- func GetHeadings(doc *goquery.Document) string
- func GetInputCSS(form *goquery.Selection) string
- func GetInputCount(form *goquery.Selection) int
- func GetInputNames(form *goquery.Selection) string
- func GetInputTitles(form *goquery.Selection) string
- func GetLabelText(form *goquery.Selection) string
- func GetLinksText(form *goquery.Selection) string
- func GetMetaDescription(doc *goquery.Document) string
- func GetMetaKeywords(doc *goquery.Document) string
- func GetMetaRobots(doc *goquery.Document) string
- func GetNavText(doc *goquery.Document) string
- func GetPageCSS(doc *goquery.Document) string
- func GetPageLinkTexts(doc *goquery.Document) string
- func GetPageStructure(doc *goquery.Document) map[string]any
- func GetPageTitle(doc *goquery.Document) string
- func GetSubmitTexts(form *goquery.Selection) string
- func GetTypeCounts(form *goquery.Selection) map[string]int
- func GetVisibleFields(form *goquery.Selection) []*goquery.Selection
- func LoadHTML(r io.Reader) (*goquery.Document, error)
- func LoadHTMLString(htmlStr string) (*goquery.Document, error)
- type TextAround
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func FindLabel ¶
FindLabel finds the <label> element associated with a form field. It checks for label[for=id] or ancestor <label>.
func GetAllFormText ¶
GetAllFormText returns all text content inside the form.
func GetBodyText ¶ added in v0.0.3
GetBodyText returns visible text from the body, truncated for feature extraction.
func GetErrorIndicators ¶ added in v0.0.3
GetErrorIndicators returns features for detecting error/soft-404/special pages.
func GetFieldsToAnnotate ¶
GetFieldsToAnnotate returns visible fields with non-empty name attribute.
func GetFormAction ¶
GetFormAction returns the form's action attribute.
func GetFormCSS ¶
GetFormCSS returns the form's class and id attributes.
func GetFormMethod ¶
GetFormMethod returns the form's method attribute, lowercased.
func GetHeadings ¶ added in v0.0.3
GetHeadings returns concatenated text of all h1-h6 elements.
func GetInputCSS ¶
GetInputCSS returns CSS classes and IDs of non-hidden input elements.
func GetInputCount ¶
GetInputCount returns the number of named input elements (matching lxml form.inputs.keys()).
func GetInputNames ¶
GetInputNames returns names of all non-hidden <input> elements, cleaned up.
func GetInputTitles ¶
GetInputTitles returns title attributes of non-hidden input elements.
func GetLabelText ¶
GetLabelText returns text of all <label> elements in the form.
func GetLinksText ¶
GetLinksText returns text of all links inside the form.
func GetMetaDescription ¶ added in v0.0.3
GetMetaDescription returns the content of <meta name="description">.
func GetMetaKeywords ¶ added in v0.0.3
GetMetaKeywords returns the content of <meta name="keywords">.
func GetMetaRobots ¶ added in v0.0.3
GetMetaRobots returns the content of <meta name="robots">.
func GetNavText ¶ added in v0.0.3
GetNavText returns concatenated text of all <nav> elements.
func GetPageCSS ¶ added in v0.0.3
GetPageCSS returns class and id attributes from <body> and <main> elements.
func GetPageLinkTexts ¶ added in v0.0.3
GetPageLinkTexts returns concatenated text of all <a> elements.
func GetPageStructure ¶ added in v0.0.3
GetPageStructure returns structural boolean features and counts about the page.
func GetPageTitle ¶ added in v0.0.3
GetPageTitle returns the <title> text content.
func GetSubmitTexts ¶
GetSubmitTexts returns the values of all <input type="submit"> elements.
func GetTypeCounts ¶
GetTypeCounts returns counts of different input types in a form.
func GetVisibleFields ¶
GetVisibleFields returns visible form fields (textarea, select, button, non-hidden inputs).
Types ¶
type TextAround ¶
TextAround holds text before and after each element.
func GetTextAroundElems ¶
func GetTextAroundElems(root *goquery.Selection, elems []*goquery.Selection) TextAround
GetTextAroundElems returns text before and after each specified element, matching lxml's text/tail walk behavior from Formasaurus.