parsekit

package module

v2.0.0-beta1 Latest Latest Go to latest Published: Jan 13, 2025 License: BSD-2-Clause Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/TroutSoftware/parsekit

Links

Open Source Insights

README ¶

ParseKit is a simple, no-surprise library to build parsers and lexers.

It provides the starting blocks needed (and most often forgotten) to make good parser:

solid error reporting and parser synchronization
efficient buffering while scanning
(more to come?)

Choices made in the package

There are many, many techniques to write a parser (LALR generators, PEG, parser combinators, …).

The authors do not claim to have invented anything new, or even smart, but instead chosen a few boring techniques working well together:

the program is in control, not using callbacks – leads to a better debugging experience, and code that look more like regular Go
the tokenizer is split between a state management coroutine (handled in the library) and a lexer implementing the actual lexeme recognition (one at a time).
the parser is recursive descent, using panics for stack unwinding and synchronisation – the resulting code is also fairly straightforward, with little verbosity

This choices work well, in practice, to read the kind of files the authors are most often confronted with (configuration files, DHCP leases, SNORT rules, …).

What next

The test suite is wholly inadequate at the moment, so please feel free to submit bugs and repro cases. We also need to make the transition table more convenient to use with go generate, instead of a little AWK script.

Documentation ¶

Overview ¶

Package parsekit implements a simple, reusable parser for simple grammars.

Example ¶

package main

import (
	"fmt"
	"net/netip"
	"time"
	"unicode/utf8"

	"github.com/TroutSoftware/parsekit/v2"
)

func main() {
	p := parsekit.Init[Lease](
		parsekit.ReadFile("testdata/example_dhcp1"),
		parsekit.WithLexer(scantk),
		parsekit.SynchronizeAt("lease"),
	)

	ParseLease(p)
	lease, err := p.Finish()
	if err != nil {
		fmt.Printf("cannot parse lease file: %s", err)
		return
	}

	fmt.Println(lease)
}

type Lease struct {
	Interface    string
	FixedAddress netip.Addr
	Expire       time.Time
}

func ParseLease(p *parsekit.Parser[Lease]) {
	defer p.Synchronize()

	p.Expect(IdentToken, "lease")
	p.Expect('{', "opening bracket")
	for p.More() {
		if p.Match('}') {
			return
		}

		p.Expect(IdentToken, "option")
		switch p.Lit() {
		case "interface":
			p.Expect(StringToken, "interface")
			p.Value.Interface = p.Val().(string)
			p.Expect(';', ";")
		case "fixed-address":
			p.Expect(IPToken, "IP address")
			p.Value.FixedAddress = p.Val().(netip.Addr)
			p.Expect(';', ";")
		case "expire":
			p.Expect(NumberToken, "number")
			p.Expect(DateTimeToken, "date and time of expiration")
			p.Value.Expire = time.Time(p.Val().(LTime))
			p.Expect(';', ";")
		default:
			for !p.Match(';') {
				p.Skip()
			}
		}
	}
}

type LTime time.Time

func (t *LTime) UnmarshalText(dt []byte) error {
	u, err := time.Parse("2006/01/02 15:04:05", string(dt))
	if err != nil {
		return err
	}
	*t = (LTime)(u)
	return nil
}

const (
	NumberToken rune = -1 - iota
	IPToken
	DateTimeToken
	IdentToken
	StringToken
	InvalidType
)

func scantk(sc *parsekit.Scanner) parsekit.Token {
	switch tk := sc.Advance(); {
	case tk == ' ':
		return parsekit.Ignore // empty space

	case tk == '{', tk == '}', tk == ';':
		return parsekit.Const(tk)

	case tk == '"':
		for sc.Peek() != '"' && sc.Peek() != utf8.RuneError {
			sc.Advance()
		}
		if sc.Peek() == utf8.RuneError {
			return parsekit.EOF
		}
		sc.Advance() // terminating '"'
		return parsekit.Auto[string](StringToken, sc)

	case '0' <= tk && tk <= '9':
		guess := NumberToken
		for {
			if sc.Peek() >= '0' && sc.Peek() <= '9' {
				sc.Advance()
			} else if sc.Peek() == '/' {
				guess = DateTimeToken
				sc.Advance()
			} else if sc.Peek() == '.' {
				guess = IPToken
				sc.Advance()
			} else if (sc.Peek() == ' ' || sc.Peek() == ':') && guess == DateTimeToken {
				sc.Advance()
			} else {
				break
			}
		}
		switch guess {
		case DateTimeToken:
			return parsekit.Auto[LTime](guess, sc)
		case IPToken:
			return parsekit.Auto[netip.Addr](guess, sc)
		default:
			return parsekit.Auto[int](guess, sc)
		}

	case 'a' <= tk && tk <= 'z' || tk == '-':
		for 'a' <= sc.Peek() && sc.Peek() <= 'z' || sc.Peek() == '-' {
			sc.Advance()
		}
		return parsekit.Const(IdentToken)
	}

	return parsekit.EOF
}

Output:

{eth0 10.67.21.85 2023-11-03 11:27:26 +0000 UTC}

Index ¶

Constants
type Identifier
type Lexer
type Parser
- func Init[T any](opts ...ParserOptions) *Parser[T]
type ParserOptions
type Position
- func (pos *Position) IsValid() bool
- func (pos Position) String() string
type Scanner
type Token
- func Auto[T any](r rune, sc *Scanner) Token
- func Const(r rune) Token
- func (t Token) Error() error

Examples ¶

Package

Constants ¶

View Source

const ErrLit = "<error>"

ErrLit is the literal value set after a failed call to Parser.Expect

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Identifier ¶

type Identifier string

type Lexer ¶

type Lexer func(s *Scanner) Token

Lexer is a stateful function to read tokens from the scanner. Each time the function returns, a new token is created, and the scanner advance.

type Parser ¶

type Parser[T any] struct {
	Value T
	// contains filtered or unexported fields
}

Parser implements a recursive descent parser. It provides facilities for error reporting, peeking, …

func Init ¶

func Init[T any](opts ...ParserOptions) *Parser[T]

Init creates a new parser. At least two options must be provided: (1) a reader, and (2) a lexer function. Further options (e.g. SynchronizeAt)

func (*Parser[T]) Errf ¶

func (p *Parser[T]) Errf(format string, args ...any)

Errf triggers a panic mode with the given formatted error. The position is correctly attached to the error.

func (*Parser[T]) Expect ¶

func (p *Parser[T]) Expect(tk rune, msg string)

Expects advances the parser to the next input, making sure it matches the token tk.

func (*Parser[T]) Finish ¶

func (p *Parser[T]) Finish() (T, error)

Finish returns the value, and error of the parsing. This make it convenient to use at the bottom of a function:

func ReadConfigFiles() (MyStruct, error) {
   p := Init(ReadFiles(xxx), Lexer(yyy))
   parseConfig(p)
   return p.Finish()
}

func (*Parser[T]) Lit ¶

func (p *Parser[T]) Lit() string

func (*Parser[T]) Match ¶

func (p *Parser[T]) Match(tk ...rune) bool

Match returns true if tk is found at the current parsing point. It does not consume any input on failure, so can be used in a test.

func (*Parser[T]) More ¶

func (p *Parser[T]) More() bool

More returns true if input is left in the stream. More does not advance the parser state, so use Parser.Skip or Parser.Expect to consume a value.

func (*Parser[T]) Skip ¶

func (p *Parser[T]) Skip()

Skip throws away the current token

func (*Parser[T]) Synchronize ¶

func (p *Parser[T]) Synchronize()

Synchronize handles error recovery in the parsing process: when an error occurs, the parser panics all the way to the Parser.Synchronize function. All tokens are thrown until the first of lits is found

Run this in a top-level `defer` statement in at the level of the synchronisation elements.

func (*Parser[T]) Val ¶

func (p *Parser[T]) Val() any

type ParserOptions ¶

type ParserOptions func(*emb)

ParserOptions specialize the behavior of the parser.

func ReadFile ¶

func ReadFile(name string) ParserOptions

ReadFile reads the content of file name, and passes it to the scanner.

func ReadString ¶

func ReadString(src string) ParserOptions

ReadString creates a scanner on src.

func SynchronizeAt ¶

func SynchronizeAt(lits ...string) ParserOptions

SynchronizeAt sets the synchronisation literals for error recovery. See Parser.Synchronize for full documentation.

func Verbose ¶

func Verbose() ParserOptions

func WithLexer ¶

func WithLexer(lx Lexer) ParserOptions

WithLexer options sets the lexer used by the parser

type Position ¶

type Position struct {
	Filename string // filename, if any
	Offset   int    // byte offset, starting at 0
	Line     int    // line number, starting at 1
	Column   int    // column number, starting at 1 (character count per line)
}

Position is a value that represents a source position. A position is valid if Line > 0.

func (*Position) IsValid ¶

func (pos *Position) IsValid() bool

IsValid reports whether the position is valid.

func (Position) String ¶

func (pos Position) String() string

type Scanner ¶

type Scanner struct {
	// contains filtered or unexported fields
}

Scanner reads lexemes from a source

func (*Scanner) Advance ¶

func (s *Scanner) Advance() rune

Advances returns the next character in the stream, and increment the read counter.

func (*Scanner) Cursor ¶

func (s *Scanner) Cursor() string

Cursor returns the string currently being scanned

func (*Scanner) Peek ¶

func (s *Scanner) Peek() rune

Peek returns the next character in the stream, without incrementing the read counter.

func (*Scanner) Tokens ¶

func (s *Scanner) Tokens(lx Lexer) iter.Seq[Token]

Tokens returns a stream of Tokens from the underlying scanner. The lexer is called repetitively on all yet unread content, and its tokens are returned for consumption in the parser.

type Token ¶

type Token struct {
	Type  rune
	Value any

	Lexeme string
	Pos    Position
}

var EOF Token

EOF is a marker token. The Lexer should return it when Scanner.Advance returns an invalid rune.

var Ignore Token

Ignore is a marker token. The Lexer should return it when the current token is to be ignored by the scanner, and not passed to the parser. This is useful to skip over comments, or empty lines.

func Auto ¶

func Auto[T any](r rune, sc *Scanner) Token

Auto returns a new token with value of type T. The value is read from the current lexeme, and converted with:

strconv.Unquote for strings if the first character is a quote
the lexeme directly for strings
strconv.ParseInt
unix and iso times for times
calling Unmarshaler otherwise

If the value cannot be parsed, an error token is returned to the parser.

func Const ¶

func Const(r rune) Token

Const returns a constant token

func (Token) Error ¶

func (t Token) Error() error

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL