Documentation
¶
Index ¶
- func DefaultTokenizer(text string) iter.Seq[string]
- func IsNoMatchVector(vector []float32) bool
- func VectorSubCount(vector []float32) (count int)
- type Corpus
- func (c *Corpus) CreatePaddedVector(text string) []float32
- func (c *Corpus) CreateVector(text string) []float32
- func (c *Corpus) GetDocumentCount() int
- func (c *Corpus) GetTermFrequency() map[string]int
- func (c *Corpus) GetUsedCapacity() (percent int)
- func (c *Corpus) IndexDocument(text string)
- func (c *Corpus) Prune()
- func (c *Corpus) Reset()
- type Option
- type PruneHook
- type TermFilter
- type Tokenizer
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func IsNoMatchVector ¶
IsNoMatchVector returns true if the vector didn't match any terms.
func VectorSubCount ¶
VectorSubCount returns the number of non-zero values in the vector.
Types ¶
type Corpus ¶
type Corpus struct {
// contains filtered or unexported fields
}
Corpus stores term frequencies across all documents.
func (*Corpus) CreatePaddedVector ¶
CreatePaddedVector creates a vector with the maximum potential vector size, padding with zeros if the vector is smaller. Not needed unless the graph you use to compare vectors does not support sparse vectors, as it will use more memory.
This is concurrent-safe.
func (*Corpus) CreateVector ¶
CreateVector creates a TF-IDF vector for the given text. Note that for documents, before generating a vector and adding it to a graph, ALL documents must be indexed first. Note that the returned vector will not be padded. See [CreatePaddedVector] if you need a constant-sized vector.
This will automatically call Corpus.Prune if there are any new documents that have been indexed since the last prune.
This is concurrent-safe.
func (*Corpus) GetDocumentCount ¶
GetDocumentCount returns the number of documents that have been indexed.
func (*Corpus) GetTermFrequency ¶
GetTermFrequency returns a snapshot of the term frequencies. Note that because [CreateVector] calls Corpus.Prune before creating vectors, if you invoke this before [CreateVector], you will receive terms that might not have been pruned yet by [PruneHook]s. Call Corpus.Prune manually before this function first in that case.
func (*Corpus) GetUsedCapacity ¶
GetUsedCapacity returns the percentage of the corpus capacity that is used. You can use this to determine if you are getting close to the max vector size. If you do go above capacity, all vectors will be calculated with the first X terms (sorted), where X is the max vector size, and you will lose corpus information. Make sure to call Corpus.Prune before checking this.
func (*Corpus) IndexDocument ¶
IndexDocument indexes a document, calculating occurrences of each term. Note that you should call this for ALL documents before creating vectors for your documents (or search queries).
This is concurrent-safe.
func (*Corpus) Prune ¶
func (c *Corpus) Prune()
Prune runs all prune hooks, removing terms of less importance from the corpus. This is automatically ran by Corpus.CreateVector if there are any new documents that have been indexed since the last prune. Run it manually if you don't plan to invoke Corpus.CreateVector immediately after indexing all documents. Do not run this until you have indexed all documents.
This is concurrent-safe.
type Option ¶
type Option func(*Corpus)
func WithMaxVectorSize ¶
WithMaxVectorSize sets the maximum potential vector size.
func WithPruneHooks ¶
WithPruneHooks allows adding hooks, which are ran before vectorization, that remove terms from the corpus. This can be used to remove terms that are either in too few documents, or too many documents, to reduce the sizze of the corpus.
func WithTermFilters ¶
func WithTermFilters(filters ...TermFilter) Option
WithTermFilters allows adding filters to the tokenizer iterator. For example to add: - stopword removal - lemmatization - stemming
Order of operations: tokenizer -> filter (1st call) -> filter (2nd call) -> ... -> filter (n-th call)
func WithTokenizer ¶
type PruneHook ¶
func PruneLessThan ¶
PruneLessThan is a PruneHook that removes terms that appear in less than the given number of documents. Keep in mind that if you happen to have very few documents, this may remove all terms.
func PruneLessThanPercent ¶
PruneLessThanPercent is a PruneHook that removes terms that appear in less than the given percentage of documents. Keep in mind that if you happen to have very few documents, this may remove all terms.
func PruneMoreThan ¶
PruneMoreThan is a PruneHook that removes terms that appear in more than the given number of documents. Keep in mind that if you happen to have very few documents, this may remove all terms.
func PruneMoreThanPercent ¶
PruneMoreThanPercent is a PruneHook that removes terms that appear in more than the given percentage of documents. Keep in mind that if you happen to have very few documents, this may remove all terms.
type TermFilter ¶
func StopTermFilter ¶
func StopTermFilter(words []string) TermFilter
StopTermFilter removes stop words from the tokenizer iterator (i.e. ignores them).
func TermFilterFunc ¶
func TermFilterFunc(filter func(string) string) TermFilter
TermFilterFunc is a helper function that creates a TermFilter from a function that transforms a single term. If the filter returns an empty string, the term is skipped.
func WithMaxLenTermFilter ¶
func WithMaxLenTermFilter(maxLen int) TermFilter
WithMaxLenTermFilter removes terms that are longer than the given length.
func WithMinLenTermFilter ¶
func WithMinLenTermFilter(minLen int) TermFilter
WithMinLenTermFilter removes terms that are shorter than the given length.