bloom

package module
v0.0.0-...-6e045c4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 9, 2025 License: BSD-2-Clause, MIT Imports: 9 Imported by: 0

README

Atomic Bloom Filters

This library is a fork of Bits and Blooms that uses an alternative backing bitset based on Go's sync/atomic.Int64 rather than a bare slice of integers. This allows for concurrent addition and testing of filters without creating memory safety issues or race conditions by leveraging hardware support for atomic Load and Or operations on Int64s.

A Bloom filter is a concise/compressed representation of a set, where the main requirement is to make membership queries; i.e., whether an item is a member of a set. A Bloom filter will always correctly report the presence of an element in the set when the element is indeed present. A Bloom filter can use much less storage than the original set, but it allows for some 'false positives': it may sometimes report that an element is in the set whereas it is not.

When you construct, you need to know how many elements you have (the desired capacity), and what is the desired false positive rate you are willing to tolerate. A common false-positive rate is 1%. The lower the false-positive rate, the more memory you are going to require. Similarly, the higher the capacity, the more memory you will use. You may construct the Bloom filter capable of receiving 1 million elements with a false-positive rate of 1% in the following manner.

    filter := bloom.NewWithEstimates(1000000, 0.01) 

You should call NewWithEstimates conservatively: if you specify a number of elements that it is too small, the false-positive bound might be exceeded. A Bloom filter is not a dynamic data structure: you must know ahead of time what your desired capacity is.

Our implementation accepts keys for setting and testing as []byte. Thus, to add a string item, "Love":

    filter.Add([]byte("Love"))

Similarly, to test if "Love" is in bloom:

    if filter.Test([]byte("Love"))

For numerical data, we recommend that you look into the encoding/binary library. But, for example, to add a uint32 to the filter:

    i := uint32(100)
    n1 := make([]byte, 4)
    binary.BigEndian.PutUint32(n1, i)
    filter.Add(n1)

Godoc documentation: https://pkg.go.dev/github.com/ericvolp12/atomic-bloom

Installation

go get -u github.com/ericvolp12/atomic-bloom

Running all tests

Before committing the code, please check if it passes all tests using (note: this will install some dependencies):

make deps
make qa

Design

A Bloom filter has two parameters: m, the number of bits used in storage, and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by an Atomic BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.

Goroutine safety

This implementation of Bloom Filters is safe to call from concurrent goroutines so long as your CPU supports Atomic Int64 (Most mainstream x86_64 and ARM systems do).

You can Test from and Add to the same filter to your heart's content.

Documentation

Overview

Package bloom provides data structures and methods for creating Bloom filters.

A Bloom filter is a representation of a set of _n_ items, where the main requirement is to make membership queries; _i.e._, whether an item is a member of a set.

This implementation uses an atomic bitset for thread-safety.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func EstimateFalsePositiveRate

func EstimateFalsePositiveRate(m, k, n uint) (fpRate float64)

EstimateFalsePositiveRate estimates the empirical false positive rate. Uses a temporary filter.

func EstimateParameters

func EstimateParameters(n uint, p float64) (m uint, k uint)

EstimateParameters estimates requirements for m and k.

func Locations

func Locations(data []byte, k uint) []uint64

Locations returns a list of hash locations representing a data item. This function remains independent of the bitset implementation.

Types

type BloomFilter

type BloomFilter struct {
	// contains filtered or unexported fields
}

A BloomFilter is a representation of a set of _n_ items, where the main requirement is to make membership queries; _i.e._, whether an item is a member of a set.

func From

func From(data []int64, k uint) *BloomFilter

From creates a new Bloom filter with len(_data_) * 64 bits and _k_ hashing functions, initialized with the provided data.

func FromWithM

func FromWithM(data []int64, m, k uint) *BloomFilter

FromWithM creates a new Bloom filter with _m_ length, _k_ hashing functions, initialized with the provided data.

func New

func New(m uint, k uint) *BloomFilter

New creates a new Bloom filter with _m_ bits and _k_ hashing functions We force _m_ and _k_ to be at least one to avoid panics.

func NewWithEstimates

func NewWithEstimates(n uint, fp float64) *BloomFilter

NewWithEstimates creates a new Bloom filter for about n items with fp false positive rate

func (*BloomFilter) Add

func (f *BloomFilter) Add(data []byte) *BloomFilter

Add data to the Bloom Filter. Returns the filter (allows chaining)

func (*BloomFilter) AddHash

func (f *BloomFilter) AddHash(h [4]uint64) *BloomFilter

Add precomputed hash values to the Bloom Filter. Returns the filter (allows chaining)

func (*BloomFilter) AddString

func (f *BloomFilter) AddString(data string) *BloomFilter

AddString adds a string to the Bloom Filter.

func (*BloomFilter) ApproximatedSize

func (f *BloomFilter) ApproximatedSize() int64

ApproximatedSize estimates the number of items added to the filter.

func (*BloomFilter) BitSet

func (f *BloomFilter) BitSet() *atomicBitSet

BitSet returns the underlying atomic bitset for this filter.

func (*BloomFilter) Cap

func (f *BloomFilter) Cap() uint

Cap returns the capacity, _m_, of a Bloom filter

func (*BloomFilter) ClearAll

func (f *BloomFilter) ClearAll() *BloomFilter

ClearAll clears all the data in a Bloom filter.

func (*BloomFilter) Copy

func (f *BloomFilter) Copy() *BloomFilter

Copy creates a copy of a Bloom filter.

func (*BloomFilter) Equal

func (f *BloomFilter) Equal(g *BloomFilter) bool

Equal tests for the equality of two Bloom filters

func (*BloomFilter) GobDecode

func (f *BloomFilter) GobDecode(data []byte) error

GobDecode implements gob.GobDecoder interface.

func (*BloomFilter) GobEncode

func (f *BloomFilter) GobEncode() ([]byte, error)

GobEncode implements gob.GobEncoder interface.

func (*BloomFilter) K

func (f *BloomFilter) K() uint

K returns the number of hash functions used in the BloomFilter

func (*BloomFilter) MarshalBinary

func (f *BloomFilter) MarshalBinary() ([]byte, error)

MarshalBinary implements encoding.BinaryMarshaler interface.

func (BloomFilter) MarshalJSON

func (f BloomFilter) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler interface.

func (*BloomFilter) Merge

func (f *BloomFilter) Merge(g *BloomFilter) error

Merge the data from another Bloom Filter. Returns error if parameters don't match.

func (*BloomFilter) ReadFrom

func (f *BloomFilter) ReadFrom(stream io.Reader) (int64, error)

ReadFrom reads a binary representation of the BloomFilter from an i/o stream.

func (*BloomFilter) Test

func (f *BloomFilter) Test(data []byte) bool

Test returns true if the data is *probably* in the BloomFilter, false otherwise.

func (*BloomFilter) TestAndAdd

func (f *BloomFilter) TestAndAdd(data []byte) bool

TestAndAdd checks membership and adds the data unconditionally. Returns true if the element was *probably* present before adding.

func (*BloomFilter) TestAndAddString

func (f *BloomFilter) TestAndAddString(data string) bool

TestAndAddString is the string version of TestAndAdd.

func (*BloomFilter) TestHash

func (f *BloomFilter) TestHash(h [4]uint64) bool

TestHash returns true if the hash is *probably* in the BloomFilter.

func (*BloomFilter) TestLocations

func (f *BloomFilter) TestLocations(locs []uint64) bool

TestLocations returns true if all locations are set in the BloomFilter.

func (*BloomFilter) TestOrAdd

func (f *BloomFilter) TestOrAdd(data []byte) bool

TestOrAdd checks membership and adds the data only if not present. Returns true if the element was *probably* present before adding. Note: Due to the nature of atomics, this isn't truly conditional on *all* bits being present beforehand if run concurrently. It ensures each bit is set if it wasn't already.

func (*BloomFilter) TestOrAddString

func (f *BloomFilter) TestOrAddString(data string) bool

TestOrAddString is the string version of TestOrAdd.

func (*BloomFilter) TestString

func (f *BloomFilter) TestString(data string) bool

TestString returns true if the string is *probably* in the BloomFilter.

func (*BloomFilter) UnmarshalBinary

func (f *BloomFilter) UnmarshalBinary(data []byte) error

UnmarshalBinary implements encoding.BinaryUnmarshaler interface.

func (*BloomFilter) UnmarshalJSON

func (f *BloomFilter) UnmarshalJSON(data []byte) error

UnmarshalJSON implements json.Unmarshaler interface.

func (*BloomFilter) WriteTo

func (f *BloomFilter) WriteTo(stream io.Writer) (int64, error)

WriteTo writes a binary representation of the BloomFilter to an i/o stream.

type Digest128

type Digest128 struct {
	// contains filtered or unexported fields
}

Digest128 represents a partial evaluation of a 128 bites hash.

func (*Digest128) Sum128

func (d *Digest128) Sum128(pad_tail bool, length uint, tail []byte) (h1, h2 uint64)

Sum128 computers two 64-bit hash value. It is assumed that bmix was first called on the data to process complete blocks of 16 bytes. The 'tail' is a slice representing the 'tail' (leftover elements, fewer than 16). If pad_tail is true, we make it seem like there is an extra element with value 1 appended to the tail. The length parameter represents the full length of the data (including the blocks of 16 bytes, and, if pad_tail is true, an extra byte).

func (*Digest128) Sum256

func (d *Digest128) Sum256(data []byte) (hash1, hash2, hash3, hash4 uint64)

Sum256 will compute 4 64-bit hash values from the input. It is designed to never allocate memory on the heap. So it works without any byte buffer whatsoever. It is designed to be strictly equivalent to

			a1 := []byte{1}
         hasher := murmur3.New128()
         hasher.Write(data) // #nosec
         v1, v2 := hasher.Sum128()
         hasher.Write(a1) // #nosec
         v3, v4 := hasher.Sum128()

See TestHashRandom.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL