It is a data structure that allows you to verify belonging of an element to the set, without storing this element itself. There is a possibility of false-positive response, but a negative response is always reliable. In the filter with 1% accuracy, only a few bits per cell are required.
This structure is often used to limit the number of requests to the data store, cutting off the requests to the not available elements. In addition, it can be used for approximate calculation of the number of unique events, users, views, etc.
However, there are some drawbacks that can restrain the web developers from the use of Bloom filter.
Storage
All web applications on all servers must have access to a common set of data with the ability to quickly change or check out a few bits in the big bit vector. In addition, the data must be stored securely on the disk.
Hashing
Most implementations use 32-bit murmur hash. There is no evidence that such an approach does not increase the asymptotic probability of false positives.
In addition, if you are using a 32-bit hash, it defines the upper boundary of the filter size of 512 megabytes. This is not so little, but it is also not so much, especially if you want a very low rate of false positives and / or an impressive number of elements in the filter.
There are not many ways to get hash functions with values in the desired range so that they would use only the information entropy of the original message:
- Cut output of one wide enough functions into equal parts. This function must have a parametric width of output in bits.
- Have only two independent hash functions of variable width h1 and h2 and form derivatives of gi function according to the formula h1 + h2 * i.
Performance
To find use in containing requests to the storage, this construction must reply faster than operated storage.
When implementing it on interpreted languages that web developers often operate on, it can be difficult, and even unnecessary.
The proposed solution
There is a solution that allows you to conveniently integrate Bloom filters in the infrastructure of the web application.
This is a server container which stores Bloom filter in memory and on disk. Access to the operations of the filter is provided through HTTP.
Persistence of storage is provided with snapshots, not blocking the daemon. Snapshot is written into a temporary file with a non-trivial name and replaces the old one only after recording has been finished.
HTTP-interface is selected for ease of integration on the development side: most development platforms have developed tools for the implementation of synchronous, asynchronous and parallel (curl-multi) HTTP-requests. Despite the simplicity of the interface, it has adopted all the advantages of HTTP standard and allows pipelining of requests within a single connection.
Alternative solutions
There are libraries that work on the application side and contain general data in Redis. They manipulate bits in the bit maps of Redis by SETBIT and GETBIT operations.
Their pros:
- Calculation of the hash happens on the side of the web application, and it does not load the server computing work.
- They use Redis, which has a rich set of features, the possibility of scripting and clustering.
Their cons:
- Redis limits the length of the bit field with 512 megabytes. This limits the size of the filter.
- These solutions are not suitable for multilingual projects: implementation of hashes on different languages happens in its own way, and such filters are not compatible with each other.
- Performance. These solutions use two strategies to work with Redis: bits for each key are set through the pipeline, or a stored in Redis procedure, which calculates the hash and gives the command is used so the speed of operations significantly slows down.
However, the situation may be improved with the appearance of BITFIELD command in Redis 3.2.0, which was released in the May 2016 year. This command allows making several operations on the bits in the same command.
Follow me, to learn more about popular science, math, and technologies
With Love,
Kate
This post has been ranked within the top 10 most undervalued posts in the first half of Nov 25. We estimate that this post is undervalued by $10.42 as compared to a scenario in which every voter had an equal say.
See the full rankings and details in The Daily Tribune: Nov 25 - Part I. You can also read about some of our methodology, data analysis and technical details in our initial post.
If you are the author and would prefer not to receive these comments, simply reply "Stop" to this comment.