background

A similar need in recent work is: There are about 300 million data dictionaries in a csv file A as a data source. For any word M entered by the user, it is necessary to quickly match whether the M word exists in A.

(A file is about 3G in size, and the total number of lines is 300 million)

Get this demand, what is your first thought?

The normal idea might be:

  1. Import csv file A into a relational database.
  2. sql query matches by M.

The obvious disadvantage of the above method is: slow!

More than 300 million rows of data, even if the index is built for retrieval, it will take a lot of time to match (I have not tried it myself, and interested friends can test and test by themselves, theoretically can’t get up).

At present, the best solution for time complexity and space complexity is probably Bloom Filter, wiki address: [Bloom Filter] (https://en.wikipedia.org/wiki/Bloom_filter)

Here for readers who don’t know much about Bloom Filter, familiar friends look directly at the next section.

This article scenario Bloom Filter uses the idea to explain:

  1. Suppose you have applied a large array of bit bits (that is, the elements in the array can only be one bit, 1 or 0, and the default element values ​​are 0)
  2. Each word in the csv file A is hashed by multiple hash functions to get multiple subscript positions corresponding to the large array.
  3. Set the bit bits of the multiple subscript positions obtained in step 2 to 1.
  4. For any word M input by the user, follow the steps of 2 to get multiple subscript positions, which are all corresponding to the value in the large array, otherwise it does not exist.

Scheme selection

There are many ways to implement Bloom Filter. There are various language versions. Here, in order to really feel the charm of the algorithm, I decided to use the java code to slap it!

On the other hand, considering the needs of distributed applications, it is obviously not appropriate to build Bloom Filter storage on stand-alone memory. Choose redis here.

Redis has the following operations, which can be used to implement bloomfilter:

For details, please refer to: [redis setbit operation] (http://redisdoc.com/string/setbit.html)

Implementation details

The key to implementing the bloom filter is the hash function. Generally, in order to reduce the false positive rate and reduce the impact of the hash collision, multiple hash functions are selected.

So how do you write a hash function?

No, the hash we want is input: String –> output: int , does the String class in jdk not exactly have a hashCode method? Take a look at it!

Seeing this line h = 31 * h + val[i];, the seeming principle is actually very simple, the ascii code corresponding to each character is added up by a formula. There is a coefficient 31 here, a little change, can you have more than one hash function?

The following is a slightly modified hash function:

The rest of the things are very simple. For each word in the dictionary A, the corresponding hash function in seeds is sequentially adjusted (there are a total of 8 here), and the subscript value is set to 1. by the setbit operation of redis.

Redis code (here wrapped in a pipeline.)

Finally, the code string looks like this:

The same is true when checking:

Precautions

Setbit’s offset is limited by size. Between 0 and 232 (maximum use of 512M memory), that is, before 0~4294967296, more than this number will automatically convert offset to 0, so be careful when using it.

Last modified: 2019年3月29日

Author

Comments

Write a Reply or Comment

Your email address will not be published.