background

A similar need in recent work is: There are about 300 million data dictionaries in a csv file A as a data source. For any word M entered by the user, it is necessary to quickly match whether the M word exists in A.

(A file is about 3G in size, and the total number of lines is 300 million)

Get this demand, what is your first thought?

The normal idea might be:

  1. Import csv file A into a relational database.
  2. sql query matches by M.

The obvious disadvantage of the above method is: slow!

More than 300 million rows of data, even if the index is built for retrieval, it will take a lot of time to match (I have not tried it myself, and interested friends can test and test by themselves, theoretically can’t get up).

At present, the best solution for time complexity and space complexity is probably Bloom Filter, wiki address: [Bloom Filter] (https://en.wikipedia.org/wiki/Bloom_filter)

Here for readers who don’t know much about Bloom Filter, familiar friends look directly at the next section.

This article scenario Bloom Filter uses the idea to explain:

  1. Suppose you have applied a large array of bit bits (that is, the elements in the array can only be one bit, 1 or 0, and the default element values ​​are 0)
  2. Each word in the csv file A is hashed by multiple hash functions to get multiple subscript positions corresponding to the large array.
  3. Set the bit bits of the multiple subscript positions obtained in step 2 to 1.
  4. For any word M input by the user, follow the steps of 2 to get multiple subscript positions, which are all corresponding to the value in the large array, otherwise it does not exist.

Scheme selection

There are many ways to implement Bloom Filter. There are various language versions. Here, in order to really feel the charm of the algorithm, I decided to use the java code to slap it!

On the other hand, considering the needs of distributed applications, it is obviously not appropriate to build Bloom Filter storage on stand-alone memory. Choose redis here.

Redis has the following operations, which can be used to implement bloomfilter:

Redis> SETBIT bit 10086 1
(integer) 0

Redis> GETBIT bit 10086
(integer) 1

Redis> GETBIT bit 100 # bit is initialized to 0 by default
(integer) 0

For details, please refer to: [redis setbit operation] (http://redisdoc.com/string/setbit.html)

Implementation details

The key to implementing the bloom filter is the hash function. Generally, in order to reduce the false positive rate and reduce the impact of the hash collision, multiple hash functions are selected.

So how do you write a hash function?

No, the hash we want is input: String –> output: int , does the String class in jdk not exactly have a hashCode method? Take a look at it!


    Public int hashCode() {
        Int h = hash;
        If (h == 0 && value.length > 0) {
            Char val[] = value;

            For (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        Return h;
    }

Seeing this line h = 31 * h + val[i];, the seeming principle is actually very simple, the ascii code corresponding to each character is added up by a formula. There is a coefficient 31 here, a little change, can you have more than one hash function?

The following is a slightly modified hash function:


    //Total bitmap size 64M
    Private static final int cap = 1 << 29;
    /*
     * Seeds of different hash functions, generally taking prime numbers
     * The seed array has a total of 8 values, which means that 8 different hash functions are used.
     */
    Private int[] seeds = new int[]{3, 5, 7, 11, 13, 31, 37, 61};

    Private int hash(String value, int seed) {
        Int result = 0;
        Int length = value.length();
        For (int i = 0; i < length; i++) {
            Result = seed * result + value.charAt(i);
        }

        Return (cap - 1) & result;
    }

The rest of the things are very simple. For each word in the dictionary A, the corresponding hash function in seeds is sequentially adjusted (there are a total of 8 here), and the subscript value is set to 1. by the setbit operation of redis.

Redis code (here wrapped in a pipeline.)

@Service
Public class RedisService {

@Autowired
    Private StringRedisTemplate template;

Public void multiSetBit(String name, boolean value, long... offsets) {
        template.executePipelined((RedisCallback) connection -> {

            For (long offset : offsets) {
                connection.setBit(name.getBytes(), offset, value);
            }
            Return null;
        });

    }

    Public List multiGetBit(String name, long... offsets) {

        List results = template.executePipelined((RedisCallback) connection -> {

            For (long offset : offsets) {
                connection.getBit(name.getBytes(), offset);
            }
            Return null;
        });

        List list = new ArrayList<>();

        results.forEach(obj -> {
            List.add((Boolean) obj);
        });

        Return list;
    }
}

Finally, the code string looks like this:

        FileInputStream inputStream = new FileInputStream("/XXXX.csv");
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));

        HashSet totalSet=new HashSet<>();

        String word=null;
        While((word = bufferedReader.readLine()) != null){
            For (int seed : seeds) {
                Int hash = hash(word, seed);
                totalSet.add((long) hash);
            }

            Long[] offsets = new long[totalSet.size()];

            Int i=0;
            For(Long l:totalSet){
                Offsets[i++]=l;
            }

            redisService.multiSetBit("BLOOM_FILTER_WORDS_DICTIONARY", true, offsets);

        }

The same is true when checking:

        String word = "XXXX"; //actual input

        Long[] offsets = new long[seeds.length];

        For (int i = 0; i < seeds.length; i++) {
            Int hash = hash(mobile, seeds[i]);
           
            Offsets[i] = hash;
        }

        List results = redisService.multiGetBit("BLOOM_FILTER_WORDS_DICTIONARY", offsets);

        / / Determine whether all are true (there is)

        Boolean isExisted=true;
        For(Boolean result:results){
            If(!result){
                isExisted=false;
                Break;
            }
        }

Precautions

Setbit's offset is limited by size. Between 0 and 232 (maximum use of 512M memory), that is, before 0~4294967296, more than this number will automatically convert offset to 0, so be careful when using it.

Last modified: 2019年3月29日

Author

Comments

Write a Reply or Comment

Your email address will not be published.