基于Redis Cluster的分布式缓存算法篇（四）

说明下这里的算法指的是内存淘汰算法。redis作为应用级缓存使用，在内存超过限制时，按照配置的策略，淘汰掉相应的kv，使得内存可以继续留有足够的空间保存新的数据。

算法介绍

redis 提供 6种数据淘汰策略：

noeviction : 默认，不淘汰
volatile-lru：从已设置过期时间的数据集（server.db[i].expires）中挑选最近最少使用的数据淘汰
volatile-ttl：从已设置过期时间的数据集（server.db[i].expires）中挑选将要过期的数据淘汰
volatile-random：从已设置过期时间的数据集（server.db[i].expires）中任意选择数据淘汰
allkeys-lru：从数据集（server.db[i].dict）中挑选最近最少使用的数据淘汰
allkeys-random：从数据集（server.db[i].dict）中任意选择数据淘汰

以上6中淘汰策略已经能够满足大多数应用场景，如何选择淘汰策略？根据实际业务情况进行选择。

源码分析

其缓存管理功能，由redis.c文件中的freeMemoryIfNeeded函数实现。如果maxmemory被设置，则在每次进行命令执行之前，该函数均被调用，用以判断是否有足够内存可用，释放内存或返回错误。如果没有找到足够多的内存，程序主逻辑将会阻止设置了REDIS_COM_DENYOOM flag的命令执行，对其返回command not allowed when used memory > ‘maxmemory’的错误消息。源码如下：

int freeMemoryIfNeeded(void) {
    size_t mem_used, mem_tofree, mem_freed;
    int slaves = listLength(server.slaves);

    /* Remove the size of slaves output buffers and AOF buffer from the
     * count of used memory. */
    // 计算占用内存大小时，并不计算slave output buffer和aof buffer，因此maxmemory应该比实际内存小，为这两个buffer留足空间。
    mem_used = zmalloc_used_memory();
    if (slaves) {
        listIter li;
        listNode *ln;

        listRewind(server.slaves,&li);
        while((ln = listNext(&li))) {
            redisClient *slave = listNodeValue(ln);
            unsigned long obuf_bytes = getClientOutputBufferMemoryUsage(slave);
            if (obuf_bytes > mem_used)
                mem_used = 0;
            else
                mem_used -= obuf_bytes;
        }
    }
    if (server.appendonly) {
        mem_used -= sdslen(server.aofbuf);
        mem_used -= sdslen(server.bgrewritebuf);
    }

    /* Check if we are over the memory limit. */
    if (mem_used <= server.maxmemory) return REDIS_OK;

    if (server.maxmemory_policy == REDIS_MAXMEMORY_NO_EVICTION)
        return REDIS_ERR; /* We need to free memory, but policy forbids. */

    /* Compute how much memory we need to free. */
    mem_tofree = mem_used - server.maxmemory;
    mem_freed = 0;
    while (mem_freed < mem_tofree) {
        int j, k, keys_freed = 0;

        for (j = 0; j < server.dbnum; j++) {
            long bestval = 0; /* just to prevent warning */
            sds bestkey = NULL;
            struct dictEntry *de;
            redisDb *db = server.db+j;
            dict *dict;

            if (server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_LRU ||
                server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_RANDOM)
            {
                dict = server.db[j].dict;
            } else {
                dict = server.db[j].expires;
            }
            if (dictSize(dict) == 0) continue;

            /* volatile-random and allkeys-random policy */
            if (server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_RANDOM ||
                server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_RANDOM)
            {
                de = dictGetRandomKey(dict);
                bestkey = dictGetEntryKey(de);
            }//如果是random delete,则从dict中随机选一个key

            /* volatile-lru and allkeys-lru policy */
            else if (server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_LRU ||
                server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_LRU)
            {
                for (k = 0; k < server.maxmemory_samples; k++) {
                    sds thiskey;
                    long thisval;
                    robj *o;

                    de = dictGetRandomKey(dict);
                    thiskey = dictGetEntryKey(de);
                    /* When policy is volatile-lru we need an additonal lookup
                     * to locate the real key, as dict is set to db->expires. */
                    if (server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_LRU)
                        de = dictFind(db->dict, thiskey); //因为dict->expires维护的数据结构里并没有记录该key的最后访问时间
                    o = dictGetEntryVal(de);
                    thisval = estimateObjectIdleTime(o);

                    /* Higher idle time is better candidate for deletion */
                    if (bestkey == NULL || thisval > bestval) {
                        bestkey = thiskey;
                        bestval = thisval;
                    }
                }//为了减少运算量,redis的lru算法和expire淘汰算法一样，都是非最优解，lru算法是在相应的dict中，选择maxmemory_samples(默认设置是3)份key，挑选其中lru的，进行淘汰
            }

            /* volatile-ttl */
            else if (server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_TTL) {
                for (k = 0; k < server.maxmemory_samples; k++) {
                    sds thiskey;
                    long thisval;

                    de = dictGetRandomKey(dict);
                    thiskey = dictGetEntryKey(de);
                    thisval = (long) dictGetEntryVal(de);

                    /* Expire sooner (minor expire unix timestamp) is better
                     * candidate for deletion */
                    if (bestkey == NULL || thisval < bestval) {
                        bestkey = thiskey;
                        bestval = thisval;
                    }
                }//注意ttl实现和上边一样，都是挑选出maxmemory_samples份进行挑选
            }

            /* Finally remove the selected key. */
            if (bestkey) {
                long long delta;

                robj *keyobj = createStringObject(bestkey,sdslen(bestkey));
                propagateExpire(db,keyobj); //将del命令扩散给slaves
                /* We compute the amount of memory freed by dbDelete() alone.
                 * It is possible that actually the memory needed to propagate
                 * the DEL in AOF and replication link is greater than the one
                 * we are freeing removing the key, but we can't account for
                 * that otherwise we would never exit the loop.
                 *
                 * AOF and Output buffer memory will be freed eventually so
                 * we only care about memory used by the key space. */
                delta = (long long) zmalloc_used_memory();
                dbDelete(db,keyobj);
                delta -= (long long) zmalloc_used_memory();
                mem_freed += delta;
                server.stat_evictedkeys++;
                decrRefCount(keyobj);
                keys_freed++;

                /* When the memory to free starts to be big enough, we may
                 * start spending so much time here that is impossible to
                 * deliver data to the slaves fast enough, so we force the
                 * transmission here inside the loop. */
                if (slaves) flushSlavesOutputBuffers();
            }
        }//在所有的db中遍历一遍，然后判断删除的key释放的空间是否足够
        if (!keys_freed) return REDIS_ERR; /* nothing to free... */
    }
    return REDIS_OK;
}

算法源码解析

淘汰算法redis.h的相关定义

/* The actual Redis Object */  
#define REDIS_LRU_BITS 24  
#define REDIS_LRU_CLOCK_MAX ((1<<REDIS_LRU_BITS)-1) /* Max value of obj->lru */  
#define REDIS_LRU_CLOCK_RESOLUTION 1000 /* LRU clock resolution in ms */  
typedef struct redisObject {  
    unsigned type:4; //存放的对象类型
    unsigned encoding:4; //内容编码
    //与server.lruclock的时间差值
    unsigned lru:REDIS_LRU_BITS; /* lru time (relative to server.lruclock) */  
    int refcount; //引用计数算法使用的引用计数器
    void *ptr;  //数据指针
} robj;  
  
/* Macro used to obtain the current LRU clock. 
 * If the current resolution is lower than the frequency we refresh the 
 * LRU clock (as it should be in production servers) we return the 
 * precomputed value, otherwise we need to resort to a function call. */  
#define LRU_CLOCK() ((1000/server.hz <= REDIS_LRU_CLOCK_RESOLUTION) ? server.lruclock : getLRUClock())

unsigned int getLRUClock(void) {  
    return (mstime()/REDIS_LRU_CLOCK_RESOLUTION) & REDIS_LRU_CLOCK_MAX;  
} 

unsigned lru:REDIS_LRU_BITS; /* lru time (relative to server.lruclock) */

以最简单的all-keys-lru淘汰策略为例，该策略随机选出16个，通过过期时间对pool内存数据进行淘汰。

/* volatile-lru and allkeys-lru policy */  
          else if (server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_LRU ||  
              server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_LRU)  
          {  
              struct evictionPoolEntry *pool = db->eviction_pool;  
  
              while(bestkey == NULL) {  
                  evictionPoolPopulate(dict, db->dict, db->eviction_pool);  
                  /* Go backward from best to worst element to evict. */  
                  for (k = REDIS_EVICTION_POOL_SIZE-1; k >= 0; k--) {  
                      if (pool[k].key == NULL) continue;  
                      de = dictFind(dict,pool[k].key);  
  
                      /* Remove the entry from the pool. */  
                      sdsfree(pool[k].key);  
                      /* Shift all elements on its right to left. */  
                      memmove(pool+k,pool+k+1,  
                          sizeof(pool[0])*(REDIS_EVICTION_POOL_SIZE-k-1));  
                      /* Clear the element on the right which is empty 
                       * since we shifted one position to the left.  */  
                      pool[REDIS_EVICTION_POOL_SIZE-1].key = NULL;  
                      pool[REDIS_EVICTION_POOL_SIZE-1].idle = 0;  
  
                      /* If the key exists, is our pick. Otherwise it is 
                       * a ghost and we need to try the next element. */  
                      if (de) {  
                          bestkey = dictGetKey(de);  
                          break;  
                      } else {  
                          /* Ghost... */  
                          continue;  
                      }  
                  }  
              }  
          }

通过lru淘汰返回lru时间，最后与当前lru比较

/* Given an object returns the min number of milliseconds the object was never 
 * requested, using an approximated LRU algorithm. */  
unsigned long long estimateObjectIdleTime(robj *o) {  
    unsigned long long lruclock = LRU_CLOCK();  
    if (lruclock >= o->lru) {  
        return (lruclock - o->lru) * REDIS_LRU_CLOCK_RESOLUTION;  
    } else {  
        return (lruclock + (REDIS_LRU_CLOCK_MAX - o->lru)) *  
                    REDIS_LRU_CLOCK_RESOLUTION;  
    }  
}

测试验证

要将redis最终应用到生产环境中，稳定性、可靠性测试更为重要，下面将对淘汰算法相关进行测试。对于性能的测试另有文章，请关注之后的性能篇。

说明

在基于Redis Cluster的分布式缓存部署篇（三）中设置maxmemory为100M，保证服务不会在高并发情况下宕机。最后设置maxmemory-policy要测试的淘汰策略。

用例代码

java编写的测试程序，没有讲究代码结构，请见谅。

package jcache;

import jcache.clients.jcachecluster.base.JCacheClient;
import jcache.clients.jcachecluster.common.PropertiesConst;
import jcache.clients.jcachecluster.factory.CacheFactorySingle;

import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import java.util.UUID;
import java.util.concurrent.BrokenBarrierException;
import java.util.concurrent.CyclicBarrier;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicLong;

public class JcacheClusterTPS {
    private static String HostAndPort =
                    "10.128.31.104:7000;" +
                    "10.128.31.104:7001;" +
                    "10.128.31.105:7000;" +
                    "10.128.31.105:7001;" +
                    "10.128.31.109:7000;" +
                    "10.128.31.109:7001";

    public static void main(String[] args) {
        final JCacheClient cluster = getCluster();
        final int nThreads = args.length >= 1 ? Integer.parseInt(args[0]) : 8;
        final int sendNumOnceTime = args.length >= 2 ? Integer.parseInt(args[1]) : 100;
        final int keySize = args.length >= 3 ? Integer.parseInt(args[2]) : 10;
        final int messageSize = args.length >= 4 ? Integer.parseInt(args[3]) : 10;
        final int times = args.length >= 5 ? Integer.parseInt(args[4]) : 1;
        final int setOrGet = args.length >= 6 ? Integer.parseInt(args[5]) : 0;
        final String keyParam = args.length >= 7 ? args[6] : "0";
        final AtomicLong atomicSuccessNums = new AtomicLong(0);
        final List<Long> tpsList = new ArrayList<Long>();
        final String msg = buildMessage(messageSize);
        final List<String> keys = getKeys(keySize, nThreads * sendNumOnceTime, times);
        doRun(cluster, tpsList, atomicSuccessNums, nThreads, times, sendNumOnceTime, keys, msg, keySize, messageSize, setOrGet, keyParam);
    }

    private static JCacheClient getCluster() {
        Properties properties = new Properties();
        properties.put(PropertiesConst.Keys.HOST_AND_PORT, HostAndPort);
        properties.put(PropertiesConst.Keys.AUTH_KEY, "13F455A8E9DC2BBEBE1BD906C82B3C0A1");
        properties.put(PropertiesConst.Keys.NAMESPACE, "weidian-1");
        return CacheFactorySingle.createJCacheClient(properties);
    }

    private static void doRun(final JCacheClient cluster,
                              final List<Long> tpsList, final AtomicLong atomicSuccessNums,
                              final int nThreads, final int times, final int sendNumOnceTime,
                              final List<String> keys, final String msg, final int keySize, final int messageSize,
                              final int setOrGet, final String keyParam) {
        final AtomicLong atomicFailNum = new AtomicLong(0);
        for (int time = 0; time < times; time++) {
            final Object object = new Object();
            synchronized (object) {
                final int t = time + 1;
                final ExecutorService exec = Executors.newCachedThreadPool();
                final long startCurrentTimeMillis = System.currentTimeMillis();
                final CyclicBarrier barrier = new CyclicBarrier(nThreads, new Runnable() { // 设置几个线程为一组,当这一组的几个线程都执行完成后,然后执行住线程的
                    public void run() {
                        synchronized (object) {
                            long endCurrentTimeMillis = System.currentTimeMillis();
                            long sendNums = nThreads * sendNumOnceTime;
                            long escapedTimeMillis = endCurrentTimeMillis - startCurrentTimeMillis;
                            long tps = sendNums * 1000 / escapedTimeMillis;
                            String type = "set";
                            if (setOrGet != 0) {
                                type = "get";
                            }
                            tpsList.add(tps);
                            System.out.printf("第 %d 次, 发送完成, 用时 : %d ms, " + "线程大小 : %d , " + "key大小 : %d , " + "msg大小 : %d , " + "发送数量 : %d , " + "成功数量 : %d , " + "失败数量 : %d , " + "统计方式 : %s , " + "TPS : %d !!!",
                                    t, escapedTimeMillis, nThreads, keySize, messageSize, sendNums, atomicSuccessNums.intValue(), atomicFailNum.intValue(), type, tps);
                            exec.shutdown();
                            object.notify();
                            System.out.println();
                        }
                    }
                });
                for (int i = 0; i < nThreads; i++) {

                    final String finalI = "i" + i;
                    exec.execute(new Runnable() {
                        public void run() {
                            try {
                                for (int j = 0; j < sendNumOnceTime; j++) {
                                    if (setOrGet == 0) {
                                        String key = "ke" + keyParam + finalI + "j" + j;
                                        try {
                                            String resp = cluster.set(key, msg);
                                            if (!"OK".equals(resp)) {
                                                atomicFailNum.incrementAndGet();
                                            } else {
                                                atomicSuccessNums.incrementAndGet();
                                            }
                                        } catch (Exception e) {
                                            atomicFailNum.incrementAndGet();
                                        }
                                    } else {
                                        cluster.get(keys.get(atomicSuccessNums.intValue()));
                                    }
                                }
                                barrier.await();
                            } catch (Exception e) {
                                try {
                                    barrier.await();
                                } catch (InterruptedException e1) {
                                    e1.printStackTrace();
                                } catch (BrokenBarrierException e1) {
                                    e1.printStackTrace();
                                }
                                e.printStackTrace();
                            }
                        }
                    });
                }
                try {
                    object.wait();
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }
        long sum = 0;
        for (Long tps : tpsList) {
            sum += tps;
        }
        System.out.printf("全部发送完成, 平均TPS : %d !!!", sum / tpsList.size());
    }

    private static String buildMessage(final int messageSize) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < messageSize; i += 8) {
            sb.append("hello baby");
        }
        return sb.toString();
    }

    private static List<String> getKeys(int keySize, int keys, int times) {
        System.out.println(">>>>>>>>>>>>>>正在生成Key>>>>>>>>>>>>>>");
        List<String> keysList = new ArrayList<String>();
//        for (int i = 0; i < keys * times; i++) {
//            keysList.add(getUId(keySize, i));
//        }
        System.out.println(">>>>>>>>>>>>>>生成成功Key>>>>>>>>>>>>>>");
        return keysList;
    }

    private static String getUId(int keySize, int i) {
        return String.format("%0" + keySize + "d", i);
    }

    private static String getUUId() {
        return UUID.randomUUID().toString();
    }

}

用例执行命令

1	java -cp jedis-2.9.0.jar:commons-pool2-2.3.jar: RedisClusterDataCorrect 100 10000 10 10 1 0

用例参数说明

第一个参数是线程数量
第二个参数是每个线程执行操作次数
第三个参数是key的大小（字节）
第四个参数是val的大小（字节）
第五个参数是运行次数
第六个参数是操作类型，0标示set，1标示get

noeviction策略

测试用例

每条数据1K，运行200M的数据，也就是测试204800条数据的读取和写入。

1	java -cp jedis-2.9.0.jar:commons-pool2-2.3.jar: RedisClusterStrategy 1024 204800

测试结果

当集群的运行内存超过最大使用内存后，测试程序抛异常。noeviction策略不淘汰数据。

查看redis的key数量

allkeys-lru策略

测试用例

每条数据1K，运行200M的数据，也就是测试204800条数据的读取和写入。

1	java -cp jedis-2.9.0.jar:commons-pool2-2.3.jar: RedisClusterStrategy 1024 204800

测试结果

内存占用到达最大值后会置换之前的数据，所有测试数据能全部跑完

查看redis的key数量

volatile-lru策略

测试用例

每条数据1K，运行200M的数据，也就是测试204800条数据的读取和写入。

1	java -cp jedis-2.9.0.jar:commons-pool2-2.3.jar: RedisClusterStrategy 1024 204800

测试结果

内存占用到达最大值后，如果没有设置过期的数据仍在添加,程序会抛出异常。
调整测试用例，前100条不设置过期时间，后面全部设置过期时间，程序能正常跑完。

查看redis的key数量

allkeys-random策略

测试用例

每条数据1K，运行200M的数据，也就是测试204800条数据的读取和写入。

1	java -cp jedis-2.9.0.jar:commons-pool2-2.3.jar: RedisClusterStrategy 1024 204800

测试结果

内存占用到达最大值后会置换之前的数据，所有测试数据能全部跑完

查看redis的key数量

volatile-random策略

测试用例

每条数据1K，运行200M的数据，也就是测试204800条数据的读取和写入。

1	java -cp jedis-2.9.0.jar:commons-pool2-2.3.jar: RedisClusterStrategy 1024 204800

测试结果

内存占用到达最大值后，如果没有设置过期的数据仍在添加,程序会抛出异常。
调整测试用例，前100条不设置过期时间，后面全部设置过期时间，程序能正常跑完。

查看redis的key数量

volatile-ttl策略

测试用例

每条数据1K，运行200M的数据，也就是测试204800条数据的读取和写入。

1	java -cp jedis-2.9.0.jar:commons-pool2-2.3.jar: RedisClusterStrategy 1024 204800

测试结果

调整测试用例过期时间设置从3600秒开始逐渐增加。内存占用到达最大值后会先移除最近过期的数据，所以程序能全部跑完。

查看redis的key数量

总结

启用内存策略之前，需要设置maxmeory配置项，即节点的最大内存使用值。
考虑到持久化数据的缓存场景，选择自己的淘汰策略，一般情况建议设置为noeviction策略。
根据使用场景，适当调低rewrite-percentage百分比，降低内存此时的最小内存值，以防止在进行rewrite时，使用大量内存导致进程挂死问题。