peeping into memcached

Memcached is generally treated as a black box. But what if you really need to know what’s in there? Not for runtime purposes, but for optimization and capacity planning?

demo

$ sudo peep --pretty 2479

time | exptime | nbytes | clsid |                 key | exprd | flushd
8658 |  613458 |    272 |     5 |      "c3RhdH:171:5" | false |  false
8658 |       0 |      6 |     1 |  "current_c3RhdH:3" | false |  false
8658 |  613458 |    281 |     5 |     "c3RhdH:171:26" | false |  false
8678 |   95078 |      6 |     1 |  "User:1:auth:m4Uq" | false |  false
8658 |       0 |      8 |     2 |   "user_dGltZWxp:4" | false |  false
8686 |  613486 |   1278 |     9 |          "User:1:6" | false |  false
8658 |  613458 |   1286 |     9 |          "User:1:4" | false |  false
...
8658 |  613458 |    283 |     5 |     "c3RhdH:171:28" | false |  false
8658 |  613458 |    277 |     5 |     "c3RhdH:171:30" | false |  false

Peep uses ptrace to freeze a running memcached server, dump the internal key metadata, and return the server to a running state. If you have a good host ejection mechanism in your client, such as in the Twitter libmemcached builds, you won’t even have to change the production server pool. The instance is not restarted, and no data is lost.

Installation instructions are here. Requires Linux.

a note about heap management

Memcached has two separate memory management strategies:

  • On read, if a key is past its expiry time, return NOT FOUND.
  • On write, choose an appropriate slab class for the value; if it’s full, replace the oldest-used (read or written) key with the new one.

Note that the second strategy, LRU eviction, does not check the expiry time at all. This makes eviction an O(1) operation, because the tail of the LRU linked list is immediately available; checking within that list for evicted keys first would raise the order to O(N). This optimization has subtle implications.

the problem

In the bad old days, we added a page cache to Twitter via memcached. We used a generational key scheme: a piece of user metadata was incremented on every change, and used to construct dependent keys. This eliminated direct invalidation concerns.

However, because of the orthogonal expiry and LRU strategies, we knew that the generational keys were affecting our evictions of direct keys. The continual massive onslaught of new generational keys would sweep out older direct keys, even when the generational keys were expired, but the direct keys were fresh.

But how could we prove it was happening? Eventually we ran peep on a representative server. Aggregate results showed that 75% of our cache pool was spent on expired page caches, and our cache horizon was only 5 hours. Moving the generational keys to a separate pool raised our cache horizon to two days, a 10x improvement.

aggregating results

The best way to aggregate peep results is to use MySQL’s excellent grouping abilities. Run peep with the --ugly (unformatted) switch, and pipe it to a file. Then you can load it into an approprate MySQL schema:

CREATE TABLE entries (
  lru_time int(11) default NULL,
  expires_at_time int(11) default NULL,
  value_size int(11) default NULL,
  suffix_size int(11) default NULL,
  it_flag varchar(255) default NULL,
  class_id int(11) default NULL,
  key_size int(11) default NULL,
  key_name varchar(255) default NULL,
  is_expired varchar(255) default NULL,
  is_flushed varchar(255) default NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

LOAD DATA LOCAL INFILE '/tmp/peep.txt' INTO TABLE entries
  FIELDS TERMINATED BY ' | ' LINES TERMINATED BY '\n';

This will load very fast, even for millions of keys on a laptop; LOAD LOCAL INFILE is an optimized pipeline.

Here are some handy queries for revealing cache properties:

fresh vs. expired values, by count and size

SELECT
  is_expired,
  COUNT(*),
  SUM(value_size)
FROM entries
GROUP BY is_expired;

+------------+----------+-----------------+
| is_expired | COUNT(*) | SUM(value_size) |
+------------+----------+-----------------+
|  false     |   688883 |       759401296 |
|  true      |   472755 |      2430092660 |
+------------+----------+-----------------+

histogram of values by size

SELECT
  ROUND(ROUND(LOG10(value_size) * 5, 0) / 5, 1) AS log,
  ROUND(AVG(value_size), 0) AS size,
  COUNT(*),
  RPAD('', COUNT(*) / 6500, '*') AS bar
FROM entries
GROUP BY log
ORDER BY log DESC;

+------+--------+----------+----------------------------------------+
| log  | size   | COUNT(*) | bar                                    |
+------+--------+----------+----------------------------------------+
|  5.6 | 422238 |        3 |                                        |
...
|  1.6 |     39 |     3652 | *                                      |
|  1.4 |     23 |   103434 | ************************************** |
|  1.2 |     14 |     6369 | *                                      |
|  1.0 |     11 |    36588 | ******                                 |
|  0.8 |      6 |    55795 | *********                              |
|  0.6 |      4 |    90332 | **************                         |
|  0.4 |      2 |      239 |                                        |
+------+--------+----------+----------------------------------------+

histogram of values by LRU age

SELECT
  ROUND(ROUND(LOG10(
    (SELECT MAX(lru_time) FROM entries)
    - lru_time) * 5, 0) / 5, 1) AS log,
  ROUND(AVG(
    (SELECT MAX(lru_time) FROM entries)
    - lru_time), -2) AS lru_age,
  COUNT(*),
  RPAD('', COUNT(*) / 6500, '*') AS bar
FROM entries
GROUP BY log
ORDER BY log DESC;

+------+-----------+----------+-------------------------------------+
| log  |   lru_age | COUNT(*) | bar                                 |
+------+-----------+----------+-------------------------------------+
|  4.6 |     34800 |    18064 | ***                                 |
|  4.4 |     24200 |    96739 | ***************                     |
|  4.2 |     15700 |   212865 | *********************************   |
|  4.0 |     10200 |   224703 | *********************************** |
|  3.8 |      6500 |   158067 | ************************            |
|  3.6 |      4100 |   108034 | *****************                   |
...
|  0.4 |         0 |      672 |                                     |
|  0.0 |         0 |      319 |                                     |
| NULL |         0 |    13400 | **                                  |
+------+-----------+----------+-------------------------------------+

histogram of values by size for a namespace

SELECT
  key_name,
  ROUND(ROUND(LOG10(value_size) * 5, 0) / 5, 1) AS log,
  ROUND(AVG(value_size), 0) AS size,
  COUNT(*),
  RPAD('', COUNT(*) / 10000, '*') AS bar
FROM entries
WHERE key_name LIKE '%chunk:%'
GROUP BY log
ORDER BY log desc;

+----------------------+------+-------+----------+----------------------+
| key_name             | log  | size  | COUNT(*) | bar                  |
+----------------------+------+-------+----------+----------------------+
|  reply_chunk:69913   |  3.4 |  2511 |     1032 | **                   |
|  user_chunk:1868495  |  3.2 |  1530 |      972 | **                   |
...
|  user_chunk:1405137  |  1.6 |    40 |     2822 | ******               |
|  reply_chunk:2084579 |  1.4 |    24 |     4361 | *********            |
|  user_chunk:2455162  |  1.2 |    14 |     6141 | ************         |
|  user_chunk:5989656  |  1.0 |    12 |     2477 | *****                |
|  user_chunk:2268781  |  0.8 |     6 |    16527 | ******************** |
+----------------------+------+-------+----------+----------------------+

slab allocation counts by class

SELECT
  class_id AS slab_class,
  MAX(value_size) AS slab_size,
  CEIL(COUNT(*) / ((1024 * 1024) / MAX(value_size))) AS allocated,
  (
    (SELECT COUNT(*) FROM entries
     WHERE class_id = slab_class AND is_expired = "true")
   * 100
   /
    (SELECT COUNT(*)
     FROM entries
     WHERE class_id = slab_class)
  ) AS `%_expired`
FROM entries
GROUP BY class_id
ORDER BY class_id;

+------------+-----------+-----------+-----------+
| slab_class | slab_size | allocated | %_expired |
+------------+-----------+-----------+-----------+
|          1 |         8 |         1 |   99.8938 |
|          2 |        32 |         5 |    6.0320 |
|          3 |        71 |         9 |   36.4859 |
...
|         34 |    187783 |        28 |   61.2903 |
|         35 |    234205 |       130 |    1.8998 |
|         36 |    286110 |         7 |    0.0000 |
|         37 |    309832 |         2 |   75.0000 |
|         38 |    397500 |         1 |  100.0000 |
|         39 |    496014 |         1 |  100.0000 |
+------------+-----------+-----------+-----------+

This is comparable to the memcached stats slabs command, but lets you investigate things like expired ratio by slab, etc.

Play around and see what else you can come up with.

conclusion

Generational keys are often presented as the easy way to cache, but consider their implications. How else are you abusing your cache? Some suggested things to check:

  • Keys without expiry times (which will bite you when you alter the server pool)
  • Excessive use of very large slabs
  • Uneven LRU age by slab (indicating poor slab allocation)
  • Unexpected namespaces (revealing invalidation bugs)

The only unfortunate thing about peep right now is that it blocks the memcached server. I could do an incremental version, but it would be quite slow. The current version is already CPU-intensive, taking 15 minutes to dump a 4GB memcached on modern hardware (turn off your client-facing processes if you run mixed front-ends!). A version in C would be much quicker, if someone is up for it, but for now the Ruby version is sufficient.

Happy peeping.

10 responses

  1. Nice! Thanks for sharing. I was considering writing an ugly hack in the form of a wrapper to keep tabs. Clearly, this is a much better way.

  2. By doing this, you have won many beers, bought by me.

    Out of curiosity: how long does it take to dump, with say, a 16GB memcached instance?

  3. I assume it’s linear. To be safe, you could remove the server from your cache pool ahead of time. The queries will still be correct, since they use the server’s most-recently-used time, not the wall clock, for comparisons.

    Here’s what happens when you peep in production:

    No big deal. :-)

  4. Great post; please keep them coming.

    It’s interesting to see how few tools like peep we have today for memcached, the most popular cache solution in today’s big sites.

  5. Why does memcached evict non-expired LRU keys before expired keys? That surprises me.

    Why is checking two properties (expiry and timestamp) an expensive operation? I assume that (if memcached tracks keys with linked lists) checking for expired keys (as well as LRU keys) would be O(1).

  6. Chris: I was wrong on this issue. The problem is that the free (expired) linked list is passive—elements are only inserted if they happen to be expired on an attempted read. This avoids a GC thread.

    However, it means a guaranteed search for an expired key would require walking the entire LRU linked list, an O(N) operation. Instead, if the free list is empty, the LRU tail is used; that process finds a freeable item in O(1).

    I updated the post.

  7. Can you explain what the feature in your libmemcached builds that makes this better is? Are the changes you’ve made documented somewhere?

  8. The big change is AUTO_EJECT_HOSTS support, which lets libmemcached itself manage temporary server failure (useful in this case because peep induces such a failure).

    This work is getting integrated into the next version of libmemcached, finally.

    If you have further issues, please put them in the peep tracker.