As part of a distributed crawler, we store all URLs in a Redis sorted set, which is the crawl queue and Redis hash (to de-duplicate and mark visited URLs).
We have about 11M URLs for various domains that we wish to visit in a file, which occupies 506 MB of space on disk.
However the same set of URLs when put in Redis sorted set, with decreasing priorities starting from integer 11M all the way to 0, takes 1.759 GB of RAM and the Redis hash from key: URL-> value: same URL, takes 2.048 GB of RAM space.
The redis server is hosted in High-memory (17GB) Extra Large EC2 Instance in AWS.
I want to figure out what's causing the space bloat in Redis, could it be because of inefficient way of storing them or should we optimize for memory in a certain way to avoid the space bloat? Any suggestion towards improving the memory performance would be gr8. Thanks in advance for any help!
This is the redis info dump:
redis_version:2.4.14
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
gcc_version:4.6.3
process_id:739
uptime_in_seconds:329647
uptime_in_days:3
lru_clock:1603627
used_cpu_sys:9521.58
used_cpu_user:3165.06
used_cpu_sys_children:19535.11
used_cpu_user_children:126500.32
connected_clients:76
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:12794713864
used_memory_human:11.92G
used_memory_rss:13586632704
used_memory_peak:16575849280
used_memory_peak_human:15.44G
mem_fragmentation_ratio:1.06
mem_allocator:jemalloc-2.2.5
loading:0
aof_enabled:0
changes_since_last_save:46321
bgsave_in_progress:1
last_save_time:1358213403
bgrewriteaof_in_progress:0
total_connections_received:1702
total_commands_processed:95112145
expired_keys:3488037
evicted_keys:0
keyspace_hits:43443780
keyspace_misses:38945
pubsub_channels:2
pubsub_patterns:0
latest_fork_usec:3820832
vm_enabled:0
role:master
db0:keys=116,expires=25