Hadoop SAN Storage Reuse

1.3k Views Asked by At

We have 600TB of EMC SAN storage. Currently, Oracle RAC is utilizing this storage. We are replacing Oracle RAC with Hadoop Storage (Yarn,Spark - Hive, Shark) for scalability reasons - though we compromised on performance a bit.

For Hadoop, local storage is recommended than SAN storage. But our management is not willing to waste the SAN storage. They want to protect the investment on SAN storage.

How best can we use SAN for Hadoop? Ethernet upgrade will help? What are the options to make use of the SAN storage to the maximum (as Hadoop Storage).

2

There are 2 best solutions below

0
On

Obviously you use SAN for Hadoop but it is not advisable. There will be contention in SAN controllers and degrades the performance.

The best way to use SAN for hadoop are:

1.Create LUN with RAID-0.

2.LUN should not be shared and it needs to be dedicated to one DataNode server only

3.If a DataNode needs 10GB then create 2 LUNs (or even numbers) and load balance these LUNs between two controllers of SAN.

Obviously you can use SAN for NameNode with appropriate RAID level (with redundancy - non-zero).

0
On

Assuming we're using the same terminology - specifically that SAN is block devices accessed across a fiber-channel network - then there's not much difference between 'local storage' and 'san storage'.

The performance you get out of it is limited by the same factors - number of controllers, number of spindles, contention ratios etc. The reason you buy a storage array/SAN in the first place is because then you can consolidate your workload and get a higher burst performance with the same (or lower) average.

However there's one additional factor - a SAN will typically include a fabric, which is a network used for carrying your disk storage traffic. The switches you use for it are typically high performance/low latency - but they can also be bottlenecks and points of contention.

Hadoop... is effectively doing the same thing by using HDFS - using it's multiple local disks to get big 'bursts'. That will inherently cause contention on your SAN, so you don't get much consolidation benefit any more - and you might well end up worse off, because contention means bottlenecks and latency.

You might find you're better off if your storage array has good peak throughput good dedupe mechanisms and large caches. Just make sure you've got plenty of end-to-end peak throughput and IOP capacity. probably you'll find you were worse off than you would be - but whether you should reuse something at a lower cost, rather than pay a premium to do it right is more an IT Policy sort of decision than a technical one.