@abbottmtn: Why I Replaced Spinning Disk with Flash in the Datacenter

It is rare that a new technology transforms my datacenter. Virtualization, VDI, and 10Gb converged networking all worked to simplify my infrastructure, slash costs, and reduce downtime over the past decade. The most recent transformation came in the form of all-flash storage. My hope with this posting is that I can reveal the process that my team and I went through to arrive at storage nirvana, better known as the Pure Storage FlashArray while detailing how the arrays have performed in my environment over the last 2 years.

I have spent years forklifting out old spinning disk only to replace it with a new array with few (if any) performance gains. My team and I faced the daily frustration that comes with constantly tuning storage. Trying to deliver the performance required only to have a new application come in and kill performance starting the process all over again. The exponential growth of virtualization was placing a high reliance on our centralized storage infrastructure and it just couldn't keep up.

At the start of 2011 several arrays from NetApp and EMC were failing under the load from Databases (SQL and Oracle), Citrix VDI, Exchange, and other virtual machines. For several hours every evening, data audits or loads within our entire environment would bring all our virtualized systems to a grinding halt as storage latency averaged over 300ms with peaks over 1000ms. Nearly a year was wasted battling the disk performance Phantom Menace with the vendors, yet they still could not isolate the root of our performance pains.

I considered adding technologies like FlashCache to our controllers, purpose built Flash accelerator cards to our servers, or more disks and bigger controllers to our exiting arrays. However none of these solutions offered the IO performance required without locking us into inflexible server based solutions or costing more than the business could afford. I started the search for a new storage solution that could deliver the performance and flexibility that I wanted. Hybrid arrays didn't deliver the performance needed for our high IO write-intensive environment so I turned to all-flash arrays as my savior.

In February 2012 after several false starts with various all-flash vendors, I ran across a Tech Field Day presentation on YouTube. A startup named Pure Storage was presenting the design for their all-flash array with deep technical details. What I was seeing looked like an easy to manage array that could dedup and compress my data inline while providing 200K IOPS. Other vendors had strongly recommended against turning on their dedup due to performance impacts, or their arrays performed these tasks post-process. My team and I were energized by the prospects and decided to put their claims to the test and within a couple weeks of reaching out to Pure Storage I had one of their beta units running in my datacenter.

My first impressions after receiving the Flash Array were beyond my expectations. After spending years digging through the layers and layers of settings on traditional disk based arrays, the simple interface and lack of random settings to configure on the array was a significant change. Initial setup and provisioning of storage from both the GUI and the command line was easy to learn and within a few minutes I was up and running. It only took a few clicks to provision LUNs to hosts. No wasted time configuring RAID, Aggregates, Volumes, etc. or hunting for what advanced options to set for my configuration. The simplified experience opened my eyes to how storage should be built and managed.

I presented LUNs to a number of test systems running Windows, Linux, and VMware. I proceeded to throw every test I could think of against the array using SQL Server, VM storage vMotions, Cloning VMs, and other testing scripts or tools that I could dig up. The array performed exactly as we were led to believe it would, if anything in most tests it exceeded our expectations. IOPS scaled predictably with the size of IO that I was running against the array and it could easily handle the sustained random writes from my test applications. Dedup with over 100 test VMs cloned from our Staging environment ranged from 4:1 to upwards of 11:1 depending on the type of data. At one point during the evaluation I ran into issues with our Brocade Fiber switches where some paths were dropping. I was pleasantly surprised when Pure Storage support jumped in to help troubleshoot and resolve the issue, even though the switches were covered under NetApp support (which wasn't being very helpful up to that point). After several months of testing my team and I were having a hard time resisting the urge to move production data on the array so we moved forward with purchasing our first production FlashArray.

In August 2012 we installed our FA320 11TB array running early GA (2.0.x) code. Within a few days I had moved our entire Dev, QA, and Staging environments to the array and started performance and reliability testing. I generated as much IO load as I could then proceeded to pull power on controllers, pull SSD’s, pull NVRAM modules, disconnect fiber, and disconnect InfiniBand cables all in an attempt to find any weakness in the reliability of the array. Again as with the Beta array everything worked flawlessly. It was time to start moving production data onto the array so that we could see if it would finally end the Phantom Menace that was growing worse by the day.

The first project was to virtualize two small SQL Server instances that were causing some of the problems on our old disk storage. The very first day after cutting over these two SQL Servers running on Pure Storage, our audit/load times on these servers plummeted from >4 hours to just over 1 hour. Beyond the time reduction our greatest success was the total lack of impact to other VMs on the FlashArray during this window. Latency remained at ~1ms on the virtual machines and <1ms on the array, no longer did we have to deal with 300+ms latency across all our systems. With the confidence that the array could handle our entire load without any issues I worked over the next week to migrate all production VMs (Web, App, Infrastructure, and Citrix) onto the array, and within 3 months my team and I had decommissioned an old EMC Clarion used for our Oracle hosts, and all VMware storage was off our Netapp Filer. Throughout the migration as we added more data and increased the IO load on the array it continued to perform flawlessly, with latency for the most part below the 1ms mark and dedup ratios above 6:1.

Over time the array was upgraded every couple months as new features or bug-fixes came out. During one code upgrade that should have been non-disruptive, a Linux server running Oracle hung due to all paths going down, again Pure Storage support stepped up and was extremely helpful at reviewing all aspects of the Linux host storage configuration and assisting with troubleshooting. In the end it turned out to be a bug in the MPIO driver on the Linux host.

In 2013 we were ready to add a second Pure Storage array for eventual use in our DR site, and as part of the purchase we also had the opportunity to upgrade our controllers in our production array from the FA320 controllers to the FA420 controllers. In my past experiences with controller upgrades they would lead to a major outage, that is of course if the storage vendor didn't just force you buy a whole new array to deliver a little more performance.

Our Pure Storage sales engineer (James) came on site and installed the new DR array for us first and had it up and running in short order. Since I wanted to see just how much load two arrays could handle with them sitting side-by-side I quickly added storage from the new array onto my test ESXi cluster while James was unpacking the new heads for the primary array. I setup some tasks to clone and migrate VMs back and forth between the two arrays to generate a ton of IO load (and to see how fast I could move data between them). The timeframe for this upgrade was mid-week with a fairly short window where the array would be moderately quiet before our nightly database tasks started up. If the production array went down at any time day or night it would bring down all my production systems, so running mid-week wasn't any bigger risk. From all our previous upgrades and fault testing I had a very high level of confidence that the head upgrade would go off without a hitch, and true to form the head upgrades were completed without any production impact, even with me throwing >2GBps and >100K IOPS against the array throughout the entire upgrade process.

As with any new product there have been minor issues with the array on occasion. However support has always been very responsive and quick to find a resolution. They have reached out on several occasions to notify me of proactive fixes to issues that they are starting to see on our array allowing us to keep our production systems running smoothly. This has resulted in ~35 minutes of total downtime the first 6 months running on the array (the move to non-disruptive upgrades took a few revisions in the early 2.x code). In the past 18 months I have had zero production outages as a result of an issue with the FlashArray and zero storage related performance issues.

After more than 2 years since I first installed Pure Storage in our datacenter we have gone through countless code updates on our arrays, and we have even changed controller generations on our primary array. We have never lost a SSD drive the entire time we have been running FlashArray's. Any time we have any issues with the array or systems connected to the array, support has been very responsive and a pleasure to work with. For the first time since I have been working with storage arrays from various vendors I don’t have to worry about how my storage is performing, if the next upgrade is going to take my datacenter down, what will happen if I lose a drive or controller, or how much my next maintenance bill will go up. My team and I get pretty excited when talking about Pure Storage and for good reason; it has saved countless hours managing storage in our datacenter, eliminated storage as a performance bottleneck, lowered power requirements, reduced rack space, and cut costs. Mostly because Pure Storage is a great company and they are successfully changing the storage industry for the better.

@abbottmtn

Saturday, June 14, 2014

Why I Replaced Spinning Disk with Flash in the Datacenter

1 comment: