It is rare that a new technology transforms my datacenter. Virtualization, VDI, and 10Gb converged networking
all worked to simplify my infrastructure, slash costs, and reduce downtime over
the past decade. The most recent transformation came in the form of all-flash
storage. My hope with this posting is
that I can reveal the process that my team and I went through to arrive at
storage nirvana, better known as the Pure Storage FlashArray while detailing
how the arrays have performed in my environment over the last 2 years.
I have spent years forklifting out old spinning disk only to
replace it with a new array with few (if any) performance gains. My team and I faced the daily frustration
that comes with constantly tuning storage. Trying to deliver the performance required
only to have a new application come in and kill performance starting the
process all over again. The exponential growth of virtualization was placing a
high reliance on our centralized storage infrastructure and it just couldn't
keep up.
At the start of 2011 several arrays from NetApp and EMC were
failing under the load from Databases (SQL and Oracle), Citrix VDI, Exchange, and
other virtual machines. For several
hours every evening, data audits or loads within our entire environment would bring
all our virtualized systems to a grinding halt as storage latency averaged over
300ms with peaks over 1000ms. Nearly a year was wasted battling the disk
performance Phantom Menace with the vendors, yet they still could not isolate
the root of our performance pains.
I considered adding technologies like FlashCache to our
controllers, purpose built Flash accelerator cards to our servers, or more
disks and bigger controllers to our exiting arrays. However none of these
solutions offered the IO performance required without locking us into
inflexible server based solutions or costing more than the business could
afford. I started the search for a new storage
solution that could deliver the performance and flexibility that I wanted. Hybrid arrays didn't deliver the performance
needed for our high IO write-intensive environment so I turned to all-flash
arrays as my savior.
In February 2012 after several false starts with various
all-flash vendors, I ran across a Tech Field Day presentation on YouTube. A startup
named Pure Storage was presenting the design for their all-flash array with
deep technical details. What I was
seeing looked like an easy to manage array that could dedup and compress my
data inline while providing 200K IOPS. Other vendors had strongly recommended
against turning on their dedup due to performance impacts, or their arrays performed
these tasks post-process. My team and I were energized by the prospects and decided
to put their claims to the test and within a couple weeks of reaching out to
Pure Storage I had one of their beta units running in my datacenter.
My first impressions after receiving the Flash Array were
beyond my expectations. After spending years digging through the layers and
layers of settings on traditional disk based arrays, the simple interface and
lack of random settings to configure on the array was a significant
change. Initial setup and provisioning
of storage from both the GUI and the command line was easy to learn and within
a few minutes I was up and running. It
only took a few clicks to provision LUNs to hosts. No wasted time configuring RAID, Aggregates,
Volumes, etc. or hunting for what advanced options to set for my configuration.
The simplified experience opened my eyes to how storage should be built and
managed.
I presented LUNs to a number of test systems running
Windows, Linux, and VMware. I proceeded
to throw every test I could think of against the array using SQL Server, VM
storage vMotions, Cloning VMs, and other testing scripts or tools that I could
dig up. The array performed exactly as
we were led to believe it would, if anything in most tests it exceeded our
expectations. IOPS scaled predictably with the size of IO that I was running
against the array and it could easily handle the sustained random writes from
my test applications. Dedup with over 100 test VMs cloned from our Staging
environment ranged from 4:1 to upwards of 11:1 depending on the type of data. At one point during the evaluation I ran into
issues with our Brocade Fiber switches where some paths were dropping. I was
pleasantly surprised when Pure Storage support jumped in to help troubleshoot
and resolve the issue, even though the switches were covered under NetApp
support (which wasn't being very helpful up to that point). After several
months of testing my team and I were having a hard time resisting the urge to
move production data on the array so we moved forward with purchasing our first
production FlashArray.
In August 2012 we installed our FA320 11TB array running
early GA (2.0.x) code. Within a few days
I had moved our entire Dev, QA, and Staging environments to the array and started
performance and reliability testing. I
generated as much IO load as I could then proceeded to pull power on
controllers, pull SSD’s, pull NVRAM modules, disconnect fiber, and disconnect InfiniBand
cables all in an attempt to find any weakness in the reliability of the
array. Again as with the Beta array
everything worked flawlessly. It was time to start moving production data onto
the array so that we could see if it would finally end the Phantom Menace that
was growing worse by the day.
The first project was to virtualize two small SQL
Server instances that were causing some of the problems on our old disk
storage. The very first day after cutting over these two SQL Servers running on
Pure Storage, our audit/load times on these servers plummeted from >4 hours
to just over 1 hour. Beyond the time
reduction our greatest success was the total lack of impact to other VMs on the
FlashArray during this window. Latency remained at ~1ms on the virtual machines
and <1ms on the array, no longer did we have to deal with 300+ms latency
across all our systems. With the
confidence that the array could handle our entire load without any issues I worked
over the next week to migrate all production VMs (Web, App, Infrastructure, and
Citrix) onto the array, and within 3 months my team and I had decommissioned an
old EMC Clarion used for our Oracle hosts, and all VMware storage was off our
Netapp Filer. Throughout the migration
as we added more data and increased the IO load on the array it continued to
perform flawlessly, with latency for the most part below the 1ms mark and dedup
ratios above 6:1.
Over time the array was upgraded every couple months as new
features or bug-fixes came out. During
one code upgrade that should have been non-disruptive, a Linux server running
Oracle hung due to all paths going down, again Pure Storage support stepped up
and was extremely helpful at reviewing all aspects of the Linux host storage
configuration and assisting with troubleshooting. In the end it turned out to
be a bug in the MPIO driver on the Linux host.
In 2013 we were ready to add a second Pure Storage array for
eventual use in our DR site, and as part of the purchase we also had the
opportunity to upgrade our controllers in our production array from the FA320
controllers to the FA420 controllers. In
my past experiences with controller upgrades they would lead to a major outage,
that is of course if the storage vendor didn't just force you buy a whole new
array to deliver a little more performance.
Our Pure Storage sales engineer (James) came on site and
installed the new DR array for us first and had it up and running in short
order. Since I wanted to see just how
much load two arrays could handle with them sitting side-by-side I quickly
added storage from the new array onto my test ESXi cluster while James was
unpacking the new heads for the primary array. I setup some tasks to clone and
migrate VMs back and forth between the two arrays to generate a ton of IO load
(and to see how fast I could move data between them). The timeframe for this upgrade was mid-week
with a fairly short window where the array would be moderately quiet before our
nightly database tasks started up. If the production array went down at any
time day or night it would bring down all my production systems, so running
mid-week wasn't any bigger risk. From
all our previous upgrades and fault testing I had a very high level of confidence
that the head upgrade would go off without a hitch, and true to form the head
upgrades were completed without any production impact, even with me throwing >2GBps
and >100K IOPS against the array throughout the entire upgrade process.
As with any new product there have been minor issues with
the array on occasion. However support has always been very responsive and
quick to find a resolution. They have
reached out on several occasions to notify me of proactive fixes to issues that
they are starting to see on our array allowing us to keep our production systems
running smoothly. This has resulted in
~35 minutes of total downtime the first 6 months running on the array (the move
to non-disruptive upgrades took a few revisions in the early 2.x code). In the past 18 months I have had zero
production outages as a result of an issue with the FlashArray and zero storage
related performance issues.
After more than 2 years since I first installed Pure Storage
in our datacenter we have gone through countless code updates on our arrays,
and we have even changed controller generations on our primary array. We have never lost a SSD drive the entire
time we have been running FlashArray's.
Any time we have any issues with the array or systems connected to the array,
support has been very responsive and a pleasure to work with. For the first time since I have been working
with storage arrays from various vendors I don’t have to worry about how my
storage is performing, if the next upgrade is going to take my datacenter down,
what will happen if I lose a drive or controller, or how much my next
maintenance bill will go up. My team and
I get pretty excited when talking about Pure Storage and for good reason; it
has saved countless hours managing storage in our datacenter, eliminated
storage as a performance bottleneck, lowered power requirements, reduced rack
space, and cut costs. Mostly because Pure Storage is a great company and they
are successfully changing the storage industry for the better.
Hi David,
ReplyDeleteJust to tell you that I did the same here in France last year. And I totally agree on what you said. Flash is disruptive when it is done right. And to do so, you need to forget about the past in storage industry and start learning new things.