Philip Williams
on 22 December 2021
It’s that time of year, where we start to look ahead, and think about the ongoing trends in our various industries. One thing is for certain in the storage industry: capacity demand remains high, with the industry observing continued exponential growth.
Growth, growth, growth
More and more data is being created every day. It truly is non-stop. In 2021 alone, it was predicted that enterprise storage vendors would ship almost 150 Exabytes in capacity, and this number is only expected to increase again in 2022!
We now see 20TB hard drives on the market to help with these needs, but we have to remain vigilant when building storage clusters, as the access speed of these drives hasn’t really changed at all over the last few years. In failure scenarios, where we have to recreate replicas or erasure-coded shards of data, it can take many many hours with drives of such high capacity.
So the rule of thumb remains the same: a larger number of smaller drives leads to a more predictable system for any amount of capacity. Of course, you do have to remain pragmatic to balance capacity needs with the cost of increasing the number of spindles.
Flash, denser, and faster
Over the last few years, we have seen huge leaps forward in capacity orientated flash. Intel recently launched a 30TB QLC 3D NAND drive, surpassing even the largest of traditional spinning drives. Whilst we wouldn’t suggest using these for very write-heavy workloads, there is definitely a place for them in storage systems to increase throughput above traditional spindle based configurations. Additionally, there are power usage benefits too, which in large-scale clusters becomes more and more important as you scale – and even at the Edge, where power budgets might be quite limited!
Computational storage
An interesting and novel area in hard drive technology is the concept of computational storage, that is, adding more intelligence to the hard drives and SSDs that we use in servers and storage clusters.
We have seen work in this area before, but the use case was almost too narrow. Seagate created a hard drive called Kinetic, which exposed a key/value object storage interface over Ethernet, rather than the usual block interfaces of SAS or SATA. This was interesting for those of us building larger scale object stores. It meant that, with each hard drive added to a cluster, an additional amount of compute resource was added too, leading to a highly scalable sea-of-compute-and-storage. Furthermore, it reduced failure domains significantly to a single disk, rather than a whole server containing multiple disks. However, this concept didn’t really gain much traction, as it required significant changes to the software used to build storage clusters. There just wasn’t enough resources on each drive to run an entire OSD in the case of Ceph.
Fast forward to 2021, and we see some smaller companies start to offer products that maintain typical SAS and SATA interfaces, but also provide capacity efficiency options like compression, or encryption, on-drive, without the requirement of any host processing power, or changes to the software running on the server.
This is a lot like what we have seen already in the Ethernet space, where certain tasks are offloaded to Smart-NICs. With some computationally aware storage devices, it is already possible to access the compute resources on these drives and use them for pre-processing datasets. When you may have a storage system with thousands of drives, this becomes a huge amount of additional computing power at your disposal.
Data repatriation – post pandemic splurge
Over the last two years, we have all seen huge changes in the way that we work. To support that, many companies have turned to public clouds to help them scale their operations immediately and maintain business as usual. Cost optimisation has largely been a secondary consideration.
However, as companies have settled into these new ways of operating, we now see a renewed focus on cost optimization and efficiency. Storage remains the least cloud-friendly piece of infrastructure, as usage is typically static or expanding, and doesn’t have peaks and troughs like compute might.
More and more companies are waking up to the costs of storing data in the cloud, and are considering near-cloud solutions where they operate their own hardware in co-location facilities adjacent to major cloud provider facilities, and link them together with private interconnects. Not only does this reduce costs immediately, it also means that there are no penalties when migrating to other cloud providers in the future too!
Wrap up
We wish you all Happy Holidays and a wonderful New Year!
Open source storage solutions such as Ceph can readily help solve for the growth and scaling challenges seen across the industry. Learn more about deploying Ceph from our recent webinar here.