Some thoughts on the performance of SSD RAID 0 arrays

My Friend Alan Rocker and I often discuss ideas about technology and tradeoffs.  Alan asked about SSDs for Linux:

> I haven’t been following hardware developments very closely for a while, so I
> find it hard to judge the arguments. What’s important?

Ultimately what’s important is the management software, the layer above the drivers, off to one side. That applies regardless of the media and means that the view the applications take of storage is preserved regardless of changes in the physical media.

> The first question is, what areas are currently the bottlenecks and
> constraints, at what orders of magnitude?

The simple answer is ‘channels’.

> Are processors starved for data, or drives waiting around for processors to
> send them data?

All of the above and don’t forget the network!

Ultimately everything has to move through memory, wherever its coming from, wherever its going to. Can you bypass memory? Yes, but then you have SEVERE management problems. Don’t go there in the general case and when you DO go there for specialized functionality you can’t do much else.

Back in the early 1970s I was working with a British GEC 4000 series machine.
It had been built as a controller system, think what we call SCADA today, for railways and the like. I’m sure there’s a manual on-line for it somewhere; I had one but seemed to have lost it.

What characterized it was that it had FIVE main buses or channels.
The memory was 5-ported, so the CPU could be accessing a region in one port while one disk was writing to memory using another and second disk was reading from another region and the network was was doing input to one region and output from another region.

All in parallel.
The bus selection/contention was managed individually by the devices all of which had ‘autonomous data transfer’ facilities. Nothing new here. Rather than shove values into registers on the devices as we do today, there were a chain of control blocks that the CPU built in memory and the devices each had what amounted to a DMA mechanism to read and inwardly digest and act on each control block, set the complete flag for it, then go on to the next. This was normal for large machines of the time. Smaller machines like the PDP-11 introduced register stuffing, even though some of their controllers were this older style.

You can find academic papers and studies that show that parallelism is one of the greatest accelerators of performance. Go google for that and read up on that in your own time.

There are few reasons we can’t build a machine like that today; ultimately its economics.

In order to have block level granularity where blocks were 512 bytes, the memory had to be (chip level) addressable at that granularity. Two different devices might need to access adjacent sections of memory. Its not that memory wasn’t aligned, it was, on 512 byte blocks.

If you think about it for a monument you’ll see why. When we have, say, 4k by 8 chips, never mind larger, we can’t have two different processes addressing adjacent 512 byte blocks on that one chip. It only has one
set of address lines, and one set of data lines. With 4k by 8 chips we now have a threshold, not matter how sophisticated the hardware between the chip and the bus or buses goes for ‘multiplexing’ of 4k granularity.

Well, that’s OK, Linux can live with 4G memory block allocation for the virtual memory system and many modern disks are moving to 4k block addressability.

But reality: how available and expensive are 4k by 8 chips?
Does this
have 4k granularity addressability?

Yes, but that’s only a single channel. You still need to have a card that plugs into 5 buses and has the multiplexing hardware. That gets to be expensive.

Some models of the PDP-11 and the VAX made do with dual channel memory; one bus was devoted to the CPU alone, the other to devices, primarily the disk, but later a form of terminal IO that was buffered and could send whole lines or screen updates. A lot of the character by character handling, certainly in line
mode rather that RAW code such as used by VI, was carried out by a dedicated terminal server. I used a system like that at HCR. It was very effective and reduced the immediate per character interrupt rate on the PDP-11/45 to a level where it could happily support 40 concurrent users. So much so that when we moved to the nominally more powerfl VAX that didn’t have this parallelism the perceived perforce and responsiveness dropped to a level where a machine with nominally 4-6 times the ‘power’ could only happily support about half the number of users with the same kind of job mix.

It’s all very well to say that it was about off-loading IO, but that off-loading involved parallel processing, the parallel subsystem doing what the main processor would otherwise have to do. This isn’t like the traditional (e.g. IBM) mainframes that simply couldn’t even do that kind of IO at all and had to use peripheral processors.

> SSDs have clear advantages over spinning rust in robustness and lack of
> latency, offset by cost/byte. When a terabyte disk costs $60,

What do you mean, ‘terrabyte’, Kimosabe?
Shop around and you can get a 2T SATA 2.5″ for US$60, a few outlets might stretch to a 3T for less than US$75. I’m sure discount/reconditioned houses will have some 2T under CDN$60 if I start looking.

> and you are neither taking lots of video nor running a bit barn, cost/byte’s
> not a big deal. What are the relative speeds of disks, SSDs and cards.
> (There’s at least an order of magnitude variation in quoted SD card transfer rates.)

In that exclusion set we can count a lot of home computing.
Some people like John might run a family SAN and stretch to a few terrabytes, but a single 3T drive represents a SPOF. His SAN might be better served by mirrored 500G drives or if performance is critical mirror-striped or striped-mirror (preferably RAID 1+0, “stripe of mirrors”. i.e the disks within the group are mirrored. But, the groups themselves are striped).

Ultimately, striping is about parallelism.

> Infinite expandability gives the edge to cards, provided they’re big enough and fast enough.

No, Allan, that’s incorrect.
Infinite expandability is a function of the file system or file system manager.

Both LVM and BtrFS are indefinitely expandable; you can keep adding spindles (or SSD equivalent) and growing the individual file system size of the file system to limit of the inode fields, or possibly recompiling your system to have bigger inodes/more fields.

BtrFS has a “one file system to rule them all” approach; the ‘them’ being all the drives/spindles. It supposes 64-but machines and hence…

Max. volume size 16 EiB
Max. file size 16 EiB
Max. number of files 2^64

What’s an “EiB”? That’s an ‘exbibyte’.
1 exbibyte = 2^60 bytes = 1152921504606846976bytes = 1024 pebibytes
What’s a “pebibyte”?
That 1024^5 bytes.
So 1 exbibyes,

1TB is, if we use the binary form, 1024^4 bytes.
So 1 EiB is 1024^2 TB.
That’s a lot of ‘spindles’.
And the limit for BtrFS is 16 times that.

So that’s what ‘indefinitely’ amounts to for BtrFS.

Well, no, you are going to face kernel space resources before that 🙂


You can also have an indefinite number of sub-volumes within that.
These aren’t partitions; they are more like ‘thin partition’ where the space comes out of the general pool
The can be mounted with access controls just like regular volumes.

BtrFS is important in another regard: it is one of the few file systems that do not need provisioning when created.

Back in the original version 6 days of the late 1960s and early 1970s the original file system was designed for simplicity and hence a very small amount of code; memory was very limited. The KISS meant that there was noting dynamic or adaptive being done. There was a hard division between the number of blocks devoted to the inodes and the number of blocks devoted to data. It could never be varied and the superblock said what those numbers were and where the boundary was. Even the later Berkeley Fast Filesystem followed the principle of preallocation, it just rearranged the positioning.

Some, but sadly not all, later model file system, ReiserFS, XFS and BtrFS among them removed this restriction. They use dynamically balanced b-trees for allocation pretty much on demand. There is a pool and tree-parts are allocated for inodes, name space and data space as needed. Unlike the reallocated file
systems, they can never run out of inodes before data or out of data before inodes.  Or the other way round.

Sadly not all late model file systems follow this sensible approach. Even though the ext4FS uses b-trees internally, it still requires preallocation to set the inode/data ratios. Big OUCH!

Back to management.
BtrFS does not make the distinction between the file system and the management of media. There are other supports for that. Under Linux the principle one is LVM.

Volume management goes back some way; I used the Veritas Volume Manager on the IBM AIX in the 1990s, and the Linux LVM is derived form that. There’s an amazing amount you can do with LVM.
An example

You start with LVM by assigning a disk or a disk partition as a ‘physical volume’ (PV). You can them go on to create one of more ‘Volume Groups’ (VG) that perhaps span the physical media.

On my primary drive, for example, I create a /boot partition, a SWAP partition and devote the rest of the drive to LVM. My ROOTFS is in LVM.

Within a VG you can then go on to create Logical Volumes, which are akin to disk partitions.

I’m probably making this sound a lot more complex than it is, but for each layer of abstraction you have specific functionality which is bundled and either indecipherable or inaccessible in BtrFS. It basically comes down to

* create
* modify
* check
* delete

for each layer. Having separate layers lets you do things that you can’t easily or perhaps *ever* do with BtrFS because BtrFS does not adequately difference Volume Management from file system management.

You can, for example, have separate VGs for separate businesses or business segments.

I’ve mentioned that I don’t like the Virtual Machine model for a variety of reasons; replication,the whole OS and libraries and basic file system just to run one application is heavy handed, storage, bandwidth to access/load and memory to run. That’s why we have “Docker“. In some ways LVM is the Docker of disk management.

The PV and VG layers are normally not something you deal with. If you’re commissioning a new machine, installing a new drive, having to remove an only or flaky drive that is erroring or about to die, then they concern you. Otherwise they are a once in (machine) lifetime occurrences.

What you may be concerned with more often is Logical Volumes (LV).  These are akin to disk partitions. Unlike a disk partition using FDISK they are managed in the VG space. You can grow them or shrink them or move them from media to media after creation.

The commands for lvcreate and lvmodify have many options,, if you choose to employ them. You can, for example, specify how striped, how mirrored a LV is.  Unlike BtrFS you can have separate LVs with quite different characteristics to suit specific system and business needs.

And yes, you can change from mirrored or striped to ‘linear’.

You then go on to create file systems in the LVs.
Oh, yes, you could create a BtrFS in a LVM LV. 🙂

One of the things about using LVM is that it avoids the provisioning problems you have with FDISK. The “partition” boundaries are dynamically flexible. You can grow (or shrink) the size of LV with the machine running, with the disk in use. There are some file systems that allow you to grow the file system itself just as the LV it is in is growing. Or shrinking.

You might gather at this point that I have developed a dislike for pre-provisioning. The reality is that even with Terabyte drives its better to have a ‘pool’ that can be managed in the future, slack, rather than imposing the kind of boundaries that go with the architectures of the 1960s.

There’s one other thing you can do with LVM and that is termed ‘thin provisioning‘. It pushes the concept of avoiding provisioning even further.  Now you have a LV with a file system is is, potentially, 20G in size, but is
currently only 4G. Rather than allocate 20G of physical extends right now, hard binding them to that LV, just arrange to ‘bind on demand’ as the file system in that LV actually uses them.

It’s worth noting that the Redhat manual mentions

logical volumes can be thinly provisioned. This allows you to create logical
volumes that are larger than the available extents. Using thin provisioning, you
can manage a storage pool of free space, known as a thin pool, which can be
allocated to an arbitrary number of devices when needed by applications. You can
then create devices that can be bound to the thin pool for later allocation when
an application actually writes to the logical volume. The thin pool can be
expanded dynamically when needed for cost-effective allocation of storage space.


By using thin provisioning, a storage administrator can over-commit the physical
storage, often avoiding the need to purchase additional storage. For example, if
ten users each request a 100GB file system for their application, the storage
administrator can create what appears to be a 100GB file system for each user
but which is backed by less actual storage that is used only when needed.

LVM also offers some interesting features using, for example, CoW and the ability to make snapshots of LVs for backups.

Along the while the question arises “Can you add SSD to LVM“.
Yes you can, its just another volume, in one way of looking at it. However you may want to difference the fast vs the slow Physical Volumes in a volume group with you allocate LV. I referenced above a way to use the fast PVs to act as a cache for slower LVs. Part of the power of LVM over BtrFS is that you can do things like this.

> Where in the spectrum from “To be stored undisturbed for posterity” to “extremely
> transient” does the data fall, and along another axis from “Entirely in the
> public domain” to “horrible things will happen if anyone else sees this”?

Those are quite separate issues: one concerns the fact that backups and archives are meant to be immutable (though some backup modes that use RSYNC end up looking more like Revision Control Databases!).

The other is about access control, and hence identification and authentication.

One can reasonably say that we have solved the technical aspects of both of those. Getting people to employ, well not so much the technology as the operational practices, is, as I’ve mentioned quite a number of times, a
completely different matter. All to many problems with the issues you mention arise from that failure.


About the author

Security Evangelist

Leave a Reply