Spine and Leaf Basics
As virtualization, cloud computing, and distributed cloud computing (Hadoop, for example) becomes more popular in the data center, a shift in the traditional three-tier networking model is taking place as well.
The traditional core-aggregate-access model is efficient for traffic that travels “North-South”, which is traffic that travels in and out of the data center. This kind of traffic is typically a web service of sorts–HTTP/S, Exchange, and Sharepoint, for example–where there is a lot of remote client/server communication. This type of architecture is usually built for redundancy and resiliency against a failure. However, 50% of the critical network links are typically blocked by the Spanning-Tree Protocol (STP) in order to prevent network loops, just to sit idly as a backup, which means 50% of your maximum bandwidth is wasted (until something fails). Here is an example:
This type of architecture is still very widely used for service-oriented types of traffic that travel North-South. However, the trends in traffic patterns are changing with the types of workloads that are common in today’s data centers: East-West traffic, or server-to-server traffic. Take a look at the diagram above. If a server connected to the left-most access switch needs to communicate with a server connected to the right-most access switch, what path does it need to take? It travels all the way to the core switch and back down again. That is not the most efficient path to take, and causes more latency while using more bandwidth. If a cluster of servers (this number can be in the hundreds, or even thousands) is performing a resource-intensive calculation in parallel, the last thing you want to introduce is unpredictable latency or a lack of bandwidth. You can have extremely powerful servers performing these calculations, but if the servers can’t talk to each other efficiently because of a bottleneck in your network architecture, that is wasted capital expenditure.
So how do you design for this shift from North-South to East-West traffic? One way is to create a Spine and Leaf architecture, also known as a Distributed Core. This architecture has two main components: Spine switches and Leaf switches. You can think of spine switches as the core, but instead of being a large, chassis-based switching platform, the spine is composed of many high-throughput Layer 3 switches with high port density. You can think of leaf switches as your access layer; they provide network connection points for servers, as well as uplink to the spine switches. Now, here is the important part of this architecture: every leaf switch connects to every spine switch in the fabric. That point is important because no matter which leaf switch a server is connected to, it always has to cross the same amount of devices to get to another server (unless the other server is located on the same leaf). This keeps the latency down to a predictable level because a payload only has to hop to a spine switch and another leaf switch to get to its destination.
You would typically have many more spine and leaf switches in a deployment, but this small-scale diagram gets the fundamental design points across. The beautiful thing about this design is that instead of relying on one or two monster chassis-based switches at the core, the load is distributed across all spine switches, making each spine individually insignificant as you scale out.
Before you design an architecture like this, you will need to know what the current and future needs are. For example, if you have a server count of 100 today and that will eventually scale up to 500 servers, you need to make sure your fabric can scale to accommodate future needs. There are two important variables to calculate your maximum scalability: the number of uplinks on a leaf switch and the number of ports on your spine switches. The number of uplinks on a leaf switch determines how many spine switches you can have in your fabric–remember: every leaf switch has to connect to every spine switch in the fabric! Also, the number of ports on a spine switch determines how many leaf switches you can have; this is why spine switches need to have a high port density. Let’s take the example of 100 servers today with a need to scale to 1000 servers in the future. If we plan on using a 24-port 10Gbps switch for the leaf layer, utilizing 20 ports for servers and 4 ports for uplinks, we can have a total of 4 spine switches. If each spine switch has 64 10Gbps ports, we can scale out to a maximum of 64 leaf switches. 64 leaf switches x 20 servers on each switch = 1280 maximum servers in this fabric. Keep in mind this is a theoretical maximum and you will need to accommodate for connecting the fabric to the rest of the data center. Regardless, this design will allow for seamless scalability without having to re-architect your fabric. You can start off with 5 leaf switches and 4 spine switches to meet your current need of 100 servers and scale out leaf switches as more servers are needed.
Another factor to keep in mind when designing your fabric is the oversubscription ratio. This ratio is calculated on the leaf switches, and it is defined as the max throughput of active southbound connections (down to servers) divided by the max throughput of active northbound connections (uplinks). If you have 20 servers each connected with 10Gbps links and 4 10Gbps uplinks to your spine switches, you have a 5:1 oversubscription ratio (200Gbps/40Gbps). It is not likely that all servers are going to be communicating at 100% throughput 100% of the time, so it is okay to be oversubscribed. Keeping that in mind, work with the server team to figure out what an acceptable ratio is for your purpose.
Cisco Nexus 5500 vs 6001
A common deployment of a spine and leaf architecture I have seen is to use Cisco Nexus 5548 or 5596 switches with a Layer 3 daughter card as the choice of spine switch. At first glance, this looks like a great switch at a low price point to use as a spine: 960Gbps forwarding on the 5548 and 1920Gbps forwarding on the 5596, as well as 48 or 96 10Gbps ports which is plenty of density for a small to mid-sized implementation (48 or 96 leaf switches). However, what is commonly missed in the specifications sheet is that once you add the Layer 3 daughter card into one of these switches, the Layer 3 forwarding drops down to 160Gbps or 240mpps (240 million packets per second). This is a huge performance hit and is definitely not sufficient for a spine switch at a large scale. Also of note, the MAC address table can handle 32,000 entries. The list price for a Cisco Nexus 5548 with the 16 port expansion module and Layer 3 daughter card comes out to $41,800 (without SMARTNET services attached).
Now let’s take a look at the Cisco Nexus 6001. The reason I’m comparing this to the Nexus 5548 for this application is because Cisco recently dropped the list price of the 6001 by 42%, which is a huge list price drop. If you buy through a reseller, the discounts will be even deeper. Onto the specifications comparison, take a look at the table below:
|Switch||Port Density||Forwarding Rate – Layer 2||Forwarding Rate – Layer 3||MAC Entries||List Price|
|Nexus 5548||48 SFP+ ports||960 Gbps or 714.24 mpps||160 Gbps or 240 mpps||32,000||$41,800|
|Nexus 6001||48 SFP+ and 4 QSFP+ ports||1.28 Tbps||1.28 Tbps||256,000||$40,000|
The Nexus 6001 beats the 5548 in each comparison above, even in port density because you can use a breakout cable to convert the QSFP+ interfaces into 4 SFP+ interfaces each, adding a total of 16 more 10Gbps interfaces. Performance does not change on the 6001 regardless of Layer 2 or Layer 3 forwarding. The MAC table is 8 times larger. Plus, by the time you add a 16 port expansion module (to bring the port count from 32 to 48) and a Layer 3 daughter card to the Nexus 5548, the 6001 list price (with the recent discount) actually comes in lower. If you have a Cisco infrastructure already and are looking to build a small distributed core architecture, the Nexus 6001 is a no-brainer as a spine switch. For a larger-scale architecture, the Nexus 6004 provides up to 384 10Gbps interfaces or 96 40Gbps interfaces, but at a much larger price point of course.
This post was a ton of information, so if you have any questions or comments, I will be glad to answer them. Leave a comment or email me at firstname.lastname@example.org. I look forward to hearing some discussion around this!