Let’s Encrypt – How do I Cron?

Let’s Encrypt was really easy to setup, but Cron was less so. I kept getting emails that the Let’s Encrypt renewal was failing:

2017-03-09 02:51:02,285:WARNING:letsencrypt.cli:Attempting to renew cert from /etc/letsencrypt/renewal/bbbburns.com.conf produced an unexpected error: The apache plugin is not working; there may be problems with your existing configuration.
The error was: NoInstallationError(). Skipping.
1 renew failure(s), 0 parse failure(s)

I had a cron job setup with the absolute bare minimum:

crontab -e
56 02 * * * /usr/bin/letsencrypt renew >> /var/log/le-renew.log

When I ran
/usr/bin/letsencrypt renew
at the command line, everything worked just fine. I was like, “Oh – this must be some stupid cron thing that I used to know, but never remember.”

Turns out the problem was the cron environment PATH variable. Cron didn’t have access to /usr/sbin and apparently certbot was using that for access to the apache2 binary. The fix was to change the cron to the following:

56 02 * * * /root/le-renew.sh

Then create a script that runs the renewal after the PATH variable is set correctly:

cat /root/le-renew.sh
#!/bin/bash
#Automate the LE renewal process

#Need /usr/sbin for apache2
# https://github.com/certbot/certbot/issues/1833
export PATH=$PATH:/usr/sbin

#Renew the certs and log the results
/usr/bin/letsencrypt renew >> /var/log/le-renew.log

It was a good thing I put the link to the problem right in the script, or I never would have been able to find it again to write this blog.

NOW my renewal works absolutely fine. Problem solved. Thanks Cron.

Let’s Encrypt – Easy – Free – Awesome

I recently saw a news article about StartCom being on Mozilla and Google’s naughty list. Things looked bad, and my StartCom certs were up for renewal on the blog.

I have seen articles flying around about Let’s Encrypt for a while now. The idea seemed awesome, but the website seemed so light on technical instructions that I didn’t know if it would actually work. I wanted to know EXACTLY what lines it would propose to hack into my carefully manicured Apache configuration. And by carefully manicured, I mean “strung together with stuff I copied and pasted from stack overflow“.

I couldn’t find the information I really wanted – so I just JUMPED in and started installing things and running commands. 30 seconds later, I had a fully functioning cert on my site. I was blown away. It copied my existing non-ssl vhost config and created a new vhost with SSL enabled. All I had to do was enter my email address, select the vhost to enabled SSL for, and hit GO.

I had to put in a crontab entry myself to get the auto-renewal to work but that wasn’t so bad. I would hope they improve that in the future – but cron is no big deal.

I’m interested to see if everything works when my web certs expire 90 days from now! Crazy times. I used to do this and dread it once per year because the process was so manual. Now that it’s automated – I’ll get new certs while I’m sleeping. Woohoo.

Nutanix AHV Best Practices Guide

In my last blog post I talked about networking with Open vSwitch in the Nutanix Acropolis Hypervisor. Today I’m happy to announce the continuation of that initial post – the Nutanix Acropolis Hypervisor Best Practices Guide.

Nutanix Acropolis introduced the concept of AHV, based on the open source Linux KVM hypervisor. A new Nutanix node comes installed with AHV by default with no additional licensing required. It’s a full-featured virtualization solution that is ready to run VMs right out of the box. ESXi and Hyper-V are still great on Nutanix, but AHV should be seriously considered because it has a lot to offer, with all of KVMs rough edges rounded off.

Part of introducing a new hypervisor is describing all of the features, and then recommending some best practices for those features. In this blog post I wanted to give you a taste of the doc with some choice snippets to show you what this Best Practice Guide and AHV are all about.

Take a look at Magnus Andersson’s excellent blog post on terminology for some more detailed background on terms.

Acropolis Overview

Acropolis (one word) is the name of the overall project encompassing multiple hypervisors, the distributed storage fabric, and the app mobility fabric. The goal of the Acropolis project is to provide seamless invisible infrastructure whether your VMs exist in AWS, Hyper-V, ESXi, or the AHV. The sister project, Prism, provides the user interface to manage via GUI, CLI, or REST API.

Acropolis_Prism_Block_Diagram
AHV Overview

AHV is based on the open source KVM hypervisor, but is enhanced by all the other components of the Acropolis project. Conceptually, AHV has access to the Distributed Storage Fabric for storage, and the App Mobility Fabric powers the management plane for VM operations like scheduling, high availability, and live migration.

Acropolis Architecture CVM Scale

The same familiar Nutanix architecture exists, with a network of Controller Virtual Machines providing storage access to VMs. The CVM takes direct control of the underlying disks (SSD and HDD) with PCI passthrough, and exposes these disks to AHV via iSCSI (The blue dotted VM I/O line). The management layer is spread across all Nutanix nodes in the CVMs using the same web-scale principles of the storage layer. This means that by-default, a highly available VM management layer exists. No single point of failure anymore! No additional work to setup VM management redundancy – it just works that way.

AHV Networking Overview

Networking in AHV is provided by an Open vSwitch instance (OVS) running on each AHV host. The BPG doc has a comprehensive overview of the different components inside OVS and how they’re used. I’ll share a teaser diagram of the default network config after installation in a single AHV node.

acropolis_initial_install

AHV Networking Best Practices

Bridges, Bonds, and Ports – oh my. What you really want to know is “How do I plug this thing into my switches, setup my VLANs, and get the best possible load balancing. You’re in luck, because the Best Practice Guide covers the most common scenarios for creating different virtual switches and configuring load balancing.

Here’s a closer look at one possible networking configuration, where the 10gigabit adapters and 1gigabit adapters have been connected into separate OVS bridges. User VM2 has the ability to connect to multiple physically separate networks with this design to allow things like virtual firewalls.

acropolis_ovs_reco_2-10g

After separating network traffic, the next thing is load balancing. Here’s a look at another possible load balancing method called active-slb. Not only does the BPG provide the configuration for this, but also the rationale. Maybe fault tolerance is important to you. Maybe active-active configuration with LACP is important. The BPG will cover the config and the best way to achieve your goals.

For information on VLAN configuration, check out the Best Practices Guide.

Other AHV Best Practices

This BPG isn’t just networking specific. The standard features you expect from a hypervisor are all covered.

  • VM Deployment
    • Leverage the fantastic aCLI, GUI, or REST API to deploy or clone VMs.
  • VM Data Protection
    • Backup up VMs with local or remote snapshots.
  • VM High Availability
    • During physical host failure, ensure that VMs are started elsewhere in the cluster.
  • Live Migration
    • Move running VMs around in the cluster.
  • CPU, Memory, and Disk Configuration
    • Add the right resources to machines as needed.
  • Resource Oversubscription
    • Rules for fitting the most VMs onto a running cluster for max efficiency.

Take a look at the AHV Best Practice Guide for information on all of these features and more. With this BPG in hand you can be up and running with AHV in your datacenter and get the most out of all the new features Nutanix has added.

Survivable UC – Avaya Aura and Nutanix Data Protection

I wanted to share a bit of cool “value add” today, as my sales and marketing guys would call it. This is just one of the things for Avaya Aura and UC in general that a Nutanix deployment can bring to the table.

Nutanix has the concept of Protection Domains and Metro Availability that have been covered in pretty great detail by some other Nutanix bloggers. Check out detailed articles here by Andre Leibovici, and here by Magnus Andersson for in depth info and configuration on Metro Availability.

Non-redundant Applications

In an Avaya Aura environment, most machines will be protected from failure at the application level. A hot standby VM will be running to take over operation in the event of primary machine failure such as with Session Manager and Communication Manager. In the following example we see that System Manager, AES, and a number of other service don’t have a hot standby. This might be because it’s too expensive resource wise, licensing wise, or the application demands don’t call for it.

1000-user_topology

If multiple Nutanix clusters are in place, we actually have two ways to protect these VMs at the Nutanix level.

Nutanix Protection Domains

First, let’s look at Protection Domains. With a Protection Domain, we configure a NDFS (Nutanix Distributed Filesystem) level snapshot that happens at a configurable interval. This snapshot is intelligently (with deduplication) replicated to another Nutanix cluster. It’s different than a vSphere snapshot because the Virtual Machine has no knowledge that a snapshot took place and no VMDK fragmentation is required. None of the standard warnings and drawbacks of running with snapshots apply here. This is a Nutanix metadata operation that can happen almost instantly.

We pick individual VMs to be part of the Protection Domain and replicate these to one or more sites.

In the event of a failure of a site or cluster, the VM can be restored at another site, because all of the files that make up the Virtual Machine (excluding memory) are preserved on the second Nutanix cluster.

ProtectionDomain

 

Nutanix Metro Availability

But I hear you saying, “Jason that’s great, but a snapshot taken at intervals is too slow. I can’t possibly miss any transactions. My UC servers are the most important thing in my Data Center. I need my replication interval to be ZERO.” This is where Metro Availability comes in.

Metro Availability is a synchronous write operation that happens between two Nutanix clusters. The requirements are:

  1. A new Nutanix container must be created for the Metro Availability protected machines.
  2. RTT latency between clusters must be less than 5 milliseconds (about 400 kilometers)

Since this write is synchronous, all disk write activity on a Metro Availability protected VM must be completed on both the local and the remote cluster before it’s acknowledged. This means all data writes are guaranteed to be protected in real time. The real-world limitation here is that every bit of distance between clusters adds latency to writes. If your application isn’t write-heavy you may be able to hit the max RTT limit without noticing any issues. If your application does nothing but write constantly to disk, 400km may need to be re-evaluated. Most UC machines are generally not disk intensive though. Lucky you!

MetroAvailability

In the previous image we have two Nutanix clusters separated by a metro ethernet link. The standalone applications like System Manager, Utility Services, Web License Manager, and Virtual Application Manager are being protected with Metro Availability.

In the even of Data Center 1 failure, all of the redundant applications will already be running in Data Center 2. The administrator can then either manually (or through a detection script) start the non-redundant VMs using the synchronous copies residing in Data Center 2.

Summary

Avaya Aura Applications are highly resilient and often provide the ability for multiple copies of each app to run simultaneously in different locations, but not all Aura apps work this way. With Nutanix and virtualization, administrators have even more flexibility to protect the non-redundant Aura apps using Protection Domains and Metro Availability.

These features present a consumer-friendly GUI for ease of operation, and also expose APIs so the whole process can be automated into an orchestration suite. These Nutanix features can provide peace of mind and real operational survivability on what would otherwise be very bad days for UC admins. Nutanix allows you to spend more time delivering service and less time scrambling to recover.

 

 

Virtualized Avaya Aura on Nutanix – In Progress

Explaining the Nutanix Distributed Filesystem

The Avaya Technology Forum in Orlando was a great success! Thanks to everyone who attended and showed interest in Nutanix by stopping at the booth. I met a lot of interested potential customers and partners and was also able to learn more about what people are virtualizing these days. There is nothing quite like asking people directly “What virtualization projects do you have coming up?”

Explaining the Nutanix Distributed Filesystem
Explaining the Nutanix Distributed Filesystem

After talking about Nutanix and what I do on the Solutions team, some key themes I heard repeated by attendees were:

“Wow, that’s really cool technology!”

and

“When will you have a document for Avaya Aura?”

The response to the first one is easy. Yeah, I think it’s really cool technology too. Nutanix will allow you to compress a traditional three tier architecture into just a few rack units. It gives you the benefits of locally attached fast flash storage AND the benefits of a shared storage pool. Customers can use this to save money, improve performance, and focus on their applications instead of their infrastructure. After you compress you also have the ability to scale up the number of nodes in the Nutanix cluster with no hard limit in place. Performance grows directly with cluster growth.

The second question is actually why I’m writing this blog today. When will the reference architecture for Avaya Aura on Nutanix be completed?

I’m in the research phase now because Avaya Aura is a monster of an application. It’s actually a set of dozens of different systems that all work together. Each system will have its own requirements for virtualization. Part of getting a reference architecture or best practices guide right is figuring out what each individual component requires to succeed.

Let’s give an example by looking at the Avaya Aura Virtual Environment overview doc. This list is the number of different OVAs that are available:

Avaya Aura® applications for VMware
• Avaya Aura® Communication Manager
• Avaya Aura® Session Manager
• Avaya Aura® System Manager
• Avaya Aura® Presence Services
• Avaya Aura® Application Enablement Services
• Avaya Aura® Agile Communication Environment (ACE)
• Avaya Aura® Messaging
• Communication Manager Messaging
• Avaya Virtual Application Manager
• Avaya Aura® Utility Services
• WebLM
• Secure Access Link
• Session Border Controller for Enterprise
• Avaya Aura Conferencing

Avaya Call Center on VMware (OVA files)
• Avaya Aura® Call Center Elite
• Elite Multichannel Feature Pack
• Avaya Aura® Experience Portal
• Call Management System

Each of the applications listed above is a separate OVA file available from Avaya. Each application has its own sizing, configuration, and redundancy guides. To deploy an Aura solution you can use some, or all of these components.

An Aura document on Nutanix is in the works, but it’s going to be a lot of WORK. I plan on focusing on just the core components at first and a few sample deployments to cover the majority of cases.

I’ve read every single Avaya Virtual Environment document and now just need to compile this information into an easy to digest Nutanix-centric format. In the meantime if you have Avaya Aura questions on Nutanix feel free to reach out to me @bbbburns

The great thing so far is that I don’t see any potential road blocks to deploying Aura on Nutanix. In fact at the ATF we performed a demo Aura deployment on a single Nutanix 3460 block (4 nodes). We demonstrated Nutanix node failure and Aura call survivability of the active calls and video conferences.

Part of the challenge of deploying any virtual application, especially real time applications, is that low-latency is KING. This was repeated over and over by all the Avaya Aura experts at the conference. Aura doesn’t use storage very heavily, but since it’s a real-time app the performance better be there when the app asks for it. All the war stories around virtualizing Aura dealt with oversubscribed hosts, oversubscribed storage, or contention for resources.

Deploying Aura on Nutanix is going to eliminate these concerns! Aura apps will ALWAYS have fast storage access. There will never be any contention because our architecture precludes it. I’m excited to work on projects like this because I know customers are going to save HUGE amounts of money while also gaining performance and reliability.

We really will change your approach to the data center.

Nutanix and UC – Part 3: Cisco UC on Nutanix

In the previous posts we covered an Introduction to Cisco UC and Nutanix as well as Cisco’s requirements for UC virtualization. To quickly summarize… Nutanix is a virtualization platform that provides compute and storage in a way that is fault tolerant and scalable. Cisco UC provides a VMware centric virtualized VoIP collaboration suite that allows clients on many devices to communicate. Cisco has many requirements before their UC suite can be deployed in a virtual environment and the Nutanix platform is a great way to satisfy these requirements.

In this post I’m going to cover the actual sizing and implementation details needed to design and deploy a real world Cisco UC system. This should help tie all the previous information together.

Cisco UC VM Sizing

Cisco UC VMs are deployed in a two part process. The first part is a downloaded OVA template and the second part is an installation ISO. The OVA determines the properties of the VM such as number of vCPUs, amount of RAM, and number and size of disks and creates an empty VM. The installation ISO then copies the relevant UC software into the newly created blank VM.

There are two ways to size Cisco UC VMs:

  1. Wing it from experience
  2. Use the Cisco Collaboration Sizing Tool

I really like “Option 1 – Wing it from experience” since the sizing calculator is pretty complicated and typically provides output that I could have predicted based on experience. “Option 2 – Collaboration Sizing Tool” is a requirement whenever you’re worried about load and need to be sure a design can meet customer requirements. Unfortunately the Sizing Tool can only be used by registered Cisco partners so for this blog post we’re just going to treat it as a black box.

Determine the following in your environment:

  • Number of Phones
  • Number of Lines Per Phone
  • Number of Busy Hour calls per line
  • Number of VM boxes
  • Number of Jabber IM clients
  • Number of Voice Gateways (SIP, MGCP, or H.323)
  • Redundancy Strategy (where is your failover, what does it look like?)

Put this information into the Collaboration Sizing Tool and BEHOLD the magic.

Let’s take an example where we have 1,000 users and we want 1:1 call processing redundancy. This means we need capacity for 1,000 phones on one CUCM call processor, and 1,000 phones on the failover system. We would also assume each user has 1 voicemail box, and one Jabber client.

This increases our total to 2,000 devices (1 phone and 1 Jabber per user) and 1,000 voicemail boxes.

Let’s assume that experience, the Cisco Sizing Tool, or our highly paid and trusted consultant tells us we need a certain number of VMs of a certain size to deploy this environment. The details are all Cisco UC specific and not really Nutanix specific so I’ll gloss over how we get to them.

We need a table with “just the facts” about our new VM environment:

Product VM Count vCPUs RAM HDD OVA
CUCM 2 1 4GB 80GB 2500 user
IM&P 2 1 2GB 80GB 1000 user
CUC 2 2 4GB 160GB 1000 user
CER 2 1 4GB 80GB 20000 user
PLM 1 1 4GB 50GB NA

The first column tells us the Cisco UC application. The second column tells us how many VMs of that application are needed. The rest of the columns are the details for each individual instance of a VM.

The DocWiki page referenced in the last article has details of all OVAs for all UC products. In the above example we are using a 2,500 user CUCM OVA. If you wanted to do a 10,000 user OVA file for each CUCM VM the stats can easily be found:

CUCM OVA Sizes

 

Visit the DocWiki link above for all stats on all products.

Reserving Space for Nutanix CVM

The Nutanix CVM runs on every hypervisor host in the cluster so it can present a virtual storage layer directly to the hypervisor using local and remote disks. By default it will use the following resources:

  • 8 vCPU (only 4 reserved)
    • Number of vCPUs actually used depends on system load
  • 16GB RAM
    • Increases if compression or deduplication are in use
  • Disk

In a node where we have 16 cores available this means we’d have 12 cores (16 – reserved 4) for all guest VMs such as Cisco UC. A cautious reading of Cisco’s requirements though would instruct us to be more careful with the math.

The Cisco docwiki page says “No CPU oversubscription for UC VMs” which means in theory we could be in an oversubscribed state if we provision the following in a 16 core node:

CVM x 4 vCPUs, UC VMs x 12 vCPUS = 16 total

It’s safer to provision:

CVM x 8 vCPUs, UC VMs x 8 vCPUs = 16 total

Even though it’s unlikely the CVM will ever use all 8 vCPUs.

Placing Cisco UC VMs

That’s a lot of text. Let’s look at a picture of how that placement works on a single node.

I’ve taken a single Nutanix node and reserved vCPU slots (on paper) for the VMs I want to run. Repeat this process for additional Nutanix nodes until all of your UC VMs have a place to live. Depending on the Nutanix system used you may have a different number of cores available. Consult the Nutanix hardware page for details about all of the available platforms. As new processors are released this page is sure to be updated.

*EDIT on 2015-10-23* Nutanix switched to a “Configure To Order” model and now many more processor core options are available, from 2×8 core all the way up to 2×18 core. This provides a lot of flexibility for sizing UC solutions.

The shaded section of the provisioned, but not reserved, CVM vCPU allocation is critical to sizing and VM placement. 4 vCPUs that will go unused unless the system is running at peak load. UC VMs are typically not IOPS intensive, so I would recommend running some other Non-Cisco workload in this free space. This allows you to get full efficiency from the Nutanix node while also following Cisco guidance.

Follow best practices on spreading important functions to multiple separate nodes in the cluster. This applies to ALL virtualization of UC. If we have one piece of hardware running our primary server for 1,000 users, it’s probably a good idea that the backup unit run on a DIFFERENT piece of hardware. In this case, another Nutanix node would be how we accomplished that.

Remember that at least 3 Nutanix nodes must be used to form a cluster. In the diagram above I’ve shown just a single node, but we’ll have at least two more nodes to place any other VMs we like following all the same rules. In a large Nutanix environment a cluster could contain MANY more nodes.

Installation Considerations

After the UC VM OVAs are deployed the next step is to actually perform the application installation. Without installation the VM is just an empty shell waiting for data to be written to the disk.

I’ll use an example CUCM install because it’s a good proxy for other UC applications.

Cisco_UC_Diagrams_shadow_ISO

The first Nutanix node has two CUCM servers and the second Nutanix node also has two CUCM servers. The installation ISO has to be read somehow by the virtual machine as it’s booted. In VMware we have a number of options available.

  • Read from a drive on the machine where vSphere Client is running
  • Read from a drive inserted into the ESXi Host
  • Read from an ISO located on a Datastore

DataStoreISO

When we select Datastore we can leverage a speedup feature of the Nutanix NDFS. If we put the CUCM ISO in the same NDFS container where the VM disk resides we can use Shadow Clones to make sure that the ISO is only ever read over the network once per Nutanix node.

In our previous example with two CUCM servers, the first CUCM server on the second node would be installed from Datastore. When the second CUCM installation was started on that same second node, it would read the ISO file from the local NDFS shadow clone copy.

 Rinse and Repeat

For all of the UC VMs and all Nutanix nodes the same process would be followed:

  1. Figure out how many and what size UC VMs are needed.
  2. Plan the placement of UC VMs on Nutanix nodes by counting cores and staggering important machines.
  3. Deploy the OVA templates according to your plan.
  4. Install the VMs from ISO making sure to use the Datastore option in vSphere.

In our next blog post we’ll  look at tools that can be used to make VM placement a bit easier and size Nutanix for different workloads.

Thanks for following along! Your comments are always welcome.

Nutanix and UC – Part 2: Cisco Virtualization Requirements

In the last post I covered an Introduction to Cisco UC and Nutanix. In this post I’ll cover UC performance and virtualization requirements.

A scary part of virtualizing Cisco Unified Communications is worrying about being fully supported by Cisco TAC if a non-standard deployment path is chosen. This is due to a long history of strict hardware requirements around UC. When Cisco UC was first released in it’s current Linux based incarnation around 2006 as version 5.0 it could only be installed on certain HP and IBM hardware server models. Cisco was VERY strict about hardware revisions of these servers and a software to hardware matrix was made available.

This lead to the creation of a “Specifications” table, listing exact processors, disks, and RAM for each supported server model. When you hear “Specifications Based” or “Spec Based” it all started here.

Customers were welcome to purchase a server directly from HP or IBM that used all of the same hardware components, but the Cisco MCS server (which was just a rebranded HP or IBM server) was recommended. If it was discovered a customer had deviated from the hardware specs listed in the matrix, they could be in an unsupported configuration. If this unsupported configuration was found to be causing a particular problem then a customer might have had to change out the server hardware before further support could be obtained. These calls to technical support were often very stressful and harrowing if it was discovered that the purchasing process for the hardware didn’t follow the Spec based matrix exactly.

From a support perspective this makes sense. UC is a critical real-time application and non-standard hardware with less than excellent performance characteristics could cause all sorts of hard to diagnose and hard to troubleshoot problems. Working in support I saw my fair share of these cases where failing disks or problem hardware caused periodic interruptions that only revealed themselves through odd and intermittent symptoms.

UC Performance Needs

Let’s take a break from history to look at why performance is so critical to UC.

Signal-vs-Media
Figure 1: Signal vs Media

 

Figure 1 shows where the CUCM Virtual Machine fits into the call path. Each IP Phone will have a TCP session open at all times for call control, sometimes called signaling. Typically in a Cisco environment this is the SCCP protocol, but things are moving to the SIP protocol as an open standard. All the examples below assume SCCP is in use.

The SCCP call control link is used when one phone wants to initiate a call to another phone. Once a call is initiated a temporary media link with Real Time Protocol (RTP) audio/video traffic is established directly between phones. The following process is used to make a phone call.

Basic Phone Call Process

  1. User goes off hook by lifting handset, pressing speaker, or using headset
  2. User receives dial-tone
  3. User dials digits of desired destination and prepares a media channel
  4. CUCM performs destination lookup as each digit is received
  5. CUCM sends back confirmation to calling user that the lookup is proceeding
  6. CUCM sends “New Call” to destination IP Phone
  7. Destination phone responds to CUCM that it is ringing
  8. CUCM sends back confirmation to calling phone that the destination is ringing
  9. Destination phone is answered
  10. CUCM asks destination phone for media information (IP, port, audio codec)
  11. CUCM asks originating phone for media information (IP, port, audio codec)
  12. CUCM relays answer indication and media information to the originating phone
  13. CUCM relays media information to the destination phone
  14. Two way audio is established directly between the IP phones

At every step in the above process one or more messages have to be exchanged between the CUCM server and one of the IP phones. There are three places delay is commonly noticed by users:

  1. Off-hook to dial-tone
    1. User goes off hook, but CUCM delays the acknowledgement. This leads to perceived “dead air”
  2. Post dial delay
    1. User dials all digits, but doesn’t receive lookup indication (ringback). This can cause users to hang up. This is EXTREMELY important to avoid because during a 911 call users will typically only wait a second or two to hear some indication that the call is in progress before hanging up. Consider the psychological impact and stress of even momentary dead air during an emergency situation.
  3.  Post answer, media cut-through delay
    1. Destination phone answers, but audio setup is delayed at the CUCM server. This leads to a user picking up a phone saying “Hello, this is Jason”, and the calling user hearing “o, this is Jason”.

Also consider that each of the above messages for a single phone call had to be logged to disk. Huge advances have been made in compression and RAM-disk usage, but log writing is still a critical component of a phone call. Call logs and call records are crucial to an enterprise phone system.

Let’s look at this at scale.

cluster-scale
Figure 2: Cluster Scale

With a cluster of fully meshed call control servers and tens of thousands of IP phones, the situation is slightly more complex. Any single phone can still call any other phone, but now an extra lookup is needed. Where the destination phone registers for call control traffic is now important. Users in the Durham office may be located on a different Call Control server than users in the San Jose office. This means all of the above steps must now be negotiated between two different CUCM servers as well as the two phone endpoints.

CUCM uses Inter Cluster Communication Signaling (ICCS) to do lookups and call control traffic between servers. A problem now on any one server could spell disaster for thousands of users who need to place calls and have immediate response. Any server response time latency will be noticed.

Now that we have some background on why performance is so crucial to a real time communication system, let’s get back to the history.

Enter virtualization!

Cisco was slow to the virtualization game with Unified Communications. All the same fears about poor hardware performance were amplified with the hypervisor adding another possibly hard to troubleshoot abstraction layer. Virtualization support was first added only for certain hardware platforms (Cisco MCS) and only with certain Cisco UC versions. All the same specifications based rules applied to IBM servers (by this point HP was out of favor with Cisco).

What everyone knew is that virtualization was actually amazing for Cisco UC – in the lab. Every aspiring CCIE Voice candidate had snapshots of Cisco UC servers for easy lab recreates. Customers had lab or demo as proof of concept or test. Cisco used virtualization extensively internally for testing and support.

A Cisco UC customer wanting to virtualize had two options at this point for building a virtual Cisco UC cluster on VMware.

  1. Buy Cisco MCS servers (rebranded IBM)
  2. Buy IBM servers

The Cisco DocWiki page was created and listed the server requirements and IBM part numbers and a few notes about VMware configuration.

To any virtualization admin it should be immediately clear that neither of the above options are truly desirable. Virtualization was supposed to give customers choice and flexibility and so far there was none. Large customers were clamoring for support for Hardware Vendor X, where X is whatever their server virtualization shop was running. Sometimes Cisco UC customers were direct competitors to IBM, so imagine the conversation:

“Hello IBM competitor. I know you want Cisco UC, but you’ll have to rack these IBM servers in your data center.”

Exceptions were made and the DocWiki was slowly updated with more specifications based hardware.

Cisco UCS as Virtualization Door Opener

Cisco Unified Computing System (UCS) is what really drove the development of the Cisco DocWiki site to include considerations for Network Attached Storage and Storage Area Networks. Now Cisco had hardware that could utilize these storage platforms and best practices needed to be documented for customer success. It also started the process of de-linking the tight coupling between UC and very specific server models for support. Now a whole class of servers based on specifications could be supported. This is largely the result of years of caution and strict requirements that allowed UC and virtualization to mature together. Customers had success with virtualization and demanded more.

UC Virtualization Requirements Today

Today everything about Cisco UC Virtualization can be found on the Cisco DocWiki site. A good introductory page is the UC Virtualization Environment Overview, which serves to link to all of the other sub pages.

In these pages you’ll find a number of requirements that cover CPU, RAM, Storage, and VMware. Let’s hit the highlights and show how Nutanix meets the relevant requirements.

Oversubscription

This isn’t anything Nutanix specific, but it’s important nonetheless. No oversubscription of ANY resource is allowed. CPUs must be mapped 1 vCPU to one physical core (ignore HT logical core count). RAM must be reserved for the VM. Storage is recommended to be done with Thick Provisioning, but Thin Provisioning is allowed.

The big one here is 1:1 vCPU to core mapping. This will be a primary driver of sizing and is evidenced in all of the Cisco documentation. If you know how many physical cores are available, and you know how many vCPUs a VM takes, most of the sizing is done already!

CPU Architecture

Specific CPU architectures and speeds are listed in order to be classified as a “Full Performance CPU”. The Nutanix home page provides a list of all processors used in all node types. All Nutanix nodes except the NX-1000 series are classified as Full Performance CPUs at the time of this writing. That means the NX-1000 is not a good choice for Cisco UC, but all other platforms such as the very popular NX-3000 are a great fit.

Storage

Nutanix presents an NFS interface to the VMware Hypervisor. The Nutanix Distributed Filesystem backend is seen by VMware as a simple NFS datastore. The DocWiki page lists support for NFS under the Storage System Design Requirements section. There is also a listing under the storage hardware section. Most of the storage requirements listed apply to legacy SAN or NAS environments so aren’t directly applicable to Nutanix.

The key requirements that must be met are latency and IOPS. This is another area where calculation from the traditional NAS differs. In the legacy NAS environment the storage system performance was divided by all hosts accessing the storage. In the Nutanix environment each host accesses local storage, so no additional calculations are required as the system scales! Each node has access to the full performance of the NDFS system.

Each UC application has some rudimentary IOPS information that can be found here on the DocWiki storage site. These aren’t exact numbers and are missing some information about the type of testing that was performed to achieve these values, but they get you in the ballpark. None of the UC applications listed are disk intensive with average utilization less than 100 IOPS for most UC applications. This shows that again the CPU will be the primary driver of sizing.

VMware HCL

Cisco requires that any system for UC Virtualization must be on the VMware HCL and Storage HCL. Nutanix works very hard to ensure that this requirement is met, and has a dedicated page listing Nutanix on the VMware HCL.

 

With the above requirements met we can now confidently select the Nutanix platform for UC virtualization and know it will be supported by Cisco TAC. The DocWiki is an incredibly useful tool to know that all requirements are met. Check the Cisco DocWiki frequently as it’s updated often! 

Cisco UC OVA Files

Before we conclude let’s take a look at one more unique feature of Cisco UC and the DocWiki page.

Each Cisco UC application is installed using the combination of an OVA file and an install ISO. The OVA is required to ensure that exact CPU, RAM, and Disk sizes and reservations are followed. All Cisco OVA files can be found here on the DocWiki. Be sure to use these OVA files for each UC application and use the vCPU and RAM sizes from each OVA template to size appropriately on Nutanix. The ISO file for installation is a separate download or DVD delivery that happens on purchase.

In the next post, we’ll cover the exact sizing of Cisco UC Virtual Machines and how to fit them onto an example Nutanix block.

Nutanix and UC – Part 1: Introduction and Overview

I’ll be publishing a series of blog posts outlining Cisco Unified Communications on Nutanix. At the end of this series I hope to have addressed any potential concerns running Cisco UC and Nutanix and provided all the tools for a successful deployment. Your comments are welcome and encouraged. Let’s start at the beginning, a very good place to start.

Cisco UC Overview

Let’s start with an overview of Cisco Unified Communications just to make sure we’re all on the same page about the basics of the solution. UC is just a term used to describe all of the communications technologies that an enterprise might use to collaborate. This is really a series of different client and server technologies that might provide Voice, Video, Instant Messaging,  and Presence.

Clients use these server components to communicate with each other. They also use Gateway components to talk to the outside world. The gateway in the below image shows how we link into a phone service provider such as AT&T or Verizon to make calls to the rest of the world.

Cisco UC Overview
Cisco Unified Communications Overview

 

Each of the above components in the Cisco UC Virtual Machines provides a critical function to the clients along the bottom. In the past there may have been racks full of physical servers to accomplish these functions, but now this can be virtualized. Redundancy is still one of our NUMBER 1 concerns in a UC deployment, but scale is also important. When the phone system goes down and the CEO or CIO can’t dial into the quarterly earnings call there is huge potential for IT staff changes. Even more importantly, everyone relies on this system for Emergency 911 calls. The phone system MUST be up 100% of the time (or close to it).

Virtualization actually helps both in terms of scale AND redundancy on this front. Let’s look at each component of the UC system and see what it does for us as well as how it fits into a virtual environment.

Cisco Unified Communications Manager

Cisco Unified Communications Manager (CUCM) is the core building block of all Cisco UC environments. CUCM provides call control. All phones will register to the CUCM and all phone calls will go through the CUCM for call routing. Because the CUCM call control is such a critical function it is almost always deployed in a redundant full-mesh cluster of servers. A single cluster can support up to 40,000 users with just 11 VMs. Additional clusters can be added to scale beyond 40,000 users.

Once the size of the Cisco CUCM cluster is determined the next step is to deploy the VMs required. Each VM is deployed from an OVA which has a number of fixed values that cannot be changed. The number of vCPUs, the amount of RAM, and the size of the disks is completely determined by the Cisco OVA.

The Cisco DocWiki site lists various OVAs available to deploy a CUCM server. The size of the CUCM server OVA used depends on the number of endpoints the cluster will support.

Cisco Unity Connection

Cisco Unity Connection (CUC) provides Voice Message services, acting as the voice mailbox server for all incoming voice messages. CUC can also be used as an Interactive Voice Response server, playing a series of messages from a tree structure and branching based on user input. For redundancy each CUC cluster is deployed in and Active/Active pair that can support up to 20,000 voice mailboxes. Scaling beyond 20,000 users is just a matter of adding clusters.

The OVA for CUC can be found on the Cisco DocWiki site. Notice that these OVAs for CUC have much larger disk sizes.

Cisco Instant Messaging & Presence

Cisco IM&P is the primary UC component that provides service to Cisco Jabber endpoint Presence and Instant Messaging. Jabber clients will register to the IM&P server for all contact list functions and IM functions. The Jabber clients ALSO connect to the CUCM server for call control and CUC server for Voice Messaging.

IM&P servers are deployed in pairs called subclusters. Up to 3 subclusters (6 IM&P servers total) can be paired with a single CUCM cluster supporting up to 45,000 Jabber clients. The OVA templates for IM&P can be found on the DocWiki site. Each IM&P cluster is tied to a CUCM cluster. Adding more IM&P clusters will also mean adding more CUCM clusters.

Cisco Emergency Responder

911 emergency calls using a VoIP service often fall under special state laws requiring the exact location of the emergency call to be sent to the Emergency Public Service Answering Point (PSAP). The 911 operator needs this location to dispatch appropriate emergency services. VoIP makes this more complex because the concept of a phone now encompasses laptops and phones with wireless roaming capabilities which are often changing locations.

Cisco Emergency Responder (CER) is deployed in pairs of VMs (primary and secondary) to provide Emergency Location to the PSAP when a 911 call is placed. CER will use either SNMP discovery of switch ports, IP subnet based discovery, or user provided location to provide a location to the PSAP.

OVAs for CER can be found on the Cisco DocWiki.

Additional voice server components can be found on the DocWiki page. They follow a similar convention of describing the number of vCPUs, RAM, and Disk requirements for a specific platform size.

We’ll talk more about these individual components in the next part of this series, but for now it’s enough to just understand that each of these services will be provisioned from an OVA as a VM on top of VMware ESXi.

Nutanix Overview

Nutanix has been covered in great detail by Steven Poitras over at the Nutanix Bible. I won’t repeat all of the work Steve did because I’m sure I wouldn’t do it justice. I will however steal a few images and give a brief summary. For more info please head over to Steve’s page.

The first image is the most important for understanding what makes Nutanix so powerful. Below we see that the Nutanix Controller Virtual Machine (CVM) has direct control of the attached disks (SSD and HDD). The Hypervisor talks directly to the Nutanix CVM processes for all disk IO using NFS in the case of VMware ESXi. This allows Nutanix to abstract the storage layer and do some pretty cool things with it.

The Hypervisor could be VMware ESXi, Microsoft Hyper-V, or Linux kvm. We’ll focus on ESXi here because Cisco UC requires VMware ESXi for virtualization.

Nutanix Node Detail
Nutanix Node Detail

The great thing is that to User Virtual Machines such as Cisco Unified Communications this looks exactly like ANY OTHER virtual environment with network storage. There is no special work required to get a VM running on Nutanix. The same familiar hypervisor you know and love presents storage to the VMs.

Now we have the game changer up next. Because the CVM has control of the Direct Attached Storage, and because the CVM runs on every single ESXi host, we can easily scale out our storage layer by just adding nodes.

Nutanix CVM NDFS Scale
Nutanix CVM NDFS Scale

Each Hypervisor knows NOTHING about the physical disks, and believes that the entire storage pool is available for use. The CVM optimizes all data access and stores data locally in flash and memory for fast access. Data that is less frequently accessed can be moved to cold tier storage on spinning disks. Once local disks are exhausted the CVM has the ability to write to any other node in the Nutanix cluster. All writes are written once locally, and once on a remote node for redundancy.

Because all writes and reads will happen locally we can scale up while preserving performance.

Nutanix Distributed Filesystem requires at least 3 nodes to form a cluster. Lucky for us the most common “block” comes with space for 4 “nodes”. Here’s an inside view of the 4 nodes that make up the most common Nutanix block. The only shared components between the 4 nodes are the redundant power supplies (2). Each node has access to its own disks and 10GbE network ports.

Back of Nutanix Block
Back of Nutanix Block

Additional nodes can be easily added to the cluster 1 – 4 at a time using an auto discovery process.

Up Next – Cisco Requirements for Virtualization

Now that I’ve been at Nutanix for a few months I’ve had a chance to really wrap my head around the technology. I’ve been working on lab testing, customer sizing exercises, and documentation of UC Best Practices on Nutanix. One of the most amazing things is how well UC runs on Nutanix and how frictionless the setup is.

I had to do a lot of work to document all of the individual Cisco UC requirements for virtualization, but with that exercise completed the actual technology portion runs extremely well.

In the next blog post I’ll cover all of the special requirements that Cisco enforces on a non-Cisco hardware platform such as Nutanix. I’ll cover exactly how Nutanix meets these requirements.

Nutanix and Unified Communications

The past week has been a whirlwind of studying, research, and introductions now that I’ve started at Nutanix! I’m happy to be on the team working on Reference Architectures for Unified Communications.

Nutanix Logo

I’m planning to investigate the major Unified Communications platforms (VoIP, Voice Messaging, IM & Presence, E911) from the top vendors and come up with Best Practices for deployment on Nutanix. This is a hot opportunity because customers are excited about Nutanix and have real need for Unified Communications.

The savings and consolidation that Nutanix can bring to other areas in the data center can also be applied to Unified Communications. Imagine ditching all of your SAN or NAS storage and deploying on a hyper-converged solution that utilizes the on-box storage of every node in the cluster to its full potential. Imagine scaling up the size of your cluster by simply adding new nodes and not worrying about the storage.

With my past Cisco CCIE experience I’ll be tackling these technologies first, but I’m also planning on working on Microsoft Lync and Avaya Aura. To me this seems like the key area of opportunity at Nutanix, proving that any workload can run successfully on our systems.

Cisco has a great resource in the DocWiki pages that identify how to design and deploy Cisco UC in a virtual environment. I’m getting started there and hope to have a Best Practices guide (including sample cluster builds) put together by the end of November. After reading through all of the requirements and restrictions on the Cisco DocWiki site I’m confident Nutanix and Cisco UC will be successful!

What Unified Communications platform is YOUR company using? Is it virtualized? How much of your cost is in the SAN?

 

I’m looking forward to your comments. Keep an eye on this space and the Nutanix website for releases of our Best Practice documents in the future.

OpenID Connect

At work I’ve been doing a ton of Single Sign On, SAML, and certificate based authentication. I wanted to try that out for my own personal use. It turned out to be much easier than I expected.

I’ve already blogged here about updating my site certificates using StartCom SSL. Another free service they offer is an OpenID Connect certificate.

The process is actually pretty straight forward.

To get started with StartCom in the first place you have to download a client side certificate into your web browser. This is a file that you must keep on your computer and must protect. When your web browser connects to StartCom services it presents this certificate and says “Here I am”. Since you should be the only person with that certificate StartCom can say “OK, come on in”. 

This is nice because you don’t have to remember a username and password. The downside is that you have to keep this certificate handy to load it onto each machine you connect from. An encrypted USB key can be handy for this. You have to figure out how to make your Operating System and Browser combination present this certificate as an identity cert. Often this is an advanced setting in the browser that allows you to import an Identity Certificate.

OpenID Connect takes this to the next step.

StartCom knows who I am and knows my certificate. As a provider of web services (bbbburns.com WordPress) I can make the decision to allow in certain users that an OpenID provider has authenticated. I downloaded the WordPress OpenID plugin, and tied my ID of bbbburns.startssl.com to my WordPress account.

When I login to my own WordPress site now I can just type in “bbbburns.startssl.com” as the user and hit Login. The site redirects me to startssl.com for authentication using my client certificate. If successful, I get redirected back to bbbburns.com with an authentication assertion. Since bbbburns.startssl.com is tied to my WordPress account on the server I’m automatically logged in as this user.

The setup for all of this took just a few minutes!