antipaucity

fighting the lack of good ideas

but, i got them on sale!

Back in August 2008, I had a one-week “quick start” professional services engagement in Nutley New Jersey. It was a supposed to be a super simple week: install HP Server Automation at BT Global.

Another ProServe engineer was onsite to setup HP Network Automation.

Life was gonna be easy-peasy – the only deliverable was to setup and verify a vanilla HPSA installation.

Except, like every Professional Services engagement in history, all was not as it seemed.

First monkey wrench: our primary technical contact / champion was an old-hat Sun Solaris fan (to the near-exclusion of any other OS for any purpose – he even wanted to run SunOS on his laptop).

Second monkey wrench: expanding on the first, out technical contact was super excited about the servers he’d gotten just the weekend before from Sun because they were “on sale”.

It’s time for a short background digression. Because technical intricacies matter.

HP Server Automation was written on Red Hat Linux. It worked great on RHEL. But, due to some [large] customer requests, it also supported running on Sun Solaris.

In 2007, Sun introduced a novel architecture dubbed, “Niagara”, or UltraSPARC T1, which they offered in their T1000 and T2000 series servers. Niagara did several clever things – it offered multiple threads running per core, with as many as 32 simultaneous processes running.

According to AnandTech, the UltraSPARC T1 was a “72 W, 1.2 GHz chip almost 3 times (in SpecWeb2005) as fast as four Xeon cores at 2.8 GHz”.

But there is always a tradeoff. The tradeoff Sun chose for the first CPU in the product line was to share a single FPU (floating point unit) between the integer cores and pipelines. For workloads that mostly involve static / simple data (ie, not much in the way of calculation), they were blazingly fast.

But sharing an FPU brings problems when you need to actually do floating-point math – as cryptographic algorithms and protocols all end up relying upon for gathering entropy for their random value generation processes. Why does this matter? Well, in the case of HPSA, not only is all interprocess, intraserver, and interserver communication secured with HTTPS certificates, but because large swaths are written in Java, each JVM needs to emulate its own FPU – so not only is the single FPU shared between all of the integer cores of the T1 CPU, it is further time-sliced and shared amongst every JRE instance.

At the time, the “standard” reboot time for a server running in an SA Core was generally benchmarked at ~15-20 minutes. That time encompassed all of the following:

  • stop all SA processes (in the proper order)
  • stop Oracle
  • restart the server
  • start Oracle
  • start all SA components (in the proper order)

As you’ll recall from my article on the Sun JRE 1.4.x from 6.5 years ago, there is a Java component (the Twist) that already takes a long time to start as it seeds its entropy pool.

So when it is sharing the single FPU not only between other JVMs, but between every other process which might end up needing it, the total start time is reduced dramatically.

How dramatically? Shutdown alone was taking upwards of 20 minutes. Startup was north of 35 minutes.

That’s right – instead of ~15-20 minutes for a full restart cycle, if you ran HPSA on a T1-powered server, you were looking at ~60+ minutes to restart.

Full restarts, while not incredibly common, are not all that unordinary, either.

At the time, it was not unusual to want to fully restart an HPSA Core 2-3 times per month. And during initial installation and configuration, restarts need to happen 4-5 times in addition to the number of times various components are restarted during installation as configuration files are updated, new processes and services are started, etc.

What should have been about a one-day setup, with 2-3 days of knowledge transfer – turned into nearly 3 days just to install and initially configure the software.

And why were we stuck on this “revolutionary” hardware? Because of what I noted earlier: our main technical contact was a die-hard Solaris fanboi who’d gotten these servers “on sale” (because their Sun rep “liked them”).

How big a “sale” did he get? Well, his sales rep told him they were getting these last-model-year boxes for 20% off list plus an additional 15% off! That sounds pretty good – depending on how you do the math, he was getting somewhere between 32% and 35% off the list price – for a little over $14,000 a piece (they’d bought two servers – one to run Oracle RDBMS (which Oracle themselves recommended not running on the T1 CPU family), and the other to run HPSA proper).

Except his sales rep lied. Flat-out lied. How do I know? Because I used Sun’s own server configurator site and was able to configure two identical servers for just a smidge over $15,000 each – with no discounts. That means they got 7% off list …
tops.

So not only were they running hardware barely discounted off list (and, interestingly, only slightly cheaper (less than $2000) than the next generation T2-powered servers which had a single FPU per core, not per CPU (which still had some performance issues, but at least weren’t dog-vomit slow), but they were running on Solaris – which had always been a second-class citizen when it came to HPSA performance: all things being roughly equal, x86 hardware running RHEL would always smack the pants off SPARC hardware running Solaris under Server Automation.

For kicks, I configured a pair of servers from Dell (because their online server configurator worked a lot better than any other I knew of, and because I wanted to demonstrate that just because SA was an HP product didn’t mean you had to run HP servers), and was able to massively out-spec two x86 servers for less than $14,000 a pop (more CPU cores, more RAM, more storage, etc) and present my findings as part of our write-up of the week.

Also for kicks, I demoed SA running in a 2-CPU, 4GB VM on my laptop rebooting faster than either T1000 server they had purchased could run.

Whats the moral of this story? There’s two (at least):

  1. Always always always find out from your vendor if they have a preferred or suggested architecture before namby-pamby buying hardware from your favorite sales rep, and
  2. Be ever ready and willing to kick your preconceived notions to the sidelines when presented with evidence that they are not merely ill thought out, but out and out, objectively wrong

These are fundamental tenets of automation:

“Too many people try to take new tools and make them fit their current processes, procedures, and policies – rather than seeing what policies, procedures, and processes are either made redundant by the new tools, or can be improved, shortened, or – wait for it – automated!”

You must always be reviewing and rethinking your preconceived notions, what policies you’re currently following, etc. As I heard recently, you need to reverse your benchmarks: don’t ask, “why are we doing X?”; ask, “what would happen if we didn’t do X?”

That was a question never asked by anyone prior to our arrival to implement what sales had sold them.

a smart[ish] dhcpd

After running into some wacky networking issues at a recent customer engagement, I had a brainstorm about a smart[ish] DHCPd server that could work in conjunction with DNS and static IP assignment to more intelligently fill subnet space.

Here’s the scenario we had:

Lab network space is fairly-heavily populated with static assigned addresses – in a /23 network, ie ~500 available address on the subnet, about 420 addresses were in use.

Not all statically-assigned IPs were registered in DNS.

The in-use addresses were did not leave much contiguous, unused space (little groups of 2 or4 addresses open – not ~80, or even a couple small batches of 20-30 in a row).

DNS was running on a Windows 2012 host.

DHCPd (ISC’s) was setup on an RHEL 5×64 Linux machine.

The problem with using the ISC DHCPd server, as supplied by HPSA, is that while you can configure multiple subnets to hand-out addresses on, you cannot configure multiple ranges on a single subnet. So we were unable to effectively utilize all the little gaps in assigned addresses.

Maybe this is something DNS/DHCP can do from a Windows DC, but I have an idea for how DHCPd could work a little smarter:

  • give a very large range on a given subnet (perhaps all but the gateway and broadcast addresses)
  • before handing an address out, in addition to checking the leases file for if it is free, check against DNS to see if it is in use
  • if an address is in use because it is static, update the leases file with the statically-assigned information as if it were assigned dynamically – but give it an unusually-long lease time (eg 1 month instead of 4 hours)
  • on a periodic basis (perhaps once an hour, day, week – it should be configurable), scan the whole subnet for in-use addresses (via something like nmap and checking against DNS)
    • remove all lease file entries for unused/available IPs
    • update lease file entries for used/unavailable IPs, if not already recorded

This would have the advantage of intelligently filling address gaps on a given subnet, and require less interaction between teams that want/need to be able to use DHCP and those that need/want static addresses.

Or maybe what I’m describing has already been solved, and I just don’t know how to find it.

defaulting pxe boots with hpsa 10.0

In follow-up to my last post, which itself was a commentary on an earlier topic, I have the additional steps you need to do the previous procude (which is to edit /opt/opsware/boot/tftpboot/pxelinux.cfg/default):

/etc/init.d/opsware-sas stop smartboot

Edit file.

/etc/init.d/opsware-sas start smartboot

pxe works differently in hpsa 10.x

2 years ago I wrote-up how you can change the default choice for the PXE menu in HP Server Automation. Found out this week that those instructions are not valid if you are running 10.0 (release this past summer).

HP changed how they present their PXE menu with 10.x, and I have filed an RFE (on 18 Dec 2013) with them to get this fixed back to how it was (or provide a solid alternative).

This is one of the times when I’ve ever seen a vendor remove functionality in a product (at least, removed without providing an alternative).

Thanks, HP 😐

call

I learned about the call command in Windows recently.

Some context – was trying to run a command via HPSA at a customer, but kept getting an error that the program was not a recognized internal or external command.

Very frustrating.

Then one of the guys I worked with suggested adding a “call” to the front of my script. That worked like a champ. Here’s why.

When the HPSA Agent on a Managed Server receives a script to run from the Core, it runs it in a headless terminal session. This means that while environment variables (eg %ProgramFiles%) expand properly, if the first part of the command is NOT a built-in from cmd.exe, it won’t execute. Unlike *nix which is designed to run most things headless, Windows never was (and isn’t still as of Win2k8R2).

The built-in command ‘call‘ forks the next command to a full session (albeit still headless), and enables cmd.exe to run it properly.

Now you know.

certifications and dependencies

Last week I participated in a beta class for HP’s new Cloud Service Automation 3.0 product release (ok, so it’s a prerelease, and “product” is a strong term). 3.0 is a full rewrite from 2.x, so there is no upgrade path. Also, not everything that “appears” to be in place OOB is actually working – and there is no way to grey-out options that are unavailable.

We were told this should be addressed in a patch sometime in the next 6 months. Yay us. Oh, and did I mention I’m involved in a project to implement this currently? Woot!

After taking this class, I found out that a prerequisite for the class is some Operations Orchestration training from HP – without which HP will not certify I took the class. Right. So, I have to take those classes via HP University over the next couple weeks so that by the time the CSA 3 class is “live” next month I can be officially-verified as having taken it.

And, if I’m going to take those classes, I might as well also go for the certification from HP to add to my CV 🙂

Also by about my birthday, I will be taking the VCP week-long class and test to learn and be certified on VMware’s vCenter, vSphere, and ESXi product lines from an architectural and implementation standpoint.

These next several weeks are going to be a blast 🙂

automation

I have been deeply involved in data center management and automation for well over 5 years.

Most companies still view automation the Wrong Way™, though – and it’s a hard mindset to change. Automation is NOT about reducing your headcount, or reducing hiring.

Automation is used to:

  • improve the efficiency of business tasks
  • improve employee productivity
  • reduce human error
  • ensure consistency, and auditability
  • improve/ensure repeatability
  • replace “fire fighting” with planning and proactivity
  • ensure an organization can pass the bus test (which disturbingly-few can)
  • free engineers to work on interesting, engineering problems – not day-to-day busywork

Cringely has an article on this topic this week, entitled “An IT labor economics lesson from Memphis for IBM“.

How can a company 1/100,000th the size of IBM afford to have monitoring?  Well, it seems DBADirect has its own monitoring tools and they are included as part of their service.  It allows them to do a consistently good job with less labor.  DBADirect does not need to use the cheapest offshore labor to be competitive.  They’ve done what manufacturing companies have been doing for 100+ years – automating!

Even today IBM is still in its billable hours mindset.  The more bodies it takes to do a job the better.  It views monitoring and automation tools as being a value added, extra cost option.  It has not occurred to them you could create a better, more profitable service with more tools and fewer people.  When you have good tools, the cost of the labor becomes less important.

Any company that fails to realize that throwing more people at the problem is rarely the answer (something former IBMer Fred Brooks wrote about as a post-mortem of the OS/360 project in The Mythical Man-Month), is doomed to fail – consistently, and tragically.

And yet IBM is still in the mindset of the 1960s and raw, manual labor in an increasingly-connected, -compliant, -complex, and –cloudy world. They are still trying to solve problems the Risk way – throw a gob o’ guys at the problem, and roll over your opponents through sheer numbers.

In many ways, it is sad to see the demise of once-great companies like IBM. There’s the loss of competition, the passing of the Old Guard, etc.

But it’s also a huge opportunity for new businesses to come in, compete, and clean-up in sectors the Bug Guys can’t (or won’t) touch well.