Archive for the ‘technical’ Category

how intel could save itanium

Monday, September 19th, 2011

It’s not news that the Itanium processor is doomed in its current incarnations. Microsoft has dropped support, as has Red Hat – meaning only HP supports it with an active platform in the US (with HPUX).

The Itanim should have been Intel’s chance to totally walk away with the processor market – but they blew it by totally breaking x86 compatibility, and not in a good way: their bungle allowed AMD to develop the x86-64 extensions that are now ubiquitous – even on Intel’s own processors (though they call them EMD64).

With the multi-core industry in full swing, Intel has a chance to make the Itanium relevant once again – package it along side x86-64 cores in a new CPU, a la IBM’s Cell processor.

With Microsoft now looking to branch Windows onto ARM with Windows 8, the x86 architecture is no longer the only [major] game in town.

If Intel were to co-package the Itanium and x86 cores on one chip, or at least in one package, they could start to reap the benefits of both worlds – keep the high-end, “mission critical” market they have that lives on HPUX (though “mission critical” and “HPUX” don’t jive in my head), and bring into that realm the x86 world.

Hey, it’s a thought.

ogsh/ogfs for fun and profit

Saturday, September 17th, 2011

The absolute coolest feature of HP’s Server Automation suite is the OGSH (or OGFS) – the Opsware Global SHell (or FileSystem).

I worked for Opsware before HP acquired them, and the OGSH was a new feature to the product (then called Opsware SAS (Server Automation System)). It’s a fuse module that gives a [limited] bash interface to the managed environment by presenting a live query/view into the database, and, ultimately, allowing manipulation of managed servers in the environment.

For example, to access a list of all managed servers, you login to global shell, then

cd /opsw/Server/@

The ‘@’ sign is used to indicate you are “there” – at the limit of that particular filter (in this case, “Server”).

Since it’s bash, you can run most common *nix utilities and commands. But the one that’s most handy, in my opinion, is rosh – the Remote Opsware SHell.

Remote shell opens an authenticated, logged session to a remote machine (*nix or Windows – doesn’t matter), based on your user’s/group’s permissions. For testing purposes, I always configure one group (and add myself) that can connect using root for *nix machines (and Administrator on Windows).

The basic command to connect to a machine is:

rosh -l [username] -n [machine]

You can also pass commands to rosh like it was an ssh session:

rosh -l [username] -n [machine] '[command]'

For the fullest power of rosh, though, use it in a script or loop. For example:

for sn in *; do rosh -l root -n $sn 'uptime ; uname -a'; done

That will remote shell into every server in the current view, using standard shell expansion of the splat (*), and run uptime and uname -a, printing the results to screen. That particular command is handy for quick-and-dirty reports on the managed environment to see

  • which servers are up, and which aren’t
  • how long they’ve been up

In addition to rosh, global shell provides a near-complete exposing of the SA API (which is also accessible via Java, web services, and Python (using the “PyTwist” bindings written to access the Java interfaces).

the ticket smash, raw metrics, and communication – how to have a successful support organization

Thursday, September 15th, 2011

When I worked at Opsware, and for a while after HP bought us, we used to try to have once- or twice-a-week meetings for each support group wherein we would bring our most difficult cases (with the difficulty being determined by the case owner), and have an opportunity for everyone on the team to ask questions, contribute, and maybe even solve the problem our customer was having.

Novel idea, isn’t it? The typical Support team is driven by stats – the number of tickets in their queue, age of the ticket, number solved/closed, number escalated, etc. Support is driven by these numbers because managers don’t think of any better way to do it.

All things being equal, if you can close 40 cases in a week, that’s a lot better than your podmate who “only” finished-out 12. But what about the complexity of each of those cases? And how much effort did each engineer put into them? Did the customer come back and ask for it to be closed because it’s either no longer an issue, or they solved it themselves? Is it a question that can be answered with a reference to a specific page/section of a manual? Or was it a problem that took multiple webex engagements, and dozens of contacts back and forth to find a solution because it was a deep bug?

Theoretically, the goal of “support” is to, well, support - get the problem reporter a solution of some kind they can use. That solution may be a bug fix, an RFE, a reference to a tutorial, reconfiguring, or a work around / alternative approach to their problem. A big problem with this setup is that the reporter rarely asks the right question. They ask what they have pre-determined to be what they think is a question – but by biasing their initial report, they can often end-up dragging-out the solution process far longer than it should take. I recently wrote a guide on creating effective support tickets, based on my experience working in support, and interacting with various support organizations both before and since.

Reporter bias is the hardest issue to overcome, in my opinion; engineer bias is easier to get past because (hopefully) there are folks you can bounce the problem off of in the team who can help narrow-down the problem and find a solution … or at least figure out where to try looking next.

Communication is the key to solving problems – when I was at Opsware we utilized internal IRC channels and (gasp!) talking with each other to try to find solutions to customer issues. We also spent a lot of time wording inquiries to the reporter to try to gain as much information as possible on each iteration of the communication process.

Another key to solving problems was to make records of cases with the following:

  • initial reported behavior (or lack thereof)
  • actual problem
  • solution

Those records were sometimes on wiki pages, sometimes in our Plone internal KB, and sometimes got “promoted” out to the customer-facing KB. All of these approaches helped us get problems solved faster – either by offloading the “work” to the customer (via a KB reference), or by being able to apply previous answers more quickly when new-but-similar/identical problems were reported.

The end goal of a support team is not to outdo one another on how many cases one engineer has in his queue, or how many another has closed – the end goal is to solve customer problems. “Works well in a team setting” is a qualification typically associated with support engineering employment listings – but all too often that gets reduced to a cliche that practically means “tries to outdo his cubemates by closing more cases than the next guy”.

I’m as much a fan of personal responsibility and action as the next red-blooded capitalist, so don’t take this next section to imply I’m promoting communalism.

The way a support team should work is the way [good] sports teams work, or the way a Nascar team operates: yeah, it’s the driver of the car who gets the “glory”, but without his pit and maintenance crew, he’d be no better than you or I going to the grocery store. Any given support engineer gets to have his name tagged to the case for posterity – both with the good things he did, and the not so good ones. But since the goal is really to get the customer’s problem addressed, the ego of the engineer needs to be removed from the equation.

Bob Smith might be “the guy” who informed his customer of a solution, but generating the solution involved the other 7 people in his office. He gets the “fame” from Universal Widgets LLC, but he was just one of the [important] cogs in the process of resolving the issue.

The number of cases Bob has in his queue should have [almost] ZERO correlation to his skill as an technical engineer: it’s the 7 people behind him whom he can ask and brainstorm with that get the job done.

Maybe Bob gets to handle most of the “customer” action, but the other 7 are writing bug reports, solutions articles, etc. When evaluating that team, management needs to do just that: evaluate the team first, and the individuals second.

bglug meeting – 17 september – topic: data center automation

Wednesday, September 14th, 2011

The September meeting of the Bluegrass Linux User Group will be this Saturday, 17 Sep.

We’ll be meeting at Collexion’s facilities in Lexington at 2:30p.

I will be presenting on data center automation, specifically on HP’s Server Automation platform (the tool I use on my day job).

Some [limited] history of HPSA is available on the Opsware wikipedia page.

We’ll also briefly touch on some of the OSS alternatives to a full-blown environment like HPSA, such as:

debugging authorized_keys and ssh

Tuesday, September 13th, 2011

I saw an interesting question this morning on ServerFault, entitled “SSH Prompts for password even though private keys are available, presented to server and known to it”.

  • when my user is not already connected to the server (first ssh connexion), it prompts for password even though privates keys are availiable (PuTTY + Pagent). After that first connection, if I open a secondary or a third connection it gets connected with the keys.
  • If I close all connections and open a new one it prompts for the password.
  • If I have let say 4 open connections and I close the first one (the one that prompted for the password), the fifth connection will be opened with the keys

Now that is an interesting problem. The answer supplied, with follow-on comments was also interesting, but the process behind solving this is even more fascinating, I think.

The issue is that password-less logins should work. sshd_config has been set properly, and there is a set of matching keys in authorized_keys.

But it doesn’t work, obviously – or there’d be no question raised.

A list of items to look into, both from the supplied answer, and from my own thoughts (somebody else beat me to an answer):

  • permissions on .ssh/authorized_keys (must be 600)
  • verify sshd has been started/restarted post changes to sshd_config
  • check to see if home directory is remotely mounted / mounted on demand
  • check to see if key has a passphrase in use
  • look at /var/log/auth.log for errors
  • check to see if the home directory is encrypted (actual answer)

Debugging is something I have written about recently – it seems to come up over and over in my line of work.

It’s a skill that’s vital to have in the IT world, and yet an awful lot of folks do not.


The answer, for those interested:

It sounds like, for whatever reason, the user’s home directory is not available if the user is not logged in, so that sshd can’t find the authorized_keys file

The user’s home directory must be using ecrypt or something like that

that’d be the cause, then, since sshd can’t decrypt the contents of the home directory

Ubuntu Desktop asks if you want to encrypt the home directory (why not?) without mentioning what it may do to ssh… a simple “note: this will effect SSH…” would be helpful

assessment and capacity analysis and planning for virtualization initiatives

Thursday, August 25th, 2011

Q:

What would need to go into an assessment tool for a virtualization initiative?

A:

Typical factors may include:

  • current CPU load per server
  • what’s running on each server
  • current hardware of each server
  • expected percentage increase in usage
  • OS usage – homogenous or heterogeneous
  • new hardware or re-use current hardware
  • storage needs
  • vendor for virtualization (VMware, Microsoft, Xen)

And don’t forget the all-important:

  • BUDGET

My experience is all related around VMware, but what I’ve seen and used in the past is the following:

  • look at all CPU utilizations currently
  • add those average and peak percentages in two separate columns
  • plan for ~10% overhead from your hypervisor of choice
  • for every 40% of ‘average’ or 80% of ‘peak’, use one server of the type you now consider “high-end” (ie, if you have a total of 687% of ‘peak’, you need 9 physical servers running your hypervisor of choice)

Other thoughts:

  • I like to plan for 1 full spare physical server per ~6, so that I can utilize Vmware’s Vmotion for migrating servers around
  • plan for buying/utilizing SAN storage of some form so your VMs can be moved to different physical servers easily

I originally answered this topic ~2 years ago on serverfault.com.

new connexions collection available

Friday, August 19th, 2011

I have been working on my Connexions submissions again recently, and have a collection ready for use (it will be growing as time goes on): “Debugging and Supporting Software Systems

I realize there are some small typos in the current text, but I will be addressing that in a upcoming revision :)

I’d love to get feedback from anyone on how it could be improved/expanded.

lightsquared attacking gps manufacturers

Saturday, August 13th, 2011

The LightSquared situation keeps getting more interesting. InfoWorld has another story on them attacking GPS manufacturers for not being more careful about filtering adjacent frequency bands (per a DoD recommendation from 2008).

LightSquared is at loggerheads with makers and users of GPS (Global Positioning System) over interference between the navigation system and its planned cellular LTE (Long-Term Evolution) network. That network would transmit on frequencies close to those used for GPS. The company has long argued that makers of GPS equipment are to blame for the interference because they don’t use strong enough filters to keep their receivers from searching for signals in LightSquared’s bands. But this is the first time LightSquared has accused the vendors of flouting a specific rule.

The DoD’s GPS Standard Positioning Service Performance Standard called for GPS receivers to filter out transmissions on frequencies adjacent to the GPS band, LightSquared told the FCC in a filing related to the agency’s ongoing consideration of the company’s network proposal. The standard, issued in September 2008, recommends that receivers reject all transmissions on frequencies that are more than 4MHz outside the GPS band, said Jeffrey Carlisle, LightSquared’s executive vice president for regulatory affairs and public policy. That 4MHz buffer is essentially a “guard band” to protect operations on either side, he said.

LightSquared plans eventually to use frequencies adjacent to the GPS band for its LTE network, but after mandatory tests earlier this year showed strong interference in that area, the company said it would start out in a slightly lower-frequency block.

Here’s something that’s a little disturbing, though:

There is no mandatory standard for filtering in GPS receivers, and the FCC does not certify the devices for this

And here:

In addition to the DoD recommendation, the International Telecommunication Union, a United Nations agency, has also warned since 2000 that stronger filtering might be necessary to protect GPS from nearby transmissions

The ‘Coalition to Save Our GPS’ had the following to say:

“GPS receivers incorporate filters that reject transmissions in adjacent bands that are hundreds of millions of times more powerful than those of GPS. What LightSquared is proposing, however, is to transmit signals that are at least one billion times more powerful,” the group said in a statement. “There has never been, nor will there ever be, a filter that can block out signals in an immediately adjacent frequency band that are so much more powerful, nor has LightSquared put forward any credible, independent expert opinion or other evidence that this is possible.”

I’m no expert, but “hundreds of millions” is distinctly not far-off from “one billion” (since one billion is equal to ten hundred million). I also acknowledge not having much domain expertise in radio signals, transmission, etc – but what LightSquared is looking to do seems a lot more useful than worrying about some poorly-built GPS receivers.

The FCC said earlier this week that it would not allow the LTE service to launch unless the interference issue was resolved.

LightSquared has said it is confident the plan will be approved next month.

the fcc decides to intervene on lightsquared

Wednesday, August 10th, 2011

I’ve taken an interest in LightSquared recently.

Today InfoWorld reports that the FCC “won’t allow LightSquared’s proposed mobile broadband service to interfere with GPS signals, even though the potential interference would be caused by GPS receivers picking up signals outside of their designated spectrum”.

So, the devices are in error, but the FCC is going to prevent LightSquared from interfering?

Sounds like the FCC should be going after the receiver manufacturers to ensure their systems don’t bleed over, rather than after a company not operating on GPS spectrum.

Wait, I forgot: that’d be too logical for a government agency :|

why technical intricacies matter

Monday, August 8th, 2011

I have been working on a upgrade for one of our customers for nearly a month.

Last week we spent about two hours focused on one specific problem that had been rearing its ugly head on an exceedingly-frequent basis: one of the components of the application was routinely pitching OutOfMemory errors from the Java Virtual Machine (jvm). The errors were actually being returned from WebLogic  (currently an Oracle product; previously from BEA).

Much googling of the error messages returned the following Sun bug:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4697804, and the workaround:
Disable VM heap resizing by setting -mx and -ms to the same value.
This will prevent us from hitting the most common sources of the vm_exit_out_of_memory exits.
The best thing to do is increase swap size on the machines encountering this error.

[If you want to skip the rest of this, feel free: the short version is we boosted swap space from 1GB to 13GB, and it works like a champ now.]

Important Things You Should Know™

  • The version (1.4) and platform (32-bit) of Java is used for a variety of reasons by this product in this component
  • A 32-bit OS/machine1 can only access ~3GB of RAM (due to OS overhead and bus address mapping strategies)
  • A 64-bit OS/machine can access between 248 and 264 bytes (256TB-16EB) of memory (depending on addressing model used)
  • There are two types of memory a system can use: heap and stack
  • The jvm gets memory for itself from the host OS from the heap
  • If more memory is need by the Java application in question, and it has not yet exceeded the max (-Xmx argument) amount available to the jvm, the jvm will get more memory for itself from the system
  • The 32-bit jvm has a certain amount of overhead itself (I have seen 5-25%, depending on the application)

Environmental issues for the application in question

  • 8 CPUs
  • 32GB physical memory
  • ~9GB RAM in use, the rest unused
  • RHEL 4 64-bit
  • 1GB swap

Go check out this video while you think for a few seconds :)

Oh, you’re back? Welcome!

More details about the Sun jvm: when the jvm needs more memory, so long as the system can issue it, it will ask for a multiple of what it really needs (observationally about 40%, or 1.4x the “actual” request). And while it is asking for more memory, it swaps itself out to swap space (virtual memory, or a special location/partition on the drive). After it gets its new allocation, it loads itself back in from swap, and goes on its merry way.

Why does it ask for more than what the application ”actually” requested? It’s a best-guess on the part of the jvm – if you have allocated 256M of RAM minimum, and 1G max, when the application asks for 257M, the jvm doesn’t want to ask for more RAM too often from the OS, so it asks for ~360M, with the theory being that if you needed 1M over your initial amount, you will likely need yet more. This continues on until the jvm has asked for as much RAM as it is allowed, or until the application quits – whichever comes first.

Last piece of useful technical data:

  • The specific component in the application I was working with asks for 256MB to start, with a cap of 1280MB (we raised that to 2560MB (2.5GB) as an initial attempt to stave-off OutOfMemory errors)

I know it’s been a little while, but think back to that initial list of Important Things … and add into the mix that the component in question was chewing an entire CPU (in normal operation it rarely will go above 25%), and was using 3600MB of virtual memory and 2.8GB of real RAM. That’s a problem. First, because we have 32GB of real memory – there’s no reason the whole component (2.8GB is equal to our 2.5GB max plus some jvm overhead). Second, because while it’s chewing an entire CPU, it’s never actually coming up, or, if it does, it’s taking an hour or more (when normally the entire application will start in 12-20 minutes from power on).

What was the problem with this ONE component? The detail is in the list of environmental factors: there was only 1GB of swap space. Uh oh. That means that unless the jvm asks for all 2.5GB up front, it will have to keep re-allocating memory to itself from the system. But with only 1GB of swap space, it has no place to unload itself to while it asks for more and then load itself back into RAM.

What to do? Let’s go back to that obscure Sun bug: “increase swap size on the machine”. We tried going from 1GB to 13GB (had a 12GB partition not being used, so we flipped it to be a swap partition) and rebooting the server.

After increasing swap space, not only does the application start in about the expected amount of time (~15 minutes), but it never pegs the CPU! Woot!

With a newer version of the product, there is an installation prerequisite check to ensure that there is as much swap space as physical RAM installed – but no explanation of why this is now the case.

Whether the above travails are the entire reason, or merely a single example of why it’s important, I won’t be installing onto any machine that doesn’t have enough swap again.


1 without special drivers/kernel modifications