Category Archives: Extreme Availability

Category for our extreme availability blog topics and posts. Look here for thoughts and advice on the subject.

The Extreme Availability Blog is our attempt at framing what this is and is not, and establishing a solid dialog on the subject in the hopes of raising the expectations we have for our friend the computer. In our daily experience with computers, we see spam, viruses, slow web pages, crashes, spinning clocks, and other minor annoyances that really don’t help our perception of what computers can do. What we don’t see is the infrastructure that quietly runs in the background, making sure that our money is moving around correctly without prying eyes, running our power plants, giving us the security of knowing that we can pick up the phone and call 9-1-1, and letting us go to the grocery store and pay for food with confidence that the computers will be up, even if our credit is maxed-out.

Success Study: Massive Performance for Financial Switch Ported to NonStop

The Client’s Profile

Our client is an international corporation who has a mature and well adopted financial switch in the UNIX/Oracle space. They have over one hundred financial customers.

The Client’s Challenge

As Indestructibility and Continuous Availability requirements become more desirable, our client wanted to port their code to the HP NonStop™ platform. Their challenge was to have their system perform better than it has on other platforms. We were engaged to make recommendations on how to improve the code and to augment their porting effort.

Our Mandate

As part of achieving our client’s objectives, we were given the mandate to fine-tune their code, practices, SQL structures, and application management framework. Also, we were engaged to make recommendations on design changes for porting code to the NonStop architecture.

The Key Benefits Our Client Received

In order to help our customer, Nexbridge contributed in the following ways:

  1. Using our indestructible architecture constructs, we helped move their code to the NonStop TS/MP environment, with online code update capabilities.
  2. Migrated to SQL/MX 3.2.1 to support large tables and higher performance.
  3. Reduced the number of system transactions to match actual business transactions, thereby reducing transaction times and massively increasing capacity.
  4. Leveraging new technology available in the IP CLIM, we build the ability to preserve socket connections across failures.
  5. Revised their code to function at an unprecedented business transaction rate.
  6. Acted as guides for their technology plans.
  7. Conducted a benchmark to demonstrate the new speed and capabilities of their application and to enable the publication of their results.

End Conditions

Once our services were completed, the client had the fastest, reliable, hot-upgrade capable, and most cost effective financial switch available with over 6000 TPS performance levels.  Their success is documented on the  BPC Banking Technologies website. We are honoured to have played a part in this historic and ground-breaking project. We are also honoured to continue our association with this client.

For more information, about this and our other successes, please contact us.

With Flying Colours!

I am  so pleased to have just done an Indestructibility Assessment for the Richmond Hill Chamber of Commerce, who passed with flying colours. It is actually rare that organizations are so well prepared. Some of the things they did right:

  • Weathered the recent Ice Storm very nicely. Their operations barely noticed it.
  • Have sound backup plans for loss of technical and human resources.
  • Replace their UPS batteries on a regular basis.
  • Manage their systems very nicely.
  • Understand and follow their data retention requirements.
  • Plus lots more…

As a member of the Chamber, I am very happy that we are so well represented and glad to be associated with them.

WordPress: So Where are We Now?

It’s been four days since the deployment and we’ve been live for two. So how are things? Not too bad. We’re having problems with analytics and spiders finding us, but that’s expected. Changing DNS makes spiders pretty angry at the best of times. I’m just glad that Shelob isn’t after me – it’s not like I have an Elven Sword or Starlight.

Right now, things are reasonably stable. No reports of problems have come my way and backups are being taken without problems. I can take the backup and Git can tell me what changed between versions – a big plus if you have to manage content as a database. Although the ECLIPSE compare engine takes a long time even on an i7 because the MySQL snapshot is so massive. Git compare tools might be better here.

Our one annoyance? Spambots. Yes, they caught up with us moments after the DNS switch happened. Fortunately, we are able to block posts. We have a large number of fake users that I have already deleted – hopefully I did not remove anyone real. Don’t worry, our editors won’t blindly delete users with posts so even if we accidentally delete a real user, the posts will be kept (or at least reviewed first for spam). We’re researching how to prevent the initial registration, but that’s a harder problem than catching actual spam posts.

More tomorrow on a really cool idea one of my sons had for learning about disaster management training.

WordPress Deployment: Here is What We Did

It was a dark and stormy night…

The adventure started by building a local copy of WordPress using basically MAMP (Mac, Apache, MySQL, and PHP) although I configured each on its own rather than using the MAMP product. No real differences there. DNS on a Mac is a bit of a gotcha because it sometimes does not like localhost vs. 127.0.0.1 if the server is not itself a DNS host, but that’s a different problem. The Apache instance had to be on a different port, so at least our URLs were very apparent. This turned out to be a good thing.

We then migrated whatever content we wanted from the old website. Forget Import. It doesn’t do what you’ll want. Probably never. At this point, some of the plug-ins worked, some didn’t. Anything with email was DOA because we are inside our firewall. No problem, we expected that.

So now the fun bits. There is no “deploy” button. FrontPage had one. DreamWeaver has one. Most things do. WordPress? Not so much. Sure, FTP worked to move the PHP code and ECLIPSE update sites. But there was more…

To move the content, we had to take a copy of the database. Oh, there’s nothing like TMF online dumps mind you; dump to text via mysqldump because WordPress requires MySQL. (Yes, that’s in red because of all the blood and pain). Our web host provider then used one of their magic scripts to go through the text image of the database and change all the URLs. Are you afraid yet? They then created our database using that script. This was early last Friday.

Once that was done, we had to check everything. PayPal had to be used once or errors  were displayed to the user. That’s gone now. The Contact plug-in had to be disabled and enabled. The Captcha plug-in had a security problem in its internal directory that we had to set/reset and now it’s happy too. Not bad but very manual.

48 hours later, the DNS caches all flushed and the new site is visible to one and all. We’re breathing a little easier. But only a little. When we make content changes, instead of pushing them up to the host, we have to make in-place changed and then take a backup – that or we lose comments and posts. I can force an on demand backup any time I want (there are plug-ins like BackWPup to do that) There’s still effective way to push wholesale changes from development to production. So what’s the plan?

Well, we’re talking to people. WordPress hasn’t quite figured out this discipline yet; that or they are letting the community handle it. They do not have an official position on deployment. Yet. I know a blogger who is getting comments out there. You might know him. There are plug-ins being built that are starting to take this stuff into account. They aren’t free, and if I’m going to pay for something, it better be rock solid – you know, like we expect in the NonStop community. Maybe it is an opportunity for one of us to do something about it.

 

 

WordPress Lessons Learned – The Hard Way

Deep Breath

It’s difficult for me, used to a high level of operations management discipline, to look back on our website deployment with anything but shame. Here’s the story of our somewhat moderate accidental success, or horrible failure – depending on your point of view.

Last fall, our web host provider informed us that support for FrontPage extensions, which we used on the previous incarnation of the website, was being dropped. Great! I finally had the motivation to get rid of the cobwebs and move to something modern. We’ve all been there. This started a long internal discussion of what to use for the new improved nexbridge.com. Our hosting provider suggested WordPress.

Moving the content was interesting, and difficult. Import did not work, so we had to move each page. The WordPress editor is not 100% consistent with HTML so that didn’t work out with simple copy&paste functions. But that’s just a side note. We’re currently fixing some of the paste issues that accidentally copied image links from the older site. Paste only text. Do not assume images are preserved.

But the work we did was all on a local server, to try things out. How to get the content up to the hosted environment? That was the question. The website has many moving parts:

  • Our ECLIPSE update site,
  • the PHP code for WordPress, its plug-ins, our customizations,
  • and the website content, which hides inside a MySQL database – and it has to be MySQL. That’s a WordPress requirement.

The ECLIPSE update site and our customizations are no problem. We’re good at that. Standard staging, install, fallback using Git. No worries.

The WordPress code has internal pointers to the root DNS of the website, so those have be changed when you deploy or you can use redirection code, which still needs to be modified. Queue fear here.

Some of the plug-ins have caches and internal pointers that depend on the DNS site root also. Those had to be reset after we deployed. I was not happy about that.

An important lesson we learned was the security inside the PHP code area is crucial, and generally wrong. There are caches, you see. The security and users on the target system are usually different from your build machine. Enter annoyance.

And probably the worst part: The way to deploy website content is through SQL scripts. You basically dump the database to text, upload it, change a bunch of stuff in the scripts, and apply it to the target database, and every table is blown away and recreated. Ok, I can accept that once. But what about our next set of changes?

In any event, we’re still working on it and are not live yet. We’ve made some of the site available for now. More to come on how we solved it…. [followup-post]

The Indestructible Intersect Point

I’m writing this blog while watching the first game of the Montreal Canadiens’ playoff hopes. There was a time when I was growing up when they were virtually undefeatable. We can only hope now. Yes, I’m a fan. Is there a relationship? Probably not, but who knows? I’m also still snickering at the maintenance outage that the hosting site for this blog had on Thursday morning. Does anyone see the irony in that?

To continue where I was last week, improving availability is typically an exponential cost function, where each 9 of availability costs substantially more than all the previous 9’s combined. So at some point you get to diminishing returns where it costs you more than it’s worth.

But what if you looked at the problem from the other direction, specifically assuming that the system will never go down, and work backwards? Wouldn’t that be unattainable and cost an infinite amount of money? If you come at it from the wrong way, starting from an unreliable system and trying to make it perfect, you’ll never get there – yes, that’s a debatable point, so go ahead and argue with me. So, start from the assumption that outages are unacceptable right from inception. There’s lots of stuff you’ll need and have to invest, but it’s actually a quantifiable cost that you can get to using traditional project budgeting techniques. Traffic routing, reliable platforms, sophisticated version control are all elements of it. Infrastructure is a huge part of the cost, as is cultural change in the operations and development groups – we’ll go there later – but is it worth it?

iIP[1]

Here’s another cost graph where the cost of indestructibility is added to last week’s picture. Strangely, it’s a straight line. It makes no difference in the cost whether you run an indestructible solution 7x24x365.25 or 20 hours a day. So again, why bother?

In many situations these days, installation can take days, not hours. Try renormalizing a multi-terabyte-size database in your normal outage window. You can’t. It doesn’t matter whether you are at 99.9999% or 99.99%. Indestructibility means you have to have the ability to perform the renormalization while the system is up – a daunting task, but possible.

What the cost curve shows is what I call the Indestructibility Intersect Point, let’s say iIP to coin an acronym. It’s where the availability curve and the indestructibility curve meet. If you’re indestructibility investment is less than your outage cost, which is can easily be (again, you know who you are out there), why bother chasing the 9’s curve? Sometimes the iIP will be above the outage cost. That’s when you don’t bother. So the question for you to think about – who knew there would be homework in a blog – is this: Do you know what your three costs are so that you can decide what to do?

What’s coming next? Well, we’re starting to get past the introduction, so I’m going to have to write about details soon. Stay tuned and give me feedback.

Indestructibility vs. Availability

An interesting perspective came from a discussion I had recently with Richard Buckle of the Real Time View (http://itug-connection.blogspot.com/). We talked about a major difference in perspective between highly/continuously available systems and indestructible systems. In the case of indestructibility, people take the position that the system will always be running, forever, right from the beginning. It’s the core assumption. How you build your infrastructure, software, and platforms needs to keep that firmly in mind right from the initial concept. With highly available systems, people start with a trade-off of what is good enough vs. cost; whether that’s four 9’s, five 9’s, or better. The tradeoffs made during a project come with how many nines are people willing to pay for. So it gives marketing people a really interesting pitch:

We can give you a brand new system that has five 9’s of availability, and it will only cost you ___.

Sure they can, but who pays for the other changes. You know, the small stuff, like retraining everyone, change management, business process reengineering (BPR), and testing cycles – all the your mileage may vary costs that somehow are always much bigger than anyone expects and often larger than the technology outlay to get you that extra 9 in the first place. At some point, and it’s very personal for your organization, the cost of indestructibility is actually less than chasing the exponential 9’s curve when you start from a system that is fundamentally fragile.

Now, suppose you’ve got a system that is happily running along at 99.99% of the time, and somebody figured out that every minute of outage costs your company twenty million dollars in penalties (You know who you are out there), and you’ve had outages. Or worse, suppose your outages are larger than that, and you cross the critical fifteen-minutes-down-and-lose-your-charter line. In order to add another 9 to your availability numbers, you’re going to have to rework your environment, maybe change platforms, change your processes, rewrite your software, build new deployment technology, get user acceptance testing signoffs, and worse, try to find funding in the organization to make all that happen. That’s pretty daunting. The fear of having to go through that for every nine is what led me down the indestructibility path in the first place. Organizational change is far harder than technological change, but that’s often what we have to do to add that elusive and expensive additional 9.

Untitled[1]

To illustrate this point, here’s a sample graph of the risks/rewards of availability. Next time, I’ll talk more about this cost function and what it looks like in the indestructible world. You might be very surprised.

What Is Indestructible Computing Anyway?

I was recently asked a good question by one of the readers: “What is indestructible computing, and why should I care?” It’s a good question. Here are a few common terms. What you should keep in firmly in mind is that whatever aspect a system you look at, the actual service level you experience is usually the weakest of your components. Guess which aspect one is almost always the weakest? If you’ve been following the blog, you already know: software change.

General Purpose Computing

Well, you’re probably reading this blog from a general purpose environment. A workstation or laptop can be considered general purpose hardware. Your browser could probably be considered general purpose software. The combination of the two gives you a general purpose environment.

Highly Availability Systems

These systems are available most of the time – generally 99.99% of the time, or slightly under 5 minutes of unplanned or planned downtime a month. Banking systems are typical of these. Fitting maintenance into even a five minute window is difficult, particularly when you’re upgrading disks or restructuring your Operational Data Store (ODS).

Continuously Available Systems

These systems are available virtually all of the time – generally 99.999% of the time (about 30 seconds of down-time). Extensive use of independent components allows these systems to operate virtually without any unplanned outages. Planned outages do occur for upgrades, but the window for these outages is very small. There’s a lot of confusion between Highly Available and Continuously Available systems, the lines are pretty blurry, and I won’t really differentiate between them, much. That there is even a distinction is arguable.

Critical Systems

These systems include some of the obvious life-critical systems: flight control systems; rockets; many health monitor devices. Systems like this do not have the same level of long-term availability that continuously available systems have, but during their duty cycle, no outage is permitted at all. Fortunately, no changes are generally permitted while the systems are up. How many launches were delayed because of sensor or software issues?

Long-Life Systems

In long-life systems, reliability is the number one priority. Unscheduled maintenance is usually impossible or cost-prohibitive. Scheduled maintenance is possible but not desirable, and usually involves only software components. During maintenance, rigorous testing is done to ensure that the system will function reliably when back online. Communication satellites and the Mars Explorers fall into this category. Even then, subtle defects, like miles vs. kilometres per hour in a calculation, can cause disastrous failures.

Indestructible Systems

A truly indestructible system builds on the best of all of these systems. The systems are expected to be long-life, yet dynamic. Change is not only possible, but expected. Yet, there are no unplanned outages and no planned outages. Not only small components, but major components like data centres can go offline without a perceived outage or noticeable reduction is service levels. Maintenance is done while the system is up.

And I don’t blame anyone for thinking indestructibility is unattainable. It’s very hard to get right and even then, it’s always possible that something will go wrong. In future posts I’ll go into what it takes to make this work. Hopefully you’ll see that indestructible systems are practical in the real world and understand what it takes to make them work for you.

The next post will go into the starting points of view for building these systems and how money gets wrapped up in it.

Perceptions on Destruction

So, my system blew up. Ok, not really, but what does that mean to you? It means a lot of different things to me. Here are a few things that have gone wrong with my computers recently:

  • A monitor refused to power on at start-up.
  • A patch came in that caused a test system to refuse to boot.
  • A Software license key got corrupt – you’ve seen this one last week.
  • An external battery melted into a puddle of smoking silicon, plastic, cadmium, and goo.
  • A laptop wouldn’t even power up, but I had complete backups.

There are more, but I think you get the point: I’ve had my share of interesting experiences. So which one caused me the most grief (and embarrassment)? Well, I have many monitors, so losing one is never an issue. External batteries are interesting devices, but it only cost me some time on the plane that I used for some much needed rest anyway. The software vendor with the license key issue responded quickly, so, while I lost six hours, it was overnight and I should have been sleeping. So what bothers me the most?

From a convenience standpoint, there’s no question that losing my laptop was the worst. It took me six weeks to get back to full speed. However, the laptop was at the end of its life. I had a backup and lost virtually nothing. However, I couldn’t buy a comparable laptop that had the version of the operating system and productivity software I wanted on it. So reconstructing my environment was painful. But they key part of it was that I didn’t slow down at all for my customers. Not a single document or presentation revision was lost nor any e-mails nor anything else of importance.

From an operational standpoint, my company took a hit because one of our servers refused to boot. We had to run without our core document repository for almost a day. There was no issue in the hardware though. The problem was that an operating system patch came down automatically, but the BIOS couldn’t handle it. The hardware vendor issued a BIOS patch after the operating system patch came down and the problem was resolved. So who’s to blame?

In a word: Me. It’s always my fault. Hey, when you run a company, it’s your fault. Take the blame and move on. I’ve since made sure to disabled automatic patch installation on all my servers, and they now go through quarantine prior to installation on business critical systems. The operating system vendor has to take some of this for not testing their patch adequately on very common BIOS releases where their operating system is installed as OEM software. And it’s also the hardware vendor’s fault for not issuing the warning to their customers in advance of the issue. I can’t believe that we were the first company to encounter the problem – although, as some of you readers already know, it’s happened before.

So what did I learn? There was egg on my face, of course. My staff was pretty upset that the server was down, even if only for a short period. My customers didn’t notice, because it was an internal server, fortunately. But, mostly, that it’s not the hardware. The physical machine had no problems, although there were lots of amber warning lights flashing at me. It’s not the individual software components, because they all worked to specification, according to the vendors anyways. It was the inter-relationship between changing software components. And you know what? That’s usually what burns you. A few years ago, I gave a keynote where I asked the audience to choose which represented the most likely risk to their organization from a set of disaster scenarios, from an asteroid, nuclear war, floods, mould, and a picture of my (then toddler) son at a keyboard. I have to give the group (one of the HP NonStop user groups) credit for getting the answer right – my son. Changing software represents something far different than a disaster scenario. It is a quantifiable, known, and expected risk. Each time you change your system, you are putting your company, your customers, and your stakeholders at risk, because something may break. On the other side of the coin, not changing your system also puts the same group at risk of no longer being competitive. So what do you do?

It’s simply not an option to stand still. Could we surf the web if we all were using bicycle-chain driven computers or upgraded looms? I don’t know about you, but I can’t blog on a loom. My cats can blog on a rug, but that’s a different story and very messy. So change is something we want and have to embrace. Dealing with the risk-reward of moving forward is pervasive in technology. We can’t ignore it, so we have to change and put in new releases. The questions are where to find a balance and how to do it safely.

Stay tuned for upcoming entries where I’ll talk about this balance. The next blog will go into a comparison of different levels of expectation of reliability.

Running with a Security Blanket

I remember, back when I was two, I had a little blue blanket. It had a blue shiny border and made me feel warm, comfortable, and secure. I cried when my mom washed it because somehow it was different when it came back. It smelled nice and clean, but it just wasn’t the same somehow. By now, you’re probably wondering why I’m bringing this up. So am I, actually, but I’ll get there.

This blog is coming at you directly from the HP NonStop Security SIG in Canada. A number of vendors showed up today to present their capabilities and perspectives. It was a pretty good event. Topics included: PCI Compliance, Sarbanes-Oxley (SOX) Compliance, Kerberos, single logon, various protocols, emulations, integration points, auditing and reporting. OK, so why am I rambling on about security on an Indestructible Computing blog?

Security is rarely considered in the Indestructible Computing domain. Yet security breaches definitely contribute to outages, particularly when the criminal is bent on malicious damage rather than data access. Fortunately, of all the breaches, this kind is not that common. A bigger concern these days for security is protecting data from prying eyes. But come to think of it, if you get audited and get shut down because you’re too vulnerable, that’s a pretty big problem for your customers.

But if you look at indestructibility, security and authentication can play in other ways that are not obvious, but annoyingly interrupting. Suppose a customer logs onto your banking application using their card number and password and then changes their password. If everything goes right, all the servers happily running in the data centre pick up the credential changes and are able to service the customer’s requests for balances, transfers, and other inquiries. But what if one system is down for maintenance – it happens? The password update isn’t picked up by that system immediately, but that should be OK. The user gets a note that some part of the system they don’t care about isn’t available, and they carry on happily. An hour later, the system and user come back in. The user now wants to use resources on the system that was down, but the batch job that updates passwords from the master password server hasn’t run yet. Now you have an unhappy user who has to call customer support. That costs you money for the support agent and credibility to the customer. So the important part of this equation is that password and credential management must have no latency. It may even be that the servers that process your credentials are right up there with the more critical parts of your service offering, because they are customer facing.

But that’s not all. Current audit requirements mean that so much logging is going on that companies need increasing amounts of disk every year. We’re even hearing that we’re going to have to eventually keep records of all traffic going through our routers. Who makes this stuff up, disk drive manufacturers? We’re projecting a need for terabytes of storage just for security and audit compliance. And, the rub is that if you run out of disk, your application cannot process transactions or even inquiries. You actually have to shut down until you can start logging again. Now that is not indestructible, is it?

Random rant: In an effort to keep things politically correct to reduce HR vulnerabilities and access to bad sites, some companies are putting in activity loggers as part of their security and audit infrastructures. These are going on your laptops and workstations! The concept is great for catching slackers and indiscriminate porn surfers. The problem is that some of these tools can also capture credit card information, passwords, and other identifying information. Who is securing the HR department? What if their tracking data is hacked?

Security is like a warm blanket. You can wrap your systems up in it and feel all nice and comfortable. But some hackers might want to take away your blanket or poke at you through it from places you can’t see. More importantly, you can’t hide from your customers under it. And unlike some superheros’ capes, blankets are not indestructible.