Cloud, DevOps, ITIL, Provisioning – A reflection on James Urquhart’s and Dan Wood’s articles

By pfuetz, March 30, 2010 16:57

James Urquhart (@jamesurquhart) posted a series of articles on operational management in the cloud on his blog: The Wisdom of Clouds.

Following are my comments on his series and the discussions that followed on Twitter

But first, the links to James’ articles, entitled “Understanding cloud and ‘devops’”:

It also refers to his series around Payload description and Application packaging as well as Dan Woods’ article on Virtualization’s Limits

Dan states:

“But automated provisioning and management of cloud computing resources is just one of three elements needed for intelligent workload management. The other two are the ability to set up and configure an application that is going to run on a cloud or virtualized server, and then making sure the data ends up where it is needed. Here’s where the hard problems arise.”

Dan is right, but also wrong in his arguments.

Let’s look back a bit in IT history: 5-10 years ago, the notion of “provisioning” did try to shape the way, DCs should have been managed. Terms like SODC (service oriented datacenter) and OM (operational maturity) were hip. Still they neglected a couple of trivial things, like: Inconsistent upgrade paths of software stacks, and the inherent “need” of app-users to tweak the apps according to their perceived needs.

Let’s look at the latter first: Why did that culture of “tweaking” or “tuning” apps happen? Because in many cases the HW had not been fast enough to fulfill the needs of the end-users. That’s, why tuning was very popular, and happened close to always. But there’s a side-effect to that:

R. Needleman, Editor in Chief of Byte Magazine decades ago, once wrote to this topic in an editorial:

“And no matter what hardware you have, it’s really hard to learn to play piano.”

This might be proof to Dan’s statement, but it also is proof to a dilemma, that many hardware selling and creating companies today have: The need of the software w.r.t. CPU-cycles didn’t keep up with Moore’s Law. That’s why we see more and more underutilized systems, and we experience a shift towards appliances. Because this seems to be the only way for a hardware creator and vendor to survive: Create the margin from something different than the hardware. Add stuff to the stack, so that a competitive advantage occurs across the stack. That’s for example, why Oracle bought Sun. From this also comes a second thing: Standardization. In order to be able to exchange the underlying hardware for cheaper and more powerful hardware, app-deployers and -users now tend to no longer tweak and tune as much as they did decades ago. Today, we see way more “standardized” deployments of software stacks, than we saw decades ago. This is also triggered with the broad acceptance of virtualization. V12N does at least provide a standardized layer for the Operating System, so that here no longer any tweaking or tuning is needed. That also in turn led to the notion of also applying such methods to the apps on top of the OS and we see so-called “images” being the element of access in virtualized environments.

Back to Dan’s argument, and his problem statement:

I’ve been in provisioning for more than a decade now, and I’ve seen 100% automated setups, from Deutsche Bank’s working RCM (Reliable Configuration Management) over to its next version, RCMNG (Next Generation), to the never deployed APE (A Provisioning Environment) at Daimler Chrysler over to the things, that are in production at BG-Phoenics or Deutsche Bahn. These things do work, and, yes, they do a 100% automated bare-metal install, up to app deployment and app-configuration management even up to the content provisioning.

So, back to James’ points, which also addresses the former pain-point mentioned above!

The main problem of all these environments is the fact, that the “meta data”, that James refers to, needs to be adopted and kept up-to-date over the lifetime of an environment to the ever changing pieces it is build of. Never assume, that the install for version X of app C can be used also for version Y of app C. Here, a big maintenance effort has to be done, and with the diversity of the apps themselves, even across versions, this is something, that can’t be neglected. And in an environment, where time-to-market and fine-tuned setup is key, spending time on shaping the meta-handling simply didn’t occur or has not been worthwhile.

So, with the advent of V12N and the term “Cloud Computing” we now get into an era, were due to the more standardized deployments of OSes as well as Apps, and with the fact, that most of the “configuration” of the apps can already now be done during installation, that amount of work needed to manage the “meta data” changes and gets smaller. That in turn allows to again think about provisioning on a broader scale.

James describes in his “Payload description” article and its predecessor exactly the things, that had been the factors for companies like TerraSpring or CenterRun to create their provisioning tools. James calls the envelop a pCard. CenterRun did call this, over a decade ago, a resource. In CenterRun, resources can inherit capabilities (parameters, deployment routines, et.al., it’s a really object oriented approach!) from other resources and can also ask their targets of installation (called hosts, which can by physical or virtual, a virtual host in turn can be an “entity” like a web-server-farm, where you can deploy content or also “apps” into) for their specific capabilities, like payload spare-room, or OS-version, or CPU type, or you-name-it.

So, what’s been needed in order to successfully use tools like CenterRun (and, yes, that’s not the only tool of that time! There’s been way more!) was a modeling of the overall stack, breaking it down into generic, but specific enough resources and hosts, so that deployment can be done over a longer period of time. Pitfalls mostly were, that thinking of “hosts” did limit people to believe, that a host is a “physical machine”.

Now, that we see, that James’ ideas are nothing new, and had already been proven to work close to a decade ago, why did those not have been a great success over the time or are even seen by James as part of the solution to his problem statement? Or even Dan’s ideas of the need for “Systems Management” at a higher level?

I do see mainly two reasons for that, both already being mentioned above:

  • It’s tedious to manage all of the needed meta-data of the whole stack.
  • The stack did change to often to make it worthwhile to use “provisioning” or “automation” of the stack. I once stated: “If you want to automate chaos, you’ll get chaos automatically!”

So, why do people like Dan or James believe, and why do I agree, that now, with the notion of “Cloud Computing”, it’s time again to think about “provisioning”?

First, as mentioned above, the complexity of the stack is reducing itself due to the fact, that V12N is helping with standardization: Less options, easier to manage!

Second, many later-on config and tuning options are now options to the installer, or will simply never be performed. There’s a couple of reasons for that: Again, CPU-cycles are now more easily available, so that fine-grained tuning no longer is a necessity. And, many config-options are now install-time options, making also the handling easier, because the steps to achieve a given goal are reduced. And then many customers learned the hard way, that tweaking a software to its limits killed a possible upgrade-path to newer versions, as some features or “tweaks” had simply disappeared in newer versions. So, CUs tend now to stick to more off-the-shelf installs, hoping to be able to quicker upgrade to newer versions. This in turn also reduces the complexity of the pCard (James’ speaking) or the meta-data-modeling, making it possible to perform such tasks.

Third, we see a “reduction” in options for tasks or problems. There’s a concentration going on in the IT-industry, which in some publications is called “industrialization of IT” or “commoditization”. With that comes the reduction of for example software-solutions for a given task, and also a concentration in the hands of single companies. That leads to more integrated software-stacks, which in turn also simplifies the meta-data, and makes it feasable to start again looking at provisioning of the whole stack. Like in the car industry, you’re no longer looking for the individual parts to build a car from, you’re buying it “off-the-shelf”, or, in the car-manufacturing part of the story, you’re no longer looking for people to build the car, but since the invention of Ford (construction-belt), you’re looking at automating the building of the car.

So, what now is James saying in the so-far 2-part DevOps series?

He’s going back to what I stated above as “Operational Maturity” (ITIL speak). No longer management of individual pieces and being forced to react to changes in those resources, but “designing” stuff, so that they can benefit from whatever underlying layers are available.

In my world, there are also constraints that need to be acknowledged: In order do design stuff, you need to have at least two things: Freedom (and capabilities!) to “implement” your dreams and simple enough elements to build the implementation of those “dreams”. If you would be forced to create a “one-off” for the implementation of your dream (or design), then some basic requirements might be difficult to achieve, like “elasticity” or “rapid deployment”.

So, also here the basic rules of “managing constraints” is still in place. Yes, James is right in that the focus shifts from OSes and servers to applications. That’s why the term “appliance” was created a while ago, and why all vendors today start shifting their focus to easily provide “services” in form of an appliance. An example today from the company I work for is the Exadata 2 DataBase machine. Order it, get it, and use it latest two days after delivery. No more tweaking, configuring, and exception handling if the pieces don’t work as expected. You get, what you want and what you need.

This appliance approach, when brought to the “Cloud” needs rules, so that these appliances can happily live together in the cloud. That’s what James describes in his second article of the series.

Still, my mantra from years ago, applies: “If you automate chaos, you’ll get chaos automatically!”

But: Today it gets easier to manage the chaos, as there are less switches and glitches to manage, due to more standardized elements of the stack. That also in turn makes it easier for the “provisioning tool provider”, as the tools themselves no longer need to be over-sophisticated, but can be stripped down to simpler approaches. That’s, why for example, in Oracle Enterprise Manager Grid Control the provisioning part gets more important over time, and will be an important part of the systems management portfolio. Without the elasticity management capabilities and the deployment capabilities, you no longer can manage, and therefore sell, software.

But, let’s not forget: Here, we’re talking about the “back-end” side of things! The “front-end” side, with the “desktop-computing” part, I did cover in my former post: VDI and its future

Finally, I’ll leave you with Tim O’Reilly, who did publish his thoughts on the Internet Operating System, which Sam Johnston calls the cloud… ;-)

Enjoy!

Matthias

Share and Enjoy:
  • Print
  • PDF
  • Add to favorites
  • Digg
  • Slashdot
  • Technorati
  • del.icio.us
  • Facebook
  • LinkedIn
  • Suggest to Techmeme via Twitter
  • Google Bookmarks
  • Twitter
  • FriendFeed

8 Responses to “Cloud, DevOps, ITIL, Provisioning – A reflection on James Urquhart’s and Dan Wood’s articles”

  1. pfuetz says:

    I’ll add my first comment myself:

    In http://twitter.com/jamesurquhart/status/11170633199 and http://twitter.com/jamesurquhart/status/11199705169 James did question automation, as he thinks, 100% automation needs to also include “exception handling”. That’s ridiculous! Errors can never be predicted, therefore, trying to include exception or error handling into the automation piece is a recipe to disaster. What’s needed is the inclusion of (rapid) change management into the automation process, so that the reactive part can also be done quickly and automated. BUT: There still needs to be the ability to “humanly control” the automation!

    Matthias

  2. Matthias,

    You bring up some great points in this post, and I will attempt to address them in part 3 of the series. I think you are generally correct that we are simply moving the complexity farther up the stack, and that “if you automate chaos, you get chaos automatically”.

    In your comment, you missed my point entirely. We were talking about the role of operations in devops, and I was simply trying to note that there are still reactive elements to the ops pro’s role. Those exceptions have to be handled, whether or not they count towards the success or failure of the automation involved, and it ultimately falls on operations to figure it out.

    That said, 100% automation coverage of what you know needs to be covered ahead of time is indeed possible.

    Thanks for taking the time to lay this all out.

    James

  3. pfuetz says:

    James, thanks for the comments. Yes, I did miss your point in my article, that’s why I added a first comment… ;-) But I assume, now we have a small misunderstanding here, of what we mean by “automation”… And that’s why you assume, I missed it in my first comment…

    Yes, 100% automation can never be achieved, and I assume, anybody who’s going to try that, might be a bit silly… Because, as mentioned in my first comment, errors can not be predicted.

    Look, for example, at the automated workload distribution (as an example, that might need to be handled, either by people (devops) or a “tool” (automation)). You might hit a case, where due to the workload characteristics, some workloads might be shifted indefinitely between “hosting hosts”, as every attempt to host them elsewhere might again overload that system. So, in order to catch all such corner-cases, the automation process and its logic might get so complex, that it will no longer be able to be handled or managed.

    We see that, for example today, with the HA-tools available on the market. Many customers now are going to use a “poor-mans-cluster” instead of highly sophisticated tools like Oracle Solaris Cluster.

    So, what I mainly wanted to state are two things:

    1.) Provisioning, as we had been talking about in the past, was way to complex to reach mass-adoption, therefore, in order to be considered useful, it needs to get simpler.

    2.) Don’t think of provisioning as “roll-out-only”. The “elements-to-be-provisioned” also need “capabilities” of handling, like, for example, moving to another “host”, stop, start, re-configure, you-name it. The “design” of these “methods” and “elements” is, what’s key to its success. If we also have “management-like” “features” in the provisioning methods, we can use those as part of “devops” daily business…

    A bit clearer now?

    Matthias

  4. Sam Johnston says:

    Another strong driver for standardised workloads is horizontal scalability – spreading the workload across many commodity machines rather than growing one to cater for changing demands. It’s fine to change the number of Apache worker threads when you’re just running one box, but when you have 100,000 it’s a very different story and you’ll certainly want automated tools (assuming you haven’t got the build “just right” from the start).

    It’s also worth bearing in mind that most cloud computing providers today use virtualisation sparingly, if at all. Amazon use it for EC2 but I would be surprised if any of their other services use it (e.g. S3). The Google platform spans a global network of data centers and runs on bare metal. Similarly, the database engines driving Twitter and Facebook are almost certainly running on large clusters of dedicated machines.

    From my point of view virtualisation is little more than a bridge from legacy to the cloud, and an optional one at that. I think we may well actually be better off if infrastructure had have been left out of the picture altogether (it would have evolved on its own anyway) and we focused on “fabric” rather than “instance” based services. I’m quite convinced this will happen in due course anyway, but we may have been able to save some time had we compartmentalised properly at the beginning.

  5. pfuetz says:

    Sam,

    I’m not as optimistic as you are w.r.t. “neglectance” of the underlying HW and OS “segments”. We all know, that even building large SMP systems was and still is a very difficult task (Sun was very successful with that, both in HW and scalability of the OS across multiple CPUs/Cores), and you need an OS that helps with that. If you even add more distance (different chassis’), you hit the problem of speed of light. All known, partially solved, still not used in mainstream.

    Main inhibitor I see, is, that from the APP-DEV side, creating APPs that really use the advantages of multi-threads, or multi-CPU is still an art. It even becomes more difficult, as with the advent of Java, the dev’s thinking was shrinking to “applets”, and no longer to “big complex systems”. OK, that shift also helped in “network communication” between the applets (needed, because single systems hadn’t been powerful enough), but it still hits the “speed of light” problem.

    And, with the advent of big multi-core commodity CPUs (think: Intel Nehalem) being able to speak to one’s another directly (big advantage of the Nehalem BUS system here!) my guess is, that we will still see big single OS systems and less “distributed” apps… This is also driven by the fact mentioned in my article, that the CPU-needs of software didn’t keep up with Moore’s law.

    Yes, you’re right, that Google, or Facebook or Twitter are more “cloudy” than systems using virtualization as a layer underneath it (like Amazon EC2). BUT: Those are only three, although well known and prominent, examples of cloud computing.

    So, the main point here is: Who will be those, building apps to be run in the cloud, and where do they come from?

    Same, as we see, that Linux doesn’t scale well in SMP beyond a certain amount of cores (developers don’t have that much money to run and test big machines!), we might also see dev’s coming from the desktop and expanding their experiences “to the cloud”, using stuff like amazon EC2 and even possibly “desktop hypervisors” as their model of development (see my “VDI future” article (Link in the above article)), which might make a lot of sense for “smaller” apps.

    We yet do not know, if the future of computing will again take place in big, centralized DCs, or on small handheld, but powerful, and always on devices. That’s a bet on the future, and I’m not willing to bet here, at least not right now… ;-)

    Matthias

  6. Sam Johnston says:

    The trend I see with devices is that they are turning into single-purpose appliances – look at the iPad for example… its primary function is the browser. Similarly, ChromeOS, essentially sheds the OS entirely (except for the bare essentials required for browser life support). The days of having a 1kW+ space heater under your desk are numbered – I’m responding to you on a grossly overpowered MacBook Pro and have a grossly overpowered Mac Pro under my desk and a dirty big iMac at home. To be honest I could get by with something like a MacBook Air running ChromeOS as virtually everything I do is in the browser.

    On the server side there are many reasons for centralising systems, most notably the massive economies of scale that stem from serving hundreds of millions of users rather than just one. Fortunately the problems you talk about re: multi-threading etc. only need to be solved once and the results are available for everyone – you don’t need to run your own BigTable clusters to benefit from the service via AppEngine for example.

    The way I see it there will be a single, loosely coupled computer (“The Cloud™”) powered by a small handful of very large providers (Amazon, Apple, Google, Microsoft, etc.) and a myriad smaller ones (Joyent et al). There will also be “community” clouds operated by companies like IBM on behalf of large enterprise, but as these will be able to communicate with others they can be considered part of the overall system too. The days of people running their own services are numbered.

    Sam

  7. pfuetz says:

    Sam,

    it seems, you only follow Cloud People. Try, for example:

    http://twitter.com/douglasabrown/status/11404352185

    Or: http://twitter.com/brianmadden, http://twitter.com/crod, http://twitter.com/rspruijt or http://twitter.com/drtritsch

    Those are the VDI type of people… ;-)

    But, you’re right:

    http://twitter.com/thinguy/status/11315541279

    And: I’m writing this at my Sun Ray @ Home, the computing takes part in the Office on a central server, and I assume, looking at ThinGuy’s tweet above, those will even get way more centralized in the future. Cool, good! Let’s go that route. My Sun Ray only consumes 4W, plus a monitor. Silent, quiet, cheap! My PC is only switched on, when I need flashy flash stuff, as that doesn’t work to well across WANs, where display and compute are separated. Speed of light and latency are the problems here… And, as you can see, my “bit-hoarding” is done on a cheap Intel Atom system using between 40 and 65W (read my blog), so in total my power consumption is small for the computing I need to do…

    OK, after a bit of twitter-mania (and Sun glorification), back to arguments:

    Yes, I also do believe that client-side computing will be reduced, BUT, as we can see, also Citrix and VMware have created Apps for the iPad. So, there still is the chance of “fully loaded stacks” on “Client-Side-Virtualisation”, or “Server-Side-Virtualisation”. For decades we’ve seen the swing between client-server and big-client and client-server and big-client and client-server… So, why shall we assume, that the current trend towards big DCs with “clouds” is the end of all of that? And with the exchangeability of the “images” it may very well go back to client-side computing. We could all ask the simple question: “Why, in order to write an email, or a document, do I need to flood the air with network traffic? I can do that locally on my iPhone, no need to be “online”. We’ll see, which of these will be out there, but I assume, we’ll see a big mix. Yes, there are rumors, that, once again, Oracle might offer an “Office in the Cloud” solution, something, that Sun did shortly after it bought StarOffice. Sadly, in those old days, that offer hadn’t been very successful, let’s see, if the times have changed… Also, there are rumors that VDI appliances might come out (see, for example, thinguy’s tweet above), so, yes, centralization of certain things might happen, and, yes, they can be called a “cloud” or “happen somewhere in the cloud”, no doubt about that. Yes, that’s driven by economy of scale, no doubt about that either!

    But I guess, you didn’t completely understand my point w.r.t. multithreading, but, finally, it might not be a big point after all, with Moore’s Law still active: I’m talking about the ability of “spreading” the load of a SINGLE system over multiple threads. I’m not talking about “distributed systems” per se (that’s been called Grid computing in the old days!), I’m talking about massive DBs, that perform parallel tasks and need massive memeory and CPU in order to perform their jobs (Oracle RAC, Hadoop, you-name-it). These things aren’t easy to program, but you might be right, there aren’t that many such apps. Most can be segmented on higher levels (applets talking to each other), and therefore that argument might be moot… Especially with the fact, that Moore’s law will sooner or later provide more CPU cycles. than the app needs, so thinking and programming multi-threaded might not be needed any longer. It’s only been a kludge to circumvent the problems that came from the fact, that systems hadn’t been powerful enough…

    Overall I agree with your last paragraph!

    Still, I’m not as sure as you are, that that will be the overall future. I assume, we will also see “photorealistic image editing” directly on the phone, docs- and email-writing directly on the phone, speech recognition on the phone (although we also see many of these things with Google trying to put those into the cloud, and remove them from the actual device), I assume, we need one powerful and long network outage to see the benefits of “local compute power”… And with the “provisioning” features of the “cloud”, these can also easily be provisioned to the device itself.

    So, thanks for the comments, I agree in many parts, but don’t see the future as “cloud only”…

    Matthias

  8. pfuetz says:

    I’ll just add:

    Read: http://blog.drtritsch.com/?p=98

    Benny discusses VDI and its applicability to enterprises and desktops (yes, there’s a difference!).

    Matthias

Leave a Reply

*

Panorama Theme by Themocracy