Skip to main content

mom: monitoring cpu spikes the right way

NOTE: this script is deprecated. feel free to use it, but you should refer to this post, which actually has a newer, cooler script.

one of the things i can't stand about most monitoring systems nowadays is that they're not really designed to be viewed by an operator. i think we've diluted that term. we don't enable "operators" to really do much of anything. we give them a little console they can stare at and hope that if they see some alert pop up, they'll wake up and dial someone. how does that translate into a successful use of technology? i think we've all been around a phone long enough to know how to dial it. so ... why not take some baby steps and move forward?

here's my baby step. i don't really do things out of my own volition because unless it's making my life easier, it's hard to be inspired. anyway, a fellow coworker received an alert on a cpu spike and asked the obvious question. what's making the condition occur? this raises interesting questions on its own because in order for anyone to answer this, they'd have to be at the machine at the time the problem occurred...

or at least in spirit, proxy, or whatever. then, you've armed your operator with at least a tad more information than what they had before. for mom anyway, the best way to do this is letting the agent handle it.

i wrote up a script that was bastardized out of microsoft windows base operating system state monitoring script. it's the one used to detect cpu spike conditions. that script returns a list of processes utilizing more than 10% of the cpu. so... i took most of the pieces, rearranged them, added a parameter for threshold ... and have added it our environment. aforementioned, it doesn't make sense to use this as a task or anything like that since you'd have to be sitting there glaring at the console, waiting for a cpu spike, and then executing, to get the problem occurance. just add the script as a response to an event or maybe a threshold rule.

it'll create an event so make sure you have an alert that'll pick it up. now, i suppose things that happen over a duration, the information returned may be pointless... since there could be multiple things going on over that duration. oh well... it's a start.

for my sample setup, i created a performance threshold rule that would alert on processor % time utilization. i set it to continously fire just for my test. appended to that, i created a response to run the script to return processes. since the script writes an event, i setup an event rule to grab the event and generate an informational alert. anyway, here's the details:

script properties:
  • name: Top Processes
  • parameters: Percentage
  • value: 5
threshold rule properties:
  • rule name: [Test Rule] Processor spike occurring!
  • provider: Processor-% Processor Time-_Total-2.0-minutes
  • threshold: the sampled value
  • match when: always
  • response: Top Processes
event rule properties:
  • rule name: [Test Rule] Pick up events for top processes.
  • source: Top Processes Script
  • event id: 40100
i've posted the script to momresources.org and myitforum.com. pete's usually great about getting back to me once the file has been posted so i'm sure it'll happen soon. have fun with it and let me know what you think. it's rough around the edges, but i think you get the idea.

Comments

  1. that's cool marcus - Matthew G.

    ReplyDelete
  2. Available for download from:
    http://www.momresources.org/momscripts/TopProcesses.txt

    ReplyDelete

Post a Comment

Popular posts from this blog

using preloadpkgonsite.exe to stage compressed copies to child site distribution points

UPDATE: john marcum sent me a kind email to let me know about a problem he ran into with preloadpkgonsite.exe in the new SCCM Toolkit V2 where under certain conditions, packages will not uncompress.  if you are using the v2 toolkit, PLEASE read this blog post before proceeding.   here’s a scenario that came up on the mssms@lists.myitforum.com mailing list. when confronted with a situation of large packages and wan links, it’s generally best to get the data to the other location without going over the wire. in this case, 75gb. :/ the “how” you get the files there is really not the most important thing to worry about. once they’re there and moved to the appropriate location, preloadpkgonsite.exe is required to install the compressed source files. once done, a status message goes back to the parent server which should stop the upstream server from copying the package source files over the wan to the child site. anyway, if it’s a relatively small amount of packages, you can

How to Identify Applications Using Your Domain Controller

Problem Everyone has been through it. We've all had to retire or replace a domain controller at some point in our checkered collective experiences. While AD provides very intelligent high availability, some applications are just plain dumb. They do not observe site awareness or participate in locating a domain controller. All they want is the name or IP of one domain controller which gets hardcoded in a configuration file somewhere, deeply embedded in some file folder or setting that you are never going to find. How do you look at a DC and decide which applications might be doing it? Packet trace? Logs? Shut it down and wait for screaming? It seems very tedious and nearly impossible. Potential Solution Obviously I wouldn't even bother posting this if I hadn't run across something interesting. :) I ran across something in draftcalled Domain Controller Isolation. Since it's in draft, I don't know that it's published yet. HOWEVER, the concept is based off

sccm: content hash fails to match

back in 2008, I wrote up a little thing about how distribution manager fails to send a package to a distribution point . even though a lot of what I wrote that for was the failure of packages to get delivered to child sites, the result was pretty much the same. when the client tries to run the advertisement with an old package, the result was a failure because of content mismatch. I went through an ordeal recently capturing these exact kinds of failures and corrected quite a number of problems with these packages. the resulting blog post is my effort to capture how these problems were resolved. if nothing else, it's a basic checklist of things you can use.   DETECTION status messages take a look at your status messages. this has to be the easiest way to determine where these problems exist. unfortunately, it requires that a client is already experiencing problems. there are client logs you can examine as well such as cas, but I wasn't even sure I was going to have enough m