opalis: monitoring grey (gray) agents in opsmgr

let's talk about gr(a|e)y agents.  (did you like my nerdy regex reference?)  a friend of mine fairly new to opsmgr was chatting with me about grey agents one day which lead to a search on how to detect them (since it's not native to opsmgr to do this).  this kind of spurred an idea of something to try out.

for those that don't know, grey agents occur when an opsmgr agent goes into a strange state where it's possibly not being monitored (not communicating, healthservice isn't receiving data, etc).  essentially, the agent looks grey.  more detail about grey agents and how to troubleshoot them can be found here.

since grey agents can lead to grey hair, let's look at how to find them.

 

detecting grey agents

andreas zuckerhut posted a powershell script that can quite easily get at this information through powershell.  here's the contents of the script:

$WCC = get-monitoringclass -name "Microsoft.SystemCenter.Agent"
$MO = Get-MonitoringObject -monitoringclass:$WCC | where {$_.IsAvailable -eq $false}
$MO | select DisplayName

simple, right? 

...and just for reference, here's a sql script which produces the same result:

SELECT ManagedEntityGenericView.DisplayName, ManagedEntityGenericView.AvailabilityLastModified
FROM ManagedEntityGenericView
INNER JOIN ManagedTypeView ON ManagedEntityGenericView.MonitoringClassId = ManagedTypeView.Id
WHERE (ManagedTypeView.Name = 'microsoft.systemCenter.agent') AND (ManagedEntityGenericView.IsAvailable = 0)
ORDER BY ManagedEntityGenericView.DisplayName

 

preparing for opalis

with this knowledge, you can do a number of different things to get this information to you in a useful way.  I decided since opalis is the playground I seem to be in most these days, I'd use that as the engine to make some stuff happen.  based on how you proceed, you could use the sql object or the run .net object from opalis to get the information.  I chose the powershell path.

to get this to work in opalis, there are a few slight modifications that had to be made to the original script.  basically, opalis needs the opsmgr snapin loaded since the default profile doesn't have it.  I suppose you could make the default profile load the snapin?  anyhow, here's the modified script:

add-pssnapin microsoft.enterprisemanagement.operationsmanager.client
cd operationsmanagermonitoring::
new-managementgroupconnection myOpsMgrServer

$WCC = get-monitoringclass -name "Microsoft.SystemCenter.Agent"
$MO = Get-MonitoringObject -monitoringclass $WCC | where {$_.IsAvailable -eq $false}


additionally, I removed the last line of the original script since there's no need to send this through the select cmdlet.

 

creating the opalis workflow

in a very simple sense, all you really need is one object, the "run .net script".  however, since we're in opalis, there are other useful things we could do.  anyway, here we go...

image

(I neglected to include a start parameter on purpose.  this is so that if you chose to do something like this, you could start it with whatever means necessary.  I would probably use a scheduler object and have it run every hour or so.  because of the way opalis handles multiple values going through the pipeline, it is necessary to use junctions and text files to hold the data together to pass to the "send email" object.  I'll document this in more detail in the next blog post.  for now, just keep this in mind that steps 1, 2, and 3 are of primary concern.)

the first step of this workflow kicks off the powershell script that detects grey agents.  we need to make sure the "run .net script" object is properly configured.  use the modified code snippet above for powershell as illustrated below.

image

to get the information out of this object, the variable in the script needs to be passed as published data. (if you need more information about it, I posted an article titled opalis: properly retrieving published data from powershell scripts that should be able to fill in the gaps.)  I set it up as follows:

image

after detecting grey agents, an attempt is made to reach the server.  if it's offline it would certainly explain its grey condition.  now, the cool thing about the link coming off of "get computer/ip status" is that it defaults to only sending the objects that return a success value which results in a list of computers that responded to a ping, yet are in grey status.

the last thing that occurs is to send that information to a designated recipient with the list of computers in the body of the message using the "send email" object.  that's pretty much it.

 

additional notes

so now you're wondering, what was all that junk up there?  why couldn't it have been as easy as this?

image

well, as I alluded to earlier, the way opalis handles multiple objects coming down the pipeline is rather interesting.  we'll talk about that more in this next post.

Comments

  1. when I run the add-pssnapin.... new-managementgroupconnection servername cmdlet via a PowerShell session on my Opalis Management Server, I'm golden. However, when I run the same cmdlet inside a "Run .Net Script" object - I receive the following error message:

    The term 'new-managementgroupconnection' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.

    I added Set-ExecutionPolicy Unrestricted to the top of the cmdlet and that didn't do anything. I've also ensured the OpsMgr console and command shell is installed on my Action servers. Any other ideas?

    ReplyDelete
  2. hey seth -

    just remember that anything that runs in opalis is generally executed under the action service account (or some other specified security context). in your case, the interactive powershell works because the cmdlet is most likely in your path somewhere -- where as it does not reside in a path for the action service account.

    i would recommend running a powershell prompt as the action service account and trying to run through the add-pssnapin part.

    ReplyDelete
  3. ok, i think we're getting somewhere now. i added my opalis action account (svc-opalis) to be a scom admin and now i can run powershell as the opalis action account and run all the commands successfully, whereas i was not able to do that before. but still when i run the script in the opalis, no dice. on my action server, the "opalis action service" is set to svc-opalis and the "opalis remoting service" is set to local system.

    ReplyDelete

Post a Comment