O R G A N I C / F E R T I L I Z E R: 06.10

Jun 25, 2010

opalis: controlling maintenance mode with opalis, sccm, and scom

UPDATE: found a problem in the "retrieve ads and updates from sccm" script that was causing the script to stop working under certain conditions.  I've modified it slightly and posted it at the bottom of the blog post.

 

WARNING: this is a proof of concept.  don't just load this in your production environment and kick it off.  you'll be totally on your own (as if that weren't the case already).  while it works in my test environment, it may not in yours.  test, test, test.

 

I've been spending some time toying around with opalis.  the first hurdle, if you've been reading my posts, was in setting up the thing.  the second hurdle was actually getting something useful to work, and the third hurdle was to figure out how to export something for use by communities.  fortunately, I've cleared them all.  the particular proof of concept I wanted to try was using opalis to integrate scom with sccm.

whenever we go through patching cycles, we tend to spend more time than necessary in scheduling maintenance modes for our servers.  this goes for patches, software deployments, etc.  since it's a proof of concept, I don't get to all the scenarios that are actually required to make this thing function in its entirety.  I'll make sure to list the things I think are open items that we need to address.  guess what?  that's where you come in.  I'm hoping you guys that are reading this will help out.  it works in the exact scenario that I'll outline... but you know... there's always holes.

 

requirements

  • system center operations manager 2007
  • system center configuration manager 2007 or systems management server 2003 (should work)
  • opalis integration server 6.x (yes, it must be configured AND working)
  • opalis integration packs
    • microsoft sms
    • microsoft operations manager 2007
  • windows powershell v2.0

 

scenario

to get a level set, I want to lay out exactly what this thing should do.  this way, there are no lofty expectations or unexpected outcomes.  :)  that said, at a high level, the opalis action server will read out the upcoming advertisements and deployments from a sccm server, validate a few things (dates, times, exclusions, etc), retrieve the collections involved, retrieve the members of those collections, and then put them into maintenance mode.  if this is interesting to you, we'll move on to the finer details of what I'm talking about.  but to make it a little clearer, I'm going to add in some screenshots from the opalis policy that controls all this stuff.

 

how it works

okay, I promised screenshots so here we go.

image

so if you'll follow along, I circled where we're starting since it may not be entirely clear.

  1. Run Every 10 Minutes this object simply schedules policy to run every 10 minutes. it's configurable of course.  you may want it modify based on your needs.
  2. Get Date from Start we need to get the value from this object so that we can use it later for comparisons.  trust me.  it'll make sense.
  3. Retrieve Ads and Updates from SCCM this is a powershell script that goes out and pulls all open advertisements (software) or deployments (patches).
  4. Write Discovered Advertisements self-explanatory, I think.
  5. Get Date from Script this is another date object that we pull from the script in step 3.  we want to hold on to this so we can compare them.
  6. Compare Dates now we take the date from step 2 and the date from step 5 and compare them.  if they match, that means the ad is scheduled to deploy on the same day.  we'll discard anything that doesn't match so that we're not scheduling for things in a future date.  I'm skipping the logging functions since that's self-explanatory as well.
  7. Check Duration on this step, we're looking to see if there's a time skew variable (get to that soon).  if not, we're just going to use the default of 30 minutes.

on to the second half of the policy...

image

  1. Log the New Duration basically we're just writing to the log what the duration value is.
  2. Compare Time Skew this step is kind of interesting so I'll outline it.*
    • get the time value from step 2 above and add the time skew figure to it.  if the time value is 2:15 PM, we add time skew to it.  assuming it's default of 30 minutes, the new value is 2:45 PM.
    • get the time value from the ads/deployments in step 3.
    • determine if the skewed time is greater than the ad/deployment time.  if the ad time is 2:30 PM, we check if the skewed time of 2:45 PM is greater.  if it is, then we assume it's safe to start.
    • at the same time, determine if the current time is less than or equal to the ad time.  after all, we don't need to have maintenance mode set for an advertisement every 10 minutes.  if the ad time is 2:30 PM and the current time is 2:15 PM, cool.  run it.  if the current time is 2:35 PM, then don't run it again.
    • if we can start, write that to a log and move on.  if not, write that to a log and skip that ad.
  3. Get Collection Members this is where we go out and retrieve the members of each collection that are referenced in the eligible advertisements and deployments.
  4. Check Member Names in this step, we're sending each of the collection members (server names) through a powershell script.  if they do not match the value in the variable RMS*, we send the server name through to the Maintenance Mode objects.  if you notice the links labeled "safe", this is where this evaluation occurs.
  5. Start Maintenance Mode for Windows Computer / Healthservice Watcher these two steps are identical except we're sending the object into both of them since a healthservice watcher and windows computer object will exist for any server in scom.
  6. Junction this is utilized as a waiting point for both forks to complete the execution of placing machines in maintenance mode.
  7. Updated and finally, here we write the servers that were placed into maintenance mode.

* the reason I'm using time skew is so that opalis will catch advertisements before they're scheduled to start.  let's say that you have the policy set to run every 10 minutes.  you have an ad that's scheduled to start at 2:13 PM.  your policy started it's execution at 2:10 PM, found nothing, and is waiting to start at 2:20 PM.  with a time skew value, we're basically telling the opalis server that it should add 30 minutes to the execution time of the policy.  so in this case, the opalis server believes it is 2:40 PM instead of 2:10 PM.  since 2:40 PM is greater than 2:13 PM, it'll get started.  coincidentally, the time skew value is also used to indicate how long the server should be placed into maintenance mode.  a time skew of 30 minutes equals a maintenance mode duration of 30 minutes.

* it's a bad idea for the rms server to be placed into maintenance mode.  if the rms server is inadvertently added to a collection for deployment, this step will not send the rms server to the maintenance mode objects.  if there are multiple values for the rms variable, it's looped through so even clustered rms environments are safe (assuming you list all the cluster members).

 

setting it all up

  • if your opalis environment is already set up, make sure to load up the integration packs listed in the requirements section above.  you'll want those in place before importing an ois_export file that calls objects that aren't there.  trust me.  I've been there.  It's easier this way. :)
  • when you import your ois_export file, remove all checkboxes except for the policies and the variables.  though I promise you, I removed these on export, it's probably a good practice to do this anyway just in case.  you'll probably get a dialog screen indicating that you're going to overwrite your "ops console" variables.  whether you choose to or not is up to you.  if you choose not to, it'll just create a harmless, empty folder that you can remove.

image

  • the next thing we should talk about are the variables.  when you import the ois_export file, it should import all the variables that you need and place them into a folder called maintenance mode as shown here:

image

advertisement prefix this is used as a means of filtering our advertisements or deployments that you want to look at.  for example, if you started all of your ads or deployments with MM: the script will only pull back those ads.

this is important because most environments will have ads that target workstations as well.  there's no point in needlessly looping workstations through to opsmgr for maintenance.
package

the name used for the program or package doesn't really matter.  what's important is that the program advertised has the right flag set.  make sure that "disable operations manager alerts while this program runs" is set.

image

rms in this variable, set the rms name to guard it from being added to maintenance mode.

if you have a clustered rms environment, list each cluster member individually, in quotes, separated by a comma.  ex: "myrms1","myrms2"
site code this is the site code of your sccm environment
site server this variable holds the name of your sccm server
time skew this is where you can adjust the time skew value.  it will default to 30 if none is supplied.

 

  • once you have that squared away, there are some places you'll have to configure in the policy.
logging
image
anywhere that a log item exists, you may want to change the default logging location.  it's set to c:\temp\datetime.log.
get collection membersimage this step will require your credentials to the sccm server.  I took the painstaking process of narrowing it down to just the necessary permissions for the action service account.  I've outlined this below, if you're interested.*
start maintenance mode
image
both of these objects will require you to define the connection to the opsmgr server.  this is defined in the supporting documentation for the opsmgr integration pack if you need help.

additionally, "mydomain.com" will need to be replaced with your domain suffix on the monitor line.

image

 

Security Permissions for SCCM

I mentioned having to correct permissions for the action account above in the get collection members step.  this is what I had to do to make it work:

  1. dcom permissions adjustment (on the sccm server)
    • launch dcomcnfg.
    • navigate to component services \ computers \ my computer.  right-click my computer, choose properties.
    • under the com security tab, click edit limits in both sections.
    • grant the following rights to the ois action account:
      • remote access
      • remote launch
      • remote activation
    • navigate to the dcom config section under my computer, locate windows management instrumentation
    • right-click windows management instruction, choose properties. 
    • under the security tab, click edit under the launch and activation permissions section.
    • grant the ois action account the following permissions:
      • remote launch
      • remote activation
  2. sccm permissions
    • in the configuration manager console, grant the ois action account the following permissions:
      collections read
        read resource
      advertisement read
      deployment read
      package read

 

additional stuff to fix

  • I think I put enough checks in place to keep the ad or deployment from needlessly getting reevaluated throughout the day and having maintenance mode set over and over again.  however, if your time skew is sufficiently large enough, say 1 hour, and your scheduling object is set to run every 10 minutes, you would be attempting to put the same machines into maintenance several times during the same hour until the current time lapses the ad time.
  • the time skew challenge really needs to be adjusted.  I haven't figured out what the right formula is since the equipment I've been using to do all this is lab equipment.  it's not exactly the fastest stuff.  I also have the sql server running on the same server.  in a production scenario, this would all be separated.
  • I haven't done any timing tests to determine how fast machines actually go into maintenance mode.  the scheduling (every 10 minutes) and the time skew are critical here based on how fast opalis can drop machines into maintenance mode and how many machines are expected in any given collection.  obviously the schedules have to be far enough apart to allow for adequate processing.

 

where to get stuff

I've posted the file to my skydrive:

 

 

updated script

it's a real pain to do exports so for now, I'm putting in the modified script I mentioned at the top of the post below:

$mySCCMServer = "\`d.T.~Vb/{2E9411B1-8303-40F4-AF6F-0D914047D89A}\`d.T.~Vb/"
$myNamespace = "root\sms\site_\`d.T.~Vb/{D30862CE-1972-40CF-AAC5-B17FAE687E3C}\`d.T.~Vb/"
$myAdvPrefix = "\`d.T.~Vb/{4644F218-B446-4519-9E7F-E806969BD13D}\`d.T.~Vb/"

$myAds = Get-WmiObject -ComputerName $mySCCMServer -Namespace $myNameSpace `
-Query "select * from sms_advertisement where advertisementname like '$myAdvPrefix%'"

$myPrgs = Get-WmiObject -ComputerName $mySCCMServer -Namespace $myNameSpace `
-Query "select * from sms_program"

$myPkgs = Get-WmiObject -ComputerName $mySCCMServer -Namespace $myNameSpace `
-Query "select * from sms_package"

$myColls = Get-WmiObject -ComputerName $mySCCMServer -Namespace $myNameSpace `
-Query "select * from sms_collection"

$myDeployments = Get-WmiObject -ComputerName $mySCCMServer -Namespace $myNameSpace `
-Query "select * from sms_updatesassignment where assignmentname like '$myAdvPrefix%' and disablemomalerts = 1"

$finalCollection = @()
$finalDate = @()
$finalDuration =@()

if ( $myAds -ne $null ) {

foreach ($ads in $myAds) {

foreach ($prgs in $myPrgs) {

if ($ads.packageid -eq $prgs.PackageID -and $ads.programname -eq $prgs.programname) {

if ($prgs.ProgramFlags -band [math]::pow(2,5)) {

$AdInst = Set-WmiInstance -Path $ads.__PATH

if ( $($adinst.assignedschedule).starttime -ne $null ) {
$AdDate = ([management.managementdatetimeconverter]::todatetime($($adinst.assignedschedule).starttime))

foreach ($colls in $myColls) {

if ( $ads.collectionid -eq $colls.collectionid ) {
$finalCollection += $colls.name
}

}

$finalDate += $AdDate
$finalDuration += $prgs.Duration

}

}

}

}

}

}




if ( $myDeployment -ne $null ) {

foreach ($Deployment in $myDeployments) {

foreach ($colls in $myColls) {

if ( $Deployment.TargetCollectionid -eq $colls.collectionid ) {
$finalCollection += $colls.name
}

}

$finalDate += ([management.managementdatetimeconverter]::todatetime($Deployment.EnforcementDeadline))
$finalDuration += 60

}

}

if ( $finalDate -ne $null ) {
$continue = "Y"
}

Jun 24, 2010

using process explorer to examine runaway processes

i was watching russinovich’s presentation called “the case of the unexplained 2010” when something he mentioned caught my attention: wmiprvse.  if you watch the segment from 21:00 to 32:00, you’ll get some good insight on troubleshooting with process explorer.  fun stuff…

image

you can watch the video here and get process explorer here.

Jun 15, 2010

how to get around dhcp in virtual pc

 

yes, virtual pc has dhcp built-in.  found this out in a training class where virtual pc was being used to host virtual machines in an osd (operating system deployment) scenario.  one of the best things you can do with imaging pcs is pxe boot.  this kind of falls apart when you have competing dhcp services.

on statically assigned IPs, you’d never realize that anything was going on.  however, in an osd training scenario, you usually have a machine that will get an ip address, connect to pxe (wds), and retrieve a boot image.  we fully expected the pxe client to receive the ip address from dhcp that was loaded on the domain controller.  instead, we kept getting an apipa address (automatic private ip addressing – 169.254.0.0 – 169.254.255.255).  you would generally expect to see this on a windows 98 and above machine where a machine can’t get an ip address and assigns one to itself.

because of that behavior, it seemed as if the pxe boot machine couldn’t get an ip address from the dhcp server.  further investigation revealed that the dhcp server integrated in virtual pc answers requests faster than the dhcp service on the dc.  there are two ways to correct this problem, if the scenario you’re using is like mine.

use a loopback adapter

  • on the host machine, create a loopback adapter (available under the manually added hardware with microsoft as the vendor).
  • on each virtual machine, use the loopback adapter as the network connection.
  • restart the nic on each virtual machine.

 

shut down the virtual pc dhcp service

  • shut down any open virtual machines.
  • kill the vpc.exe process if it’s still running.
  • open options.xml for editing (%localappdata%\microsoft\windows virtual pc).
  • change the value of “enabled type” to false.

<virtual_network id="0">
<id type="bytes">50C6B929053B4BF0B5ADAEEF5818A133</id>
<gateway type="integer">0</gateway>
<name type="string">Internal Network</name>
<virtual_server>
<dhcp>
<enabled type="boolean">false</enabled>
<ending_ip_address type="integer">2851998462</ending_ip_address>
<network type="integer">2851995648</network>
<network_mask type="integer">4294901760</network_mask>
<starting_ip_address type="integer">2851995664</starting_ip_address>
</dhcp>
</virtual_server>


 

credit: found the second solution here at http://www.simonsen.bz/blog/Lists/Posts/Post.aspx?ID=117.

Jun 10, 2010

opalis: operator console nuances to avoid

i’ve been saying on twitter that i was going to rebuild my opalis environment.  i did – and i did it just to capture some good notes.  the previous environment was kind of built with spit, glue, prayer, hope, and duct tape.  before i turn anything over for implementation, i like to make sure things are pretty sound.  anyway, i never could get the operator console to work before.  after i saw mark gosson’s demo again, i was determined to figure out why.  the answer was so stupid i wanted to smack myself.  oh well.  at least there are some good items that may help someone else.

 

use the OPCONSOLEINSTALLER command-line wizard (or script, if you prefer)

the first thing you should be aware of is that there’s an installer for the operator console.  to get this to work right, this is what you’ll have to do.

  1. create a folder for your console.  make it simple.  mine was c:\opalis\jboss.
  2. copy the contents of jboss-4.2.3.GA to this folder.  your directory contents should look like this:

    image

  3. copy the operatorconsole directory (\opalis integration server\operatorconsole) to a location like c:\temp\operatorconsole.
  4. now copy all of the downloaded content to c:\temp.

now you’re ready to start the script.  just point to the directories you created when the “command-line wizard” asks you for them.

 

if you’re using active directory authentication, use the right credentials

even though configured contents of the opalis-activedirectory-service.xml indicate the directory structure and paths to use, you still have to supply a domain name when authenticating.  the jboss cmd window (if you’re using it interactively instead of as a service) will throw this error if it encounters an authentication attempt without it:

10:10:05,381 ERROR [AccountServicesActiveDirectory] LDAP error
javax.naming.AuthenticationException: [LDAP: error code 49 - 80090308: LdapErr:
-0C090334, comment: AcceptSecurityContext error, data 525, vece ]

which by the way, you get this if you use a wrong account as well.  so, make sure you add the domain!  ex: MYDOM\MYUSERID

 

if you’re using sql server, good grief, configure it right

in a production environment, i would have never run into this issue.  why?  because there are much smarter people running sql server here that know how to configure it.  :)  unfortunately, i missed the part in the guide about tcpip.  if you’re trying this out and can’t get your operator console to work either, enable tcpip in sql server configuration manager.

image

sadly, yes, it was that simple.  i didn’t realize it until i saw this error code in the jboss cmd window:

09:43:23,630 WARN  [JBossManagedConnectionPool] Throwable while attempting to get a new
connection: null org.jboss.resource.JBossResourceException: Could not create connection; -
nested throwable: (com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection
to the host has failed. java.net.ConnectException: Connection refused: connect)