Skip to main content

scom: overloading the consolidation module (and how to avoid it)

in a previous post titled using repeat count to detect a problem in a window of time I described a process whereby you can using consolidation settings, you can detect something happening in a window of time.  for example, event id 529 equals "bad password" basically.  if we alerted on every bad password, that'd be problematic.  however, if we looked at every one and then alerted whenever the count of bad passwords for a single user exceeded a threshold, that might be useful.

apparently there's this concept called a "consolidation module".  this module has a limit of 128k.  if you go beyond this limit, you tend to overload the module and cause the event scraping to backlog.  on very active domain controllers, using a large sliding window, it's very easy to overrun this limit.  it results in odd errors like this:

(event id 11105)

The Microsoft Operations Manager Condolidator Module failed to save the state after processing and might loose data. 
Error: 0x80070057
One or more workflows were affected by this.
 
 
The Windows Event Log Provider monitoring the System Event Log is 317 minutes behind in processing events. This can occur when the provider is restarted after being offline for some time, or there are too many events to be handled by the workflow.

One or more workflows were affected by this.

before you try to correct me, I copied and pasted that.  I know how to spell "consolidator" and "lose data".  I mentioned fixing this.  there are two ways.  the most obvious is by reducing your sliding window time frame so that you're not collecting as many events in a given period of time.  the second way is to simply set the storestate value to false.  the first one, you should be able to derive quite easily.  modifying the storestate value tells the agent not to store the internal state.  the problem here, albeit a small exchange, is that the state does not survive healthservice restarts.

as for the second method, it's not available in the console (surprise, surprise!) and must be done by editing the xml (surprise, surprise!).  so, you'll need to export your rule, modify your xml, and import it again.  once you have your xml, locate an area of the xml that should look suspiciously like this:

<Consolidator>
<ConsolidationProperties>
<PropertyXPathQuery>EventDisplayNumber</PropertyXPathQuery>
<PropertyXPathQuery>PublisherName</PropertyXPathQuery>
<PropertyXPathQuery>LoggingComputer</PropertyXPathQuery>
<PropertyXPathQuery>Params/Param[1]</PropertyXPathQuery>
<PropertyXPathQuery>Params/Param[2]</PropertyXPathQuery>
</ConsolidationProperties>
<TimeControl>
<WithinTimeSchedule>
<Interval>1800</Interval>
</WithinTimeSchedule>
</TimeControl>
<CountingCondition>
<Count>20</Count>
<CountMode>OnNewItemTestOutputRestart_OnTimerSlideByOne</CountMode>
</CountingCondition>
</Consolidator>
 
 
okay, now modify it to add this one entry noted below.
 
<Consolidator>
<ConsolidationProperties>
<PropertyXPathQuery>EventDisplayNumber</PropertyXPathQuery>
<PropertyXPathQuery>PublisherName</PropertyXPathQuery>
<PropertyXPathQuery>LoggingComputer</PropertyXPathQuery>
<PropertyXPathQuery>Params/Param[1]</PropertyXPathQuery>
<PropertyXPathQuery>Params/Param[2]</PropertyXPathQuery>
</ConsolidationProperties>
<StoreState>false</StoreState>
<TimeControl>
<WithinTimeSchedule>
<Interval>1800</Interval>
</WithinTimeSchedule>
</TimeControl>
<CountingCondition>
<Count>20</Count>
<CountMode>OnNewItemTestOutputRestart_OnTimerSlideByOne</CountMode>
</CountingCondition>
</Consolidator>

after making the modification, import it back in and the problems should go away.

Comments

Popular posts from this blog

how to retrieve your ip address with powershell...

update: this is how it’s performed in powershell v3 as demonstrated here.(get-netadapter | get-netipaddress | ? addressfamily -eq'IPv4').ipaddress update: this is by far the easiest.PS C:\temp> (gwmi Win32_NetworkAdapterConfiguration | ? { $_.IPAddress -ne $null }).ipaddress
192.168.1.101
are you laughing yet?  i know you probably find this topic amusing.  it's really interesting though.  whenever you get over it, i'll do this in the standard cmd.exe interpreter and then in powershell to show you what kind of coolness powershell does.done?  okay, good.  this is an interpretation of a demo that bob wells did at our smug meeting.  hope you like it.i should tell you, it's not as simple as the title would lead you to believe.  i like doing that little slight-of-hand thing since it gives the impression that i'm painting a very easy target on my back for your criticism (though it's probably true in other ways)!  the idea is that we want to retrieve just the ip ad…

understanding the “ad op master is inconsistent” alert

i use the term “understanding” loosely.  this is by far no definitive guide on this particular alert, just a few things i have picked up in my attempt to understand it.let’s look at the context of the alert:The Domain Controller's Op Master is inconsitent. See additional alerts for details.
first of all, it gives very little information.  the only particularly useful detail is that it indicates which server is having the issue.  other than that, just a spelling error as there are no additional critical alerts to look at for details.this rule, as you know, comes from a sealed mp.  therefore, we can’t modify anything in it except the overrides.  the couple i’ve tinkered with are:interval (sec) log success event to begin with, interval (sec) is just set way too high.  the default is 60 seconds.  why on earth would anyone want to know that your op master consistency may be off, every minute?  actually, i could think of a few reasons, but really, it’s overkill.  the way the script works…

sccm: content hash fails to match

back in 2008, I wrote up a little thing about how distribution manager fails to send a package to a distribution point. even though a lot of what I wrote that for was the failure of packages to get delivered to child sites, the result was pretty much the same. when the client tries to run the advertisement with an old package, the result was a failure because of content mismatch.I went through an ordeal recently capturing these exact kinds of failures and corrected quite a number of problems with these packages. the resulting blog post is my effort to capture how these problems were resolved. if nothing else, it's a basic checklist of things you can use.DETECTIONstatus messagestake a look at your status messages. this has to be the easiest way to determine where these problems exist. unfortunately, it requires that a client is already experiencing problems. there are client logs you can examine as well such as cas, but I wasn't even sure I was going to have enough material to …