O R G A N I C / F E R T I L I Z E R: 12.10

Dec 14, 2010

opalis: working around limitations with workflow objects and link operators

UPDATE: pete zerger was kind enough to point out that sometimes i don't make sense, and i missed a very obvious point in the documentation.  since the post itself is still useful, i didn't just scrap it.  :)  instead, i added an addendum.

if you've been working with opalis long enough, you might will find that there are moments when hacks are required to get you from one point to the next point.  i've been experimenting a lot with nested workflows.  it's like evolving from inline scripting to scripting with functions and/or subs.

i discovered that when using trigger policy to run a nested workflow, a bizarre thing happens.  even if the nested workflow executes with an error, the status returned by the calling trigger policy object is "success".  it didn't make sense at first until i realized that by all accounts, the trigger policy did execute successfully.

image image

well, there's a problem with this.  if it comes back as success, even though something failed, the policy will continue on down the path unless you tell it otherwise.  enough of that.  let's talk specifics about my scenario.

the workflow i created was designed to do one thing: usher alerts from opsmgr into tickets in remedy.  since remedy is divided into many different operating queues, i had to consider how to create tickets into the correct queues.  i decided to try it based on computer group membership.

in order to get the group membership, i had to query the opsmgr database.  i decided to push that into a nested workflow so that it could be reused in other workflows at some later point.  the information retrieved from the nested workflow would be the basis of information fed to a text file.  the master workflow could reference the text file to search through for cross checking names.

now what would happen if the database failed to query, and the associated text file never filled with any data?  if you're cross checking the alerts against an empty text file, chances are you will never have a match and as such no tickets generated.

but if opalis is returning a success on the nested workflow, how do you know the query is failing?  that seems simple.  if the published data returned from the nested workflow is empty, then obviously the query failed.  too bad the link operators don't have any filters for stuff like "is empty" or "is not empty".

image

all isn't lost though.  to get the effect that we want, we simply have to know what to look for.  going to the nested workflow, we can use the query database object status as our criteria to branch appropriately.  if successful, the publish policy data object writes the expected server list.  if it runs into a warning or failure, we publish static text to a different publish policy data object in the form of "FAILED".

image

back in the master workflow, we can now use the link operator to cull out anything that tries to come through with "FAILED".  if it matches the include filter, the policy processing stops.

image

 

addendum:

keep in mind that link operators do not have an "AND" operation.  instead the filters are evaluated as "OR" expressions.  however, the include/exclude tabs are separate so mixing and matching is a possibility, assuming you have the right content coming through. 

in the opalis client user guide, the trigger policy object section has a table that states this description for the child policy status: the status that was returned by the child policy.  it's important to clarify that the default behavior of a link operator coming from the trigger policy object is to set the filter to look for anything coming from trigger policy itself to "success".

image

if you're looking for the status coming from the child policy, you should change the link operator filter to look for something like this:

image

Dec 8, 2010

opalis: multiple policy instances running in the operator console

here is an interesting and rather complex problem i experienced with opalis.  honestly, this thing has probably been going on since i built the environment.  however, i have only begun working with nested workflows which really helped flush it out.

 

summary

essentially, when a master workflow launches a sub workflow, the policy spins in the background and never does anything.  another indication of the problem is that the summary of policies indicate a high number of running instances.  lastly, the problem appears to be environmental since only one opalis instance has the problem. 

here is a view of the operator console with this problem.

image

 

troubleshooting

the first thing i prefer to do in scenarios like these is to try to recreate the problem.  that wasn't difficult on the server having the problem.  the operator console shows a high number of instances, sub-workflows hang indefinitely, etc.  when i tried it in a different environment, replicating the same master/sub nested test, it worked fine.  i decided it must be my policy and exported the test version into my broken opalis instance, ran it, and got the same result -- it hung.

i decided to look down another path into the running instances count to see if that was part of the problem.  i began by restarting the opalis service, hoping instances that were trapped somewhere would free themselves and disappear.  not a chance!  next, i started looking through the process list and found policymodule.exe which appears to be the running process that a policy resides in.  i figured with that many policy instances running, surely, i was way over the concurrent policy instance maximum.  unfortunately, i only found two of these processes which matched up to the two running policies as seen in the screenshot above. 

since that looked fine, i went back to sifting the logs (yeah, i mistakenly dismissed it the first time around).i found this statement in the logs (full snip below)[3] which had me scratching my head:

2010-12-07 15:12:38 [1640] 1 Opalis Event: Frequent DB errors

that information wasn't very telling but eventually a little further down in the logs, i found this error:

<Param>The EXECUTE permission was denied on the object 'sp_UnpublishPolicyRequest', database 'Opalis', schema 'dbo'.</Param>

apparently, i had the wrong set of permissions.  according to the administrator guide, the action server requires "part of database users group on the datastore computer".  going by this guidance, the action server service account was granted db_datareader and db_datawriter. 

this made perfect sense.  all of my other environments were ones i set up simply to test.  with this broken one, it was set up as a dev environment utilizing best practices such as using databases from people who know how to run them (not me).  adding to that, using the minimal level of rights required is followed.  by the log output, you can see i missed one of those such rights, however.  i did not grant the "execute" permissions that the account required.

this seems to speak to the root of the problem but in my research found that it did not correct the running instances count that appears in the operator console.  looking through the database views, i found dbo.policies_view (which details a lot more than what you see in the operator console, by the way).  i opened this view and found "runningpolicyinstances" column which contained the exact counts i saw in the operator console:

image

the design of the dbo.policies_view indicated very clearly how this column was being constructed.

(SELECT COUNT(*) AS Expr1 FROM dbo.POLICYINSTANCES WHERE (dbo.POLICIES.UniqueID = PolicyID) AND (TimeEnded IS NULL))

so basically, if the timeended value in the dbo.policyinstances table is null, it counts them up and displays it here.  sure enough, the table contained a high number of rows with null values.  i searched through all of the stored procedures to try to locate an entry that indicated inserting values into policyinstances but in almost every case, data was drawn from the table and rarely written to.  to add to this problem, log purging doesn't delete entries where the timeended value is null.  i suspect this is because it believes the policies haven't finished executing yet.  thus, these counts would have never gone away on their end.

 

resolution

:: database ::

the problem with the database permissions can be fixed in one of two ways:

  • grant the account db_owner rights
  • grant the account execute rights on all stored procedures [1]

once either one of these is done, the error in the opalisactionservice logs stop generating.  i would be cautious with this.  first, granting db_owner is probably granting rights beyond what the account actually needs.  second, if you choose to just grant rights to the stored procedures, it may not be the entire set of rights required.  my testing has been limited so far, and thus i may find in doing so that i run into other issues [2].

 

:: operator console ::

correcting the summary count will require some changes to the policyinstances table.  obviously going in and messing around with tables is not going to be supported by microsoft so proceed at your own risk.  my environment is a testing environment so it makes sense for me.  you may want to call microsoft support.

to begin, i shutdown all running policies so that any timeended date stamps would write in as necessary (allowing me to avoid unnecessarily changing data that didn't need modification).  afterwards, i ran the following sql query to set the timeended value to the current timestamp.

update dbo.policyinstances 
set [timeended] = getdate()
where timeended is null

this simple query adds the current datetime to the timeended field where the value is null.  as shown in the screenshot below, the instance summary now displays the correct count.

image

 

addendum

[1] if you're interested in the sql query to provide execute permissions for the action server service account, here it is.  many thanks to patrick for whipping this up!

declare @str_objname sysname,
@str_execsql nvarchar(256)
 
declare cur_spname cursor for
select name from sys.procedures
 
--select schema_name(schema_id), name from sys.procedures
 
open cur_spname
 
fetch cur_spname into @str_objname
 
while @@fetch_status = 0
begin
set @str_execsql = 'grant execute on '+@str_objname+' to [myServiceAccount];'
print @str_execsql
exec sp_executesql @statement = @str_execsql
fetch cur_spname into @str_objname
 
end
 
deallocate cur_spname

here is what the query ended up executing:

grant execute on sp_insertevent to [myServiceAccount];
grant execute on sp_GetLogEntriesForDelete_FilterByEntries to [myServiceAccount];
grant execute on sp_GetLogEntriesForDelete_FilterByDays to [myServiceAccount];
grant execute on sp_GetLogEntriesForDelete_FilterByEntriesAndDays to [myServiceAccount];
grant execute on sp_CustomLogCleanup to [myServiceAccount];
grant execute on sp_PublishPolicy to [myServiceAccount];
grant execute on sp_UnpublishPolicyRequest to [myServiceAccount];
grant execute on sp_UnpublishPolicy to [myServiceAccount];
grant execute on sp_DeleteTreeData to [myServiceAccount];
grant execute on sp_FindTreeInsertionPoint to [myServiceAccount];
grant execute on sp_InsertTreeData to [myServiceAccount];
grant execute on sp_MoveTreeBranch to [myServiceAccount];
grant execute on sp_StopAllRequestsForPolicy to [myServiceAccount];
grant execute on sp_StopAllRequests to [myServiceAccount];

[2] if you want, you can follow this forum post. looking for the complete list of database rights required by the action server service account: http://social.technet.microsoft.com/Forums/en-US/opalisv5v6/thread/faf9c8a0-a1b2-4442-8391-1ff738826f28

[3] here's the full log snippet:

2010-12-07 15:12:38 [1640] 1 Opalis Event: Frequent DB errors

2010-12-07 15:12:38 [1640] 1 Exception caught in void __cdecl OpalisEventDeliveryStrategyComposite::sendAndTraceExceptions(const class OpalisEventDeliveryStrategy &,const class OpalisEvent &,const class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> > &)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\OpalisEventDelivery\OpalisEventDeliveryStrategyComposite.cpp(83):
<Exception>
<Type>Opalis::Exception</Type>
<Location>
void __cdecl OpalisEventDeliveryStrategyComposite::sendAndTraceExceptions(const class OpalisEventDeliveryStrategy &,const class OpalisEvent &,const class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> > &)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\OpalisEventDelivery\OpalisEventDeliveryStrategyComposite.cpp(80)
</Location>
<MsgCode>Cannot deliver Opalis Event</MsgCode>
<Params>
<Param>EW</Param>
</Params>
<Prev><Exception>
<Type>Opalis::Exception</Type>
<Location>
void __cdecl StorageCallExecutor::throwChained(const class Opalis::Exception &)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\OpalisActionService2\StorageCallExecutor.cpp(47)
</Location>
<MsgCode>SCE: ActionServerStorage call failed</MsgCode>
<Prev><Exception>
<Type>Opalis::Exception</Type>
<Location>
long __stdcall CODBDataStore::ReportEventW(struct tagVARIANT)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\DBDataStore\ODBDataStore.cpp(6002)
</Location>
<MsgCode>_com_error</MsgCode>
<Params>
<Param>Unknown error 0x800A0CC1</Param>
<Param>Item cannot be found in the collection corresponding to the requested name or ordinal.</Param>
<Param>-2146825023</Param>
</Params>
</Exception></Prev>
</Exception></Prev>
</Exception>

2010-12-07 15:12:38 [1640] 1 Exception caught in void __thiscall StorageCallExecutor::onStorageException(class Opalis::Exception &,bool)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\OpalisActionService2\StorageCallExecutor.cpp(120):
<Exception>
<Type>Opalis::Exception</Type>
<Location>
void __thiscall PublishedPoliciesManager::removePolicyRequest(const class _bstr_t &,const __int64 &)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\DBDataStore\PublishedPoliciesManager.cpp(850)
</Location>
<MsgCode>Cannot unpublish policy</MsgCode>
<Prev><Exception>
<Type>Opalis::Exception</Type>
<Location>
void __thiscall PublishedPoliciesManager::removePolicyRequest(const class _bstr_t &,const __int64 &)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\DBDataStore\PublishedPoliciesManager.cpp(838)
</Location>
<MsgCode>No more attempts</MsgCode>
<Prev><Exception>
<Type>Opalis::Exception</Type>
<Location>
void __thiscall DBAccessor::execute(const class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> > &) const
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\DBDataStore\DBAccessor.cpp(37)
</Location>
<MsgCode>_com_error</MsgCode>
<Params>
<Param>IDispatch error #3081</Param>
<Param>The EXECUTE permission was denied on the object 'sp_UnpublishPolicyRequest', database 'Opalis', schema 'dbo'.</Param>
<Param>-2147217911</Param>
</Params>
</Exception></Prev>
</Exception></Prev>
</Exception>

Dec 7, 2010

powershell: using convertfrom-csv

a friend of mine pointed out this really cool cmdlet called convertfrom-csv today.  using it, you can immediately create a PSCustomObject.  pretty cool!  as a practical example, you can dump out repadmin and use the object to work with data any way you see fit.

[96] {D:\temp} > $foo = convertfrom-csv @(repadmin /showrepl * /csv)
[96] {D:\temp} > $foo | gm

TypeName: System.Management.Automation.PSCustomObject

Name MemberType Definition
---- ---------- ----------
Equals Method bool Equals(System.Object obj)
GetHashCode Method int GetHashCode()
GetType Method type GetType()
ToString Method string ToString()
Destination DC NoteProperty System.String Destination DC=myDC
Destination DC Site NoteProperty System.String Destination DC Site=mySite
Last Failure Status NoteProperty System.String Last Failure Status=0
Last Failure Time NoteProperty System.String Last Failure Time=0
Last Success Time NoteProperty System.String Last Success Time=2010-12-07 14:21:39
Naming Context NoteProperty System.String Naming Context=DC=myDomain,DC=com
Number of Failures NoteProperty System.String Number of Failures=0
showrepl_COLUMNS NoteProperty System.String showrepl_COLUMNS=showrepl_INFO
Source DC NoteProperty System.String Source DC=myDC2
Source DC Site NoteProperty System.String Source DC Site=mySite
Transport Type NoteProperty System.String Transport Type=RPC

Dec 6, 2010

opalis: trigger policy fails in the testing console

don't be alarmed if you find that while testing nested workflows, the testing console generates an error when it hits the "trigger policy" object.  this is "by design" as the testing console is only designed to test a single policy instance. 

it's referenced in this forum post: http://social.technet.microsoft.com/Forums/en-GB/opalisv5v6/thread/b9ac5928-f4b5-4160-ba60-00f683614426

image

Dec 2, 2010

opalis: correcting sql port changes for the operator console

let's say that a friend breaks the operator console in an opalis lab that you helped set up.  if you run into this problem, this is the way to unbreak it.  so to begin with, no matter what you attempt to do, your login fails to work.  in the console, you receive this message: The username or password you have entered is not correct.  Transaction failed.

image

don't be fooled by these messages since they provide very little value with what the actual problem is.  instead, go look at the server.log in the \jboss\server\default\log directory.  in it, you may find information a little more valuable like this message:

Caused by: org.jboss.util.NestedSQLException: Could not create connection; - nested throwable: (com.microsoft.sqlserver.jdbc.SQLServerException: The connection to the named instance  has failed. Error: java.net.SocketTimeoutException: Receive timed out.); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection; - nested throwable: (com.microsoft.sqlserver.jdbc.SQLServerException: The connection to the named instance  has failed. Error: java.net.SocketTimeoutException: Receive timed out.))

 
that message makes a little more sense to me.  I asked about any recent changes and was told that the port of the sql server changed.  to correct the problem, I modified the opalis-ds.xml (\jboss\server\default\deploy) file.  it was as simple as adding the port value to the instance name.  before editing it looked like this:
 
<datasources>
<local-tx-datasource>
<jndi-name>OpConsoleDS</jndi-name>
<connection-url>jdbc:sqlserver://HOST2\SRS;database=Opalis;IntegratedSecurity=true;</connection-url>
<driver-class>com.microsoft.sqlserver.jdbc.SQLServerDriver</driver-class>
<min-pool-size>5</min-pool-size>
<max-pool-size>25</max-pool-size>
<track-statements>nowarn</track-statements>
<connection-property name="autoReconnect">true</connection-property>
<idle-timeout-minutes>3</idle-timeout-minutes>
<background-validation>true</background-validation>
<background-validation-minutes>2</background-validation-minutes>
<check-valid-connection-sql>SELECT 1</check-valid-connection-sql>
<prepared-statement-cache-size>30</prepared-statement-cache-size>
</local-tx-datasource>
</datasources>

and once corrected, here it is:

<datasources>
<local-tx-datasource>
<jndi-name>OpConsoleDS</jndi-name>
<connection-url>jdbc:sqlserver://HOST2\SRS:49175;database=Opalis;IntegratedSecurity=true;</connection-url>
<driver-class>com.microsoft.sqlserver.jdbc.SQLServerDriver</driver-class>
<min-pool-size>5</min-pool-size>
<max-pool-size>25</max-pool-size>
<track-statements>nowarn</track-statements>
<connection-property name="autoReconnect">true</connection-property>
<idle-timeout-minutes>3</idle-timeout-minutes>
<background-validation>true</background-validation>
<background-validation-minutes>2</background-validation-minutes>
<check-valid-connection-sql>SELECT 1</check-valid-connection-sql>
<prepared-statement-cache-size>30</prepared-statement-cache-size>
</local-tx-datasource>
</datasources>
 

Dec 1, 2010

sccm: counting versions by site

if you want to track your site progress as they upgrade to a new client version, just use a query like this:

select
site.SMS_Assigned_Sites0 as 'Site',
COUNT(case when sys.client_version0 = '4.00.6221.1000' then 1 end) as 'SP1',
COUNT(case when sys.Client_Version0 = '4.00.6487.2000' then 1 end) as 'SP2'
from v_R_System sys
inner join v_RA_System_SMSAssignedSites site on sys.ResourceID = site.ResourceID
group by site.SMS_Assigned_Sites0

 

your results should look like the following:

Site SP1 SP2
ABC 130 10
XYZ 362 2000