Jul 13 2022 |
Dealing with Failure: Failure Escalation Policy in CLR Hosts
Offensive tooling built upon the .NET framework and its runtime environment, the Common Language Runtime (CLR), is an important part of the red teaming ecosystem. .NET tools offer rapid development times, a low barrier to entry, and are highly extensible through native interoperability. These tools have been and will continue to be used effectively on offensive engagements. Perhaps the cornerstone of continued interest in .NET offensive tooling is the ability to execute .NET assemblies from an unmanaged host using the CLR hosting interfaces; commonly referred to as running an assembly inline. Unfortunately, running arbitrary assemblies inline can be quite dangerous. Let’s see if we can address this problem.
Safer Inline Assembly Execution
By definition, inline assemblies share the same process as their unmanaged host — they run from within the host. For offensive use, sharing the same process is the principal motivating factor for running assemblies inline. However, there are some known drawbacks to this approach. Namely, exceptions originating from within the inline assembly will cause the process to unceremoniously exit if the CLR deems the exception unrecoverable; a decidedly less desirable consequence of sharing a process. Even attempts to isolate inline assemblies in individual AppDomains will not yield the desired effect for catastrophic failures.
When a managed exception is thrown and remains unhandled or is unable to be handled from user code, as is the case with particularly nasty exceptions (e.g. StackOverflowException), the default policy for the CLR host is to exit the process. This scenario is not ideal for offensive tooling; especially when considering an operationally sensitive CLR host such as a long running agent. Fortunately, there are (documented!) ways to configure and extend a failure escalation policy that the CLR uses to determine actions taken in the event of failures and timeouts.
Before continuing, most of this post was facilitated by the definitive (by virtue of being the sole printed reference available) resource on this subject, Customizing the Microsoft .NET Framework Common Language Runtime, written by Steven Pratschner.
Failure Escalation Policy
As the CLR matured from its first iterations into the 2.0 release, there was a need to expose functionality for scenarios which require long process lifetimes such as servers and operating system processes. Thus, starting with version 2.0 of the CLR there exists infrastructure that unmanaged hosts can use to remove and therefore isolate exceptional code without affecting the availability of the process itself. First, let’s examine the different types of failures that are exposed to the host, the actions the host may choose to take in response to a failure, and the escalation policy, which directs the execution flow of the aforementioned operations. Then we will implement a custom policy capable of continuing unmanaged code execution after catastrophic managed exceptions.
The EClrFailure enumeration describes which types of failures are available to be customized through an escalation policy. I believe the commented fields were introduced with the release of the CoreCLR versions.
https://medium.com/media/6b9600949e607c471669230635eb9559/href
- Failure to allocate a resource: A resource typically refers to a thread, block of memory, synchronization primitive, or some other resource managed by the operating system.
- Failure to allocate a resource in a critical region: A critical region is defined as any code that might be dependent on a shared state between threads. This is distinguished from the previous failure because a resource that relies on states from other threads cannot be safely cleaned up by terminating only the exceptional thread. The CLR assumes that any exception occurring within a region of code which depends on a synchronization primitive is a critical region.
- Orphaned lock: An orphaned lock is an abandoned synchronization primitive that is likely to leave the execution context in an inconsistent state. This can occur when resource allocation fails in code regions that are awaiting a synchronization primitive. Additionally, a thread may be aborted before the synchronization primitive is freed. In both scenarios, the primitive is lost and cannot be freed. This failure is also a resource leak and can eventually lead to resource exhaustion.
- Fatal runtime error: If the CLR encounters a fatal internal error and is no longer able to run managed code, the default behavior is to terminate, with varying levels of respect to cleanup, the process. It is possible to override this behavior and continue execution of native code.
The EPolicyAction enumeration describes the actions the host may take when presented with the different types of failures. The CLR provides two flavors of actions: a graceful action and a rude action.
https://medium.com/media/45aa99129b82a142fc75d8c790fdced2/href
- Graceful Action: Graceful actions attempt to properly free resources by running exception-handling routines and finalizers, freeing associated CLR data structures, and in the case of process exit, finishing processing necessary for a proper shutdown.
- Rude Action: Rude actions make no such attempts. The CLR does not guarantee any finalizers are run, with the exception of critical finalizers.
Additionally, the unmanaged host may set timeouts for the specified policy actions to complete. This is especially useful when dealing with unresponsive code, such as an abandoned synchronization primitive or infinite loop. Timeouts are configured by specifying an operation upon which a policy action is taken after a determined interval. The set of operations exposed to the host is documented in the EClrOperation enumeration.
https://medium.com/media/ec5e4493b7ead1cda119898a278ae73a/href
Together, the failure types, policy actions, and operations make up the failure escalation policy of a host. It can be put no more succinctly than by Pratschner, “The escalation policy is the host’s expression of how failures in a process should be handled.” Still, it’s easiest to visualize how a custom escalation policy might look when an exception occurs.
The above escalation policy will be expressed in the process as follows:
- The CLR first determines if the exceptional code depends on a synchronization primitive, to establish if the exception originates from a non-critical or critical code region.
- If the exception occurs from within a non-critical region, the CLR will attempt to gracefully abort the thread. If the graceful thread abortion times out then the action is escalated to rudely abort the thread. Additionally, if the graceful thread abortion occurs in a critical region, it is escalated to gracefully unload the AppDomain.
- If the exception occurs from a critical region, the policy will escalate to gracefully unload the AppDomain.
- If the graceful unloading of the offending AppDomain times out, the action will be escalated to rudely unload the AppDomain.
- If rudely unloading the AppDomain times out, the runtime is disabled. No more managed code may be run.
- If at any point the CLR encounters a fatal runtime failure, the policy overrides the default action of shutting down the process with disabling the CLR.
Note: Rude thread aborts are not able to be escalated to anything more useful than disabling the runtime. We will come back to this point in a bit.
Policy Configuration
Now that we know what a failure escalation policy is, at least in the context of CLR hosts, we can dive into customizing a runtime host with the policy pictured in the diagram above. Hopefully, this will remedy some stability issues facing long running native agents.
CLR hosts implement escalation policies using the failure policy manager exposed by the CLR hosting interfaces. The failure policy manager is composed of two interfaces: ICLRPolicyManager and IHostPolicyManager. The ICLRPolicyManager interface is implemented by the CLR and subsequently exposed to the user, whereas the IHostPolicyManager interface is implemented by the host. This design pattern is common throughout the CLR hosting interfaces.
The first step in implementing a custom failure escalation policy is to obtain a pointer to the CLR’s ICLRPolicyManager implementation. The following code shows how to do this.
Note: All code will assume a pointer to the ICLRRuntimeHost interface has already been obtained from calling either CorBindToRuntime/CorBindToRuntimeEx or CLRCreateInstance.
https://medium.com/media/230c4583a3f5285a95d7ff176a689e18/href
After obtaining the ICLRPolicyManager implementation, the host sets actions to take on failures.
https://medium.com/media/0d015664018ab68ff5e632ab5d46390c/href
Most actions can be taken on failures. There are a few exceptions. A failure associated with an orphaned lock must at least unload the AppDomain gracefully. Similarly, a stack overflow failure commands at least a rude unloading of the AppDomain from which it occurred. When a fatal runtime failure occurs, then the only suitable actions are to exit the process or disable the runtime. There are some additional cases not covered; they can be found in the remarks section of the ICLRPolicyManager::SetActionOnFailure MSDN page.
Then, the host sets the timeout period associated with an operation and the subsequent action to take upon exceeding the defined timeout.
https://medium.com/media/2fe2d6da666ce82d7f89a559d106e800/href
There are some nuances to be aware of. First, the CLR’s default implementation specifies no timeouts (read: infinite) associated with any operation other than OPR_ProcessExit. In this case, if a process does not cleanly exit within 40 seconds, the action is escalated to rudely exit the process. Second, the subset of EClrOperation values upon which a timeout and action can be specified, as described in the MSDN documentation, is inaccurate. Consulting the SSCLI2 source code, we can see the EEPolicy::SetTimeoutAndAction method validates the operation and action by calling the EEPolicy::IsValidActionForTimeout method, shown below. As long as this switch statement is satisfied, the combination of operation and action is valid.
https://medium.com/media/1444267a7aa7dbcde71a42d0282cd463/href
The host now sets default actions to take in response to a given operation. In our case, we only want to override the default action of one operation — OPR_ProcessExit. This way, the host will stop the CLR from shutting down the process.
https://medium.com/media/d4c55eae582c1c9d88532a6b0a5cf361/href
Default actions must only be used to escalate the action taken on failure; one cannot downgrade the action. One may consult the SSCLI2 implementation for the complete list of valid failure and action combinations.
Finally, the host must specify that the unhandled exception policy is defined by the host rather than the CLR.
https://medium.com/media/36d7a6b67cac0ffd6fa303e8bb03699f/href
The failure escalation policy is now configured and takes effect once the CLR is started.
Receiving Notifications and Host-Implemented Managers
The CLR host may also choose to implement the IHostPolicyManager interface. This allows the host to receive basic notifications resulting from either the default escalation policy or a custom one. Before implementing this interface, let’s take a step back and understand how the CLR host discovers our host-implemented manager.
As previously shown, the CLR host accesses specific CLR-implemented managers by calling ICLRControl::GetCLRManager. The ICLRPolicyManager interface, used above to configure the failure escalation policy, is one of a handful of CLR-implemented classes that may be accessed by the host. Inversely, the CLR hosting interfaces also expose a way for the CLR to discover host-implemented managers. This is accomplished by supplying a host-implemented instance of a class derived from the IHostControl interface to the ICLRRuntimeHost::SetHostControl method. This must be done before the CLR is started. Just like there are a number of CLR-implemented managers, there are many host-implemented managers which can further customize the functionality of the host. One such manager that may be of interest for offensive usage would be the IHostMemoryManager. It can be used to configure a custom memory manager for the CLR.
To better illustrate how the CLR discovers host-implemented managers, let’s take a look at code which will:
- Implement the IHostPolicyManager interface; this class will receive notifications related to the failure policy
- Create a class which derives from the IHostControl interface
- Register the host-implemented class with the CLR
First, the CLR host creates a class derived from the IHostPolicyManager interface.
https://medium.com/media/814e6ab714385fab0eba310934d9fe0d/hrefhttps://medium.com/media/c97ae329bd120bd598a6f925603b2847/href
This host-implemented manager receives notifications of the following events: OnDefaultAction, OnFailure, and OnTimeout. The information exposed to the host is not particularly verbose, but nonetheless may be useful for monitoring events related to the failure policy.
For the CLR to be notified of the existence of any host-implemented manager, the host must additionally implement the IHostControl interface. The CLR discovers host-implemented managers by calling the IHostControl::GetHostManager method which associates instances of host-implemented managers with their IID.
https://medium.com/media/bf05f3ed2fea7e5fbacb84462b0cd140/hrefhttps://medium.com/media/16b75410349599426ebcf440b5deb472/href
Finally, the host must call ICLRRuntimeHost::SetHostControl before starting the CLR.
https://medium.com/media/0d890b7de96e255e6b5fa55cb24ef865/href
Parting Thoughts
Implementing a failure escalation policy in unmanaged CLR hosts is a powerful tool for handling the inline execution of arbitrary assemblies. Such customization may help alleviate some operational difficulties associated with process termination due to unhandled exceptions. Additionally, there are a number of other interesting host-implemented managers which have possible offensive uses.
One may also question the usefulness of a process left unterminated yet unable to continue managed code execution. There may be a workaround for this — manually unloading and reloading the CLR. Doing so would be extremely dangerous, unreliable, and undocumented; undoubtedly compromising the integrity of the process the host worked so hard to preserve. Nonetheless, it sounds like a worthwhile exercise and area of continued research.
Finally, if you found this post interesting and would like to learn more about customizing the CLR, please check out Customizing the Microsoft .NET Framework Common Language Runtime, by Steven Pratschner.
Dealing with Failure: Failure Escalation Policy in CLR Hosts was originally published in Posts By SpecterOps Team Members on Medium, where people are continuing the conversation by highlighting and responding to this story.