Writing controllers for CKRM ---------------------------- 17 Apr 2006 Shailabh Nagar (nagar@watson.ibm.com) Chandra Seetharaman (chandra.seetharaman@us.ibm.com) Vivek Kashyap (kashyapv@us.ibm.com) Gerrit Huizenga (gerrit@us.ibm.com) 1.0 Introduction Class-based Kernel Resource Management (CKRM) is a framework for generalizing the Linux kernel's management of resources such as cpu time, physical pages and disk I/O bandwidth. CKRM introduces a grouping of tasks called a class which is visible to the kernel. CKRM manages resources for each class as opposed to the kernel which typically operates at the granularity of a task. CKRM is implemented as a set of components, some of which are optional. Required components 1. Core : defines classes, resource controller hookup logic 2. RCFS (Resource Control File System): a filesystem interface that is built on top of configfs for defining classes, setting their attributes and getting their resource usage. Optional components 1. Classification Engine (CE): User space component that assists in automatic classification of tasks. 2. Resource controllers: These are the components that regulate the resources consumed by a class. They are typically written as incremental modifications of existing schedulers in the kernel e.g. cpu controller and memory controller ; or as an alternate scheduler e.g. I/O controller. There is a well-defined interface between the resource controllers and CKRM Core. The controllers currently provided by the CKRM open-source project are: CPU: regulates CPU time. Memory: regulates number of physical memory pages resident. Disk I/O: regulate I/O bandwidth per block device. Numtasks : regulates the rate and number of children that can be forked. However, it is hoped that the framework will encourage the development of new controllers for other resources or even alternate implementations for the ones listed above. This document furthers that goal by explaining how one can write a controller for CKRM to manage any resource of interest. We will use the numtasks controller as an example since it is simple and easy to understand. ckrm_numtasks is a simple yet meaningful example of regulation. The controller tries to regulate two aspects - number of tasks in a class - rate at which tasks get forked in a class Its control is exercised by inserting a callback into the fork system call. The callback checks if the class of the task doing the fork is above its "share", either in the number of tasks or rate of forking. If the class is above its share, the fork fails. Thus, the regulation done by the numtasks controller is quite simple and the majority of the controller's code consists of defining the interfaces to CKRM Core. 2.0 Overview The lifecycle of a controller is as follows: 1. Initialization: The controller is initialized when themodule being loaded or during system boot. CKRM Core is always initialized before any controller. During initialization, a controller registers itself with CKRM Core using ckrm_register_controller(). 2. Allocate resource class object for root and any other classes created earlier than controller initialization: The root core class exists as soon as CKRM Core is initialized (regardless of rcfs being mounted or even loaded). Hence, one of the first things a controller needs to do is allocate a resource class object for the root core class. The controller exports only one function for creating a resource class object and this is called by CKRM Core to create the root resource class object. If a user has explicitly created more core classes before the controller is loaded, it is called upon by Core to create resource class objects for these classes as well at the time of registration. Note: Do not hold any lock in the init function that is used in the alloc function. 3. Respond to user commands: A user can create and delete classes, change their share settings, query statistics and set configuration parameters at any time after controller initialization. The controller responds to each of these requests through the functions it defines for each of these actions. 4. Teardown: When a controller's module is unloaded or the system is being shut down, the controller initiates teardown consisting mainly of deregistration of the controller using ckrm_unregister_controller(). These are the steps that can be followed to write your controller: a) Define a resource class object b) Define a resource controller descriptor (struct ckrm_controller) c) Provide the functions needed by the controller-core interface - allocation/free of a class - set/get share values - get statistics - change of class membership d) Initialization/teardown code for the controller involving registration/deregistration of the controller using the descriptor created e) Optionally provide any controller specific configuration interface. f) Modifying kernel code to enforce the resource class's shares 2.1 Define a resource class object The CKRM core defines a generic object (struct ckrm_class) to represent a class. This object - is named - is linked into a tree/hierarchy of classes - corresponds to the user's notion of a class - is in the list of system-wide class list Each ckrm_class contains pointers to controller-specific "resource class" objects, one per controller. These - have a pointer to the class to which they belong - are not linked together directly (to access another resource class in the hierarchy, you need to go through the utility function, for_each_child) - not visible to the user - not named - contain all the information needed by the controller to do regulation and monitor statistics At the very least a resource class object needs to define struct ckrm_class *core; struct ckrm_shares shares; though it is often useful to have a pointer to the parent core class (if any) struct ckrm_core_class *parent; to save on indirecting through ckrm_core_class each time you need to traverse the hierarchy. Caching the parent pointer is safe since a class's parent cannot change or removed during the class's lifetime. In numtasks controller case it is struct ckrm_numtasks. 2.2 Define a resource controller descriptor Each resource controller has a descriptor object (struct ckrm_controller). The fields needed are a) name : String with name of controller. Seen by the user. The same name is used when shares or configuration values are set for this controller (for any core class). b) depth_supported : Depth of hierarchy supported by this controller. Through this parameter controllers can set a limit on the depth of hierarchy they can support. This resource controller will _not_ be part of any class that is deeper than this value. c) ctlr_id : Internal id assigned by CKRM Core The controller should initialize this to CKRM_NO_RES_ID. It gets filled in by CKRM Core at registration. This id represents the index of this controller in the array of resource class objects for each class. d) function pointers : Functions forming the interface between the core and the controller alloc_shares_struct free_shares_struct shares_changed show_stats reset_stats move_task These are called by CKRM Core in response to corresponding events typically initiated by user-space actions like creating/deleting a class, setting shares, reading the stats files, or moving a task to a class. Providing these functions is the major part of writing a controller. 2.3 Provide the functions needed by the controller-core interface 2.3.1. alloc_shares_struct Usually called when a class is created by the user. Also called when the controller registers for existing classes. Typically needs to allocate a resource class object for the class, initialize it, and return pointer to shares data structure which is part of the resource class object. This is a mandatory function. 2.3.2 free_shares_struct Usually called when a class is deleted by the user. Also called when the controller unregisters for existing classes. Core makes sure that there are no tasks in the class and clears the share values of the resource object before it calls this function. Needs to deallocate resource class object, and any class specific cleanups. This is a mandatory function. 2.3.3 shares_changed Called when a user changes controller's share values. Controller is called only after the core verifies and sets the values provided by the user. Note that even though controller classes aren't visible to the user directly, she must specify the controller when a share is being set (since shares only have meaning for a controller). The shares set by the user through RCFS are relative to the parent, with root being defined to have 100% of the system's resources. To calculate the effective share of a class, one needs to start at root and traverse the path to the class, calculating the absolute shares along the way. e.g. If the hierarchy looks like root (N.A,100) -> A (25,50) -> A1 (10,30) | |-> B (10,200) -> B1 (100,300) where the numbers in parantheses are the values of min_shares and child_shares_divisor for a class and the total number of tasks allowed n the system is 131072 (this corresponds to the 100% alloted to root), then the absolute shares are as follows A = 25/100 * 131072 = 32768 B = 10/100 * 131072 = 13107 A1 = 10/50 * 32768 = 6553 B1 = 100/200 * 13107 = 6553 While regulating, if the number of tasks in A1 or B1 reaches 6553, all future forks by tasks in those classes fail. Most controllers will want to maintain these absolute shares as part of the resource class object so that the above calculations don't have to be repeatedly done for each regulation decision. Hence, the shares_changed functions needs to - adjust the absolute share values of the parent resource class - adjust the absolute share values of all children resource classes ckrm_numtasks.c:numtasks_shares_changed can be used as a template to implement shares_changed. 2.3.4 get_stats Called when a user reads from the stats file of any class. Each registered controller for that core class is called to return the statistics (stats) of its resource class object for that core class. Unlike the shares struct which is common to all controllers, the statistics for a resource class are controller-specific. Hence, this function uses a character buffer to return whatever the controller considers statistics for that resource class object. 2.3.5 move_task Called when a task is moved from one class to another. This can happen when a user manually reclassifies a task (by writing its pid to to the members file) or when a class is removed, or when the controller registers/unregisters. This function needs to - make any changes to its resource class object that are affected by the a task being removed/added to it. Typically, this won't be necessary. - do controller specific work to "account" for the move The latter needs a little explanation and is best done by using the example of ckrm_numtasks. ckrm_numtasks regulates how many tasks can be forked within a class. Each class gets an absolute number, calculated from its share, of tasks that can be present in the class. Whenever a task forks, its class' count is checked. If the count will exceed the class' allocation due to that fork, the fork fails. When a task moves from one class to another, the numtasks controller must decrement the count for the source class and increment it for the destination class. This is done by calling dec/inc_usage_count(). These functions also take care of another feature of numtasks which is the "borrowing" between a child and its parent. This feature is too numtasks specific to warrant a discussion here. 2.4 Initialization/teardown code for the controller A controller needs to provide initialization and teardown functions which are called either at system startup (if the controller is compiled in) or when the controller module is loaded. Its preferable to write the controller as a module if possible. The main responsibility of the initialization and teardown functions is to register/deregister the controller with CKRM Core. The functions init_ckrm_numtasks_res and exit_ckrm_numtasks_res in ckrm_numtasks.c are quite simple and can be used as a template for most controllers. 2.5 Enforcing the resource class's shares All the previous sections dealt with controller's interface to Core. But the real purpose of a controller is to regulate some kernel resource in accordance with the shares set by the user. How this is done is specific to each controller and can get complex (as in the case of the CPU, mem and I/O controllers). In numtasks controller, the logic for doing these is within three functions numtasks_allow_fork(), inc_usage_count() and dec_usage_count(). These functions get called in two ways: - fork system call If numtasks_allow_fork returns non-zero, the fork fails. The exit system call does not need to be hooked by ckrm_numtasks since every exiting task gets reclassified (thru move_task) and the accounting update is handled properly there. - when a task changes its class membership. The "from" class's ref count is decremented and the "to" class's count is incremented. Note that this mode of invoking inc_usage_count/ dec_usage_count is not affected by the "rate" of forks. Controller Configuration information: Each controller may have some configuration values that may be modified by the user. For example, numtasks controller 3 paramters - total_numtasks: Number of tasks at the root level. Since there is no real limit on the number of tasks in a system, if one wants to control the numbers relative to the system amount they need a mechanism to do it. This parameter provides that mechanism. Default value is infinite. - forkrate: Number of forks allowed in forkrate_interval. - forkrate_interval: fork rate interval used above in seconds. These parameters can be exported to user level by way of module parameters, which allow the user to change. See numtasks controller for a template.