Varying levels allow system users to access, view or perform operational functions

System Recovery

Philip A. Bernstein, Eric Newcomer, in Principles of Transaction Processing (Second Edition), 2009

Transaction-Based Server Recovery

Transactions simplify server recovery by focusing clients’ and servers’ attention on the transactions executed by each server, rather than on individual calls within a transaction. That is, the server does all its work within transactions. The client tells the server to start a transaction, the client makes some calls to the server within that transaction, and then the client tells the server to commit the transaction.

If a server that supports transactions fails and subsequently recovers, its state includes the effects of all transactions that committed before the failure and no effects of transactions that aborted before the failure or were active at the time of the failure. Comparing this behavior to a nontransactional server, it is as if the transactional server performs a checkpoint every time it commits a transaction, and its recovery procedure discards all effects of aborted or incomplete transactions. Thus, when a transactional server recovers, it ignores which calls were executing when it failed and focuses instead on which transactions were executing when it failed. So instead of recovering to a state as of the last partially-executed call (as in checkpoint-based recovery), it recovers to a state containing all the results of all committed transactions and no others.

For this to work, the server must be able to undo all of a transaction’s operations when it aborts. This effectively makes the operations redoable when the transaction is re-executed. That is, if an operation was undone, then there’s no harm in redoing it later, even if it is non-idempotent. This avoids a problem that was faced in checkpoint-based recovery—the problem of returning to a state after the last non-idempotent operation. This isn’t necessary because every non-idempotent operation was either part of a committed transaction (and hence won’t be redone) or was undone (and hence can be redone).

If all operations in a transaction must be redoable, then the transaction must not include the non-idempotent operations we encountered in the earlier section, Server Recovery, such as printing a check or transferring money. To cope with such a non-idempotent operation, the transaction should enqueue a message that contains the operation. It’s safe for the transaction to contain the enqueue operation, because it is undoable. The program that processes the message and performs the non-idempotent operation should use the reply handling techniques in Section 4.4 to get exactly-once execution of the actual operation (printing the check or sending a money-transfer message).

Transactions not only simplify server recovery, they also speed it up. A memory checkpoint is expensive, but transaction commitment is relatively cheap. The trick is that the transactional server is carefully maintaining all its state on disk, incrementally, by writing small amounts to a log file, thereby avoiding a bulk copy of its memory state. It is designed to suffer failures at arbitrary points in time, and to reconstruct its memory state from disk using the log, with relatively modest effort. The algorithms to reconstruct its state in this way are what gives transactions their all-or-nothing and durability properties. Either all of a transaction executes or none of it does. And all of its results are durably saved in stable storage, even if the system fails momentarily after the transaction commits. These algorithms are the main subject of the rest of this chapter.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978155860623400007X

Disaster Recovery

Kelly C. Bourne, in Application Administrators Handbook, 2014

12.5.4 Software licensing and your DR site

Before acquiring and setting up DR servers make sure you understand the licensing ramifications regarding a DR site. If your DR site includes servers with software already loaded onto them, you don’t want to do anything illegal. You need to consider the following questions to ensure you aren’t violating your contract with the software vendor.

Will the application software vendor allow you to load their software on a DR server without additional licensing fees?

Does the database engine vendor allow you to load create DR database servers without additional licensing fees?

Is all of the support software on the DR servers properly licensed?

Is there a limit to the number of days that DR servers can be used without being considered a test or production server? Will your testing schedule violate these restrictions?

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123985453000121

Recovery

Pierre Bijaoui, Juergen Hasslauer, in Designing Storage for Exchange 2007 SP1, 2008

Recovery Storage Group Creation

Your next step is to create an RSG on your recovery server. The RSG is used to restore your last backup of the MB4-SG1-PRIV1 database to recover your historical data.

The Database Recovery Management tool in ExTRA provides a GUI to simplify the steps and guide you in the recovery procedure. After you specify the Exchange server that you want to work with and the domain controller, ExTRA should connect to the menu that allows you to create an RSG (Figure 9-33).

Varying levels allow system users to access, view or perform operational functions

Figure 9-33. Creating an RSG

ExTRA by default creates a subdirectory beneath the database and log file directory of your production storage group to host the RSG database and log files (Figure 9-34).

Varying levels allow system users to access, view or perform operational functions

Figure 9-34. RSG directories

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781555583088000090

Disaster Recovery Options

Fergus Strachan, in Integrating ISA Server 2006 with Microsoft Exchange 2007, 2008

Activation Using Database Portability

To activate an SCR target database on a different server from its original host, we need to use the database portability feature of Exchange. Unlike the server recovery method, this method requires some reconfiguration within Active Directory—moving the location of the mailbox database and reconfiguring mailbox settings to point to the new mailbox store—and depending on the clients may require client reconfiguration as well.

What it boils down to, operationally, is little more than a restore of an offline database backup onto a separate server because we are attaching a copy of some database files onto a new storage group and mailbox database.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781597492751000060

Taking Responsibility for an Application

Kelly C. Bourne, in Application Administrators Handbook, 2014

6.7 Hardware

What server(s) support this application?

Compile a list of all application, development, database, and SMTP servers.

Be sure to include ALL servers

Production servers

Test and development servers

DR servers

Report servers

Can you remotely access all of the servers used by the application?

What tool is used to access them?

RDC—Remote Desktop Connection

WinSCP

Putty

Are the application servers physical or virtual?

If they are virtual servers, then what software is being used? For example:

VMWare

Citrix XenApp

Microsoft Hyper-V

Where are the servers physically located?

Who has physical access to them if they don’t respond remotely?

What operating system is loaded on each server?

What version of the operating system?

What patch level or service pack?

What resources does each server have?

Processors: make, model, and speed in GHz

Memory

Disk drives

NIC (Network Interface Cards)

Is any special or unique hardware needed for this application? This would include:

RAID drives

Multiple processors

Load balancers

Clusters

Proxy servers

Has any hardware (servers, disk drives, memory, etc.) been added recently?

If so, what and when?

Why was it added?

Are any hardware additions planned for the future?

Is any of the existing hardware scheduled to be replaced in the near future?

If so, what hardware and when?

When is the “End of Life” for the servers?

What is your organization’s process for replacing servers as they approach their end of life?

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123985453000066

What is virtualization?

Thomas Olzak, ... James Sabovik, in Microsoft Virtualization, 2010

Virtualization and business continuity

Business continuity is an important consideration in system design, including both system failures and datacenter destruction scenarios—and everything in between. Traditional system recovery documentation provides instructions for rebuilding a system using the hardware which is no longer accessible or operational. The problem is that there are usually no guarantees your disaster recovery or hardware vendors will be able to duplicate the original hardware.

Using different hardware can result in extended rebuild times as you struggle to understand why your applications do not function. Even if you can get the same hardware, you need to rebuild the environment from the ground up.

Finally, interruptions in business processes occasionally happen when systems are brought down for maintenance. You understand the necessity, but your users seldom do.

Virtualization provides advantages over traditional recovery methods, including:

Breaking hardware dependency. Since the hypervisor provides an abstraction layer between the operating environment and the underlying hardware, you do not need to duplicate failed hardware to restore critical processes.

Increased server portability. If you create virtual images of your critical system servers, it does not matter what hardware you use to recover from a failure—as long as the recovery server supports your hypervisor and, if necessary, the load of multiple child partitions. Enhanced portability extends to recovering critical systems at your recovery test site, using whatever hypervisor-compatible hardware is available.

Elimination of sever downtime (almost). You may never reach the point at which maintenance downtime is eliminated, but virtualization can get you very, very close. Because of increased server portability, you can shift critical virtual servers to other devices while you perform maintenance on the production hardware. You can also patch or upgrade one partition without affecting other partitions. One way to accomplish this is via clustering, failing over from one VM to another in the same cluster. From the client perspective, there is no interruption in service—even during business hours.

Quick recovery of end-user devices. When a datacenter goes, the offices in the same building often go as well. Further, satellite facilities can suffer catastrophic events requiring a complete infrastructure rebuild. The ability to deliver desktop operating environments via a centrally managed virtualization solution can significantly reduce recovery time.

It might seem that virtualization is an IT panacea. It is true that it can solve many problems, but it also introduces new challenges.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781597494311000011

SQL Server 2000 Overview and Migration Strategies

In Designing SQL Server 2000 Databases, 2001

Selecting a Recovery Model

The trunc. log on chkpt and select into/bulk copy options have been replaced with three recovery models in SQL Server 2000. To simplify recovery planning and backup and recovery procedures, you can select one of the recovery models shown in Table 1.3 from the database Properties dialog box on the Options tab.

Table 1.3. SQL Server 2000 Recovery Models

ModelRecovery CapabilitiesPrevious Version Settings
Simple Recover up to the last successful backup. Remaining changes must be redone. Trunc. log on chkpt: True
Select into/bulkcopy: True
or False
Full Recover to any point in time. Trunc. log on chkpt: False
Select into/bulkcopy: False
Bulk-Logged Recover up to the last successful backup. Remaining changes must be redone. Trunc. log on chkpt: False
Select into/bulkcopy: True

Each recovery model offers certain advantages for performance, space requirements, and data loss recovery. The simple recovery model requires the least amount of resources; log space is reclaimed because it is no longer needed for server recovery, similar to the trun. log on chkpt. option in previous versions of SQL Server. The disadvantage here is that data recovery to a particular point beyond the latest backup is not possible. Simple recovery should not be used where data recovery is critical and reentry is not possible.

Full and bulk-logged recovery models offer greater data recovery capabilities. Full recovery supports recovery to any point in time given that the current transaction log is not damaged. Bulk-logged recovery provides similar capabilities to the full recovery model, but bulk operations such as SELECT INTO, bulk loads, CREATE INDEX, image, and text operations are minimally logged, increasing your exposure to data loss in the event of a damaged data file. Selecting between full and bulk-logged recovery is dependent on your database structure and data operations. If complete data recovery is essential and the performance of bulk operations is not critical to your application, you should select the full recovery model for your database.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978192899419050004X

Business Continuity and Disaster Recovery in Energy/Utilities

Susan Snedaker, Chris Rima, in Business Continuity and Disaster Recovery Planning for IT Professionals (Second Edition), 2014

Compute and data

Not so long ago, servers were purposely built for nearly every piece of application software, the two were configured to work with each other and were inextricably linked once put into production, and the term “bare metal recovery” was part of the lexicon of standard IT disaster recovery operations. As a result, restoration of the software and data, or IT service, always depended upon additional plans, procedures, and costs associated with restoration of the server hardware, operating system, and individually configured network interfaces. As a result, recovery operations for IT infrastructure had to be designed more specifically for each and every application and took additional time and resources. In addition, the cost of the infrastructure design was much higher if continuous data protection (e.g., mirrored hot sites), spare parts/redundant server hardware, or server clustering software was required for high-availability applications. Today, virtualization technologies have obfuscated much of the underlying hardware to the point where recovery operations are dramatically less expensive and take much less time than they used to. ABC was an early adopter of such technologies and has nearly a decade under its belt in terms of how to best manage compute and data to achieve drastically improved BC/DR results.

As was stated earlier, ABC’s most stringent RTO is 30 minutes; however, they currently operate with recovery times under 5 minutes for most applications in most unplanned outage scenarios. ABC does not implement full mirroring between its dual production data centers in order to achieve this performance, primarily because virtualization technologies cost much less and produce better than required performance based on documented requirements. Instead, they physically separate Production (Prod) systems from Development (Dev), Test or Quality Assurance (QA) systems. All production systems are split between the two data centers. If one data center houses Prod, the other houses the Dev, Test, and/or QA instances for that particular application or service. They then backup these systems and data to the alternate data center bidirectionally. In addition, they maintain a reserve of additional compute and storage capacity at each data center in the event an entire data center is lost to a disaster, and they need to recover the most critical systems and data within 24 hours (i.e., Tier 1 and 2 applications, as defined in their Service Level Matrix). Backups of remote systems and data to one of the HQ data centers occur at regularly scheduled intervals and at off-peak times over slower WAN links.

ABC was an early adopter of server virtualization, in part due to its significant BC/DR and cost benefits. Today, it has a “virtualization first” policy; all new server instances are virtualized, regardless of application vendor support, unless requirements or contractual obligations dictate otherwise. For example, if a server requires a special communication adapter or there are exceptional performance requirements, a physical server will be used instead. From a support standpoint, moving from a virtual server to a physical server (P2V), if need be for specific troubleshooting circumstances, is easily accomplished with Novell’s PlateSpin tool. Today, VMWare ESX and VCenter technologies are employed to virtualize nearly 100% of Microsoft Windows instances using commodity Dell rack mount or blade server hardware. In addition, all Oracle Sun Solaris server instances are virtualized zones using Solaris’ ZFS file system on shared physical Oracle Sun Sparc servers.

VMWare allows for virtual server instances to be housed on shared enterprise storage using any of a number of storage protocols, including file-based protocols such as NFS. ABC uses NFS over 10 Gbps Ethernet extensively to serve up nearly all virtual server instances to clients on shared NetApp storage systems. By doing so, each virtual server instance is stored as a single VMDK file on a shared enterprise storage volume. Since each server is essentially a network file, ABC’s BC/DR recovery strategy has improved significantly for several reasons. First, files are much easier to manage than block-based fiber channel LUNs, which require full restoration on available storage (and significantly more time) before data inside can be recovered. Second, ABC uses the built-in data protection technologies of the shared NetApp storage to snapshot the volumes housing the VMDK file, locally, and then replicate the snapshots to off-site storage periodically using NetApp’s SnapMirror technology. Snapshots are essentially very fast, point-in-time backups of the storage volume housing the virtual server instances. Snapshot operations have virtually no effect on server performance, and no server-based backup software is required to be purchased or managed. To restore a virtual server instance from backup disk media, one simply connects to the snapshot or remote mirrored snapshot and either connects to the VMDK file in order to boot the virtual server or peek inside to extract volume-level folder and files using a free tool such as UFS Explorer.

In addition, VMWare employs two primary technologies, VMotion and HA, which build on the concept of a “server as a file” and further enhance disaster recovery capabilities. VMWare server VMotion and storage VMotion allow server instances to move seamlessly between disparate server and storage hardware, without taking an outage on the server itself. VMWare HA (high availability) detects if virtual servers are unresponsive for any reason and automatically resurrects the server on an available ESX node using VMotion. Together, these two technologies allow for virtually no server down time for planned maintenance, and near seamless automatic server recovery without human intervention in the case of most unplanned hardware failures.

Critical Concept

Server Virtualization and Shared Network-Attached Storage: Transformational for IT Disaster Recovery

The capabilities that server virtualization brings to BC/DR operations have been transformational for ABC. Instead of having to perform slow recovery from tape, having to reconstitute entire block-based LUNS on separate available storage, having to use cumbersome backup and recovery software to manage distinct backup jobs, or even having to deal with bare metal recovery of physical servers, ABC can recover virtual servers easily and reliably in minutes instead of hours or days. In addition, virtualization drastically improves availability and service levels by allowing ABC’s IT staff to put physical ESX hosts (server hardware) in “maintenance mode” at any time to perform planned hardware maintenance without having to take down virtual server instances; VMWare simply VMotions the all virtual servers to another ESX host with available capacity, with no client impact, so the physical server can be powered off or rebooted at any time without incurring an outage. Moreover, server virtualization technology even enables automatic, scripted failover, and failback processes so that ABC IT staff can provide self-service failover to DR sites for individual applications or sets of applications, as you will read about later in this chapter.

In the case of Oracle Sun Solaris, the ZFS file system is used to create virtual Zones which house each virtual “zoned” server. Tools such as NetApp’s Open System SnapVault (OSSV) can then be used to snapshot the zone/server locally, while the server is live, on the same production storage volume, similar to virtual servers managed by VMWare. OSSV can also be installed on physical server instances to create snapshots of entire physical server data volumes on shared storage, and it is generally licensed at little to no cost if you already license the NetApp hardware and replication software. Replication of historical snapshots to off-site storage (i.e., mirrored snapshots) occurs exactly the same way as with VMware virtual servers, via NetApp SnapMirror.

ABC’s use of shared enterprise storage to house network files and virtual server files together, along with local historical backups of the data (i.e., snapshots), has transformed system-level backup and recovery operations entirely. By employing similar storage systems at each data center location, production and backup data can be seamlessly served up at the same time. By employing clusters of ESX or Solaris ZFS hosts connected to enterprise storage at each location with sufficient reserve compute capacity, restoration of virtual servers at off-site facilities is as simple as opening a file from within VMWare VCenter or Solaris. This is one reason why server virtualization is a huge enabler of private or public cloud computing for disaster recovery operations.

Virtualization also improves database recoverability. At ABC, databases are backed up to local server volumes using vendor backup toolsets, and a limited number of local backups are kept to minimize required volume sizes. Since these server volumes are contained within a single VMDK or Solaris zone, database backups can easily be recovered from “local” volumes once a snapshot is restored and the virtual server is started back up. There is no need to at ABC to manage separate backup jobs of databases over WAN links to separate off-site storage volumes.

Since reliability, availability, and recoverability of virtual servers are tied heavily to shared enterprise storage, and to a lesser extent, individual physical ESX or Solaris servers at ABC, it is important that significant fault tolerance be designed into their enterprise storage. All NetApp storage systems employed by ABC which house production servers and network files (as opposed to merely backup data) utilize a pair of clustered HA controllers, redundant fiber loops to each disk shelf and RAID DP disk groups. Clustered controllers (or heads) allow the system to automatically fail over from the primary head to a secondary head if any problems occur. In addition, redundant fiber loops ensure that all disks are presented to and accessible by either head. Moreover, the disks are arranged into fault tolerant RAID DP groups, allowing up to two disks in any RAID DP group to fail without losing access to the data volume. Spare disks on each system are available to reconstitute any RAID DP group on the fly, and phone home capability ensures that NetApp is automatically notified of any disk failure so that they can send out replacement disks to be received and replaced within 4-8 hours. For physical servers, such as VMWare ESX or Solaris hosts, boot partitions are mirrored in a RAID 0 disk configuration. Therefore, if one disk fails, the other disk is still read to and written from, allowing the hosted hypervisor or operating system to remain operable. All physical server and storage system network interfaces and power supplies are redundant, as well, with both A and B connections to different clustered distribution switches or different PDUs (power distribution units) within a data center rack.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124105263099773

Infrastructure as a Service

Dinkar Sitaram, Geetha Manjunath, in Moving To The Cloud, 2012

Implementing the Pustak Portal Infrastructure

CloudSystem Matrix can be used for several IaaS usecases [19]. A portal like Pustak Portal can be implemented using the CloudSystem Matrix service catalog templates and self-service interfaces previously described. CloudSystem Matrix service templates are typically authored with a built-in graphic designer and then published into the catalog in an XML format. It is also possible to create the XML representations using other tools and import the templates using the CloudSystem Matrix APIs.

Template Design for Pustak Portal

As stated earlier, service template design is the first step in service setup using CloudSystem Matrix. Subsequently, the template can be used to instantiate the service [20]. The template design for the Pustak Portal is shown in Figure 2.17. The design uses a combination of virtual machines and physical servers to realize the service in order to leverage the flexibility conferred by virtualization. This is illustrated in Chapter 8 Managing the Cloud where scaling the service up or down is considered.

Varying levels allow system users to access, view or perform operational functions

Figure 2.17. CloudSystem Matrix service template example.

The service is realized in a conventional three tier application. In the example template, the web tier is connected to the Internet and contains six ESX host VMs running a Linux operating system realized as a set of linked clones. These VMs share a file system used as a cache for frequently used web data. The web tier connects to a private service internal network that is used for communication between the web tier servers and the application and database servers. The App Server tier contains four HyperV VMs running windows, while the database tier contains two physical servers also running Windows. The physical server database cluster shares a 300GB Fibre Channel disk.

Resource Configuration

After template definition, it is necessary to configure the resources (server, storage, network) used in the service template. These attributes are set in the Service Template Designer Portal. As an example for a virtual server configuration (see Figure 2.18), it is possible to set:

Cost Per Server used for charge back

Initial and Maximum number of servers in the tier

Option to deploy servers as linked clones

Number of CPUs per VM

VM Memory size

Server recovery automation choice

Varying levels allow system users to access, view or perform operational functions

Figure 2.18. CloudSystem Matrix server configuration example.

For the configuration of the physical servers there is an additional configuration parameter regarding processor architecture and minimum clock speed. The software tab in the designer allows configuration of software to be deployed to the virtual or physical server.

Similarly for disk configuration, Figure 2.19 shows an example of a Fibre Channel disk, with the following configuration parameters:

Disk size

Cost Per GB used for charge back

Storage type

RAID level

Path redundancy

Cluster sharing

Storage service tags

Varying levels allow system users to access, view or perform operational functions

Figure 2.19. CloudSystem Matrix storage configuration example.

Storage service tags are used to specify the needs for storage security, backup, retention and availability requirements.

Network configuration allows the service network requirements to be specified including requirements regarding:

Public or private

Shared or exclusive use

IPV4 or IPV6

Hostname pattern

Path redundancy

IP address assignment policy (Static, DHCP or Auto-allocation)

For example, specifying a private, exclusive-use network would provide the servers a network isolated from other servers in the environment.

Pustak Portal Instantiation and Management

Once the Pustak Portal templates have been created, the self-service interface of CloudSystem Matrix can be used by consumers to perform various lifecycle operations on the cloud service. Lifecycle operations are major management operations, such as creation, destruction, and addition and removal of resources. More specific details of lifecycle operations as per DMTF reference architecture can be found in Chapter 10. Consumer lifecycle operations are available either from a browser-based console or via the published web service APIs. The browser-based console provides a convenient way for the consumer to view and access their services, browse the template catalog and create new services and delete existing ones, view the status and progress of the infrastructure requests they have initiated, examine their resource pool utilization, and view their resource consumption calendar.

The lifecycle operations include the ability to adjust the resources associated with a particular service. Referring back to Figure 2.18 as an example, the number of servers in the web tier was initially specified to be 6 servers, with 12 as maximum number of servers in the tier. From the self-service portal the consumer has the ability to request additional servers to be added, up to the maximum of 12 servers. The consumer also has the ability to quiesce and reactivate servers in a tier. For example, in a tier that has 6 provisioned servers, the consumer can request 3 servers be quiesced, which will cause those servers to be shut down and the associated server resource released. However, a quiesced server disk image and IP address allocation is retained, so that the subsequent re-activate operations can occur quickly, without requiring a server software re-provisioning operation.

In order to maintain service levels and contain costs, the owner can dynamically scale the resources in the environment to make sure that the service has just enough server and storage resources to meet the current demand, without the need to be pre-allocated and have a lot of idle resources. The service scaling can be performed depending on the number of concurrent users accessing the system. As stated previously, this can be done manually via the Consumer Portal. In Chapter 8, Managing the Cloud, there is a detailed description of how this can be accomplished automatically using the CloudSystem Matrix APIs.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781597497251000020

Security component fundamentals for assessment

Leighton Johnson, in Security Controls Evaluation, Testing, and Assessment Handbook (Second Edition), 2020

F Contingency plan testing

Contingency plan testing always requires special attention for assessors as this is often the only way to fully check out the alternative operations and support efforts that the organization has placed into operations but only activates when they are required to do so. The following table reflects the areas of CP controls to evaluate and obtain evidence and proof of accomplishment for testing of the various parts of the system or organization's contingency plans and COOP preparations.

ControlTesting eventSample event to document
CP-3 CP training A seminar and/or briefing used to familiarize personnel with the overall CP purpose, phases, activities, and roles and responsibilities
CP -3 Instruction Instruction of contingency personnel on their roles and responsibilities within the CP and includes refresher training and, for high-impact systems, simulated events
CP-4 CP testing/exercise Test and/or exercise the CP to determine the effectiveness and the organization's readiness. This includes both planned and unplanned maintenance activities
CP-4 Tabletop exercise Discussion-based simulation of an emergency-based situation in an informal stress-free environment; designed to elicit constructive scenario-based discussions for an examination of the existing CP and individual state of preparedness
CP-4 Functional exercise Simulation of a disruption with a system recovery component such as back-up tape restoration or server recovery
CP-4 Full-scale functional exercise Simulation prompting a full recovery and reconstitution of the information system to a known state and ensures that staff are familiar with the alternative facility
CP-4, CP-7 Alternate processing site recovery Test and/or exercise the CP at the alternate processing site to familiarize contingency personnel with the facility and available resources and evaluate the site's capabilities to support contingency operations. Includes a full recovery and return to normal operations to a known secure state. If high-impact system, the alternate facility should be fully configured as defined in CP
CP-9 System backup Test backup information to verify media reliability and information integrity. If high-impact system, use sample backup information to validate recovery process and ensure backup copies are maintained at alternate storage facility

Now, each of these areas of focus for assessment of the CP controls should be tied into and reflected in the system contingency plan and its design efforts as reflected below:

Varying levels allow system users to access, view or perform operational functions

Incident response

The current state of the security of systems across the enterprise often requires organizations to develop and conduct incident response activities due to breaches, malware infections, “phishing” events, and outright external attacks. The state of the cybercrime and hacking communities has developed dramatically over the past few years and now includes “hack-in-a-box” and fully developed malicious software development efforts including formal version controls, automated delivery channels, testing against known antivirus signatures and malware as a service (MAAS) cloud-based delivery mechanisms. The goals of any incident response effort are as follows:

Detect incidents quickly

Diagnose incidents accurately

Manage incidents properly

Contain and minimize damage

Restore affected services

Determine root causes

Implement improvements to prevent recurrence

Document and report

The purpose of incident response is to manage and respond to unexpected disruptive events with the objective of controlling impacts within acceptable levels. These events can be technical, such as attacks mounted on the network via viruses, denial of service, or system intrusion, or they can be the result of mistakes, accidents, or system or process failure. Disruptions can also be caused by a variety of physical events such as theft of proprietary information, social engineering, lost or stolen backup tapes or laptops, environmental conditions such as floods, fires, or earthquakes, and so forth. Any type of incident that can significantly affect the organization's ability to operate or that may cause damage must be considered by the information security manager and will normally be a part of incident management and response capabilities.

The US Government has long recognized the need and requirements for computer incident response and as a result has developed many documented resources and organizations for incident response to include the US-CERT, various DOD CERT organizations, joint ventures between various governmental agencies, incident handling guides, procedures and techniques, and the NIST SP 800-61.

SP 800-61—computer security incident handling guide

As the introduction at the beginning of SP 800-61 says: “Computer security incident response has become an important component of information technology (IT) programs. Cybersecurity-related attacks have become not only more numerous and diverse but also more damaging and disruptive. New types of security-related incidents emerge frequently. Preventive activities based on the results of risk assessments can lower the number of incidents, but not all incidents can be prevented. An incident response capability is therefore necessary for rapidly detecting incidents, minimizing loss and destruction, mitigating the weaknesses that were exploited, and restoring IT services. To that end, this publication provides guidelines for incident handling, particularly for analyzing incident-related data and determining the appropriate response to each incident. The guidelines can be followed independently of particular hardware platforms, operating systems, protocols, or applications.

Because performing incident response effectively is a complex undertaking, establishing a successful incident response capability requires substantial planning and resources. Continually monitoring for attacks is essential. Establishing clear procedures for prioritizing the handling of incidents is critical, as is implementing effective methods of collecting, analyzing, and reporting data. It is also vital to build relationships and establish suitable means of communication with other internal groups (e.g., human resources, legal) and with external groups (e.g., other incident response teams, law enforcement).”12

Incident handling

“The incident response process has several phases. The initial phase involves establishing and training an incident response team, and acquiring the necessary tools and resources. During preparation, the organization also attempts to limit the number of incidents that will occur by selecting and implementing a set of controls based on the results of risk assessments. However, residual risk will inevitably persist after controls are implemented. Detection of security breaches is thus necessary to alert the organization whenever incidents occur. In keeping with the severity of the incident, the organization can mitigate the impact of the incident by containing it and ultimately recovering from it. During this phase, activity often cycles back to detection and analysis—for example, to see if additional hosts are infected by malware while eradicating a malware incident. After the incident is adequately handled, the organization issues a report that details the cause and cost of the incident and the steps the organization should take to prevent future incidents.”13

Varying levels allow system users to access, view or perform operational functions

Preparation

Incident response methodologies typically emphasize preparation—not only establishing an incident response capability so that the organization is ready to respond to incidents but also preventing incidents by ensuring that systems, networks, and applications are sufficiently secure. Although the incident response team is not typically responsible for incident prevention, it is fundamental to the success of incident response programs.

As an assessor of incident response capacity and incident handling activities, it is important to understand the process itself is often chaotic and can appear haphazard when the response is active. One of the critical areas to focus on during the review is the documented and defined training for the responders, as well as the organizational policies and procedures for incident response. Each of these areas helps determine the success or failure of the response team, their interactions with the rest of the organization, and ultimately the minimization of the impact of the incident on the organization, its people and its mission.

Detection and analysis

“For many organizations, the most challenging part of the incident response process is accurately detecting and assessing possible incidents—determining whether an incident has occurred and, if so, the type, extent, and magnitude of the problem. What makes this so challenging is a combination of three factors:

Incidents may be detected through many different means, with varying levels of detail and fidelity. Automated detection capabilities include network-based and host-based IDPSs, antivirus software, and log analyzers. Incidents may also be detected through manual means, such as problems reported by users. Some incidents have overt signs that can be easily detected, whereas others are almost impossible to detect.

The volume of potential signs of incidents is typically high—for example, it is not uncommon for an organization to receive thousands or even millions of intrusion detection sensor alerts per day.

Deep, specialized technical knowledge and extensive experience are necessary for proper and efficient analysis of incident-related data.

Signs of an incident fall into one of two categories: precursors and indicators. A precursor is a sign that an incident may occur in the future. An indicator is a sign that an incident may have occurred or may be occurring now.

Incident detection and analysis would be easy if every precursor or indicator were guaranteed to be accurate; unfortunately, this is not the case. For example, user-provided indicators such as a complaint of a server being unavailable are often incorrect. Intrusion detection systems may produce false positives—incorrect indicators. These examples demonstrate what makes incident detection and analysis so difficult: each indicator ideally should be evaluated to determine if it is legitimate. Making matters worse, the total number of indicators may be thousands or millions a day. Finding the real security incidents that occurred out of all the indicators can be a daunting task.

Even if an indicator is accurate, it does not necessarily mean that an incident has occurred. Some indicators, such as a server crash or modification of critical files, could happen for several reasons other than a security incident, including human error. Given the occurrence of indicators, however, it is reasonable to suspect that an incident might be occurring and to act accordingly. Determining whether a particular event is actually an incident is sometimes a matter of judgment. It may be necessary to collaborate with other technical and information security personnel to make a decision. In many instances, a situation should be handled the same way regardless of whether it is security related. For example, if an organization is losing Internet connectivity every 12 hours and no one knows the cause, the staff would want to resolve the problem just as quickly and would use the same resources to diagnose the problem, regardless of its cause.”14

Containment, eradication, and recovery

“Containment is important before an incident overwhelms resources or increases damage. Most incidents require containment, so that is an important consideration early in the course of handling each incident. Containment provides time for developing a tailored remediation strategy. An essential part of containment is decision-making (e.g., shut down a system, disconnect it from a network, or disable certain functions). Such decisions are much easier to make if there are predetermined strategies and procedures for containing the incident. Organizations should define acceptable risks in dealing with incidents and develop strategies accordingly.

Containment strategies vary based on the type of incident. For example, the strategy for containing an email-borne malware infection is quite different from that of a network-based DDoS attack. Organizations should create separate containment strategies for each major incident type, with criteria documented clearly to facilitate decision-making.”15

“After an incident has been contained, eradication may be necessary to eliminate components of the incident, such as deleting malware and disabling breached user accounts, as well as identifying and mitigating all vulnerabilities that were exploited. During eradication, it is important to identify all affected hosts within the organization so that they can be remediated. For some incidents, eradication is either not necessary or is performed during recovery.

In recovery, administrators restore systems to normal operation, confirm that the systems are functioning normally, and (if applicable) remediate vulnerabilities to prevent similar incidents. Recovery may involve such actions as restoring systems from clean backups, rebuilding systems from scratch, replacing compromised files with clean versions, installing patches, changing passwords, and tightening network perimeter security (e.g., firewall rulesets, boundary router access control lists). Higher levels of system logging or network monitoring are often part of the recovery process. Once a resource is successfully attacked, it is often attacked again, or other resources within the organization are attacked in a similar manner.

Eradication and recovery should be done in a phased approach so that remediation steps are prioritized. For large-scale incidents, recovery may take months; the intent of the early phases should be to increase the overall security with relatively quick (days to weeks) high value changes to prevent future incidents. The later phases should focus on longer-term changes (e.g., infrastructure changes) and ongoing work to keep the enterprise as secure as possible.”16

Postincident activity

“One of the most important parts of incident response is also the most often omitted: learning and improving. Each incident response team should evolve to reflect new threats, improved technology, and lessons learned. Holding a “lessons learned” meeting with all involved parties after a major incident, and optionally periodically after lesser incidents as resources permit, can be extremely helpful in improving security measures and the incident handling process itself. Multiple incidents can be covered in a single lessons learned meeting. This meeting provides a chance to achieve closure with respect to an incident by reviewing what occurred, what was done to intervene, and how well intervention worked.

Small incidents need limited post-incident analysis, with the exception of incidents performed through new attack methods that are of widespread concern and interest. After serious attacks have occurred, it is usually worthwhile to hold post-mortem meetings that cross team and organizational boundaries to provide a mechanism for information sharing. The primary consideration in holding such meetings is ensuring that the right people are involved. Not only is it important to invite people who have been involved in the incident that is being analyzed, but also it is wise to consider who should be invited for the purpose of facilitating future cooperation.”17

As an incident response assessor and evaluator, you will be looking for the required training and exercise documentation for each responder on the team. The policies for incident response, handling, notification, and board review all need to be identified, reviewed and assessed. The supporting procedures for handling and response efforts all need review and correlation to the policies, the security controls for IR from SP 800-53 and the actual incident response Plan for each system as it is reviewed and assessed.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128184271000112

What refers to the varying levels that define what a user can access view or perform?

Accessibility. refers to the varying levels that define what a user can access, view, or perform when operating a system.

What enables a computer to run multiple operating systems and multiple software applications at the same time?

Virtualization software — programs that allow you to run multiple operating systems simultaneously on a single computer — allows you to do just that. Using virtualization software, you can run multiple operating systems on one physical machine.

What is the ability to get a system to get up and running in the event of a system crash or failure that includes restoring the information backup known as?

Backup is an exact copy of a system's information; recovery is the ability to get a system up and running in the event of a system crash or failure.

What refers to how well a system can adapt to increase demands?

Scalability describes an organization's capacity to adapt to increased workload or market demands. A scalable firm is able to quickly ramp up production to meet demand and at the same time benefit from economies of scale.