Tuesday, March 24, 2009

ICAT4221B Locate equipment, system and software faults

1: Rectify fault and test

This unit will show you about rectifying faults, testing for the success of the solution and perform acceptance testing of the system to ensure the problem has been satisfactorily solved.

Outcomes for this unit

After completing this learning pack you will be able to:

  • Rectify possible causes, testing for the success of the solution
  • Test the system to ensure the problem has been solved
Activity 1.1: Action Plan

This activity will require you to prepare an action plan for a given fault. The fault is described below. You will need to formulate this plan in fairly generic terms since you would be working without having had exposure to the system described.

The fault

You have been assigned to troubleshoot a network server. The server has been operational for over 18 months and has recently started to experience some problems. The symptoms described are as follows:

  • System hangs intermittently when accessing disk drives
  • The Windows 2000 Event Log shows several entries relating to I/O and CRC errors
  • The lights in front of the RAID enclosure sometimes blink continuously, even when disk activity is nonexisten.
  • You suspect that the RAID subsystem is failing.
Q: How would you develop an action plan, which will enable you get to the bottom of this problem?

A: An appropriate action plan would incorporate the following characteristics

  • Acknowledges the presence of a fault, providing justification for action to be taken
  • Identifies the systems or components affected or impacted
  • Identify the objectives of the plan (i.e. restore optimum functionality)
  • Identifies resources needed, including hardware, software, human, procedures
  • Identify severity and criticality, hence priority
  • Identifies a timeframe for implementation, according to priority
  • Identifies any support contracts that might exist and be applicable to system in question
  • Indicates actual remedial steps to be taken. This might include system reconfiguration, re-installation, software patches, component replacement, consultation with vendors to engage as needed
  • Indicate risks including expected disruption as result of remedial action
  • Identify a workaround solution in case previous steps failed to provide to rectify fault
Note: that not all items in the list from above should be included, but they should at least be considered. An appropriate way for developing this action plan would to use a pre-existing form, available as an organisational document.

Note: that quite often, highly featured help desk software would include all of the above items as part of the standard description of faults and their management.

Activity 1.2: Rollback strategy


This activity will require you to devise a rollback strategy based on the scenario from the previous activity. The fault is described below.

The fault
  • You have been assigned to troubleshoot a network server. The server has been operational for over 18 months and has recently started to experience some problems.
  • The symptoms described are as follows:
  • System hangs intermittently when accessing disk drives
  • The Windows 2000 Event Log shows several entries relating to I/O and CRC errors
  • The lights in front of the RAID enclosure sometimes blink continuously, even when disk activity is nonexistent
You suspect that the RAID subsystem is failing.

Q: How would you develop a rollback strategy for this situation?

A:
A rollback strategy is a series of steps or measures that would enable you to restore the system being troubleshot to the state prior to troubleshooting beginning.

In this particular case, you rollback strategies would have considered the following:
  • Steps from action plan may be reversed or equivalent system status can be achieved with alternative steps
  • No data loss will be incurred. Full system and data backups are to be made before enacting the action plan
  • Spare components are available, if needed
  • Expertise is available for system reconfiguration. This might include internal personnel and external (vendors or contractors)
  • An alternative solution is available. ie backup server
  • The consequences and impact of the rollback are understood
Activity 1.3: Acceptance Testing

This activity will require you to devise an acceptance test procedure based on the scenario from the previous activity. The fault is described below.

The fault

You have been assigned to troubleshoot a network server. The server has been operational for over 18 months and has recently started to experience some problems.

The symptoms described are as follows:
  • System hangs intermittently when accessing disk drives
  • The Windows 2000 Event Log shows several entries relating to I/O and CRC errors
  • The lights in front of the RAID enclosure sometimes blink continuously, even when disk activity is nonexistent
You suspect that the RAID subsystem is failing.

Q: How would you develop an acceptance test procedure?

A: The development of an Acceptance Test involves a number of iterative steps:

  1. Assess the type of testing required
  2. Develop the procedures and instructions for testing
  3. Develop the necessary test scripts
  4. Execute the test scripts
  5. Report any defects
  6. Retest any fixes
Your acceptance test procedure might have included some of the following items:
  1. Test type to be carried out ie simple, iterative, sequential
  2. Instructions to be carried out ie any necessary preparations such as installation of monitoring software, auditing, load testing, benchmarking
  3. The sequence (order) of tests to be done
  4. Resulting data that will be analysed following the execution of tests ie reports, charts, benchmarking results, system log events
  5. Definitions of what constitutes failure. Criteria or metrics to be stipulated here ie repetition of original symptoms, new symptoms
  6. Repetition of testing after new fixes actioned
2: Obtain appropriate fault-finding tools

Fault-finding is a crucial skill in the life of the IT professional, no matter what area of IT you are in. Fault finding can be very challenging indeed, yet being able to solve a difficult problem can bring enormous satisfaction and recognition. The good news is that fault-finding skills can be developed. Fault-finding is a skill that will accompany you throughout your professional career.

The aim of this unit is to allow you to develop an understanding for fault-finding tools and methods. You will have an opportunity to practise using fault-finding tools and methods to solve real problems.

In this topic, you will have an opportunity to learn about tools that are used for fault-finding and troubleshooting purposes. You will also learn about generic cyclic fault-finding methods. Additionally, you will have an opportunity to practise fault-finding using commonly available tools for a range of computer systems, both standalone and networked.

Outcomes for this unit

After completing this learning pack you will be able to:
  • Analyse and document the system that requires troubleshooting
  • Research specifically designed troubleshooting tools for the system
  • Investigate generic cyclic fault finding tools
  • Obtain required specialist tools
Activity 2.1: Web search

This activity will require you to use the Internet to search for fault-finding tools that might be appropriate for an IT environment.

Use the following as search criteria:
  • One software-based tool that performs standalone PC diagnostics. This tool must be freeware/open source/GNU GPL.
  • One software-based tool that performs standalone PC diagnostics. This tool must be commercial.
  • One software based tool that performs network diagnostics, for example, network discovery, packet capture and analysis. This tool must be freeware/open source/GNU GPL.
  • One software based tool that performs network diagnostics, for example, network discovery, packet capture and analysis. This tool must be commercial.
Q: What fault finding tools did you find that might be appropriate for an IT environment?

A: There are literally hundreds of software-based tools available. The real challenge is to be able to sort through them all and find the ones that will enhance your ability to find problems and fix them. Some possible answers are listed below:
  • Sandra,
  • Systemworks,
  • Ethereal,
  • Fluke Network Inspector, and
  • Protocol Inspector.
Activity 2.2: Hardware tools

The aim of this activity is for you to find out about hardware based tools that can assist you in the troubleshooting process. You will use the Internet, trade magazines and books to find out about hardware tools. Use the following criteria to narrow down your search.

You need to find:
  • A multimeter which can at least measure AC/DC voltage, Resistance, Continuity
  • A basic cable tester, which can at least measure Wiremap, Continuity (open/shorts) and length
  • A network analyser, which can detect common errors (collisions, errors and utilisation/throughput)
Q: What hardware tools did you find that can help you in the troubleshooting process

A: The sheer number of products available in the marketplace can be quite staggering. Pricing can vary significantly from product to product and from brand to brand.
You don’t need to purchase expensive products. The key is functionality and value for money. Some cable testing equipment can be worth upwards of $5000! If you are a support technician, you might find that a basic model that performs basic tests, will be more than adequate.

One word of advice though, when buying economical testing equipment, make sure that it is approved for use in Australia. Many products imported into Australia (for example, very low price multimeters) might not to be compliant with Australian Standards, as they might not comply with our safety standards.

Activity 2.3: Testing network connectivity

This activity will require you to use some commonly available software utilities for troubleshooting network connectivity problems. You will require access to a networked computer that runs TCP/IP. If you have an Internet connection at home, you may use this connection for the purpose of this activity.

Use the ping utility to test connectivity, according to the following:

First ping your own system – this checks that the TCP/IP software is working/configured correctly. Use the command below:

ping localhost
or
ping 127.0.0.1
or
you use the
ipconfig

in Windows to find out your own IP address (ifconfig if using Linux/Unix)
Repeat the command, but this time add the following:

[>ping_localhost.txt]:
ping localhost >ping_localhost.txt


This command will result in the output being saved to a file rather than being displayed on the screen. Check with your teacher as she might want a copy sent to her.

Next, try pinging some Internet sites. Try a few. Note that many websites will not answer.

This does not mean that there is a problem, as some websites block ping requests. For example:

ping www.cisco.com >cisco.com.txt

Next, try using the tracert command (traceroute if using a Unix-like system). The tracert command is very similar to ping, except that it does not just ping the destination, but every system in between the sender and the receiver. For example:

tracert www.cisco.com

Q: How many hops did it take to get to Cisco? Did all the intermediate systems answer the requests? Did it time out?

A: This was a rather simple but very useful activity. The ping command is one of the most commonly used troubleshooting tools. Without a doubt, you will use ping regularly, if working on networked systems. The tracert command is also very useful but more specific, in terms of analysing the route followed by network packets. The following dump screen is an example of a trace to www.cisco.com. Note that the trace stops (times out) before reaching Cisco (quite possibly due to security restrictions).

Another tool worth mentioning is Telnet. Telnet allows you to test connectivity using all seven OSI layers. If telnet works, all other layers have basic functionality. Telnet is regarded as a Layer 7 (application) tool.

Activity 2.4: Gathering system information

The aim of this activity is for you to practise gathering basic information about a computer system. All computer systems feature some form of utility that reports on system information. For instance, Windows features the System Information. Unix-like systems also include some form of tool that can be executed from within a GUI that brings up system information.

If you are using Windows, launch the System Information Utility. Click on Start Run and type msinfo32. Press Enter. If you are using Linux/Unix, utilities will vary from distribution to distribution. Most Distributions will feature a ‘Control Panel’ tool that generally features system information and reporting utilities.

Q: What are the built-in utilities in your computer system? Does your operating system allow you to save or export the output of commands to a text file?

A: This activity has allowed you to explore some of the built-in utilities in a computer system. Most operating systems will feature some tool that enables you to view/save a report on software and hardware components and configuration of a system
It is certainly worthwhile becoming acquainted with such tools as they generally produce a wealth of very useful information

Key terms

Acceptance Test: An ‘Acceptance Test’ can be defined as a formal test conducted to determine whether or not a system satisfies its acceptance criteria and to enable the customer to determine whether or not to accept the system. Acceptance tests may also be knows as Functional Tests.

Acceptance test criteria: Acceptance test criteria refer to what things should be considered before declaring success.

Action Plan: In terms of fault resolution and rectification, action plans are the summary of steps to be taken in order to solve a fault.

Application Software: Application software refers to software programs that are part of the operating system. Application software is not critical to the functions of the computer system as a whole. Application software allows a user to perform a specific task such as word-process, access a database, access the Internet and email etc.

Auditing: Operating System or Software Application feature that enables the capture of specific data for monitoring and testing purposes.

Boot Time Fault: Fault which might have its source in hardware or software, and that occurs during a computer systems start-up. Boot time faults may halt the normal start-up sequence rendering a system unusable.

Cyclic Fault Finding: Generic method used during fault-finding and troubleshooting. The method is based on a series of steps aimed at diagnosing a problem by investigating symptoms. It is called cyclic because the process may be repeated from the beginning until the actual fault has been identified and satisfactorily solved.

Debugging: Operating System or Software Application feature which enables the detailed capture of data for troubleshooting purposes

Decision Tree: A decision tree is a schematic aide (i.e flowchart) that takes the trouble-shooter through a series of processes, presenting questions, making decisions and indicating suggested steps toward finding a solution.

Disaster Recovery and Contingency Plan: Set of policies and procedures formulated by a company, to deal with the recovery from major (critical) faults that are likely to impair normal business operations.

FRU (Field Replaceable Unit): Hardware device which is practical and reasonable to replace on the field, without the need to remove the failed device from customer premises.

FTA (Fault Tree Analysis): Fault tree analysis is the process of analysing a fault by using a decision tree. Decision trees can be constructed in advance, for common troubleshooting tasks or they can be constructed ad-hoc for new faults.

HTA (Hierarchical Task Analysis): HTA is a logical representation of a process and steps that must occur for this process to begin and finish successfully.

Hardware Fault: Fault which has its source in a hardware component.

ITIL (Information Technology Infrastructure Library): ITIL is a set of best practice standards for Information Technology service management. ITIL is controlled by the Office of Government Commerce (OGC) in the United Kingdom.

Non-Routine Fault: Non-routine faults are those faults that occur unexpectedly. These faults may be serious, so their criticality must be evaluated. A non-routine fault may trigger the enactment of a ‘Disaster Recovery and Contingency’ plan.

Response Time: The length of time between a request to a computer system and the response from the computer system.

Rollback strategy: Rollback and back-out plans are the strategies that you might need to implement if things do not work out. If the steps that you took as per your action plan weren’t effective and the fault is not resolved, you need to take a step back or rollback.

Routine Fault: Routine faults are those faults that are expected to occur somewhat regularly. Due to the fact that some problems can be foreseen, businesses may develop procedures and practices for dealing with problems considered routine.

Software Fault: Fault which has its source on a software component or application.

System software: Usually system software refers to software components that are part of the operating system. System software some times can be critical to the proper functions of the operating system and system services.

The Scientific Method: Method that proposes logical and systematic analysis of data, with the aim of gathering information, stating the problem, formulating a hypothesis, testing the hypothesis and drawing conclusions. The scientific method forms the basis for cyclic fault-finding.

No comments: