The Council

The council consists of a Corporate Information Technology (CIT) division and each of the 5 main Departments housed over numerous different sites having their own independent Information Technology Unit (ITU). Most systems used are based on a Microsoft® platform.

The Council computer network infrastructure is supported primarily by CIT with desktop and application support devolved to the local ITU’s.

Each Department ITU and CIT is responsible for maintaining anti-virus security in their local areas.

Networked based computer systems are a key business tool for most of the several thousand council staff and it is essential to keep disruption to a minimum.

Workers use the network to connect systems for work from both office and mobile (including home working) environments.

A Help Desk Service provides a central fault reporting service.

Issues Facing the Council

There have been an increasing number of alerts about vulnerabilities to systems and corresponding application patches to remove or reduce the risk of exploits being used in virus attacks.

CIT and ITU’s have a large number of disparate applications and systems to support and each patch needs to tested to ensure it does not have an adverse affect on them before it can be issued. However resources to test systems are very limited.

Each Department is managed independently and as well as ‘official’ local ITU’s, some sections have their own independent specialist IT support group.

In areas of information technology, CIT can only advise Departments on issues not directly related to infrastructure.

The Virus

The W32.Welchia.worm also know as the W32/Nachi.worm exploits vulnerability in the operating system’s Remote Procedure Call (RPC) interface by scanning port 135 for target machines. It then uses an ICMP ping to locate a victim. Once the victim responds, exploit data is sent to it.

This exploit data creates a remote shell on the victim's machine. Enabling the attacker to connect it to an infected machine on Transmission Control Protocol (TCP) port range 666-765. Victim machines are instructed to download the worm via Trivial File Transfer Protocol (TFTP).

The newly infected machine them sends packets of information across the local subnet to the RPC service running on port 135 to locate further vulnerable machines.

When a vulnerable computer gets these packets, the virus creates a buffer overflow and crashes the RPC service on that system. This can occur without the worm actually being on the machine.

The virus can also exploit a security vulnerability present in an operating system component used by WebDAV, a set of extensions to the Hyper Text Transfer Protocol (HTTP), to create a buffer overflow enabling the attacker to use the standard for editing and file management between computers on the Internet.

No matter what anti-virus detection application is used, the computer system is susceptible to a buffer overflow attack from an infected host machine unless the operating system vulnerability has been removed.

To remove the virus the software to remove the vulnerability has to be installed on an infected computer and then an appropriate version of anti-virus software is used to remove the virus.

A Summary of how events unfolded

The virus attack was first identified as infecting two departments by CIT networking staff. The local ITU’s were informed and requested to install system patches and clean infected machines.

CIT’s network staff isolated the infected local area networks (LANs) and set up monitoring to identify infected machines and other employees were informed of the situation.

After about a week, with ITU staff working overnight and at weekends to clean infected machines and ensure others were protected. It appeared the infection had been contained with a reducing number of machines being affected.

Then the virus spread wider and more quickly as staff returned to work following summer holidays. Cleaned machines were being re-infected and the network and support teams were being inundated with fault calls as the user community becoming less patient with slow network services.

Attempting to minimise disruption of service whilst seeking to resolve the virus problem had only partially succeeded. Therefore a more direct, intrusive approach was applied.

Staff in all departments, were advised of what was planned and were kept informed at regular intervals, as the plans were put into action.

First each LAN was closed down and all network component on them cleaned and protected against re-infection.

As each LAN was opened for staff use any device or computer found to be infected was immediately isolated by network staff until it could be fixed by the appropriate ITU.

The virus has now been defeated but monitoring continues to ensure no vulnerable device has been overlooked.

Some Issues Raised

As events unfolded several serious issues became apparent.

  • No standard build on the computer equipment delayed fixing vulnerable systems.

  • Despite corporate wide policies to the contrary, some computers did not have anti-virus software and others were using out of date versions.

  • Some machines were incorrectly patched causing them to be re-infected.

  • Delays in clearing the virus effectively were partially due to lack of effective central control due to having five separate command structures to deal with.

Lessons Learnt

The incident, the first serious one involving virus attacks, resulted in a full review of actions taken and a review of service provision and business continuity.

Some of the issues reviewed were;

  • Identifying a problem owner at a sufficiently high level to ensure all departments comply immediately and appropriately with the problem resolution actions identified.

  • What can be done to ensure all systems are of a common build with common application version numbers where appropriate?

  • Improving crisis communications.

  • How to develop a common framework to manage the issue of vulnerabilities and patching systems.

  • Looking at various British Standards such as BS7799 Information Security Management and BS15000 IT Service Management Standard, for guidance on improving risk and service management.

 

We Did it Right, but it Went Wrong

Case Study of a patching exercise that didn't go according to plan

The following event demonstrates the risks in supporting just part of a secure IT environment for approximately three thousand email users.

The scenario: We run a server that provides mail scanning (in and outbound), this filters out “Spam” and checks the content for suitability for forward processing (delivery) or should a message fail this check holds it for inspection and subsequent processing. In addition this process also checks for email messages that contain viral infections or worms and deals with them accordingly.

The technology used: Microsoft Windows 2000 Advanced Server, Clear Swift Mail Sweeper, Sophos Anti Virus. In addition at the time of the incident there was a run time copy of Microsoft SQL 2000 database supplied as part of the Clear Swift product. This is housed on a Compaq (HP) DL380 server (rack mounted), this server sites in our DMZ.

It should be noted that we keep the Anti Virus signatures up to date, keep up with service and security packs from Clear Swift and Microsoft after testing on a “like” machine.

What happened: On the day in question at around 7:30am we responded to a security alert from Microsoft with a security update, we tested update this per our procedure and found it worked and installed okay, we then installed it to our “live” server and the server ceased to work following a reboot, being unable to reload its operating system fully and going in to a re-boot cycle.

What happened next: The senior technician did some initial diagnostic work and sought the support of a Microsoft Server specialist from the main IT support team, between them they worked imperially though a number of ideas to resolve the problem, including things like looking at reversal of the update, running the system repair option, reinstallation / reconfiguration of some of the software on the server, this however did not restore the machine to its correct operation.

At around lunch time, it was decided cease the patch / test / adjust approach and go for a complete rebuild of the server, it should be noted that this server had originally been configured by a contractor, however the two technicians where familiar with the software and the configuration and so with supporting documentation commenced the rebuild.

It was agreed that the re-build would be slightly different from the original specification and that a full version of Microsoft SQL 2000 would be used rather than the run-time as this run time component had been disappointing in terms of reporting on the work of the mail sweeper product.

By mid afternoon things where looking better, the operating system was installed and fully patched, MS SQL2000 was on and patched, Sophos Anti Virus software was on and fully up to date with the necessary signature files and the Mail Sweeper software was loaded, the routes for the movement of email between this server and the almost redundant MS Exchange System and the new GroupWise SMTP server where proven as was the route to the Internet.

Some complexity in the relationship between the Mail Sweeper data and the Anti Virus product where overcome with conversations with the Vendors and the contractors who originally set the server up.

The server was made live for a short period to test all was working with “real” data, this processing was carefully studied and after agreement between the two technicians and I we decided to go back to “normal operation” at around tea time.

Outcomes:

  • We have now formulated a fall back plan if this happens again, this would involve quickly setting up a fall back server with just anti virus scanning on it, while this system was rebuilt, we also resolved to review further resilience in this area, and other methods of achieving content and virus scanning.

  • We had updated from a run time to a full version of MS2000 which helps with reports from the Mail Sweeper product.

  • We had tolerance tested the lack of email! We found (with no surprise) that where as the lost of inbound and out bound email for a day a year ago was “not good, but not the end of the world” that in fact it was now “the end of the world”....

  • Due to time restraints and the need to get systems back on line, we never did get to the bottom of what went wrong, although we have many ideas.

  • Clearer focus about when to feed information in to the WARP, this data would have been useful at the point the problem occurred, in this instance around lunch time and finally when all was up and running again.