«Dehumanized» data processing centre
We have come a long way in the development of our Data Centers. First, we collected information systems in Data Centers, located at a distance of synchronous replication to avoid data loss when crashing.
A power failure in Moscow in may 2005 made everyone in the country, and us as well, to reflect on the reservation of information systems in the data center, which will not be exposed to the effects of disasters in a single region. We reviewed many options and looked at the data center in the city of Zurich - it is located within a fair distance, it is provided with redundant communication channels and power supply, it is located in an earthquake-safe area. Geographically it is located in the operational availability of the service organizations of the majority of equipment manufacturers.
On the other hand, we did not have our own staff of administrators there. And this has forced us to rethink, the approaches to data centers maintenance, which were traditional at that time and required the presence of the duty shift in close proximity to the equipment. Then we thought: how would we have organized the data center if it was located on the moon or on Mars? Not an easy place to visit very often! We realized that we need to organize a fully remote management information systems, virtual platform, physical hardware: servers, storage systems, backup systems and switches. We have selected equipment with remote power management, equipped racks with a power distributor PDU with remote control.
We got down to business: bought an "appropriate" equipment, doubled network access, ensured video surveillance, signed agreements with a data center management company and with service organizations that support the equipment. And, equally important, we connected all the equipment and information systems to our Centralized Monitoring System. We are pleased with our monitoring system: we built it so that it regularly polls each node of the infrastructure (physical and virtual), tracking only the one important for work settings. In this case, we form a tree of dependencies between the components, therefore, having found "the failed" component, so the system immediately determines the exact process it affects.
The monitoring system tracks thousands of components in dozens of settings. It is very important not to overburden administrators with the information, so we decided that the monitoring system should show only that part of the service, which shows "the failed" component. Therefore, the clock duty service that is hosted thousands of miles away from the datacenter, quickly learns about the fact of failure and sees in the monitoring system, what exactly caused it.
We set out to a remote data center for installation of a new one, as a replacement of a worked out equipment, and its rerouting once a year. This week is full of fun and intensive work. During all the time of service we didn't have to travel to remote data centers to eliminate accidents. Only once we failed to remotely turn on the good old SF25K, which has switched off to protect from overheating, when the accident with the cooling system happened. This server physically turns off the toggle switch that you can enable only manually. We had to ask the staff working on the data center of the company to reach the server and physically turn on the power. During the year the service company comes to replace the failed component — still, the visit can be organized remotely.
We're proud of the way we organized the operation of this data center. We are proud that we do not need the presence of people in the data center for equipment maintenance. Therefore, we are proud to call it "dehumanized".