Archive for the ‘Outages’ Category

ISG Helpdesk Service Interruption

Monday, December 17th, 2012

UPDATE: the power test has been cancelled. Helpdesk duty as usual.

On Thursday, December 20, ETH facility services will conduct extensive power network tests in the HPT building, where ISG (and hence the helpdesk) is located. We were told to expect at least one power cut lasting at least 15 min, possibly longer. During this time we will not be able to answer the helpdesk phone or work on your tickets. We'll post an update when power is back.

Power outage Monday evening + cleanup

Tuesday, November 20th, 2012

On Monday evening (19.11.2011) around 18:30 a power outage in the HIT server room shut down most of our core infrastructure servers. Apparently the building automation system had turned off the cooling in HIT D 13 and when the temperature in the server room reached 37C, there was an emergency power cut. After the electricians had restored power around 21:00, we started bringing our servers back up. Around 23:00 most of the services were back, with the exception of the main web server which we managed to recover on Tuesday around 9:00. Also webmail took a bit longer.

We apologize for any inconvenience.

Maintenance downtime for group share fileserver

Wednesday, September 5th, 2012

In order to upgrade the operating system and the server hardware, we schedule a maintenance downtime on

Wednesday, 12. September 2012, starting at 17:00 and lasting for several hours.

During this time, you will not have access to the group directories.

We apologize in advance for any inconvenience this service interruption might cause.

HIT Building: Electric Power Interruption on Wednesday, 25th of July

Monday, July 16th, 2012

Due to maintenance work relating the electric power supply of the HIT building there will be a planed interruption from 5:00am to 8:00am on Wednesday the 25th of July.

Please note that the whole HIT building will be without electric power during this time (The server room HIT D 13 is excepted from this interruption). Shutdown your computer and switch off (use main switch if available or unplug) your electrical devices in advance to avoid local data loss and help prevent start-up peaks when electric power is switched back on.

HIT Building: Network Interruption next Friday Morning, 9th of March

Tuesday, March 6th, 2012

ID-Kom plans to upgrade the access routers of the HIT building next Friday morning (9th of March) between 6:00 and 7:30am. This causes a network interruption for about 15 minutes during this time in the HIT building.

All D-PHYS Servers located in HIT D 13 are not affected by this interrupt and are reachable from outside the HIT building at any time.

Short maintenance downtime on Sun, Feb 12

Friday, February 10th, 2012

Yesterday's outage could be traced to a flaky voltage controller on one of our RAID adapters. We schedule a short maintenance downtime on

Sunday, Feb 12, around 13:00

in order to replace the faulty controller. Most services will be affected.

Update 14:57 Cleanup took a bit longer than expected, but now all system are back again.

Hardware failure

Thursday, February 9th, 2012

Severe hardware failure on one of our core infrastructure servers. We're working on it.

Update 11:44 The hardware problem could be fixed and all services are recovering now. Sorry for the downtime.

Emergency file server migration

Thursday, January 12th, 2012

On Jan 5, after weeks of thorough planning and rigorous testing, we performed a migration of the home directories and group shares to our new SAN system. Soon afterwards, the first phone calls started coming in. The initial problem was very exotic and affected very few people (that's why we had no chance to detect it during the testing period), but the action we took to address it unfortunately caused a cascade of consecutive faults that led to the instabilities you had to endure for one week now and for which we are truly sorry. We now know how to fix the underlying problem, but we cannot operate on the running server. That's why we have to schedule an

emergency file server migration on Sat, Jan 14, starting at 07:00 and lasting well into the afternoon probably.

During this time, you will not have access to your home or group directories, and also email will only work intermittently. Please stop all running jobs and log out before Saturday morning.

We apologize for the suboptimal performance since Jan 5. You have every right to expect better, but this caught us completely off guard. Thank you for your understanding.

Update, Sat 14:15: mounts and email are up and running again. The problem on 32bit machines still persists, but we have an idea how to fix it on Monday.

Update Fri 20.01: we (hence you) are still suffering from severe stability problems on the file server. We are very hard at work and now have a plan that we really really hope will solve the problems. There will be another migration sometime next week. We're truly sorry for the inconvenience you have to endure.

Emergency downtime of D-PHYS mail server today

Tuesday, December 27th, 2011

There will be an emergency maintenance downtime of the D-PHYS mail server later today (Tuesday, 27th of December) due to unexpected hardware issues.

Update 16:30: Everything seems to work fine again. It's though likely that there will be another downtime for further maintenance in early 2012.

Migration of Home Directories

Thursday, December 22nd, 2011

In order to gain more flexibility and performance, the home directories will move to our new SAN setup.

This will be done on Thursday, 5. January 2012, between 18:00 and 22:00.

During this time the home directories (winhome, machome, unixhome), the mail services and some websites will not be available.

To protect you from losing or corrupting any of your files, we strongly recommend you close all open files on the home directories before the migration.

Since we have switched to generic names for our services, the home directories will still be accessible the same way as before after the migration is over, so you don't have to change anything.

Update, Jan 10: We experience some unexpected and dubious problems with 32bit binaries (and therefore, 32bit machines). The symptoms range from not being able to log in (GNOME and KDE) to acroread and mathematica not starting. Workarounds while we're working on a solution: for failing logins, please call us. For acroread, use evince instead. For mathematica, log in to a 64bit machine, eg. login.phys, and start it remotely.

Update, 21:25: The migration is finished and everything should work again! In case of problems please contact the ISG Helpdesk (3 26 68)