Posts Tagged ‘service interruption’

Group share woes

Friday, December 8th, 2017

Update 20.12.: the strange intermittent permission problems some of you experienced could be traced back to a kernel regression. We're now back to using an older kernel.

Update 13.12.: we're cautiously optimistic that the problems have been fixed. Since Monday the file server has survived everything we threw at it. The culprit seems to be an Infiniband switch that sporadically disconnected under heavy load. We're now also turning on some performance improvements again, so you should see a speed increase when browsing files.

Update 06:45: group shares are back. Please let us know if you encounter any problems.

As some of you might have noticed, we've had some service quality issues with our group share server in the last few months. While not all interruptions are under our control (Informatikdienste lately have been very busy upgrading the ETH network, causing various network disruptions), we do have a problem with the group share server: it runs fine for weeks on end until it suddenly doesn't. To this day we have not been able to pinpoint the underlying problem, despite having changed a lot of parameters, both software and hardware. Our next step will be replacing the kernel on the disk backends and switch some hardware - for that we need a scheduled downtime on

Monday, December 11, starting at 06:00

during which the group shares will be unavailable for about 90 minutes. This affects all D-PHYS and IGP shares except the Astro and newly migrated IPA ones. We will post an update when the system is back.

We do apologize for the inconvenience these service issues might have caused you. Please bear with us while we're trying to locate and eliminate the root cause. We're monitoring the situation 24/7 and try to react as quickly as possible whenever a problem occurs. But wait! You can help! There seems to be a correlation between crash probability and large scale small file I/O. This means you should, whenever possible, avoid reading or writing a lot of small files and bundle your data into fewer and larger files. This also increases performance!

Server room migration on Wed, Aug 23

Tuesday, July 25th, 2017

Update Thursday 01:45: we hit some unexpected problems with the non-Astro group shares. Everything is back now, please let us know if you expericence any problems..

Some months ago, we were informed by Informatikdienste that we would have to migrate our two water cooled racks in the HIT server room due to upcoming remodeling. This move will take place on

Wednesday, August 23, starting at 16:00

and last for several hours. During this time, all our IT services will be unavailable, including login, e-mail, storage and ISG-hosted websites. Incoming e-mail will be kept back and delivered afterwards. We will give our best to have login and e-mail back up within the first two hours, but group drives will take a bit longer due to the sheer amount of hardware we have to move.
We apologize for any inconvenience. Unfortunately, this migration cannot be performed on a weekend as we might have to interact with our colleagues at Informatikdienste, but it will ensure secure and enduring operation of our servers in the future.

some impressions from the migration - thanks to the whole team!

Maintenance window on Monday, September 5, 17:00

Tuesday, August 30th, 2016

In order to perform some core service upgrades, we schedule a server maintenance window on

Monday, September 5, starting at 17:00 and lasting for approximately 3 hours.

Most D-PHYS IT services will be affected by that downtime, including logins, file servers and e-mail services.
E-mails coming in during the downtime will be held on the sender’s side and will arrive at D-PHYS with a delay. Sending e-mails won’t be possible during the window.

We’ll update this posting as soon as things are back to normal.

Update Monday 18:30 We managed to complete the migration ahead of time, everything should be back to normal. If you still encounter any problems, please let us know.

Severe server failure

Wednesday, August 27th, 2014

sometimes it just has to work, and fast!

sometimes it just has to work, and fast!


UPDATE Thu 09:30 - all systems should be back to normal. Please let us know if you still encounter problems. Thanks to Axel and Paddy for their commitment and the incredible Dalco service for fixing it within 6h (at 8am, mind you).

UPDATE Thu 00:50 - a broken valve blocked the cooling water in the HIT D 13 server room and all 14 water cooled racks severely overheated (not just D-PHYS). We managed to revive almost all services with the exception of the GGL file shares (this server is dead). We'll post updates later today when we have more information.

complete loss of cooling in the server room. We have yet to assess the damage.

Maintenance Downtime of D-PHYS Mail Server on 9-Jan-2014

Monday, January 6th, 2014

On Thursday, the 9th of January 2014, starting in the late afternoon, we will run multiple software updates on the D-PHYS mail server. We do expect multiple downtimes throughout the evening, partially of single mail services, partially of the whole mail server.

This will likely also delay the delivery of incoming mails up to several hours.

Update, 22:30: Everything back to normal.

General IT services downtime on Wed Sep 11 17:00

Tuesday, September 3rd, 2013

UPDATE Thu 12.09. 07:30 If you're trying to connect to a SMB share from an unmanaged Windows machine, you have to use "ad\USERNAME" instead of just "USERNAME" from now on.

UPDATE 21:15 apart from the IGP group shares (which will be back in a few hours) all systems are back to normal. Please let us know if you experience any problems.

In order to upgrade the operating system on several core infrastructure servers of the Department, we schedule a general maintenance downtime on

Wednesday September 11, starting at 17:00, lasting for several hours.

Most services will be affected and unavailable during that time, as they require an authentication with your D-PHYS account (email, file server, print server, managed workstations). Note that, even though you will not be able to check your emails or send new ones, all incoming mails will be received and safely delivered to your inbox afterwards.

Please make sure to save all open documents before 17:00 on that day.

Since we will also change the way file server mounts are authenticated, users who haven't updated their passwords in a very long time might not be able to mount their home directories or group shares after the migration. If you run into this problem on Thursday morning, please first change your password. If the issue persists, contact us.

We will post an update when things are back to normal.

ISG Helpdesk Service Interruption – Reprise

Tuesday, February 12th, 2013

Apparently this time they mean it:

On Thursday, February 14, ETH facility services will conduct extensive power network tests in the HPT building, where ISG (and hence the helpdesk) is located. Power will be gone for ~ 2 hours starting around 13:30. During this time we will not be able to answer the helpdesk phone or work on your tickets. We'll post an update when power is back.

Update, 10:30h: We're already offline since in other floors of the building power has already been cut and caused a network outage on other floors.

ISG Helpdesk Service Interruption

Monday, December 17th, 2012

UPDATE: the power test has been cancelled. Helpdesk duty as usual.

On Thursday, December 20, ETH facility services will conduct extensive power network tests in the HPT building, where ISG (and hence the helpdesk) is located. We were told to expect at least one power cut lasting at least 15 min, possibly longer. During this time we will not be able to answer the helpdesk phone or work on your tickets. We'll post an update when power is back.

Power outage Monday evening + cleanup

Tuesday, November 20th, 2012

On Monday evening (19.11.2011) around 18:30 a power outage in the HIT server room shut down most of our core infrastructure servers. Apparently the building automation system had turned off the cooling in HIT D 13 and when the temperature in the server room reached 37C, there was an emergency power cut. After the electricians had restored power around 21:00, we started bringing our servers back up. Around 23:00 most of the services were back, with the exception of the main web server which we managed to recover on Tuesday around 9:00. Also webmail took a bit longer.

We apologize for any inconvenience.