Posts Tagged ‘downtime’

File server migration

Tuesday, January 24th, 2012

In order to solve our recent file server problems, we schedule another migration on

Wednesday, January 25, starting at 17:00 and lasting for several hours.

During this time, you will not have access to your home or group directories, and also email will only work intermittently. Please stop all running jobs and log out prior to the migration.

Update 20:30 Migration completed. Every test we could think of passed. Please let us know if you find any remaining issues. Thanks for your patience.

Emergency file server migration

Thursday, January 12th, 2012

On Jan 5, after weeks of thorough planning and rigorous testing, we performed a migration of the home directories and group shares to our new SAN system. Soon afterwards, the first phone calls started coming in. The initial problem was very exotic and affected very few people (that's why we had no chance to detect it during the testing period), but the action we took to address it unfortunately caused a cascade of consecutive faults that led to the instabilities you had to endure for one week now and for which we are truly sorry. We now know how to fix the underlying problem, but we cannot operate on the running server. That's why we have to schedule an

emergency file server migration on Sat, Jan 14, starting at 07:00 and lasting well into the afternoon probably.

During this time, you will not have access to your home or group directories, and also email will only work intermittently. Please stop all running jobs and log out before Saturday morning.

We apologize for the suboptimal performance since Jan 5. You have every right to expect better, but this caught us completely off guard. Thank you for your understanding.

Update, Sat 14:15: mounts and email are up and running again. The problem on 32bit machines still persists, but we have an idea how to fix it on Monday.

Update Fri 20.01: we (hence you) are still suffering from severe stability problems on the file server. We are very hard at work and now have a plan that we really really hope will solve the problems. There will be another migration sometime next week. We're truly sorry for the inconvenience you have to endure.

Emergency downtime of D-PHYS mail server today

Tuesday, December 27th, 2011

There will be an emergency maintenance downtime of the D-PHYS mail server later today (Tuesday, 27th of December) due to unexpected hardware issues.

Update 16:30: Everything seems to work fine again. It's though likely that there will be another downtime for further maintenance in early 2012.

Mail Server Upgrade on Wednesday, 5th of October

Monday, October 3rd, 2011

On Wednesday, 5th of October 2011, starting at 16:30, we will upgrade the operating system on several servers of our mail server cluster. This will result in temporary unavailability of most e-mail related services we provide: sending and receiving e-mails, mailing lists and webmail access.

Due to the maintenance e-mails may have some lag and arrive a few hours later than usual.

Update, 21:55: Upgrade of the incoming and IMAP/webmail/mailing list servers were successful. Everything is back to normal.

New SSL and HTTPS certificates for many ISG D-PHYS services

Thursday, September 30th, 2010

In the past all HTTPS secured web sites hosted or provided by us used certificates issued by ourselves. This caused unsettling warnings in most browsers as the user had to manually add the root certificate of our certification authority (CA) to his web browser.

To allow SSL certificates other than those signed by ourselves, namely certificates automatically accepted by all browsers, but also community-backed CACert certificates issued by ETH ID, we will change the configuration of our web server zwoelfi this evening. This may cause some short interruptions to some of the hosted sites, but should not be of longer duration.

Some of these web sites will get already new SSL certificates issued by QuoVadis (accepted by nearly all browsers by default) this evening.

Update Friday, 1. Oct. 2010, 21:00h: Due to several unexpected issues with the new QuoVadis certificate, for now the webserver runs again with the old ISG signed SSL certificate on all virtual hosts.

Update Thursday, 7. Oct. 2010, 23:00h: Most of the issues with the new QuoVadis certificate are solved now and all virtual hosts planned for the QuoVadis SSL certificate use it now again.

(more…)

File server maintenance on Wed Aug 4, 22:00

Monday, August 2nd, 2010

A firmware update on the RAID controller of our file server requires a reboot of this server. We schedule the reboot on Wednesday August 4, at 22:00. The downtime should take no more than 30 min. Please make sure all your data has been saved at that point.

We apologize for any inconvenience.

Update, 22:20: The server has rebooted and everything should work again.

Short maintenance downtime of LDAP server on Mon Aug 2

Thursday, July 29th, 2010

On Monday, August 2nd, starting at 18:00, we need to modify our LDAP user database to incorporate structural changes needed for a new service we're currently setting up. This will cause a downtime of about 1 h, probably even shorter, that will affect user logins, email and file server access. We will post an update when things are back to normal. Update, 18:30: Things are now back to normal. 🙂

We apologize in advance for any inconvenience this service interruption might cause.

Short downtime for plempy, plompy and plumpy on Monday Aug 2

Tuesday, July 27th, 2010

On Monday, August 2nd, starting at 07:00, our terminal server / computation nodes plempy, plompy and plumpy will be moved into the water-cooled racks in the HIT server room. This will cause a downtime of said machines of about 30 min. If your thin client connects to plompy or if you're performing calculations on plempy or plumpy, please make sure your data has been saved by Monday morning. After the move, the trio will enjoy the amenities of our most advanced server room that only another thunderstorm could disrupt.

Update Mon Aug 2 08:40 All servers have reached their final destination.

Major outage due to water ingress

Monday, July 5th, 2010


This morning around 03:00 a water ingress in our HIT server room shut down most of our essential infrastructure servers. As soon as power was back around 08:00 we started to bring our services online.
Please let us know if you still experience any problems. We apologize for the inconvenience. I guess water and servers just don't mix very well.

Status 12:14 apart from the BackupPC server everything should be working again.

Maintenance Downtime of IDL License and Condor Master Server on 14. April, 5 pm

Tuesday, April 13th, 2010

Because of hardware problems with one of our infrastructure servers we couldn't perform the planned software upgrade on the IDL license server and Condor master server during the big maintenance downtime last week.

Those hardware problems are fixed now and so we will install the software upgrade on the IDL license and Condor master server tomorrow, Wednesday, 14th of April 2010, starting at 5 pm. Duration of the maintenance downtime will be approximately two hours.

Update, 20:20h: IDL License and Condor server are both back online.