Maintenance downtime for group share fileserver

In order to upgrade the operating system and the server hardware, we schedule a maintenance downtime on

Wednesday, 12. September 2012, starting at 17:00 and lasting for several hours.

During this time, you will not have access to the group directories.

We apologize in advance for any inconvenience this service interruption might cause.

Upgrade of Roundcube webmail

For some years now we have been providing you with the nice webmail solution of the Roundcube project. Last night they released a new major version which we will install

tomorrow Thursday, August 9, at 07:00.

Note that for about 30 minutes you won't have access to Roundcube. The new version brings a very nice new theme (see screenshot) which we will be enable by default. If for some reason you'd like to keep the old one, you can switch back under Settings -> User Interface -> Interface skin.

Thu 07:15 Upgrade completed.

New hardware for the laptop and lab PC backup service

We just replaced our ailing backup server for laptops and lab PCs inplace with new, more performant hardware using the same proven BackupPC software, just a newer version. Its web-interface is available under the same URL.

The new server will start making backups from scratch for every host. Backups on the old server will still be available for a while at the address

Our old backup server became a victim of its own success. In the end it handled backups for over 100 hosts every day (over 160 during its whole life-time) and stored about 10 Terabytes of backups.

The BackupPC server of the Institute for Astronomy is not affected.

HIT Building: Electric Power Interruption on Wednesday, 25th of July

Due to maintenance work relating the electric power supply of the HIT building there will be a planed interruption from 5:00am to 8:00am on Wednesday the 25th of July.

Please note that the whole HIT building will be without electric power during this time (The server room HIT D 13 is excepted from this interruption). Shutdown your computer and switch off (use main switch if available or unplug) your electrical devices in advance to avoid local data loss and help prevent start-up peaks when electric power is switched back on.

Login Server Downtime

Apart from the file-, mail- and web servers exists another crucial element of our core server infrastructure, namely the servers managing the account information and logins (LDAP). With a current uptime of a remarkable 550 days, it is time for an upgrade of the operating system and high-availability cluster software. For this reason we schedule a downtime on Wednesday 18th July between 07:00 and 08:00.

Most services will be affected and unavailable during that time, as they require an authentication with your D-Phys account (email, file server, print server, managed workstations). Note that, even though you will not be able to check your emails or send new ones, all incoming mails will be received and safely delivered to your inbox afterwards.

If everything goes smoothly, the actual downtime should be considerably less than the scheduled hour.

Short Mail Server Downtime on Friday Late Afternoon

We have scheduled a downtime of the D-PHYS mail server for hardware maintenance on Friday, 1st of June 2012, starting after 4pm and lasting approximately half an hour. During the downtime sending and receiving mails will not be possible. Incoming mails will be slightly delayed.

Update 19:30h: Everything back again and looks fine. Performance may be slightly decreased for a day or so until the RAID has rebuilded.

Preparations took longer than expected, so we started later. The maintenance downtime itself also took longer due to some some unexpected issues which had to be resolved before we could continue start all service again. We're sorry for any inconvenience this may have caused.

Mobile printing

Until now it was not possible to print on D-PHYS printers while you're on the road with mobile devices. Since several people expressed their interest in such a possibility, we have created two methods that allow you to do just that: read more.
As there's no common standard for mobile printing, certain restrictions apply. If you find yourself with an email that you think should print but doesn't, please let us know.

The Art of Scaling

Note: this is a purely anecdotal posting about our struggles with some performance bottlenecks in the last few months. If you're not interested in such background information, just skip.

You might have noticed that since about January 2012 using our file and mail servers hasn't been as smooth as usual. This posting will give you some background information concerning the challenges we encountered and why it took so long to fix them. Let's begin with the file server.

Way back in the days (i.e. 5 years ago), when the total file server data volume at D-PHYS was about 10 TB, we used individual file server to store this data. When one server was full, we got a bigger one, copied all the data and life was good for another year or two. Today, the file server data volume (home and group shares) is above 150 TB and growing fast and this strategy doesn't work any longer: individual servers don't scale and copying this amount of data alone takes weeks. That's why in 2009 we started migrating the 'many individual servers' setup to a SAN architecture in which the file servers are just huge hard drives (iSCSI over Infiniband, for the technically inclined) connected to a frontend server that manages space allocation and the file system. The same is true for the backup infrastructure, where the data volume is even bigger.

This new setup had to be developed, tested and put in place as seamlessly and unobtrusively as possible while ensuring data access at all times (apart from single hour-long migrations). The SAN architecture was implemented for Astro in December 2010 and has been running beautifully ever since. In 2011 we laid the groundwork to adopt this system for the rest of D-PHYS's home and group shares and after a long and thorough testing period the rollout happened on January 5, 2012. Unfortunately, that's when things got ugly.

At first, we noticed some exotic file access problems on 32bit workstations. It took us some time to understand that the underlying issue was an incompatibility with the new filesystem using 64-bit addresses for the data blocks. As a consequence we had to replace the filesystem of the home shares. Independently we ran into serious I/O issues with the installed operating system, so we had to upgrade the kernel of the frontend server and move the home directories onto a dedicated server. In parallel, we had to incorporate some huge chunks of group data while always making sure that nightly backups were available. All this necessitated a few more migrations until we finally achieved a stable system on March 28.

The upshot: what we had hoped to be a fast and easy migration turned out to cause a lot of problems and take much longer than anticipated, but now we have a stable and solid setup that will scale up to hundreds or even thousands of TB of data.
See live volume management and usage graphs for our file servers.

As for the mail server, matters are to some extent related and partly just coincidental in time. The IMAP server does need access to the home directories and hence also suffered when their performance was impaired. But even after having solved the file server issues, we still saw single load peaks on the IMAP server that prevented our users from working with their email. Again, we put a lot of time and effort into finding the reason. As of April 13, we're back to good performance and arrive at the following set of conclusions:

Particular issues:

  • a covertly faulty harddisk in the mail server RAID seems to have impaired performance
  • CPU load of the individual virtual machines on the mail server was not distributed across the available CPU cores in an optimal way

General mail server load:

  • while incoming mail volume doesn't increase much, outgoing mails have grown 50% in the last year alone
  • more and more sophisticated spam requires more thorough virus and spam scanning, increasing the load on the mail server
  • our users have amassed 1.1 TB of mail storage (up from 400 GB in January 2010), which need to be accessed and organized

Bottom line:

We'd like to thank you for your patience during the last 4 months and apologize for any inconvenience you might have had to endure. In all likelihood the systems will be a lot more stable in the future, but of course we're constantly working to ensure the D-PHYS IT infrastructure is able to keep up with the fast growing demand of disk space (the data volume has tripled in the last year alone). We've learned a lot and we'll put it to good use.

Mail Server Maintenance Downtime this Evening

For some hardware and other maintenance we schedule a downtime of our mail server today (Fri, 13th of April 2012) evening after 6pm.

The downtime will likely take less than one hour. During the downtime you will neither be able to access your mails on the server nor to send mails via our server. Mails which are sent to the Dept. of Physics won't get lost, but will have some lag.

Temporary SMB access restriction

Last night a security problem was detected in the SMB server software we use for our group and home shares. In order to protect your data and our systems, we

temporarily restrict access to our group and home shares to the ETHZ IP address range

until security updates are available. If you're outside the ETH network and need to access your data, use VPN. We expect the updates to be released later today or tomorrow and will then go back to world wide access.