Author Archive

Short service interruption

Monday, January 28th, 2013

Tomorrow,

Tuesday, January 29, starting at 07:15 am

there will be a short downtime of our LDAP server since we have to perform some maintenance work. You will not be able to log in or use our file services during this downtime. The expected duration is < 15 min however, so most of you won't even notice.

2012 in Review

Wednesday, December 19th, 2012

This post is meant to give you a short overview of what has been accomplished in D-PHYS IT by ISG this year. We've been hard at work to further improve and extend our services for you, our customers. Some highlights of 2012:

  • Integration of IGP's IT into ours: as you might recall, our coworker Thomas is actually paid by D-BAUG's IGP institute in exchange for us providing our IT services to their users. For the last 12 months, we have migrated their servers, data, users and software into our setup so that in future we can all profit from a unified solution
  • File servers and backup: after some difficulties earlier this year our file server and backup infrastructure is now rock solid and ready for the fast growth in data volume that we expect in the next years. All disk backends have been intregrated into our SAN setup and are connected via either Infiniband or 10G ethernet for maximum speed. Just yesterday we passed the 1 PB mark in file server disk space. Yes, that's 1024 TB.
  • Mail server bottleneck: also in spring, the sporadic performance bottleneck on our mail server could be found and fixed. The server is now running on full steam again
  • Personal user groups: probably completely unnoticed by our customers, all user D-PHYS accounts have been migrated to personal user groups this year. While this has been the standard behavior on modern Unix systems for many years now, our LDAP directory dates back to SunOS which combined all user accounts into one ''staff'' group. Not a big deal for you, but makes life much easier for us.
  • Group share reporting: in order to provide a better overview of space allocation and usage on our group shares, we introduced a periodic report email containing the link to an interactive usage graph.
  • Mac OS X 10.7: the Mac workstations have been migrated to OS X 10.7, building a unified setup to facilitate software distribution
  • Ubuntu 12.04 LTS: the Linux workstations were upgraded to Ubuntu 12.04, a long-term support (LTS) version that benefits from an extended support cycle.

Apart from these highlights, of course there have been numerous small projects and improvements to our setup, making both your and our life easier.
I would like to take this opportunity to thank my whole team for their hard and dedicated work all year long.

Happy Holidays and see you in 2013!

ISG Helpdesk Service Interruption

Monday, December 17th, 2012

UPDATE: the power test has been cancelled. Helpdesk duty as usual.

On Thursday, December 20, ETH facility services will conduct extensive power network tests in the HPT building, where ISG (and hence the helpdesk) is located. We were told to expect at least one power cut lasting at least 15 min, possibly longer. During this time we will not be able to answer the helpdesk phone or work on your tickets. We'll post an update when power is back.

Workshop: IT at D-PHYS

Thursday, November 22nd, 2012

Update: Attendance of the workshop was somewhat lower than expected (3). We would still like to offer this workshop on a regular basis, but we'll have to see how many registrations we get. We'll keep the registration page open, but without a predefined date.

When you first start your career at our department, the IT situation might look a bit confusing: there's ID and there's ISG, two different accounts and e-mail addresses, where to store the data... We figured that instead of fixing things after they have gone wrong just because people didn't know how to handle them, it would be easier for everybody to try and introduce new arrivals to D-PHYS' IT as early as possible. That's why we hereby start our series of introductory IT workshops. The first one has taken place on Thursday, November 29, at 14:00 in HPF G 6.
So if you feel you have some questions regarding the IT landscape at D-PHYS, please consider attending this ~ 1h workshop. Before you do, please take a look at the website where you can also find the PDF of the talk. Maybe all your questions are already answered in there. If not, head over to the registration website and provide your name and e-mail address. In order to allow for a lively discussion, we have limited the registration to the first 25 applicants. If you're too late, don't despair, there will be enough tutorials for everyone, we'll just schedule more.

Looking forward to seeing many of you in G 6!

Power outage Monday evening + cleanup

Tuesday, November 20th, 2012

On Monday evening (19.11.2011) around 18:30 a power outage in the HIT server room shut down most of our core infrastructure servers. Apparently the building automation system had turned off the cooling in HIT D 13 and when the temperature in the server room reached 37C, there was an emergency power cut. After the electricians had restored power around 21:00, we started bringing our servers back up. Around 23:00 most of the services were back, with the exception of the main web server which we managed to recover on Tuesday around 9:00. Also webmail took a bit longer.

We apologize for any inconvenience.

Group drive usage reports

Thursday, October 4th, 2012

Most of you use at least one of our group drives. Up to now, there wasn't much control over the disk space usage of those group drives, apart from a global quota for each share. We have now developed a monthly report than can be sent to a designated administrator of a share and that nicely states total disk usage, the biggest directories and a member list. It also contains a link to an interactive usage chart (see example here) that can be used to explore the directory. A demo report can be seen here.
So if you think you'd be a good candidate to receive the monthly report of your share(s) and would like to get a good idea about the share's usage, please contact us.

Upgrade of Roundcube webmail

Wednesday, August 8th, 2012


For some years now we have been providing you with the nice webmail solution of the Roundcube project. Last night they released a new major version which we will install

tomorrow Thursday, August 9, at 07:00.

Note that for about 30 minutes you won't have access to Roundcube. The new version brings a very nice new theme (see screenshot) which we will be enable by default. If for some reason you'd like to keep the old one, you can switch back under Settings -> User Interface -> Interface skin.

Thu 07:15 Upgrade completed.

Mobile printing

Monday, May 7th, 2012

Until now it was not possible to print on D-PHYS printers while you're on the road with mobile devices. Since several people expressed their interest in such a possibility, we have created two methods that allow you to do just that: read more.
As there's no common standard for mobile printing, certain restrictions apply. If you find yourself with an email that you think should print but doesn't, please let us know.

The Art of Scaling

Thursday, April 19th, 2012

Note: this is a purely anecdotal posting about our struggles with some performance bottlenecks in the last few months. If you're not interested in such background information, just skip.

You might have noticed that since about January 2012 using our file and mail servers hasn't been as smooth as usual. This posting will give you some background information concerning the challenges we encountered and why it took so long to fix them. Let's begin with the file server.

Way back in the days (i.e. 5 years ago), when the total file server data volume at D-PHYS was about 10 TB, we used individual file server to store this data. When one server was full, we got a bigger one, copied all the data and life was good for another year or two. Today, the file server data volume (home and group shares) is above 150 TB and growing fast and this strategy doesn't work any longer: individual servers don't scale and copying this amount of data alone takes weeks. That's why in 2009 we started migrating the 'many individual servers' setup to a SAN architecture in which the file servers are just huge hard drives (iSCSI over Infiniband, for the technically inclined) connected to a frontend server that manages space allocation and the file system. The same is true for the backup infrastructure, where the data volume is even bigger.

This new setup had to be developed, tested and put in place as seamlessly and unobtrusively as possible while ensuring data access at all times (apart from single hour-long migrations). The SAN architecture was implemented for Astro in December 2010 and has been running beautifully ever since. In 2011 we laid the groundwork to adopt this system for the rest of D-PHYS's home and group shares and after a long and thorough testing period the rollout happened on January 5, 2012. Unfortunately, that's when things got ugly.

At first, we noticed some exotic file access problems on 32bit workstations. It took us some time to understand that the underlying issue was an incompatibility with the new filesystem using 64-bit addresses for the data blocks. As a consequence we had to replace the filesystem of the home shares. Independently we ran into serious I/O issues with the installed operating system, so we had to upgrade the kernel of the frontend server and move the home directories onto a dedicated server. In parallel, we had to incorporate some huge chunks of group data while always making sure that nightly backups were available. All this necessitated a few more migrations until we finally achieved a stable system on March 28.

The upshot: what we had hoped to be a fast and easy migration turned out to cause a lot of problems and take much longer than anticipated, but now we have a stable and solid setup that will scale up to hundreds or even thousands of TB of data.
See live volume management and usage graphs for our file servers.

As for the mail server, matters are to some extent related and partly just coincidental in time. The IMAP server does need access to the home directories and hence also suffered when their performance was impaired. But even after having solved the file server issues, we still saw single load peaks on the IMAP server that prevented our users from working with their email. Again, we put a lot of time and effort into finding the reason. As of April 13, we're back to good performance and arrive at the following set of conclusions:

Particular issues:

  • a covertly faulty harddisk in the mail server RAID seems to have impaired performance
  • CPU load of the individual virtual machines on the mail server was not distributed across the available CPU cores in an optimal way

General mail server load:

  • while incoming mail volume doesn't increase much, outgoing mails have grown 50% in the last year alone
  • more and more sophisticated spam requires more thorough virus and spam scanning, increasing the load on the mail server
  • our users have amassed 1.1 TB of mail storage (up from 400 GB in January 2010), which need to be accessed and organized

Bottom line:

We'd like to thank you for your patience during the last 4 months and apologize for any inconvenience you might have had to endure. In all likelihood the systems will be a lot more stable in the future, but of course we're constantly working to ensure the D-PHYS IT infrastructure is able to keep up with the fast growing demand of disk space (the data volume has tripled in the last year alone). We've learned a lot and we'll put it to good use.

Temporary SMB access restriction

Wednesday, April 11th, 2012

Last night a security problem was detected in the SMB server software we use for our group and home shares. In order to protect your data and our systems, we

temporarily restrict access to our group and home shares to the ETHZ IP address range

until security updates are available. If you're outside the ETH network and need to access your data, use VPN. We expect the updates to be released later today or tomorrow and will then go back to world wide access.