Archive for the ‘Experiences’ Category

2021 in review

Friday, December 17th, 2021

This post is meant to give you a short overview of what has been accomplished in D-PHYS IT by ISG this year. We’ve been hard at work to further improve and extend our services for you, our customers. Some highlights of 2021:

  • Network migration: as first announced in 2018 and later detailed in July of this year, we had to completely restructure the D-PHYS network this fall. This reorganization was prompted by a segmentation of the router infrastructure at Hönggerberg and will render the network more redundant and resilient. Visible changes include a NAT network, new DHCP/DynDNS technology and the foundation for IPv6 in all network zones.
  • Hypervisor setup: we run a lot of virtual machines at ISG and this spring we remodeled our hypervisor infrastructure to make it more flexible and capable. Hourly snapshots now give us the possibility to roll back if something goes wrong in a VM. It also allowed us to move our InfluxDB server to an SSD backed hypervisor, increasing performance and stability.
  • Office 365 migration: the Microsoft Office suite was upgraded to M365 on all managed Windows workstations this year.
  • Proprietary software woes: two major software companies caused us (and you!) a lot of headaches this year: on July 5, Microsoft broke Windows printing while trying to fix a security problem and it took them until the end of November to really repair it for everyone. Good job. Meanwhile, Adobe managed to break Acrobat logins for months on end and there's no general solution yet.
  • Windows configuration synchronization: the technology used to sync your desktop settings between managed Windows workstations was migrated from 'roaming profiles' to UE-V this year for greater speed and better reliability.
  • New lab PC backup solution: after we've had a good experience with our 2020 laptop backup system based on restic, we set up a similar system for lab PCs in 2021. We're currently migrating the last machines from the old BackupPC server.
  • 2021 Hardware Crisis: you might have noticed that a lot of hardware components are only available at outrageous prices, lead times measured in months or just not at all. The situation is especially bad for graphics cards and storage components.
  • ISG lecture series: reacting to a growing demand for IT-related knowledge in the department, we established the Basics of Computing Environments for Scientists lecture series that we'll repeat each semester.
  • Matrix/Element: in 2021 we continued to extend the feature set of our popular chat & collaboration system. We contributed bug fixes and lots of time in bringing usable maths support into Element (our supported Matrix client) as this was our number one most wanted feature. The second most wanted was better support for managing groups, which was added this year with spaces. Behind the scenes we have been scaling out our homeserver to keep up with the demand and continue to be stable and responsive. This year we counted 702 active users, who sent 927'123 messages in 4'571 rooms that were created on our server. Our users also participated in 396 rooms that were not created on our server where 731'451 messages were sent.
  • Storage: in 2021 the disk space occupied by data and backup grew from 3.2 PiB to 3.7 PiB, continuing the obvious trend of ever-growing data. In spring (just in time before the 2021 Hardware Crisis) we replaced the older disk backends in our SAN with fewer, bigger disks.
  • Outages: apart from some short-term network interruptions, the only noteworthy service interruptions this year were two update-induced storage hiccups on June 10 and December 7.
  • OS upgrades: most managed Linux workstations were upgraded to Ubuntu 20.04 and a first batch of servers are now running Debian bullseye.
  • Software upgrades: mostly incremental upgrades in our Windows and Linux software list this year.

I would like to take this opportunity to thank my whole team for their hard and dedicated work all year long.

Happy Holidays and see you in 2022!

2020 in review – Corona edition

Thursday, December 17th, 2020

oh boy, what a year.

This post is meant to give you a short overview of what has been accomplished in D-PHYS IT by ISG this year. We’ve been hard at work to further improve and extend our services for you, our customers. Some highlights of 2020:

  • Home office: on March 12, due to rapidly rising Covid-19 numbers, ISG was sent to work from home, along with most of the department. While we had somewhat anticipated this step and were prepared for it, the first two weeks were very busy because we had to assist a lot of people who weren't. In the end I believe we got everyone set up and we have been fully operational from home with only occasional individual visits to the office since then.
  • Matrix/Element/Riot: one of the most pressing issues with everyone working from home was an efficient and versatile tool for team communication. We had started internal tests of our Matrix chat system in late 2019, but then intensified our efforts in February and were able to release the system for general D-PHYS availability in home office week (HOW) 2. During the course of 2020, we continually kept working on the system and added new exciting features.
    We also run a Jitsi instance for privacy-aware video conferencing.
  • New laptop backup: our traditional BackupPC backup system for laptops and lab computers relies on each backup client to be reachable in the D-PHYS network, which obviously didn't work any longer in the home office regime. In HOW 17, we released a new backup system for laptops that works from any internet connection worldwide. Unfortunately, only very few of you have signed up for the service so far. Please make sure you have a backup of your laptop!
  • Ansible deployment: more servers and finally also the managed Linux workstations have been added to our ansible configuration management, allowing for completely automated installation of our systems.
  • Network migration: the extensive Hönggerberg network reorganization we reported two years ago hasn't seen much progress by Informatikdienste, but we have been working on our side to make the first steps. In early 2020 we migrated the dhcp.phys DNS service from our servers to ID's as a prerequisite for the eventual Gebäudezonen split.
  • Storage: in 2020 the disk space occupied by data and backup grew from 2.7 PiB to 3.2 PiB, continuing the obvious trend of ever-growing data. We have now also started the process of phasing out the oldest disk backends in order to replace them with fewer, bigger disks.
  • Software licenses: in the past 12 months, both Adobe and Microsoft decided to switch to a new license system in which each installation requires a license tied to a personal user account. In future, we can't create or extend your Adobe or Microsoft licenses for you, no matter how often you ask us to. You have to do it yourself, according to our instructions for Adobe and Microsoft (you might also want to think about switching to less oppressive software alternatives).
  • Outages: apart from two pre-announced storage migration windows (one of which took a bit longer than expected), 4 h of mail server hardware issues and some short-term network interruptions, our systems have been very stable in 2020. We are aware of the fact that wifi is quite often an issue, and we're trying to convince Informatikdienste to take it seriously.
  • OS upgrades: The Windows team was active migrating the Windows 2016 servers to 2019 while on the Linux side the first workstations were upgraded to Ubuntu 20.04 and most servers are now running Debian buster.
  • Software upgrades: mostly incremental upgrades in our Windows and Linux software list this year.
  • UCC: in February, the old non-VoIP phones in HPT, HPF and HPK were replaced by shiny new ones, just a few weeks before we were all sent home...
  • ISG staff changes: Patrick Schmid left us at the end of 2019 and was replaced by Maciej Bonin in February. Christian Schneider was replaced by Stephan Müller in September. And finally, Sukash Sugumaran superseded Janosch Bühler as our apprentice.

I would like to take this opportunity to thank my whole team for their hard and dedicated work all year long.

Happy Holidays and see you in 2021!

2019 in review

Friday, December 13th, 2019

This post is meant to give you a short overview of what has been accomplished in D-PHYS IT by ISG this year. We’ve been hard at work to further improve and extend our services for you, our customers. Some highlights of 2019:

  • Ansible deployment: while we had already started to deploy servers using ansible as early as 2015, it was in 2019 that we consolidated and migrated almost all server configuration to this system and now have a common base for the D-PHYS server infrastructure.
  • Storage server separation: in the past years a constant growth in both volume and bandwidth of our SAN storage system caused occasional performance issues for some users. To alleviate this, we split our single SAN frontend file server into 4 individual machines (D-PHYS general, IPA, IGP and galaxy) in order to distribute the load.
  • New web server: at the end of 2018 we purchased a new D-PHYS web server to replace the previous 10-year-old system. In 2019 we devised a completely new and upgraded web server setup on this new machine and migrated all D-PHYS hosted web shares to the new system. If you are the owner of one of our web shares, please make sure to read the updated documentation for things that have changed.
  • Network migration: the extensive Hönggerberg network reorganization we reported last year is even more complex than we initially thought, so there's no end-user-tangible progress this year - which doesn't mean there hasn't been a lot of behind-the-scenes work.
  • Storage: in 2019 the disk space occupied by data and backup grew from 2.1 PiB to 2.7 PiB, continuing the obvious trend of ever-growing data. The end of 2019 also saw a substantial expansion of the available disk capacity.
  • Clusters: we inherited two HPC clusters from CSCS that we're now running locally.
  • InfluxDB / Grafana: we included this popular time-series database / visualization combination into our service catalog.
  • Outages: apart from a pre-announced migration window and some short-term network interruptions, our systems have been very stable in 2019.
  • OS upgrades: The Windows team was active in getting rid of the remaining Windows 7 machines and upgrading Windows 10 to the 1809 build, while on the Linux side workstations were upgraded to Ubuntu 18.04 and a first batch of servers to Debian buster.
  • Software upgrades: the FileMaker server has been upgraded.
  • UCC: the UCC project of Informatikdienste was stopped due to nonfulfillment of the technical requirements and all deployed services and devices have been rolled back. The whole project will be reevaluated from scratch.
  • IT security: we participate in and support the ETH-wide IT security initiative.

I would like to take this opportunity to thank my whole team for their hard and dedicated work all year long.

Happy Holidays and see you in 2020!

2018 in review

Tuesday, December 18th, 2018

This post is meant to give you a short overview of what has been accomplished in D-PHYS IT by ISG this year. We’ve been hard at work to further improve and extend our services for you, our customers. Some highlights of 2018:

  • New mail server: between January and March, the virtual machines that make up the D-PHYS mail server were migrated to new hardware. We're now running on a state-of-the-art server with SSD storage that will serve the department's needs for many years to come.
  • New LDAP servers: in late 2017 we started a big migration to a cluster of new LDAP servers. This move was completed in the spring of 2018 and the old server turned off.
  • group membership edit: one of the benefits of the LDAP migration is that group memberships can now be managed directly by dedicated owners of a group. If you feel responsible for one such group and would like to be able to perform member management yourself without having to go through us each time, please get in touch.
  • New web server: we purchased new D-PHYS web server hardware to replace the old 10-year-old system. Since we're also planning to change the setup of your web hosting, migrating the existing web sites to the new hardware will be a long process that will extend well into 2019.
  • Network migration: while we were in an advanced planning stage of a segmentation of the D-PHYS network and had already started to implement the first changes, Informatikdienste announced that the underlying network layout of the whole Hönggerberg campus would be redesigned in 2018/19 which deeply influences and impacts our work as well. We're now on hold until we know details of ID's technical implementation.
  • Storage: in 2018 the disk space occupied by data and backup grew from 1.6 PiB to 2.1 PiB, which means that growth in storage has picked up steam again after two slow years.
  • Outages: apart from the above-mentioned pre-announced migration windows and some short-term network interruptions, our systems have been very stable in 2018.
  • OS upgrades: the Windows 10 rollout has been largely completed and most Linux workstations have been upgraded to Ubuntu 18.04.
  • WiFi change: we accompanied and supported ETH's wifi change project in November.
  • UCC: the UCC rollout which will replace the existing ETH telephony system with an all-IP based solution has been put on hold by Informatikdienste since the service quality was severely lacking. We'll know more in 2019.Q2.
  • IT security: we participate in and support the ETH-wide IT security initiative.

I would like to take this opportunity to thank my whole team for their hard and dedicated work all year long.

Happy Holidays and see you in 2019!

2017 in review

Monday, December 18th, 2017

This post is meant to give you a short overview of what has been accomplished in D-PHYS IT by ISG this year. We’ve been hard at work to further improve and extend our services for you, our customers. Some highlights of 2017:

  • Account expiry: in early 2017 we finished assessing all ~7600 D-PHYS accounts and blocked the expired ones. We also tied all D-PHYS accounts to their nethz counterparts wherever possible. This allows us to make use of ETH's employment information from now on. While we were at it:
  • New LDAP servers: Since implementing account expiration meant touching most aspects of our identity management infrastructure anyway, we decided to completely overhaul our LDAP user database. We reworked the LDAP schema (the original one dating back to the early 90s) and set up a 3-way replicating OpenLDAP cluster.
  • Windows Server Cluster: Several mission critical Windows Server instances have been moved to a newly created Windows Cluster. This complements last year's Linux cluster.
  • Storage: in 2017 the disk space occupied by data and backup grew from 1.3 PiB to 1.6 PiB, making this a very slow year as far as storage growth is concerned.
  • Server room migration: in August we had to move most of D-PHYS's servers three rack rows down in the HIT D 13 server room. We now have a solid foundation for our servers for the next years.
  • Outages: apart from the above-mentioned migration, some short-term network interruptions and the unfortunate file server issues of late our systems have been very stable in 2017.
  • Web server upgrade: in January we upgraded the operating system on the D-PHYS web server. We also used the occasion to clean up a lot of legacy cruft.
  • OS upgrades: 2017 brought new OS versions for almost every system: the Windows 10 rollout picked up steam, High Sierra arrived on the Macs and Ubuntu 16.04 on the remaining Linux workstations.
  • eXile: we migrated the configuration management from Puppet to Ansible and then re-installed all eXile gateways in a fully automated way with the latest Debian release.
  • UCC: we laid the technical groundwork and performed implementation tests for the upcoming UCC rollout which will replace the existing ETH telephony system with an all-IP based solution.
  • IT security: we participate in and support the ETH-wide IT security initiative.

I would like to take this opportunity to thank my whole team for their hard and dedicated work all year long.

Happy Holidays and see you in 2018!

Some incoming mails lost between Jan 9, 6pm and Jan 13, 11am

Tuesday, January 14th, 2014

On Monday morning we found out that large incoming mails (1 MBytes or larger) were dropped without leaving any error messages in our log files. These mails were lost between Thursday (Jan 9) evening 18:27 and Monday (Jan 13) morning 11:06. Some indicators (i.e. spam filter rules for this case) lead us to estimate the number of about 560 broken local deliveries to about 300 unique recipients.

If you expected e-mails with attachments close to 1 MB or larger within this time frame there is a high likelihood that they got lost. The only information we still have about these mails are sender, recipient and arrival date and time. If you were one of these recipients, please contact the sender to send it again.

You can check on this web page if mails you should have received were lost. You'll have to log in with your D-PHYS account and will see sender (or mailing list) of and time when the lost mail arrived. Additionally we'll inform all affected recipients individually, too.

The problem occured after one of the software updates on Thursday which brought stricter code checking, and is solved since Monday morning 11:06.

The issue was caused by a long standing and subtle programming error in the check which prevents bigger mails from being inspected closely by the main spam filter for performance reasons. It was only triggered upon local mail delivery, so mails sent from D-PHYS to outside D-PHYS were not affected. E-mails to D-PHYS mailing lists (or other mailing lists) with archive should be available in the according mailing list archives.

We're truly sorry for any inconvenience this may have caused and have already taken measures so that similar issues won't result in mail loss from now on.

Update: it happens to the best of us: Gmail for iOS bug might cause data loss

Notes on warranty (Garantie vs. Gewährleistung)

Tuesday, February 19th, 2013

This post might help to clarify some questions related to the warranty conditions of new hardware. It is the result of internal inquiries we performed in reply to customer requests. Skip if you're not interested.

Switzerland, like other European countries, knows two forms of liability a vendor has to/can offer to clients of its hardware products: Gewährleistung and Garantie.

  • Gewährleistung is mandated by law and covers basic liability if a piece of hardware fails. In Switzerland, Gewährleistung was just extended from 12 to 24 months on Jan 1st, 2013. This means that for the first two years, any defect whose cause was already present at the time of purchase has to be covered by the vendor. As you can probably guess, the part in italic can be the crucial one.
  • Garantie is a voluntary service offered by most, but not all vendors. Its conditions can be pretty freely chosen by the vendor, unlike Gewährleistung where the terms are given by law. Garantie can cover a wider range of defects and it can also be a service you have to pay for.

Now how does this matter to you? Let's take a current real life example: you'd like to buy a new Apple MacBook Pro 13". Right now, you have a number of interesting options:

  • Neptun: CHF 1305.-, 2 years of Gewährleistung (by law), 3 years of Apple Garantie (price of Apple Care included)
  • Dataquest: CHF 1240.-, 2 years of Gewährleistung (by law), 2 years of Dataquest Garantie. Additionally, you can pay CHF 99.- for a third year of Dataquest Garantie.
  • Apple EDU Store: CHF 1268.-, 2 years of Gewährleistung (by law), 1 year of Apple Garantie. Additionally, you can pay CHF 210.- for another 2 years of Apple Garantie. IDES offers the same to ETH employees for CHF 195.-

It's hard to tell if the conditions of the additional Garantie are really more accommodating than those of the mandatory Gewährleistung. Wear parts like the battery for example are typically covered by neither. Harddisks on the other hand (most common failing part in a laptop) should be covered by both. In the end the best option will also depend on your usage pattern and the expected life time of your device. Regardless of the type of warranty you have, you should always report any problem you'd like to get fixed as soon as possible.

Sources:
Apple warranty conditions
Computer World article

The Art of Scaling

Thursday, April 19th, 2012

Note: this is a purely anecdotal posting about our struggles with some performance bottlenecks in the last few months. If you're not interested in such background information, just skip.

You might have noticed that since about January 2012 using our file and mail servers hasn't been as smooth as usual. This posting will give you some background information concerning the challenges we encountered and why it took so long to fix them. Let's begin with the file server.

Way back in the days (i.e. 5 years ago), when the total file server data volume at D-PHYS was about 10 TB, we used individual file server to store this data. When one server was full, we got a bigger one, copied all the data and life was good for another year or two. Today, the file server data volume (home and group shares) is above 150 TB and growing fast and this strategy doesn't work any longer: individual servers don't scale and copying this amount of data alone takes weeks. That's why in 2009 we started migrating the 'many individual servers' setup to a SAN architecture in which the file servers are just huge hard drives (iSCSI over Infiniband, for the technically inclined) connected to a frontend server that manages space allocation and the file system. The same is true for the backup infrastructure, where the data volume is even bigger.

This new setup had to be developed, tested and put in place as seamlessly and unobtrusively as possible while ensuring data access at all times (apart from single hour-long migrations). The SAN architecture was implemented for Astro in December 2010 and has been running beautifully ever since. In 2011 we laid the groundwork to adopt this system for the rest of D-PHYS's home and group shares and after a long and thorough testing period the rollout happened on January 5, 2012. Unfortunately, that's when things got ugly.

At first, we noticed some exotic file access problems on 32bit workstations. It took us some time to understand that the underlying issue was an incompatibility with the new filesystem using 64-bit addresses for the data blocks. As a consequence we had to replace the filesystem of the home shares. Independently we ran into serious I/O issues with the installed operating system, so we had to upgrade the kernel of the frontend server and move the home directories onto a dedicated server. In parallel, we had to incorporate some huge chunks of group data while always making sure that nightly backups were available. All this necessitated a few more migrations until we finally achieved a stable system on March 28.

The upshot: what we had hoped to be a fast and easy migration turned out to cause a lot of problems and take much longer than anticipated, but now we have a stable and solid setup that will scale up to hundreds or even thousands of TB of data.
See live volume management and usage graphs for our file servers.

As for the mail server, matters are to some extent related and partly just coincidental in time. The IMAP server does need access to the home directories and hence also suffered when their performance was impaired. But even after having solved the file server issues, we still saw single load peaks on the IMAP server that prevented our users from working with their email. Again, we put a lot of time and effort into finding the reason. As of April 13, we're back to good performance and arrive at the following set of conclusions:

Particular issues:

  • a covertly faulty harddisk in the mail server RAID seems to have impaired performance
  • CPU load of the individual virtual machines on the mail server was not distributed across the available CPU cores in an optimal way

General mail server load:

  • while incoming mail volume doesn't increase much, outgoing mails have grown 50% in the last year alone
  • more and more sophisticated spam requires more thorough virus and spam scanning, increasing the load on the mail server
  • our users have amassed 1.1 TB of mail storage (up from 400 GB in January 2010), which need to be accessed and organized

Bottom line:

We'd like to thank you for your patience during the last 4 months and apologize for any inconvenience you might have had to endure. In all likelihood the systems will be a lot more stable in the future, but of course we're constantly working to ensure the D-PHYS IT infrastructure is able to keep up with the fast growing demand of disk space (the data volume has tripled in the last year alone). We've learned a lot and we'll put it to good use.

Do not blindly trust mail

Tuesday, November 3rd, 2009

The current wave of password phishing mails seems to provoke an unusually high attention rate.  People seem to think that mail allegedly coming from help@ethz.ch may be genuine.  The german text itself is so bad that its spammy character is obvious to long time mail users.

Remember: any part of a mail can be faked. This is in the design of the mail system and cannot be fixed without making mail usage a lot harder for everybody.  And even if we used a better system (like cryptographic signatures) the rest of the world would not follow.

Therefore, be sceptic about any mail until the complete impression including the writing style fits the picture.  No IT support worth their salt will ask you to reveal your password.  And if they do they deserve to be ignored!

Solution for PDF printing problems on Windows

Friday, December 19th, 2008

We found the solution for our long term PDF printing problems, which occurred mostly on Windows computers.

The latest workaround was to use the option print as image in Adobe Acrobat. This was very slow, and some times even that didn't work.

Please follow the instructions in this readme, if you have such problems!