Berkeley CSUA MOTD:Entry 54244
Berkeley CSUA MOTD
 
WIKI | FAQ | Tech FAQ
http://csua.com/feed/
2025/07/09 [General] UID:1000 Activity:popular
7/9     

2011/11/27-2012/1/10 [Computer/HW/Drives] UID:54244 Activity:nil
11/27   CalMail has been down for a few days (hardware failure and database
        corruption -- sounds like fun!) and is starting to come back online.
        Looks like they're planning to outsource all campus mail to either
        Google Apps or Microsoft 365 as part of Operational Excellence.
        <DEAD>kb.berkeley.edu/jivekb/entry!default.jspa?externalID=2915<DEAD>
        \_ http://ist.berkeley.edu/ciocalmailupdates/november-30-2011
2025/07/09 [General] UID:1000 Activity:popular
7/9     

You may also be interested in these entries...
2012/1/4-2/6 [Computer/HW/Drives] UID:54281 Activity:nil
1/4     I want to test how my servers behave during a disk failure and
        a RAID reconstruction so I want to simulate a hardware failure.
        How can I do this in Linux without having to physically pull
        a drive? These disks are behind a RAID card and run Linux. -ausman
        \_ According to the Linux RAID wiki, you might be able to use mdadm
           to do this with something like the following:
	...
2011/9/14-10/25 [Computer/HW/Drives] UID:54173 Activity:nil
9/13    Thanks to Jordan, our disk server is no longer virtualized. Our long
        nightmare of poor IO performance should hopefully be over. Prepare for
        another long nightmare of poor hardware reliability!
        ...
        Just kidding! (I hope)
        In any case, this means that cooler was taken out back and shot, and
	...
2011/2/14-4/20 [Computer/SW/Unix] UID:54039 Activity:nil
2/14    You sure soda isn't running windows in disguise?  It would explain the
        uptimes.
        \_ hardly, My winbox stays up longer.
        \_ Nobody cares about uptime anymore brother, that's what web2.0 has
           taught us.  Everything is "stateless".
           \_ You;d think gamers would care more about uptime.
	...
2010/7/22-8/9 [Computer/SW/OS/FreeBSD, Computer/HW/Drives] UID:53893 Activity:nil
7/22    Playing with dd if=/dev/random of=/dev/<disk> on linux and bsd:
        2 questions, on linux when <disk>==hda it always gives me this off
        by one report i.e. Records out == records in-1 and says there is an
        error. Has anyone else seen this?  Second, when trying to repeat this
        on bsd, <disk>==rwd0 now, to my surprise, using the install disk and
        selecting (S)hell, when I try to dd a 40 gig disk it says "409 records
	...
2009/10/27-11/3 [Computer/HW/Drives] UID:53474 Activity:nil
10/27   I just read an article that Facebook had moved their database
        to all SSD to speed throughput, but now I can't find it. Has
        anyone else seen this? Any experience with doing this? -ausman
        \_ I hope you're not running mission critical data:
           http://ask.slashdot.org/story/09/10/27/1559248/Reliability-of-PC-Flash-SSDs?from=rss
        \_ Do you have any idea how much storage space is used by Facebook,
	...
2009/8/4-13 [Computer/SW/OS/Windows] UID:53239 Activity:kinda low
8/3     VMWare + Windows XP + Validation question. I need to test stuff with
        Service Pack 3 installed. I have a valid key that I own (yeah yeah I
        actually *bought* a copy, please don't flame me for supporting evil
        M$). Is it possible to register the key once, and then duplicate it
        for testing purposes?  Will Windows or Microsoft detect copies and
        disable the rest the copies?
	...
2009/7/28-8/6 [Computer/HW/Drives] UID:53216 Activity:nil
7/28    Does it make sense to defragment disks on VMWare? My 80GB disk
        on VMWare isn't really using 80GB, it just uses what it needs.
        Will defragment do anything to it?
        \_ If you want to speed up disk operation in your VM, it's best to
           defragment the disks in your VM, then defragment the disk on your
           host machine where the VM files are.
	...
2009/7/24-27 [Computer/SW/WWW/Browsers, Computer/SW/OS/OsX] UID:53191 Activity:kinda low
7/24    Firefox 3.5.1 on MacOS is a piece of crap. It crashes ALL THE TIME.
        It has crashed 3 or 4 times on me in the last hour, and not on
        the same pages either. The new Yahoo! home page also sucks ass.
        \_ os x keeps trashing my raid disk: '11 hours to rebuild. have fun
           with the kernel IO subsystem running like shit until then".
           Worthless piece of shit.
	...
2009/7/17-24 [Computer/SW/OS/OsX] UID:53156 Activity:kinda low
7/17    -rw-r--r--@
        What does the "at sign" mean? This is on Mac OS. VMWare disk file.
        \_ The file has metadata attributes
           \_ How do I add/delete attributes to files? What about
              -rw-r--r--+ <-- what is the "+" sign? Also how do you make
              tar preserve these attributes?
	...
Cache (7622 bytes)
ist.berkeley.edu/ciocalmailupdates/november-30-2011
IST Service Status November 30, 2011 Update 3:00pm CalMail supports 70,000 accounts over 100 subdomains for students, faculty, staff, emeriti, and retirees, and also provides forwarding services for 140,000 alumni, handling more than 3 million messages a day. The current environment is five-years old and is reaching its normal end of life. The hardware was scheduled to be replaced this year during a normal refresh cycle; however the replacement is expensive (over $1M) and with the acceptance of the OE Productivity Suite project, as well as the strong interest in external services such as Google and Microsoft, the decision was made to pursue those options rather than investing in a platform we would only be shutting down in the near future. This summer, we began evaluating our sourcing options while maintaining CalMail in preparation for replacement. Our original timeline was to continue to support CalMail through 2012 while we undertook the selection and migration to a new platform. Excess demand Unfortunately, usage patterns did not conform to our expectations and demand on the system is significantly greater than anticipated. CalMail, like most email platforms, is impacted by the number of accounts, the number of devices connected to it, as well as how those devices connect. In the past, we had a predictable pattern that closely correlated to the number of accounts. During the last 18 months, we have seen a tremendous increase in the number of unique devices connecting to CalMail. Each of those devices consume connection resources, increasing both the frequency and the number of simultaneous attempts to sync or download mail. Much of this can be attributed to the ongoing explosion in smartphones and mobile devices (Android, iOS iPad, and iPhone), which are often configured to automatically check for new email. The additional load from increased connections combined with an already aging fully utilized environment, made for the slower performance experienced during October. We expected to manage under those conditions until we had a hardware failure on October 25. The storage subsystem that normally would have been able to handle such an outage through its redundant configuration caused CalMail to crash. While the environment was brought up later that day, that one event combined with the increasing high connection load has contributed to the extended outage we are dealing with now. Essentially with each new outage, more messages back up and more people repeatedly check for email, each action driving more load until the system exceeds its ability to keep up with the volume. To try to manage this load, the CalMail team "throttles", or attempts to slow down, those connections (hence the "failed to connect" messages that many people have received) via a rolling connection brownout to make sure everything stays up and all mail continues to be sent and delivered. That was working, albeit unsatisfactorily, until Thanksgiving weekend. More hardware failure On Friday, November 25 at 9:53 am, the storage array had an additional hardware failure. The system crashed as it had in October, and on-call staff proceeded to work on the environment to bring it back up. During that process, the database that maintains all login information and account details (the "who you are" and "where to deliver your emails") was corrupted and couldn't be cleanly recovered. That meant that mail couldn't be delivered and if you tried to log in, and if you could, the system wouldn't know who you were. The database had to be rebuilt from scratch, which required recovering information from backups, then adding in all needed changes from many different log files. It was an exceptionally challenging activity that took a team of people working around the clock from Friday morning until Sunday night to achieve. The primary impact beyond lost access during that period related to outside mail servers having to continually retry sending email (per regular internet mail protocol) until the CalMail environment was again available and messages were able to be delivered. The team completed its work Sunday night around 11:00 pm and confirmed that email delivery queues were slowly coming back under control. Then, at 12:45 am, another disk failed in the storage array. Disk failure is a relatively common occurrence, however the storage environment needs system capacity to be able to recover all the data off of the failed disk and move it to other disks. This would have been managed under normal load, however once the campus community returned from the Thanksgiving holiday, we again had a huge spike with everyone logging in and syncing for the first time since the beginning of the holiday weekend. When combined with the added mobile device demand, it simply overwhelmed the system. The team again throttled connections to allow the system to stay up and deliver the backlogged email more slowly, while the rebuild of the failed disk proceeded. Tuesday, the system simply ran out of capacity and locked up - pushed so far beyond its operating capacity that while it was still running, it was so slow as to be unusable. Processes implemented To allow mail to keep flowing, we have made several difficult decisions that we felt were necessary under the circumstances to both allow access to email while protecting the integrity of each person's email box. First, we encourage students to forward mail in order to remove some load from student email connections off of the server. More than 20 percent of our students are already forwarding from CalMail to other email accounts. Encouraging students to take this action is intended to achieve some "headroom" on the environment while allowing mail to keep flowing. We also shut off the ability to connect from most mobile devices and regular email clients (like Outlook or Apple Mail) via IMAP and POP protocols, and left open the ability for both webmail clients, SquirrelMail and RoundCube, to connect. The result is the system can be effectively throttled while still allowing everyone to access mail and to have mail flowing in and out. It also puts a tremendous unprecedented load on the servers that support the webmail clients, which can result in slow performance. Those 10 servers are being monitored closely and are being tuned and augmented where we can, but system performance will continue to be an issue. Next steps While we continue to pursue the end objective of migration off of CalMail to a third party email solution (either Google or Microsoft), we have multiple efforts underway, each with the objective of supporting the system through the end of semester, at which time we will change out the back end storage to faster equipment. Those interim plans include bringing in some of the world's leading technical experts on email to evaluate what options we have for creating more headroom in our specific email architecture. We are also bringing up an additional email server and storage environment and will be moving some accounts off to create capacity on the primary cluster. Finally, we have engaged with a hardware professional services firm to both assist with maintaining the existing environment and accelerate the implementation of the new hardware, which is expected to arrive late Wednesday night or early Thursday morning. This equipment is very large and contains hundreds of disk drives that must be "burned in" before we can put production email on the server. The risk of moving to it before that testing is completed isn't viable as we would be risking permanent mailbox loss if there was a failure.