1

System Upgrade Details (Read 513 times)

eric :)


    This evening, I completed the first major software update of RA's servers.  For the most part, the servers had been humming along for two years with minimal interruption until one of the firewall's two internet facing network cards failed 6 months ago.  I had to shut down the firewall to replace it, which means RA would be unreachable during that time.  It was a good opportunity to install patches on the servers that provide various services for RA.

    I created an upgrade plan to ensure that I won't forget anything.  It documented the steps I would take, from sending out a tweet 15 minutes before shutting down the firewall; to the software patches needed by each server; to putting in the new network card.

    I told the data center's technician that I will need to remove the server from the cabinet to put in the new network card.  The room was deafeningly loud from all the air conditioning and fans used to keep the servers cool.  I had to shout next to him in order for him to hear me.  Maybe it was the noise, or maybe he was overly zealous about the mini project, but the next thing I knew, he had yanked off the server's power cables and it was sitting on a cart in front of me.

    I was so shocked by what just happened that I could only mutter "You shouldn't have done that..." to him.  It is always a bad idea to shut down a computer by yanking out the power cord.  The server he pulled out ran the database and file server that stores your GPS data.  These services needed to be shut down cleanly or they might lose any unsaved data.  Worse yet, the data files might be corrupted so badly that I would have to restore the data from backup, which could take hours if not days because the backups are stored off site.

    I don't know why he was so callus about it.  One would think that he would coordinate with me and wait for me to power down everything.  What's done is done.  I needed to focus on the task at hand and hope for minimal damage.  In less than 30 seconds, my carefully choreographed plan was scrapped and I had to improvise the entire upgrade operation.

    After putting in the new network card, it was time to power everything up and survey the damage.  The first sign that something was not right was the RAID array complaining about inconsistencies.  The RAID array stores your data on multiple hard drives.  Should one of the disks fail, your data can still be retrieved from other disks.  It can also automatically take the bad disk offline and replace it with a spare disk.  The unclean shutdown triggered the RAID array's consistency check, which will probably take 12 - 24 hours to complete.  During this time, the servers may be a little slower than usual.  You might not notice because these servers are quite beefy.

    After everything booted up, RA did not come back online.  The various servers can't talk to each other, and I can't connect to the firewall to diagnose the problem.  All traffic goes through the firewall, and it has special security to prevent unauthorized access to itself.  I had to move different machines around to bypass the security.  After a long stressful hour, I tracked the problem down to a configuration parameter that was reset during the upgrade.  The fix required several reboots.  While these servers are fast, they take minutes to reboot.  Where was nothing I can do but sit and watch the time tick by.

    The entire upgrade took about 4 hours, much longer than the estimated 1.5 hours.  Part of it was beyond my control.  I did trial runs days before but they didn't expose the configuration problem due to hardware differences in the much more powerful production servers.  In the end, I am just relieved that it was resolved.

    As for the backup servers, they are delivered to the other data center.  I haven't been able to connect to them yet, nor have they made contact with the primary servers.  I don't know if that's because they're not plugged in yet, or the technician did not connect to the correct network card, or whether I have made a configuration mistake.  This other data center is so secure that I am not allowed to be there.  It will be interesting if they don't come online by later today.

     

    eric Smile

      Wow, Eric Smile , Thanks for all the work you put into this site. That's a bummer about having everything planned out, then somebody messing things up. Hope you've gotten something to eat and had some sleep.

      "So many people get stuck in the routine of life that their dreams waste away. This is about living the dream." - Cave Dog
      wcrunner2


      Are we there, yet?

        Why in the world would anyone familiar with computers, much less a data technician, just yank the cables out? The extended down time made me realize how much I appreciate this site and what you do to maintain it.

         2024 Races:

              03/09 - Livingston Oval Ultra 6-Hour, 22.88 miles

              05/11 - D3 50K
              05/25 - What the Duck 12-Hour

              06/17 - 6 Days in the Dome 12-Hour.

         

         

             

        Julia1971


          I didn't understand any of the technical parts but it sounds like a big headache.  Thanks for all you do, Eric!


          Prince of Fatness

            Thanks Eric.

            Not at it at all. 

            beat


            Break on through


              I told the data center's technician that I will need to remove the server from the cabinet to put in the new network card.  The room was deafeningly loud from all the air conditioning and fans used to keep the servers cool.  I had to shout next to him in order for him to hear me.  Maybe it was the noise, or maybe he was overly zealous about the mini project, but the next thing I knew, he had yanked off the server's power cables and it was sitting on a cart in front of me.

               

              "What we have here... is failure to communicate..."

              Way to keep a cool hand and improvise!

              "Not to touch the Earth, not to see the Sun, nothing left to do but run, run, run..."


              hairshirt knitter

                But Mousie, thou art no thy lane,
                In proving foresight may be vain:
                The best-laid schemes o' mice an' men
                Gang aft agley,
                An' lea'e us nought but grief an' pain,
                For promis'd joy!

                 

                --Robert Burns

                Zelanie


                  Thank you for all of your work to keep this place running!

                    It could be worse:

                     

                    2014 Goals: (Yeah I suck)

                    • Sub 22  5K
                    • Sub 1:35 1/2 marathon 
                    • Sub 3:25:00 Marathon
                    Jack K.


                    uʍop ǝpᴉsdn sǝʇᴉɹʍ ʇI

                      Thanks for everything.  FYI... I tried to upload a workout from my watch and it doesn't work yet. I'm not complaining, just letting you know.

                      AutBatgirl


                        It could be worse:

                         

                         

                        I used to be a SysAdmin and we actually had this happen to us. For some reason a new cleaning person thought our LAN room needed to have the floors vacuumed every day (they were concrete). I finally figured out what was going on by standing in front of the server at the right time and telling her she was not allowed in that room ever again.

                        No act of kindness, however small, is ever wasted.

                          I was so shocked by what just happened that I could only mutter "You shouldn't have done that..." to him.  It is always a bad idea to shut down a computer by yanking out the power cord.  The server he pulled out ran the database and file server that stores your GPS data.  These services needed to be shut down cleanly or they might lose any unsaved data. 

                           

                          WTF? It's hard to imagine that level of incompetence from a data center employee working inside the server room. I'm really sorry you had to be on the receiving end of that. Hope the damage isn't too serious.

                             

                            I used to be a SysAdmin and we actually had this happen to us. For some reason a new cleaning person thought our LAN room needed to have the floors vacuumed every day (they were concrete). I finally figured out what was going on by standing in front of the server at the right time and telling her she was not allowed in that room ever again.

                             

                            Way back a LONG time ago (30 years or so), when I used to work on satellite ground stations, one guy told me they had a ground station in Saudi Arabia, and they could never understand why the station would go down at around 9:30 every night.

                             

                            Well, they had a guy hang out there one time, and it turns out the cleaning people would open up all the doors and windows to the station because it was too cold for them, what with the air conditioning and all.  Thus, the computers, which were a little more fragile back then, would over heat, and stop working.

                            Jeff


                            an amazing likeness

                              ...

                              Of course...this has nothing to do with Runningahead.com storage array getting hosed...

                              Acceptable at a dance, invaluable in a shipwreck.