12

September 11 Outage (Read 261 times)

eric :)


    Hi all,

    RA suffered the longest outage in its history.  According to the log, connectivity was lost at 12:47 AM EDT.  Some of you emailed me soon after to report the problem.  Thank you.  I contacted the data center at 8 AM EDT.  They responded within minutes but it was from their support desk.  They forwarded the ticket to the technician on call.

     

    After 6 hours since the ticket was created, and multiple emails requesting a diagnosis, the field tech finally told me that they are in the process of migrating all the servers' IP addresses to different routers and were encountering network issues since Friday.  He added that connectivity should be restored within minutes, but that didn't happen until a little over an hour later, at 3:20 PM EDT.  RA was inaccessible for 15 hours.

     

    Since the problem was with the data center's routers, it meant the servers were operating normally.  They just weren't receiving any traffic.

     

    While waiting to hear back from the data center, I was preparing to bring the backup servers online.  It is not a simple process, even though it should be.  All of the data were replicated to the backup servers so even if there's a catastrophic failure with the primary servers, there should be almost no data loss.  The backup server's web server was out of date so I had to update it before it can handle the traffic.  The problem was resolved before the web server was ready.

     

    There are ways to reduce the total down time, even with a data center failure.  It requires significant infrastructure rearrangement.  I think the current setup should be adequate once the backup web servers are updated and ready to handle the traffic.  Using the current network layout scheme, the maximum down time can be reduced to one hour, which is the time it takes to update the DNS records to point to the new servers.

     

    I have sent the data center another email requesting information about this outage.  If they respond, then I'll add the update here.  Thank you for your patience today.  It had been quite stressful.

     

    eric Smile

    CanadianMeg


    #RunEveryDay

      Thank you for taking such good care of our running logs and our community, Eric!

      It is always appreciated. Smile

      Half Fanatic #9292. 

      Game Admin for RA Running Game 2023.


      running metalhead

        Thanks for keeping us up to date.
        Your efforts are greately appreciated.

        - Egmond ( 14 januari )            :  1:41:40 (21K)
        - Vondelparkloop ( 20 januari ) :  0:58.1 (10K but did 13.44!!!)
        - Twiskemolenloop ( 4 maart )  :   1:35:19 (3th M45!)

        - Ekiden Zwolle (10K)   ( 25 maart )
        - Rotterdam Marathon ( 8 april )
        - Leiden Marathon Halve ( 27 mei )
        - Marathon Amersfoort ( 10 juni)

        LRB


          Thank you for your patience today.  It had been quite stressful.

           

          I can't even begin to imagine.

           

          Thank you for all that you do.

          Neil Gunn


          Gandalf the Grey

            Eric - thank you. I cannot imagine how stressful this must have been as you take such pride in the customer service, design, functionality & reliability of RunningAhead.

             

            I'm sure that everybody who uses this great site appreciates your hard work so once again, a big thank you from me!

             

            Neil

            UK

            Running ... just keep running!

            stealth.rnr


            She laughs at me......

              Thank you for everything you do, Eric.

               

                                                               

              GinnyinPA


                Thank you Eric.  We really do appreciate your hard work.


                Prince of Fatness

                  Thanks for everything, Eric.

                   

                  This reminded that I have been slacking and have not donated in a while.  Donation submitted.

                  Not at it at all. 

                    Thank you, Eric.  I only panicked a little!  I have faith in you! Smile

                    Out there running since dinosaurs roamed the earth

                     

                    obiebyke


                      I re-upped my subscription just now. Sending you a big happy hug, if you're into that sorta thing. Wink <3

                      Call me Ray (not Ishmael)


                      tomatolover

                        Thanks for everything, Eric.

                         

                        This reminded that I have been slacking and have not donated in a while.  Donation submitted.

                         

                        Ditto- Thanks for all your hard work Eric!!  I think we should hold another pledge drive soon!

                        HCH


                          Wow. Thank you, Eric! A big old reminder to get my subscription up-to-date. (Although I have to admit it's been interesting seeing the ads Google has chosen for me. 55+ travel? Oh man......)

                          Only 26.2 miles more to go.

                          eric :)


                            The data center manager replied to my displeased email but did not provide much additional info.  No surprise there.  Essentially, they added another router to the network to improve redundancy.  The new router and the old one were not communicating properly, resulting in some of the servers' traffic, including RA's, to be dropped.  There's really nothing anyone can do about that, aside from writing an angry email to help them remember this incident so that they'll do more testing during any future upgrades.

                            eric :)


                              Thank you, Eric.  I only panicked a little!  I have faith in you! Smile

                               

                              The data is actually quite safe.  It is replicated to two sets of servers located in Los Angeles and Boston.  The database is replicated on a per second basis, while data files are replicated every 10 minutes.  Should Los Angeles falls into the ocean, only 10 minutes worth of workout data files would be lost.  It may take a couple of days to bring up replacement servers, but your data is not lost.

                              eric :)


                                I re-upped my subscription just now. Sending you a big happy hug, if you're into that sorta thing. Wink <3

                                 

                                Who doesn't enjoy a big happy hug?

                                12