12

Race time predictors, revisited (Read 241 times)

eric :)


    Fetch of FetchEveryone.com wrote an article about his analysis of Peter Riegel's race time prediction formula.  The basic idea of the formula is that our pace slows as a logarithmic function of distance.  This decay is expressed as the performance degradation coefficient.

    Fetch attempted to verify the formula against the data collected on his site by calculating the coefficient using the finish times from half and full marathons of 1,071 users.  He found that the 1.06 coefficient only applies to elite runners.  He proposed that a value of 1.15 should be used for the general public because it described his data better.

    We had a discussion about the validity of his analysis here on RA.  In my only post, I suggested that Fetch's methodology might be flawed.  He did not explain how he calculated the 1.15 coefficient.  I don't know if FetchEveryone has predefined workout types such as race.  Without it, it would be hard to separate races from training runs of the same distance, which would skew the results.  How did he choose the half and full marathons?  One user said if the coefficient fits the data, then it must be correct.

    With some free time tonight, I decided to run my own analysis against RA's data.  The first step was to find all foot races from 41,000 to 43,000 meters.  The exact distances of 26.2 miles or 42.195 km were not used because users do not always enter the exact distance.  Although some of these races are not exactly the marathon distance, they should not affect the result.


    The pace was calculated for each of these races.  Races with finish times under 2 hours were removed.  They tend to be caused by users omitting the duration's seconds component (e.g. 3:45 instead of 3:45:00).  Finish times larger than 15 hours were also removed.  Each user's marathon PR was calculated by finding the race with the fastest pace.  All none PR races were removed, leaving one race per person in our marathon data set.

    Since we're comparing a person's marathon time against the half marathon time, we can only look for half marathons ran by the people that also ran at least one marathon.  Furthermore, only half marathons up to 365 days before each person's marathon PR were included.  Half marathons outside of the one year period may not reflect the person's physical conditions around the time of the marathon.

    From all of these half marathon results, they were whittled down using the same method as the full marathon.  That is, improbable entries were removed, and a PR was picked for each person.  In the end, there were 4402 people that ran a half marathon up to a year before their full marathon.  These half and full marathon pairs also represent each person's personal best during the one year period.

    Below is a chart of the coefficient distribution of the raw data:



    The chart showed that some of the data is invalid.  Any coefficient less than 1 is impossible since no one can run a full marathon at a faster pace than a half at a 100% effort.  The largest coefficient was 1.94, which should also be invalid.  Using 1.94 as the coefficient, a 2 hour half marathon would translate to a 7 hour 40 minutes marathon.

    Just for chuckles, I did some simple statistics on the raw data:

    Mean: 1.14

    Median: 1.13

    Mode: 1.09

    Standard deviation: 0.13

    The data clean up involved removing coefficients less than 1.  I also arbitrarily removed coefficients greater than 1.4 since that value would produce a predicted marathon time of 5 hr 16 min using a 2 hour half marathon.


    The cleaned up data produced the following statistics:

    Mean: 1.15

    Median: 1.13

    Mode: 1.09

    Standard Deviation: 0.084


    Since both RunningAHEAD.com and FetchEveryone.com have sufficiently large user bases, I assume that the data from both sites should be distributed similarly.  And from the numbers above, it appears that Fetch was right, that a 1.15 coefficient would work better than Riegel's 1.06.  Or is he?

    I think the mean and median coefficients are meaningless.  From the raw chart above, it is obvious that it contained invalid coefficients.  How one determines which values to eliminate from the final result will impact both the mean and median.  Assuming that the data reflects the distribution of runners from around the world, then the mode of 1.09 would be a better coefficient to use.

    So what does this all mean?  It means each person is different.  We all slow down differently as we run longer distances.  As I wrote in my post in the discussion thread, instead of applying one fixed value to predict one's marathon time, it is better to calculate the degradation coefficient for each person.

    Feel free to download the raw data of half and full marathon distances and durations that I used to produce the charts.  It contains no user identifiable information.  You are welcome to run your own analysis and report back.

    eric Smile

      One thing I think people are missing, in the previous discussion and here, is the following from the FetchEveryone article:

       

      I fed all of my 1071 runners through that formula, and found that only 49 of them managed to hold on to the tails of 1.06 - it was far more common to see a score of 1.15. So if we adjust our formula to look like this (for half marathon to marathon only):

       

      T2 = T1 x (D2/D1)1.15

       

      we see an instant overall improvement. The number of bad predictions drops from 65% to just 27%.

       

      But let's not stop there! We know that faster runners tend to be capable of holding on to their pace for longer. So instead of using a constant 1.15, let's connect that number to the speed of the runner, using their half marathon time. [emphasis added]

       

      He doesn't give the final formula, though.

       

      Thanks for providing this data, Eric -- I do want to find the time to see whether there is a relationship is between speed and exponent, as indicated by the article (although my wife is now sorry she pointed this out to me as I'm going to get distracted by all this, while I have other stuff to do around the house Smile ).

      Lou, (aka Mr. predawnrunner), MD, USA | Lou's Brews | lking@pobox.com

      xhristopher


        So what does this all mean?  It means each person is different.  We all slow down differently as we run longer distances.  As I wrote in my post in the discussion thread, instead of applying one fixed value to predict one's marathon time, it is better to calculate the degradation coefficient for each person.

         

        Wow! Thanks for that. Very interesting.

         

        In my case I've found that my half marathon times predict a marathon about 10 minutes faster than I've run so far in most calculators. I understand the "properly trained for the distance" disclaimer given by calculators but believe if I trained to get my marathon time to where my half time predicts my half time would then end up being a minute or two faster and my PRs would continue to be out of alignment, at least according to most calculators.

         

        I've also found it's much, much harder to run a marathon correctly than a half. Therefore, my "misalignment" in PRs may be as much about experience as fitness.

          So what does this all mean?  It means each person is different.  We all slow down differently as we run longer distances.  As I wrote in my post in the discussion thread, instead of applying one fixed value to predict one's marathon time, it is better to calculate the degradation coefficient for each person.

           

          Here is where I propose that instead of using a race time to predict another race time, how about using a natural log regression on a bunch of recent races to predict race time at another distance?

           

          Here is my regression for my past 5 races. By putting in the race distance (x) it spits out miles/hour race speed (y) for that distance. This method is automatically personal because it uses recent race results at different distances therefore no guess/check should be needed and as your fitness changes the shape of the regression will also change along with it:

           

            Thanks Eric!  I think it would be interesting to break the results out into grouping by training MPW, for example <40 MPW, 40-60, 60-80, 80-100, 100+.  I think we might find different coefficients based on training load.  And to Xhristopher's point, maybe we need to factor in # of marathons?

             

            Personally, I think both types of marathon predictors are right, just for different applications.  I would use the 1.15 coefficient for an inexperienced marathoner or someone who was not trained properly for the marathon to predict an average finish time for that type of runner.  OTOH, if you want to predict the best possible performance you would use McMillan, Daniels, etc.

            GC100k


               This method is automatically personal because it uses recent race results at different distances therefore no guess/check should be needed and as your fitness changes the shape of the regression will also change along with it:

               

              That's also true of the method we've been talking about.  In fact, if you did ln(pace), it'd be exactly the same thing.

                That's also true of the method we've been talking about.  In fact, if you did ln(pace), it'd be exactly the same thing.

                 

                The problem that I see with using 2 races to get a coefficient (which is basically trying to predict the natural log behavior) is that there is obviously some variation in what your real 5k time is or half marathon time (meaning that if you run 5x half marathons you will have 5x different times…which one do you use? Was your best time because it was your maximum effort or was the course a little short? Was one on a hot day? Was one very hilly?).

                 

                By including more races at more distances, the regression becomes more accurate and more able to wash out the impact of a particular race course/day. You can also have multiple times at each distance to help improve the accuracy of the regression. In my opinion, using two specific races to generate a factor assigns too much value to those particular races.

                GC100k


                   

                  The problem that I see with using 2 races to get a coefficient (which is basically trying to predict the natural log behavior) is that there is obviously some variation in what your real 5k time is or half marathon time (meaning that if you run 5x half marathons you will have 5x different times…which one do you use? Was your best time because it was your maximum effort or was the course a little short? Was one on a hot day? Was one very hilly?).

                   

                  By including more races at more distances, the regression becomes more accurate and more able to wash out the impact of a particular race course/day. You can also have multiple times at each distance to help improve the accuracy of the regression. In my opinion, using two specific races to generate a factor assigns too much value to those particular races.

                  In the other thread we were talking about throwing all your races or some subset of your races into the regression, so that's what I was referring to.  Ya, using just two races has limited utility.

                    In the other thread we were talking about throwing all your races or some subset of your races into the regression, so that's what I was referring to.  Ya, using just two races has limited utility.

                     

                    Got it. I have to admit that I didn't read the other thread, just this one. And I agree that whether you use pace, or speed or time, it doesn't really matter; just the use of two races. Sorry for not getting up to speed on the previous discussions before jumping in. Cheers!

                    CSP


                      Thanks for posting the results of your work--interesting information.

                       

                      Whenever I read about a study, the second thing I look at is always the profile of the participants selected.  Often it's quite telling.  In anything published out of a University, and athletic-related, it's usually: "college-aged" "trained" "male" etc.  Who else would volunteer, be near a university that conducts studies, or be required to volunteer due to class requirements?  (My own studies--different field--were all based on college Freshman participants because the 101 class was required to participate in an upper classman's research project in order to receive an easy credit toward their grade.)

                       

                      When you pull data from a website, I would guess that you have a better chance of getting more than just the college-aged Joe--makes it more interesting for the general masses, and for me, [not a college-aged Joe].

                       

                      In the Fetch analysis, the participants were:

                       

                      runners who had completed at least five half marathons and five full marathons

                       

                      It was also noted that: 

                       

                       if a man and woman can both run a 90 minute half, the woman is likely to beat the man by nearly four minutes over marathon distance. That difference gets bigger as pace reduces - women capable of two hour half marathons can look forward to beating their male equals by nearly 9 minutes in the marathon.

                       

                       

                      Recently, I read that endurance increases with age.  (I think the source was Hal Higdon's book Run Fast--apologies for not providing a good reference).  Would this suggest that the factor decrease with age?

                       

                      Formula's always fascinate me--yet, I tend to forget that life is filled with outliers--we tend to be more "range-ish" than "average".  This quote from the Fetch analysis summed it up quite well:

                       

                      Of course, it's just a prediction, and within each prediction there's leeway, and plenty of people who stray from the norm. It's based on averages, and as a runner you are anything but average. It doesn't take into account anything that happens on race day, and it doesn't know anything about how hard or effectively you trained. And therein lies the real secret towards converting a good half marathon time into a great marathon experience.

                      TheProFromDover


                      TheProFromDover

                        For me a 1/2M is not enough info.  Or the distribution covers me but I'm out at one end.  Being a 5k guy, my McMillan 5k predicts a 2:25 marathon.  That is just not happening.  I am human so my 1/2M does predict a time that lies within the expectations.  But knowing exactly how I falter with increasing distance is pretty helpful.

                         

                        (I only skimmed all the posts and data, so I should prob do more homework. But I'm lazy.)

                        -Craig - "TPFD53 at gmail dot com"

                          Thanks Eric!  I think it would be interesting to break the results out into grouping by training MPW, for example <40 MPW, 40-60, 60-80, 80-100, 100+.  I think we might find different coefficients based on training load. 

                           

                          +1

                           

                          It is a great analysis.

                           

                          The problem of the data is that HM and M may not be on the similar route profile. For instance, some may run on hilly HM but run a flat M and weather condition may be different as well. Training load may be different too for the two events.

                           

                          As many said, beginners usually improve dramatically when training consistently. So the fitness may be very different between the two events.

                          5k - 20:56 (09/12), 7k - 28:40 (11/12), 10k trial - 43:08  (03/13), 42:05 (05/13), FM - 3:09:28 (05/13), HM - 1:28:20 (05/14), Failed 10K trial - 6:10/mi for 4mi (08/14), FM - 3:03 (09/14)

                            I did a bit of analysis with Eric's data, threw out the same points (exponents < 1 or > 1.4), bin'd by half marathon time (15 minute bins) and calculated the mean, median and mode for each bin.  I threw out bins with < 50 entries (turned out to be longer half marathon times).

                             

                            Here are my results.  Bin size is the number at the top.   I think the data is getting skewed by outliers.  Not sure the best way to get rid of them.  I don't think mode is the answer, but something like mode...

                             

                             

                             

                            Lou, (aka Mr. predawnrunner), MD, USA | Lou's Brews | lking@pobox.com

                              ...

                              The problem of the data is that HM and M may not be on the similar route profile. For instance, some may run on hilly HM but run a flat M and weather condition may be different as well. Training load may be different too for the two events.

                              ...

                              I've been amazed at how many people must run similar profiled races (generally flat, roads) across a distance range to make these things work.

                               

                              Most of my races are hilly trails so can't even comprehend the calculations needed to do anything meaningful with this type prediction. Training relevancy (vertical, agility) may be more important than mpw - or not.

                              "So many people get stuck in the routine of life that their dreams waste away. This is about living the dream." - Cave Dog
                              zonykel


                                I've been amazed at how many people must run similar profiled races (generally flat, roads) across a distance range to make these things work.

                                 

                                Most of my races are hilly trails so can't even comprehend the calculations needed to do anything meaningful with this type prediction. Training relevancy (vertical, agility) may be more important than mpw - or not.

                                 

                                It's probably doable. The margin of error may need to be increased to account for the larger variability of the terrain, though. Seems like a probability and statistics problem, not a running problem.

                                12