March 31, 2008

In a fit of scientific skepticism, we decided to calculate how unlikely Joltin’ Joe’s achievement really was. Using a comprehensive collection of baseball statistics from 1871 to 2005, we simulated the entire history of baseball 10,000 times in a computer.
  • Also, instead of assuming that a player has a 50 percent chance of hitting successfully in each game, we used baseball statistics to calculate each player’s odds, as determined by his actual batting performance in a given year. IANAStatistician, but isn't this importing the one real year into all of the hypothetical universes -- that is, using the real hits as the basis for calculating the imaginary hits? Isn't that completely contrary to the exercise? The point is to see what batters would produce in these imagined universes. To use that one real event as a basis for all others is to restrict the all possible outcomes to that one set of batting averages -- which would not be the case. I would think, and again, I'm completely unqualified to pronounce on such matters, that to come up with an accurate set of hypothetical universes, ones where Joe DiMaggio had 1941 seasons where he didn't always hit .357, but also ones where he hit .312 or a measly .220, is the real impossibility. Universes where pitching was up or down, weather was good or bad, stitching on the balls was tigher or looser, universes where the stands were empty as white folk went to the Negro Leagues for the higher quality of play, universes where there wasn't a war... The star of 1941 wouldn't always be Joe DiMaggio -- it could have been Joe Garagiola instead. Baseball is a game of innumerable factors, keeping an accurate grasp of what's about to happen maddeningly out of reach. To create the 10,000 myriad seasons of baseball would need to take thousands of different factors into account, everything from how the batter's wife was feeling neglected as buddy was on the road, to if there was some fan sitting by the left fence who was going to catch a stray fly ball and influence the game unwittingly. I wouldn't say that breaking Joltin' Joe's record is impossible, but it is rare. Not rare because we haven't seen it, but rare in that so many things can go wrong (or right) that haven't been accounted for, by plugging in that one set of real numbers which, despite being real, are completely emphemeral.
  • If I can elaborate on my ignorance for a moment, it's simply to say that a hitting streak can be ruined by an errant pigeon, intersecting the ball's tajectory. The streak can be extended with the shortstop tripping on a piece of gravel. These artificial universes don't -- and can't -- take these innumerable factors into account. I don't think that breaking the record is impossible -- I think coming up with accurate alternative universes is.
  • I think that the real life stats for the players are used for the rankings in the computer generated seasons, in roughly the same way that a D&D character is generated. Players have strengths and weaknesses. The computer 'dice' in each simulation then weight the players' skills against a range of random factors. Statistically, the 10,000 iterations should filter out the most bizarre sets of circumstances and the statistical mean should be reasonably close to the observed reality. Initially, I guess, the model would have to be tweaked until the meaty part of the bell curve bore a reasonable similarity to what actually happened. I'm not a statistician. I'm not even good at sums.
  • Those abberations are why they ran the simulation 10,000 times, instead of just once. The assumptions are that random streak-ending or streak-extending events, though rare, happen with equal distribution and will be filtered out over numerous trials. They didn't give much detail, but even if they just simulated the 1941 season, I would guess the longest streak would happen to someone else. DiMaggio didn't have the highest batting average that year - he was third behing Ted Williams (.406!!!) and Cecil Travis (.359).
  • I just don't get how you can simulate the entire history of baseball by using the same set of batting averages. I just don't.
  • Let us not forget that in some of these universes, the games would be interrupted by swarms of giant wasps, or robo-Hitler.
  • I think they're simulating each season with the players' final averages at the end of that season. For instance, in 1941 DiMaggio averaged .357. If he averaged 3.9 at-bats per game that year, then his chances of hitting at least once in any one game were 1-((1-0.357)^3.9) or 0.821. Running a simulated season would give hitting streaks of varying length. The law of large numbers says that the more simulated seasons you run, the more likely you'll see a freakishly long streak.
  • WE WANT A HITTER, NOT A ROBO-HITLER WE WANT A CATCHER, NOT A ROBO-MARGARET THATCHER
  • Really rather prejudicially ignorant of you, koko. I played a double-header one time with a robo-Margaret Thatcher behind the plate. She caught the entire fourteen innings without so much as one passed ball, threw out four of five trying to steal second, laid down a beautiful sacrifice bunt, and added two doubles and a single. Really couldn't have asked for anymore.
  • That skin bag wouldn't have lasted one pitch in the old Robot Leagues! Now Wireless Joe Jackson, there was a blern hitting machine!
  • Say what you will about Margaret Thatcher, but I corked the bat.
  • Too much numberfication in this thread!
  • So, how many cylons are as-yet unrevealed in the majors?
  • Yay, baseball season!! That is all.
  • I'm not sure, because details are sketchy, but I think this is utterly bogus. What it looks like they're doing is this: suppose you listed a batter's hitting performance as a series of 1s and 0s indicating games where he hit or did not hit. So a player who played 10 games in a row and hit in the 2nd, 4th, and 10th game would look like this: 0 1 0 1 0 0 0 0 0 1 Then they're just finding all possible orderings of these digits. Or not even all possible orderings (because that would be a simple math function that I'm too lazy to figure out right now), but only at most 10,000 possible orderings generated randomly. Obviously, for anyone who played in, say, 100 games, and hit in more than 56 of them there's going to be some "universe" where at least 56 of those hits come in sequential games.
  • Monkeyfilter: the meaty part of the bell curve
  • Best Game Ever.