What's new

Football analytics/advanced stats...

AUSpur

Well-Known Member
Aug 31, 2010
1,145
1,116
Advanced stats are starting to get a lot more attention in our sport and I thought a thread to discuss them might be of interest. To break the ice, here's a blog post by MCofA at Cartilage Free Captain who's created a minute-by-minute database of "events" in matches for the 2012-13 season.


The Minute-by-Minute Database III: New Expected Goals and introducing SiBoT
By MCofA on Jun 27 2013

Shots in the Box on Target are hardly a complicated thing to track, and yet the stat is not widely available. I argue here that the basically objective SiBoT are a better marker of the quality of a club's attack or defense than the highly subjective "big chances" tracked by Opta Sports. SiBoT also explain Tottenham's poor conversion rate last year.

In my last piece, I explained why I want to build club quality statistics off of the underlying stats, things like shots on target, shots in the box, and big chances. During the last season, why I was running power rankings and season projections, I was using a goals scored / goals conceded estimator built from just these underlying stats. I primarily used shots on target and big chances, with adjustments for schedule difficulty.

I want to use the minute-by-minute database to renoobulate my expected goals formula. If possible, I'd like to get away from using "big chances." I used it for the last season because big chances correlate very well with goals scored, and if you adjust a team G/SoT based on their goals scored, you get a better correlation between expected goals and actual goals. We use the tools we have, if they work.

But I don't like big chances. Opta Sports defines a "big chance" as follows.[blockquote]A situation where a player should reasonably be expected to score usually in a one-on-one scenario or from very close range.[/blockquote]DO "BIG CHANCES" CORRELATE WITH GOALS SCORED PARTLY BECAUSE OPTA STRINGERS ARE MORE LIKELY TO CLASSIFY CHANCES AS "BIG" IF THEY ARE CONVERTED INTO GOALS?

This is a mostly subjective definition. To give an in-game example of my problems with it, the home match against Everton toward the end of the season, when we were missing Gareth Bale, is a good enough example. Both of the goals for Tottenham Hotspur, by Gylfi Sigurdsson and Emmanuel Adebayor, were classified as converted big chances. Sigurdsson's open-net finish of a rebound is about as classic a "big chance" as you'll see. Adebayor was at full stretch, but he did find himself clear on the end of a beautiful cross in the box. Those are ok, and give you a sense of what a "big chance" is. The other big chance in the game troubles me. Phil Jagielka's equalizing header also gets classified as a "big chance." Yes, it was a free header from reasonably close range, but that was a tight angle and free headers go every which way all the time in this sport. Would Jagielka have been credited with a "big chance" if he'd knocked it wide of the post or softly into the chest of Hugo Lloris? I don't know, but I am concerned. Further, after Sigurdsson's equalizer, Everton had a major opportunity on the counter which Victor Anichebe just failed to convert.

This example isn't meant to be definitive, but I hope it gets at my concerns. Do "big chances" correlate with goals scored merely because they're a really good statistic on their own, or do they also correlate with goals scored because Opta stringers are more likely to classify chances as "big" if they are converted into goals?

But at the time, I didn't have anything better to use. Now, thanks to the minute-by-minute database, I think I do.

When I argued during the season that I thought Tottenham had been unlucky in their goal-scoring, one of the common responses had to do with shots in the box. Tottenham took tons of shots from outside the box, more than anyone else in the EPL. They weren't converting because they were shooting from so far away. I said in response that I had found a mostly poor correlation between shots in the box and goals scored, certainly much weaker than the correlation with shots on target. In most publicly available slices of the Opta data, you can get information about team shots in the box or about teams shots on target, but not about the combination of the two. So since I didn't have data on shots in the box on target, I didn't use that.

I have the data now, and I think it's really important. Shots in the box on target are converted at a way, way higher rate than shots out of the box on target. For SiBoT, it's 35%. SoBoT are about one-third as likely to be scored, at a 12% conversion rate. Spurs not only led the Premier League in shots outside the box on target with 109, we were lapping the field. Queens Park Rangers were second with 76, Liverpool third with 70. You expect a club with that many of their SoT coming from outside the box to convert at a lower rate than the average club, and that's exactly what Tottenham did.

There's a second way in which SiBoT are really useful. They correlate extremely well with big chances. The correlation for attack is .90, for defense .78. Shots in the Box on Target are a good proxy for Big Chances, but without all that problematic subjectivity.

SHOTS IN THE BOX ON TARGET ARE A GOOD PROXY FOR BIG CHANCES, BUT WITHOUT ALL THAT PROBLEMATIC SUBJECTIVITY.

It makes sense, when you think about it. Imagine a big chance in your head. It's going to be an opportunity inside the 18-yard box. And then, with SiBoT, you get to group together just those big chances that someone was skilled enough to direct to the mouth of goal. Along with the big chances, you'll get mostly opportunities that perhaps aren't as "big," but they probably had to be pretty good. Only about 35% of shots in the box end up on target, most either miss the target and get blocked on the way in. Finding an open shooting lane from a reasonably short distance, and then finding the time and space to direct the ball down that shooting lane, usually requires at least a "half chance."

Renoobulating the Expected Goals Formula

For the past few months, I have been running spreadsheets using an expected goals formula based on shots on target, big chances, and shots in the box. I want to test a different expected goals formula, one based only on shots on target in the box and shots on target outside the box. As a control, I will also use a dummy formula based only on shots on target.

This is the part where we get into the nerdery, and it's going to take a few paragraphs. You can skip on down to the "preliminary expected goals" table if you just want to see the results.

Ok, for the six of you still reading, these are the three formulae for expected goals. SoT are shots on target, SiBoT and SoBoT are shots in or out of the box on target. Lg G/SoT is the league average for goals per shot on target. Lg G/SiBoT you can extrapolate. BC+ and SiB+ are the rates at which teams produce big chances and shots in the box, compared to league average. So if a club has 60 big chances, compared to a league average of 50, their BC+ will be 1.20.

1) SoT * ((.6 * Lg G/SoT) + (.3 * Lg G/SoT * BC+) + (.1 * Lg G/SoT * SiB+))

This was my old formula. I adjust the rate of goals scored per shot on target according to the club's rate of big chances and shots in the box. (I "derived" the 60/30/10 weights by trial and error. I wanted to use round numbers to avoid over-fitting.) As you can see, Big Chances make a big difference here.

2) SiBoT * Lg G/SiBoT + SoBoT * Lg G/SoBoT

This is my new proposed formula. It's so simple. Just divide shots on target between shots in the box and shots outside the box, multiply by league average conversion rates, you're done. Hope it works!

3) SoT * Lg G/SoT

This is basically a control. It's the simplest expected goals formula there is. My new numbers have to be better than this.

To test the models, I'm using two basic methods. I'm comparing the results of my expected goals formula to the actual goals the actual clubs scored or allowed. I'm using two different methods for comparison. The first is simple correlation. This will mostly tell me if I put the teams in the right order, if I have a good ranking of the clubs based on the underlying stats. The second is something called the "root mean square error" method. This measures how much my estimates missed the mark with every team, and it particularly punishes big misses. If I'm way off in estimating goals scored or goals conceded for just one or two clubs, the RMSE will have my shirt. This is good, because if I'm going to be using this model for estimating team quality, I don't want to be the idiot out there in November talking about how great Reading are. When in fact Reading are terrible.

So, these are the correlation coefficients and root mean square errors for each of my different estimators, both for attack and defense. A good correlation coefficient is as close to 1.0 as possible. A good RMSE is as close to 0 as possible.

null_zps9413864b.png


Shots in the Box on Target are generally a better predictor of goals scored than Big Chances. They produce expected goals numbers that correlate roughly as well with actual goals as the BC-based expected goals do. But when they miss, they miss by less. I want to run some of these tests using a larger set of seasons, but for now, I'm feeling pretty good about SiBoT.

The final thing I want to offer here is a nice table of preliminary expected goals scored and expected goals conceded for the 2012-2013 Premier League. There are a couple little things I should explain before getting to the table. First, since I found some meaningful correlations week-to-week for defensive G/SoT, I am using actual G/SoT as a small part of my goals conceded formula. I'm not regressing all the way to league average. Second, for penalties, which are obviously converted as a totally separate rate, I'd regressing the number of penalties given halfway to league average and rate of penalty conversion 100% to league average.

Ok, on to the numbers.

Preliminary Expected Goals Table

null_zps7d8d9047.png


- Yeah, I don't know what the deal with Manchester United is either. No matter what slices of the stats you control for, they converted more of every kind of opportunity. I said the RMSE punished me for large errors, but basically every measure misses on Manchester United by a dozen or more goals so the punishments evened out.

- This method brings Tottenham to roughly equal footing with their London rivals, but it does so by taking the air out of Arsenal and Chelsea's numbers more so than by inflating Spurs'.

- I think it's interesting to compare the three relegated clubs. Reading and QPR were terrible, but Wigan despite their goal difference had quite respectable underlying stats. Not surprising for a club that won a cup title, I guess. Obviously all of these clubs will lose important players, but I think these numbers suggest that Wigan are not a bad bet for a quick re-promotion.

- I say "preliminary" because I'm not taking into account, say, the game state analysis stuff I've been working on. And because it's still just June, there's more time ot incorporate more data and renoobulate again. I appreciate any feedback you can offer, and hopefully I can take it into account for my further revisions of these methods.

http://www.cartilagefreecaptain.com...otspur-analysis-epl-expected-goals-statistics
 

eddiev14

SC Supporter
Jan 18, 2005
7,174
19,687
I reckon that fella has watched Moneyball, read Soccernomics and is American...
 

Booney

Well-Known Member
Dec 2, 2004
2,837
3,481
Wow....I need a lie down now. I was just about holding it together until 'Renoobulating the Expected Goals Formula'

I think I'd do a lot more renoobulating if I knew what it was. Sounds fun.

Anyway...heroic effort! Good man.
 

eddiev14

SC Supporter
Jan 18, 2005
7,174
19,687
Like the idea of the thread though. I find the stats, post-match, quite interesting. The Stats Zone app is great!
 

Gedson100

Well-Known Member
Feb 13, 2012
4,487
14,648
The whole stats and football thing does seem to have gone past a tipping point this season.
The rise of whoscored.com has been beneficial but all the TV shows are now using a lot of stats and the blogs have multiplied 10fold recently.
Loads of info around for the stats fans.
 

AUSpur

Well-Known Member
Aug 31, 2010
1,145
1,116
The whole stats and football thing does seem to have gone past a tipping point this season.
The rise of whoscored.com has been beneficial but all the TV shows are now using a lot of stats and the blogs have multiplied 10fold recently.
Loads of info around for the stats fans.


The trick is figuring out which ones are useful and what exactly they're telling us.
 

Gedson100

Well-Known Member
Feb 13, 2012
4,487
14,648
The trick is figuring out which ones are useful and what exactly they're telling us.
Well... yeah... if you are managing a team or gambling!

Otherwise it's generally informative and interesting, but sure: 'Lies, damn lies and statistics'
 

AUSpur

Well-Known Member
Aug 31, 2010
1,145
1,116
Well... yeah... if you are managing a team or gambling!

Otherwise it's generally informative and interesting, but sure: 'Lies, damn lies and statistics'


Or just want a "better" understanding the game. That's why I like his Shots in the Box on Target stat, for example. If you just stop at shots on target, you don't get the whole story since the shots could be sitters or from your own half of the pitch.
 

stormfly

Well-Known Member
Dec 6, 2006
4,608
12,074
What the hell have I just read!? I don't come on here to be educated I come here for knee jerk reactions! Damn you! I blame Jenas for leaving us, it gives us too much time to think.
 

parklane1

Well-Known Member
May 4, 2012
4,390
4,054
Stats are useful but do not always tell the full story, one of the first stats man was a guy called Charles Hughes and what a mess he made of English football.
 

AUSpur

Well-Known Member
Aug 31, 2010
1,145
1,116
Stats are useful but do not always tell the full story, one of the first stats man was a guy called Charles Hughes and what a mess he made of English football.


He's heavily discussed in "The Numbers Game, Why Everything You Know About Soccer Is Wrong" as a cautionary tale of exactly what you said. Stats are great, but just as important is what you do with them/learn from them. He kind of sucked at the latter.

There's a balance to be struck between "stats are for dorks" and "stats tell us everything."
 

parklane1

Well-Known Member
May 4, 2012
4,390
4,054
He's heavily discussed in "The Numbers Game, Why Everything You Know About Soccer Is Wrong" as a cautionary tale of exactly what you said. Stats are great, but just as important is what you do with them/learn from them. He kind of sucked at the latter.

There's a balance to be struck between "stats are for dorks" and "stats tell us everything."

I agree that they can be usful but only as a part of the whole picture, it seems that nowadays ( more then ever) some fans get sucked into stats, formations, and treat them as bible.
 

Bus-Conductor

SC Supporter
Oct 19, 2004
39,837
50,713
I agree that they can be usful but only as a part of the whole picture, it seems that nowadays ( more then ever) some fans get sucked into stats, formations, and treat them as bible.

As opposed to the decades of fans getting sucked into using their warped inability to use two eyes to assess and correctly analyse the multitude of happenings occurring simultaneously every second during football matches and treat that as their bible you mean.

Stats are rarely the complete picture, but they are rarely pure bollocks, unlike a lot of opinions formed without them.
 

parklane1

Well-Known Member
May 4, 2012
4,390
4,054
As opposed to the decades of fans getting sucked into using their warped inability to use two eyes to assess and correctly analyse the multitude of happenings occurring simultaneously every second during football matches and treat that as their bible you mean.

Stats are rarely the complete picture, but they are rarely pure bollocks, unlike a lot of opinions formed without them.

And if you read my post you would have seen that is more or less what i posted.
 

AUSpur

Well-Known Member
Aug 31, 2010
1,145
1,116
Part I of II:

Here's his post introducing how he'll do his weekly table projections. I'll post the weekly updates (if there's interest in it):

2013-2014 Premier League projections: I need your help
By MCofA on Aug 13 2013

Wanna help out MCofA try to take on the big boys at Bloomberg Sports Proactive Synergies? Wanna read several paragraphs about bivariate Poisson distributions? Well then this is the article for you.

I will again be running weekly power rankings, season projections, and game projections in this space during the whole 2013-2014 English Premier League season. But before I can unveil the initial numbers, I'm going to need your help. You see, I have team ratings for attack and defense based on the underlying stats from previous season. I have some vaguely useful adjusted numbers for the promoted clubs from the championship. But there were lots of transfers this summer. And as I've said repeatedly, I don't think the world is anywhere close to having a good player value stat from which you could build team projections, from the micro- to the macro-level.

Me and You vs. Bloomberg Sports Analytics

So in order to account for roster changes, I'm going to be including in my projections a subjective component, crowd-sourced from all of y'all. In the comments -- along with, you know, constructive discussion -- please list your projected EPL tables from first to twentieth. I will use the average commentariat table as a subjective element in my season projections.

I want these numbers because I have decided that in my head we're in competition with Bloomberg. I took a look at their projected EPL standings last week, and I came away uncertain how advanced their under-the-hood methods might be. BSA will apparently be continuing to project the EPL season throughout the year, using a simulator method that appears to be quite similar to mine. According to their Projected Tables FAQ, they will be simulating each match of the season "over 10,000 times." For reasons of synergy and proactivity, they are keeping their methods otherwise hidden from public view.

I am not.

My hope for this season is that we can stage a little competition and see how my numbers, open-sourced and informed by the subjective opinions of the commentariat, do in competition with Bloomberg. The sample of a single season is too small for any result to actually be meaningful, but I think a little bit of competition is fun anyway. Especially when your opponent is a massive corporation that has no idea you exist. Right?

Methodology

I basically do two things to project the EPL season. First, I build team ratings for attack and defense. The primary input to these ratings are my "expected goals" calculations. As I discussed earlier this summer, I have re-worked my expected goals formula based on Shots on Target in the Box, and those, along with big chances and (for defense) opponent conversion rate will be the primary inputs to my expected goals ratings. I will make a small (roughly 15%) adjustment of SiBoT conversion based on big chances, and a somewhat larger (roughly 25%) adjustment of opponent SiBoT conversion based on big chances and real conversion rate.

Now, I can't just start predicting the season resutls off one game once I get the data. And I need to have some sort of preseason projection. So what I will do is begin the season using mostly my preseason projections, and over the first twenty weeks of the EPL season I will slowly phase them out until my projections are based almost entirely on the in-season data.

My preseason projections have three basic inputs. First is team strength based on expected goals from 2012-2013. Second is regression to the mean. Third is your subjective ratings. These will be weighed roughly 60/10/30. For the promoted teams, I looked at past data to try to see what projects the quality of a promoted team from the Championship to the Premier League. I found that there is basically no relationship between points from the Championship and points from the EPL. There is a small (roughly .25) correlation between goal difference in the Championship and points in the EPL. So for the promoted clubs, I'm using a number based mostly on the average strength of all promoted Championship clubs, with a small adjustment for Championship goal difference.

So what's changed from last year? I'm using a new and improved expected goals formula, using statistics I previously did not have. I'm using preseason projections as part of the projections for the first half of the season. I have also entirely re-done my projection engine.

One Million Seasons

Bloomberg Sports Proactive Synergies says they will be simulating the Premier League season 10,000 times. Ten thousand seasons isn't cool. You know what's cool? A million seasons. With my new projection algorithm, I can simulate 1,000,000 seasons in about 30 minutes. So I will be doing that. It's true that for the top-line numbers, 10,000 seasons should be enough to get you a projection-to-projection variance of only a point or so. But I like getting that variance down into the low decimals. And more importantly, this large expansion of projected seasons will allow me to do more granular pre-game projections. Late in the season, I can consider the effects of different possible outcomes in multiple games and their effects on different clubs' chances of winning the league, finishing top four, or escaping relegation. With 10,000 simulations, the samples get too small to do this sort of work. With a million, no problem.

The other chance is to my game simulation engine. Bloomberg doesn't specify what they're using, but I think it's fair to guess that they're using some sort of Poisson-based simulator. You need to simulate goals scored and goals against in every game in order to have goal difference numbers at the end of the year to break ties. It is reasonably well established in the academic literature that a random sampling of the Poisson distribution simulates goals scored in football matches to a reasonable level of confidence.

However, there is one small problem with random sampling from Poisson. It underestimates draws. Football managers and players, apparently, have a small tendency to play for a draw, and this must be accounted for in simulating matches. To solve this, I'm using a sampling from a bi-variate Poisson distribution, as suggested by Dimitris Karlis and Ioannis Nitzoufras. (Karlis and Nitzoufras, "Analysis of sports data by using bivariate Poisson models," The Statistician 52 (2003), 381-393.) With a bi-variate Poisson distribution based on an in-game goals scored correlation of about .15, I can simulate team outcomes and not underestimate draws.

So what's new here is a 100x increase in the number of simulations and a new game simulation formula that does not underestimate draws.

With your help, I hope to unveil the initial team ratings and expected table either tomorrow or Thursday.
 

AUSpur

Well-Known Member
Aug 31, 2010
1,145
1,116
Part II of II:

Projected Premier League table and title odds
By MCofA on Aug 14 2013

I have incorporated your subjective team rankings, and I can unveil the preseason English Premier League table projection and odds for important events including title winning, top four finish and relegation. I've even got Europa odds!

Yesterday I laid out my method and asked for your help. Today I've got the projections. I combined your subjective rankings (a little over 60 in total) with regressed, expected-goals-based power rankings from last season to create team quality projections. I then simulated a million seasons. Now I have the results.

Premier League Projected Table and Odds

null_zps4ee708ca.png


So what have we got here? First, I think it's much more instructive to look at the "%" columns than the overall table. Spurs do project 4th ahead of Arsenal—Tottenham had slightly better underlying numbers last year—but the difference in percentage chance of Top 4 is really rather negligible. 55% vs 52%. That's a dead heat. Likewise, there isn't much of significance in the gap between Chelsea and United in second and third, and really the run from Chelsea down to Liverpool is pretty close to a tie. There's no way for a projection engine to be accurate past a delta of 3-4 points per team, and that's enough to cover the gap.

This is perhaps particularly important when looking at the middle of the table. Yes, maybe it's sort of odd that Newcastle stand in 10th place after their weak season, but the difference between 10th place Newcastle and 14th place West Ham is too small to draw any conclusions from. Newcastle has 10% chance of relegation by these numbers, West Ham 15%.

Let's Talk Bloomberg

At this point preseason, I think everyone's mostly guessing, me and the big boys at Bloomberg. The variance in any season is huge, and putting a percentage point number on outcomes is an exercise in hubris. I have tried to limit the appearance of confidence in these numbers by rounding everything to whole numbers. Bloomberg has emphasized their proactive synergy by displaying results down to the decimal point, as if somehow that extra .1% chance of Chelsea winning the Premier League was a meaningful figure. In general, if someone hands you a projection, ignore everything after the decimal point. That's just rhetorical, an attempt to convince you that the person who handed you the projection is smart and knows about numbers. It isn't actually information.

But in any case, most differences between our projections and real outcomes can probably be attributed to a combination of random chance and the infinite number of complicated football facts that we did not and perhaps could not account for before the fact.

Nonetheless, I think there are some interesting differences. I'm sure you noticed the first one, which is Chelsea. City and Chelsea are most people's pick to win the EPL this year, but I have a pretty big gap between Chelsea and City. This is for a simple reason. Last year, City had the best underlying stats in the Premier League—slightly better than league champions United—while Chelsea's underlying numbers were unimpressive. They had fewer shots on target in the box than either Spurs or Arsenal, and allowed many more. Chelsea did do an excellent job preventing big chances, but even with that added in, I have them overall as a worse attacking and defensive side than the 4th/5th place finishers. Because a significant part of my projections are the club's stats from the previous year, Chelsea's excellent subjective placement in the table is only enough to lift them to the head of the trailing group, not up to even with Manchester City.

The other thing I see, looking at these tables, is that my numbers rate the league as significantly more wide open. Bloomberg thinks that Chelsea, City and United are unstoppable forces headed for huge nearly +50 goal difference numbers. No clubs last year had a goal difference as high as City, United and Chelsea are projected by Bloomberg. So they have those three dominating the title race (combined odds about 90%) and the top four positions (average Top% likewise 90%). My numbers are a little bit more flattened, perhaps in part because of regression to the mean. I also have Tottenham. Arsenal, and Liverpool as capable possible competitors for at least Chelsea and United, which makes for a notably wide-open race for the top four positions and creates the possibility of an upset title winner.

I'm not sure I necessarily like my numbers better. City and Chelsea really did spend ludicrous amounts of money buying great players and managers. I'm not sure they should have a 15% or 35% chance of missing the top four. At the same time, I think a reminder that one should have a little bit of skepticism toward Chelsea is worthwhile. They were good last year, not great. They've made good additions, but haven't City, who were better to begin with, done at least as well? Obviously there's the Mourinho factor. My guess is that Bloomberg's numbers are driven by some sort of Mourinho adjustment, and I'm not making any special allowances beyond the subjective factor included from your tables. We shall see.

Indeed, with all of this, we shall see. Aesthetically, I like the open season projection a little better, so I'll go with it.

In conclusion, I think it is likely that one or both of Hull City and Crystal Palace are terrible.
 
Top