Guess Some New MLS Salaries, Win a Prize

As is often the case in MLS (and other soccer leagues) this time of year, its clubs have been introducing quite a few new players. What’s unique to MLS right now is that the most famous ones are committing to clubs that won’t play any official matches until next March. Yesterday, Frank Lampard signed with New York City FC, joining David Villa as the faces of that club heading into their expansion season next year. Kaka joined MLS’ other incoming club, Orlando City, weeks before. Villa and Kaka are going on loan elsewhere until they have a US-based squad to play with, and its unclear if Lampard will do the same, or just rest up for a few awhile.

Meanwhile, existing MLS clubs have been bringing in Americans (DaMarcus Beasley) and foreign players that each have somewhat lower profiles. Alongside these signings, some MLS players have also been getting extensions raises, most notably Sporting Kansas City turning Matt Besler and Graham Zusi into designated players.

If you’re anxious to hear about the contest I mentioned in the title, here’s where I get to that. Whoever does the best job of guessing the below salaries of the players below will get a prize. Type in your estimates of each player’s guaranteed compensation (annual wages, not weekly, as is usually reported in European leagues), alongside your name and either your email address or Twitter handle. I will shut off the poll on July 31st, or when the MLS Players Union (MLSPU) updates their listing of salaries, whichever comes first.

(the form doesn’t like commas or dollar signs, so you would type $1,000,000 as “1000000″)

After the MLSPU makes their newest salary release, which happened on the first of August in both 2012 and 2013, I’ll campare the guesses to the listed guaranteed compensation. Guessers will get a point for each percentage point above or below the actual wage for each player. Whoever ends up with the lowest point total wins. For example, if you guess $3 million and it turns out that the player makes $4 million, you get 25 points, because $3mm is 75% of $4mm, and 100-75=25. I won’t make my own guesses public, so as to avoid swaying the vote, but I’ll hold some general discussions on the topic on Twitter (@StatHunting).

The Prize

I will tailor the prize to the winner. I don’t have the budget to offer something that anyone would want, but I’ll talk to the winner via Twitter or email and see what they are into. It will probably end up being a soccer scarf, DVD, or something along those lines. I also have an autographed picture of Kenny Cooper circa 2008 that I’d happily give to someone use it for something other than a dartboard.

So, I don’t know what the prize will be, but I promise to do everything I can to make it fun. Given my lack of a budget, I might have to get pretty creative if a European wins this thing, but we’ll see.

The Reveal

Every time the MLSPU releases salary data, I visualize it. This time, that post will include a wisdom of crowds visualization with all the guesses I received compared to the actual listed figure for each player, as well as an announcement of the winner. Here’s what my most recent MLS salary visualization looked like, which may help some of you make an educated guess for some of these players:

I’ll make a couple tweaks when I chart the next release, because there are a couple parts of my approach above that I’m not entirely content with. Also, anytime the MLSPU salary release is discussed, it has to be mentioned that this data is often in some level of dispute, as coaches, owners, and technical directors have publicly stated that there are inaccuracies there. However, it’s the only information we have to go off of, and this seemed like a fun way to see how well people can guess these players’ wages.

A Guided, Interactive Tour of FIFA World Cup History

After this year’s World Cup ended, I went back to my History of the World Cup visualization and added in the 2014 data. Not content with just a data enhancement, I have also decided to break World Cup history into storypoints (a really interesting new feature in Tableau) for each era, so that readers could more easily see the record of various nations over the shorter periods of time.

What follows should speak for itself, and every era’s page is fully interactive so that those who would like to explore a specific nation’s performance can still do so by click or hovering, and all filters are active.

What is most striking to me here is the way that focusing on eras can provide shape that the overall record of teams does not offer. We almost take it as fact that Brazil and Germany have always been world conquerers, but they both had dry spells where they were not one of the best sides. Brazilian dominance in the 1950-1962 and 1982-1994 eras drive their standing as the slim all-time leader in points and goal difference per game, while few have ever been as good as Germany have been during the current 32-participant era.

Please explore the data yourself, but beware of low-sample outliers within eras. For example, while Eusebio-led Portugal had a stellar record in 1966, it was that nation’s only appearance in the era, so their standing atop the goal difference per game table should not imply that they were the top team of the era. I hope that what I’m offering here makes it much easier for people to meaningful explore the history of this tournament.

This article serves as my entry for Tableau’s Storytelling Viz Contest. If you like this, please vote for me by tweeting #StorytellingSteve today, as number of hashtags drives a $500 prize. Here, I’ll make it easy for you: 

The Story of the 2014 World Cup through Google Doodle

Google’s Doodles, the animations and games that adorn the search giant’s home page, blew past their own ridiculously high standards during the last month. That’s because the they have produced 63 light, fun doodles dedicated to the World Cup. Google even sent members of their Doodle team to Brazil so they could better capture the spirit of the tournament. This led to a ton of fun little cultural references and on-the-ground imagery in many of the images. These Doodles deserve preservation beyond their relatively dry Doodle archive page, where small versions are listed out of order and without notes on which games they referenced.

Whether you choose to scroll through all of the doodles or use your browser find feature to search for particular teams or dates, I hope you enjoy.

Thursday, June 12, 2014

Opening Ceremonies

 Continue reading

Assessing the World Cup Field Using Goals, Shots, and Expected Goals

Days without fixtures in the middle of the World Cup can be disarming, and especially so after the relentlessly entertaining start to the year’s tournament. Whatever time zone you are in, a certain hour hits and you instinctively flip on the TV, streaming service, check Google just to see their match-specific doodle, or look for the latest Cup chatter on Twitter.

For the second day in a row, today you won’t find any matches, and that’s a good time to sit back and assess teams accomplishments thusfar in the tournament. To mark the lack of occasion, I’ve taken non-PK goal, shot, and expected goal data (gathered by the excellent Colin Trainor) and broken them down to per 90 minute figures (extra time would skew per game results). We’ll get a good look at the whole field, and focus on the United States, which along with being my home country and the team I follow closest, has some unique attributes.

It is natural to start with goals, excluding penalty kicks and own goals, which aren’t very likely to recur. After all, outscoring opponents is the whole point of the game. The USA falls just below average, tied for 17th in the field of 32, as they finished with a -1 goal differential, while they and Algeria played more minutes than the other minus ones. However, there are issues with trying to analyze teams using only goals for and against. Performances from Switzerland, Ecuador, Croatia, and Côte d’Ivoire have not been of equal quality, but if you’re forced to look at them only in terms of rare scoring events, they are all tied with goal differentials of zero. Over a small sample, and everything in a tournament like the World Cup is a small sample, it is very difficult to say much based only on goals, which are driven by fixture difficulty, game states, and other factors that some would call luck.

One rule of thumb in statistics is that if you need to track important but rare event, you’re often better off focusing on the more common events that lead to it. With that in mind, we step back from goals to use shots (again, without PKs) for that wider focus.

Now we’ve at least got some clearer differentiation between teams. So where is the good old USA? Oh, that’s kind of embarrassing. Klinsmann’s crew allowed 94 shots, worst in the field. Alongside their taking only the 27th-most shots per 90 minutes, that massive level of shot absorption slams them to the bottom of this shot differential list. Remember what we said about the importance of fixture difficulty, though? Belgium, Ghana, and Germany all sit in the top six here, and Portugal are 13th. The USA played against four very shot-happy teams. Surely the US’ opponents reached these levels in part because because the United States was so permissive of en masse shooting, but they also sit relatively high here because of performances in their other matches.

Now for an even deeper cut we go to expected goals (again, invaluably provided by Colin Trainor). For the uninitiated, expected goals are an adjustment of shots based primarily on location. Different analysts approach expected goals in slightly different way, and you can read about the specifics of Colin’s work with Constantinos Chappas on this metric here. Basically, shot distance, angle, and body part striking the ball make a big difference in the likelihood of that shot scoring, and expected goals reflect this.

While this only lifts the USA up to 27th, keep in mind just how far behind they were on unadjusted shots. Now we see them pass the truly weaker parts of the field in Honduras, Cameroon, and Korea, alongside some more impressive teams like Chile and Nigeria. All told, the offense appears to be a little below average in this field, and the defense allowed 1.47 less the expected goal metric would project. While some would chalk that up entirely to Tim Howard, there are other positives here for the defense. While some portray Howard’s work in this tournament as herculean, if had to choose a Greek mythology descriptor, I’d go with Sisyphean.

While Howard certainly had some very highlight-worthy saves, for the most part his teammates guided opponents into difficult shooting angles, and goaded them into attempts from distances unlikely to bother a keeper of Howard’s quality. The overall image for me is of consistency and persistence, pushing back a boulder’s worth of chances, not monster-conquering, unbelievable feats of agility. As another way to compare shots and expected goals allowed, here’s a scatter plot of the two:

Note the distance between the trendline and the United States here. This again lays out that the Americans’ average shots allowed were markedly less promising than those allowed by most defenses in Brazil, though Brazil, France, Italy, and Colombia have been similar in this respect (much better company than Cameroon, Honduras, and Spain, who swung most strongly in the other direction), though those powers have also had lower overall shot allowance at the same time. This is a nuance that gets bypassed when pundits or statisticians focus only on shots, scorelines, and levels of bracket invasion for the USA.

None of this is to say that the USA were great in this tournament, or that they should still be in Brazil, preparing for Argentina right now. They weren’t, and they shouldn’t be, but they also weren’t as bad as some are arguing right now. Yes, they were often outplayed by their opponents, and while we should rightly praise Howard for his contributions, some credit goes to US defensive schemes and players, who mostly kept the likes of Christiano Ronaldo, Thomas Muller, Asamoah Gyan, and Eden Hazard out of the prime real estate directly in front of Howard.

Whether it was a conscious choice or not, over the last couple weeks Jurgen Klinsmann’s team allowed shot quantity and limited quality. While that’s no recipe for overwhelming success, it did keep them in games, and gives them a platform to grow from. Stay stingy on quality while making a regular habit of keeping opponents out of the final third more regularly, and this team could really be onto something. Of course that’s overwhelmingly easier to say than do, but ascending the ranks is supposed to be hard.

A Widget for Efficiently-Spaced World Cup Tables

When I was a more casual fan, World Cup group tables annoyed me. This pinnacle of the beautiful game was the only sports competition I knew of in which standings could not be viewed in total without scrolling through a website. This was particularly problematic if I wanted to look up a particular nation but had not memorized their group assignment. This year, I’ve designed an alternative. Above you’ll see the result. I’ve packed flags of every World Cup nation into images for each group, and when you hover or click on a particular group, you get a view of games played, goal differential, and points (presented in a bar graph as well as a number). The point here is ease of access to information. Anyone who is familiar with a nation’s flag will find its standing in the space of a second. The widget is small enough (320×100 pixels) that it can be placed anywhere on any website, using the following embed code:

<script type='text/javascript' src='http://public.tableausoftware.com/javascripts/api/viz_v1.js'></script><div class='tableauPlaceholder' style='width: 324px; height: 169px;'><noscript><a href='#'><img alt='Black ' src='http:&#47;&#47;public.tableausoftware.com&#47;static&#47;images&#47;Wo&#47;WorldCupTableWidget&#47;Black&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' width='324' height='169' style='display:none;'><param name='host_url' value='http%3A%2F%2Fpublic.tableausoftware.com%2F' /> <param name='site_root' value='' /><param name='name' value='WorldCupTableWidget&#47;Black' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='http:&#47;&#47;public.tableausoftware.com&#47;static&#47;images&#47;Wo&#47;WorldCupTableWidget&#47;Black&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div><div style='width:324px;height:22px;padding:0px 10px 0px 0px;color:black;font:normal 8pt verdana,helvetica,arial,sans-serif;'><div style='float:right; padding-right:8px;'><a href='http://www.tableausoftware.com/public/about-tableau-products?ref=http://public.tableausoftware.com/views/WorldCupTableWidget/Black' target='_blank'>Learn About Tableau</a></div></div>

If you have an html portal for a website that will be covering the World Cup, please consider pasting the above somewhere on the main page so that your readers will be able to see whatever table they want very quickly.

Enormous thank yous go out to Colin Trainor and Jerry Tweedy, who are helping me keep this as up to date as possible throughout the group stage. My wife and I just had a child yesterday, which doesn’t bode well for the likelihood that I would be able to update results in a timely fashion on my own.

Learning to Fish in All-Time World Cup Data

I have been drafted to write for soccer.fusion.net through the World Cup. Yesterday, I posted a summary of all-time World Cup records on that site. That article does the fishing for the reader, laying out this year’s World Cup field in terms of their points per 38 matches (as a proxy for a league table), and talks about a few noteworthy stats for some of the teams.

Whenever I post something on Fusion, I will also mention it in this space, and when appropriate spin off a little something that will often be more interactive than my writing for Fusion. Today I’ll share a dashboard that allows users to use lots of filters to look at an historical record of World Cup data from any angle they like.

Give a man a fish, and you feed him for a day; show him how to catch fish, and you feed him for a lifetime.

- Anne Ritchie

For the most part, people receive a statistic or an analysis as a single item presented to them. You’re watching a game and the announcer will read a graphic on the screen that details how often the winner team holds off their opponents when ahead with x minutes left, or a writer will mention a particular fact that supports their current point; that sort of thing. While I am often a simple datafish provider myself on this site, sometimes I try to build enough interactivity into my visualizations that readers can fish for themselves. Consider the following breakdown of all-time World Cup records by country as a flexible fishing rod that lets you explore the data however you see fit.

Dashboard now includes 2014 results, but text of the article, written pre-World-Cup, is unchanged

The above Dashboard and its filters allow you to look through World Cup history via ten measures for each country, and you can filter the whole thing based on years, games played (you can move it below the default minimum of four), confederation, and 2014 World Cup groups (with a “not in 2014 WC” category if you’d like to focus on or exclude that segment). You can also filter by country simply by selecting bar graphs of particular nations.

Also, hover over any country (tapping should work on mobile browsers, too) in either the map or the bar graph and you’ll get an overall summary for them that looks like this:

This summary changes if you use the years filter to focus on a particular range of World Cups. For example, if your looking at years only from 1990-forward, the above Germany listing starts like this:

The map isn’t pinned to any particular geographic bounds, so it will zoom in and out depending on your use of the filters. Or you can click in the upper left of the map to draw a rectangle around an area you would like to focus on.

Meanwhile, the bar graph is always sorted by the measure currently chosen in the top right, and the map is color-coded by the same. Want to know who has had the most or least ejections per match across their history? Just click “Titles” in the top right and change it to “Red Cards Per Match,” and you’ll see that 17 nations have played at least 4 matches without a single ejection, and you can scroll down to see the Australia most commonly sees red, with 4 across their 10 matches.

Data Source, Methods, and Clarifications:

The data driving this comes from worldcup-history.com which is the only resource I could find online that breaks down not only nations’ World Cup totals, but their stats from individual World Cups as well. I simply used the sites’ filters to drive team totals from each Cup individually into a spreadsheet, then use Tableau Public to visualize the aggregations of each side’s data for selected years.

All stats, except top three finishes, are listed per game, because matches played totals vary so much that talk of raw World Cup totals for, say, Germany and Portugal  can be as tricky as analyzing unaggregated data for Los Angeles and San Jose, California. When one group’s sample is roughly four times the size, comparing them without digging at least a little deeper is fraught with pitfalls.

Of note, worldcup-history.com seems to count wins and losses from penalty shootouts, differing from FIFA official records, which mark them as draws for both sides. Because I felt my fusion.net piece should be more official, it was driven by stats on the FIFA website. I built the visualization above before recognizing the PK W/L issue, but I stand by it as a fun way to explore the data.

In building the visualization, I used this neat Tableau trick to allow my audience to choose which measure would drive the graph and map. Then I chose a red-white-yellow color scheme for the map, but set the mid-point as zero to ensure that only goal differential would show both colors. I also set the map to display negative values for goal allowed, red cards, and own goals, because red made the most sense to me for those negative attributes. To include gold, silver, and bronze medals in countries’ details, I tweaked Andy Cotgreave’s clever method for bar charts in a tooltip, using the character “●,” and some very large font sizes.

Also worth noting that because Tableau maps by country, it aggregated the United Kingdom as one team there, but England, Scotland, Wales, and Northern Ireland are listed separately in the bar chart.

Anyway, I hope you enjoy this visual as much as I enjoyed making it, as I think it’s a fun way to explore World Cup history. But, be warned that one World Cup does a pretty poor job of predicting the next one. Four years is a lot of time, during which players, coaches, prevailing tactics, etc. tend to shift quite a lot.  Four years ago people trumpeted that only seven countries had ever a Cup, but Spain moved that up to eight in South Africa. I’m not predicting a brand new champion, mind you, just warning that a nation’s pedigree won’t buy a great deal on the field in Brazil this month, and particularly not in the knockout rounds.

Educated Guesses, Analytics, and Tactics of United States World Cup Roster Decisions

Jurgen Klinsmann plans to submit his 30-man provisional United States World Cup roster to FIFA tomorrow afternoon. These 30 players will go to a national team camp, and by June 2nd, Klinsmann and will have to cut that roster down to the 23 who will travel to Brazil shortly thereafter.

Who will be on that plane to Brazil? Fortunately, many well-informed journalists in this country have posted their expert opinions lately, and their combined picks for the 23-man roster seem telling. The following visualization comes from adding up the picks of Steven Goff, Matthew Doyle, Matthew Tomaszewicz, Brian Sciaretta, Liviu BirdJeff Carlisle, Doug McIntyre, Adrian Melville, Alex Labidou, and Jason Davis.

These media projections go a long way toward identifying the players who are, barring injury, probably locked into a roster spot already. 18 players appeared on every single list, but Clarence Goodson seems markedly less important than the other 17. Regardless of opponent or formation, there is little reason to believe that Goodson will start a World Cup match unless Matt Besler or Omar Gonzalez are injured at the time. All of the other consensus picks have a pre-defined role on the field that does not presuppose ailments elsewhere on the roster. Given that standing, I can’t count Goodson as a certainty for the roster, but it does seem very likely that only injuries could leave the other 17 off the squad.

This leaves only six remaining World Cup spots. Goodson, Julian Green, Michael Parkhurst, Brad Evans, Chris Wondolowski, Mix Diskerud, and Maurice Edu all show up on at least half of the journalists’ 23-man roster. Add them in and you’re at 24, so at least one of them will miss out, but I would expect that each of them will at least get the chance to prove themselves during the 30-man camp.

To evaluate the rest of the US player pool, I’m going to lean on GoalImpact, a very advanced plus-minus metric, alongside my own qualitative observations. Jorg Seidel, the creator and curator of this analysis, takes goal differentials while a player is on the pitch and heavily adjusts them for home field advantage, an aging curve, red cards, and, most crucially, strength of both teammates and opponents. You may recall that my article on Julian Green‘s commitment to the United States team was centered on GoalImpact.

GoalImpact of the United States Player Pool

Thankfully, Jorg offered GoalImpact summaries of World Cup nations’ players to any bloggers that asked (all entries catalogued here), and I was the first to volunteer for the United States. Here are the top GoalImpact performers in the USA pool, alongside others I know to be in the mix, with the unanimous 18 filtered out:

Hover over any player’s name and click “GoalImpact Graph” to see Jorg’s graph of their careers by GoalImpact. Also, move the filter up to 10 in order to see the full list of players.

Overall, I’m a fan of what GoalImpact is doing here, but I feel that the metric’s strength lies in identifying general tiers of players, not in settling specific hierarchies. It is my feeling that no single metric will every form an inarguable ordering of soccer players, though, because there are far too many variations in formations, tactics, and quirks of particular players’ strengths and weaknesses for any single measure to pick up on. However, I am comfortable saying that a high GI score would generally mean that a national team coach should be paying attention to that player, and it does seem that in this case it seems that across the board, Klinsmann and his staff have managed to do just that.

The only player on the above list who hasn’t received some level of attention from Jurgen Klinsmann is Brad Friedel, and that’s because the keeper retired from international play in 2005 (but has offered to unretire in the case of an injury crisis). George John, Seth Sinovic, and Chance Myers are the only uncapped players listed, but the latter two attended a national team camp in January, while John has been invited to USA (and Greek, he’s dual-national) camps, but club opportunities and injuries (chronic injury is a notable blindspot of GoalImpact) always got in the way of his attending.

It has to be noted, though, that it is decidedly odd to see that the highest-rated outfield player in the pool is not Michael Bradley, Landon Donovan, or Clint Dempsey, but… wait, Sacha Kljestan? Indeed, Tim Howard’s GI score is the only one higher than Kljestan’s. No question, this seems weird, but we should keep in mind that Sacha is the highest profile US player whose league matches are not televised in the States. In his years with Anderlecht in Belgium, he has played well over 100 fixtures across all competitions with the club, and the only ones that get broadcast in America are their annual sampling of Europa or Champions League matches. His performances represent a significant blindspot for USMNT fans, and for surely quite a few journalists as well. This isn’t to say that we should take it on faith that he is the best US player by any stretch, but according to GI, Anderlecht thrives when Kljestan is on the pitch, and that has real value. This also goes a ways toward explaining his 46 caps under Klinsmann and Bob Bradley, who surely have kept better tabs on Kljestan’s career than most fans are capable of.

Back to the 23

Taking a closer look at the 24 players who made at least half of journalists’ projected World Cup roster lists, these players represent three keepers, six centerbacks, two leftbacks, four rightbacks, five defensive midfielders, seven midfielders, and six forwards, counting players capable of filling in at multiple positions multiple times.

As far as I’m concerned, the number of defensive options among that group is highly redundant. Defenders are the most likely position to go to a World Cup and never leave the bench, as generally subbing in a defender has far less tactical implications than does inserting a midfielder or forward. In almost any league or tournament defenders are the least popular substitutions, so why not take an extra attacker or two instead?

Given all of the above, where do I see this roster ending up? For my picks I’m treating all unanimous journalist selections (in bold) except for Clarence Goodson as givens, and piling on players that provide Klinsmann with interesting options upfield.

Keepers (3): Tim Howard, Brad Guzan, Nick Rimando

Defenders (6): Omar Gonzalez, Matt Besler, Geoff Cameron, DaMarcus Beasley, Fabian Johnson, Michael Parkhurst

Midfielders (9): Alejandro Bedoya, Graham Zusi, Jermaine Jones, Kyle Beckerman, Landon Donovan, Michael Bradley, Mix Diskerud, Maurice Edu, Sacha Kljestan

Forwards (5): Aron Johannsson, Clint Dempsey, Jozy Altidore, Chris Wondolowski, Julian Green

Because Cameron and Parkhurst can cover multiple positions on the backline, and Bedoya or Edu could play as defenders in an emergency, I would have no problem bringing only 6 nominal defenders. Nicely, a plethora of options further afield all offer greater tactical advantages than would a seventh defender, such as Goodson or Brad Evans. If I were to entertain bringing a seventh, it might actually be John Anthony Brooks or Oguchi Onyewu because this squad seems to possibly have need of a defender who is strong in the air.

This roster’s attack can shift between some combination of Dempsey’s peskiness, Donovan’s smarts, Johannsson’s movement, Altidore’s holdup play, Diskerud’s creativity, Green’s speed, Wondolowski’s positioning, Zusi’s passing, and Bedoya’s steadiness. While none of those players are world class on their own, their strengths together provide valuable flexibility.

For comparison, here’s are the top 23 vote-getters from journalists:

Keepers (3): Tim Howard, Brad Guzan, Nick Rimando

Defenders (8): Omar Gonzalez, Matt Besler, Geoff Cameron, DaMarcus Beasley, Fabian Johnson, Clarence Goodson, Michael Parkhurst, Brad Evans

Midfielders (7): Alejandro Bedoya, Graham Zusi, Jermaine Jones, Kyle Beckerman, Landon Donovan, Michael Bradley, Mix Diskerud

Forwards (5): Aron Johannsson, Clint Dempsey, Jozy Altidore, Julian Green, Chris Wondolowski

And the top 23 GoalImpact players (excluding the retired Brad Friedel, and capping defenders at eight):

Keepers (3): Tim Howard, Brad Guzan, Nick Rimando

Defenders (8): Omar Gonzalez, Matt Besler, Clarence Goodson, Michael Parkhurst,  Tony Beltran, Seth Sinovic, John Anthony Brooks, Chance Myers

Midfielders (9): Alejandro Bedoya, Graham Zusi, Kyle Beckerman, Landon Donovan, Michael Bradley, Sacha Kljestan, Maurice Edu, Brek Shea

Forwards (4): Jozy Altidore, Chris Wondolowski, Mike Magee, Terrence Boyd

Obviously, I favor my own roster choices, and the GoalImpact list is too lacking in interior tactical knowledge to land anywhere but third, but the metric does provide interesting information that challenges preconceptions of this player pool. Klinsmann has tough decisions ahead of him, and some of his roster decisions are surely dependent to some extent on what he sees in camp over the next couple weeks.

Why I Use Tableau, and a Disappearing Instructions Trick for Dashboarding

For the most part, I shy away from walking my readers through every detailed step of my data analysis and visualization process. I want to maximize the likelihood that readers will come across substantive insights rather than staid process even if they only skim my articles. This is also linked to my general goal of limiting word counts and letting interactive data visualizations in Tableau do a lot of the heavy lifting.

Note: if you are already familiar with Tableau, feel free to skip down to the section titled “My Contribution to Tableau Tips” for a trick involving disappearing instructions.

While I think these guidelines have served me well, it does leave me open to questions. I’m always happy to engage on Twitter or in the comments section about the details of my approach, but I feel I should directly address the most common question, namely how I build my data visualizations and with what software. These perked up again after the visualization I posted last Friday concerning online American interest in 16 different sports:

Tableau Public is my not-so-secret weapon of choice. I have yet to find a data product that matches it in terms of visualization, analysis, or, especially, cost (free). It’s options for visualization (particularly mapping) are deep, and don’t require any programming knowledge, and when you get to know the tool, interactivity and analysis naturally build themselves into these visualizations.

I use it’s full-featured older sibling, Tableau Desktop, for proprietary work in my dayjob, which I can’t discuss here. I will say, though, that Tableau is being quite honest when it trumpets that Public and Desktop are nearly identical. The only substantive differences are that 1) everything in Public has to be saved openly on the web, 2) Desktop can import data from far more sources than Excel, Access, text, and 3) Public can’t use data sources with greater than 1,000,000 rows. If there’s no harm in someone’s data being publicly available, these limitations will seldom be a problem for them.

Long recommendation short, if you make data charts and have a PC (I have heard Tableau hopes to roll out their products for Mac as early as this summer), you owe it to yourself to install Tableau Public and try it out. There is a tutorial treasure trove on Tableau’s own website and message boards. Also, if you want to do something with the software and the path to a solution isn’t readily apparent, Google and Bing are your best friends.  For example, Ben Jones’ post on embedding YouTube videos, was vital to my working up a chalkboard linking shot locations to video highlights.

April is Tableau Tips month, which has sparked a boom in writings about tricks and hacks of the software. Tableau’s official blog named the month’s best tips this morning, including Nelson Davis’ blog, The Visioneer, where Nelson posted a new trick on each of April’s 30 days. If you are already familiar with Tableau, click those links to expand your toolbox.

My Contribution to Tableau Tips Month:

Most of the tricks I use are borrowed. I find someone else’s explanation of a technique and use it in conjunction with other things I know about Tableau and my data to build a visualization. In preparing last week’s post, though, I stumbled on a very basic trick I haven’t seen elsewhere, and that others could benefit from.

Space is often at a premium in a dashboard, and that can be doubly true in Tableau, where you often need to include some explanations of interactive features built into the visualization. Others have used a button that viewers can hover over for detailed instructions, or a launch screen with instructions that fade away with one click.

In the viz I embedded above, I felt that it was necessary to point out that the graphs and calendar all act as filters for each other. There was a nice chunk of negative space in the line graph where I could put such a note, but if I just floated a standard text box there, it would often get in the way post-filtering. For example, someone isolating soccer, baseball, and ice hockey (via clicking on those sports in the bar graph) would have seen this:

You can still see the graph, thanks to text box transparency, but this is hardly ideal. So, instead of a standard text box, I put my note in an area annotation. This way, the instructions usually disappear once the viewer filtered the data, at which point they no longer have any need of the note. This is accomplished in a few quick steps:

1) Right-click in an empty space on the graph (preferably just on the edge of your chart’s negative space) and click “Annotate,” then “Area …”

2) Type up your note and click OK.

3) By default it will probably be cramped, so tinker with the height, width, and placement of your note. If still doesn’t fit comfortably in the negative space, double-click on your text then adjust font size, line spacing, and/or beef up the brevity.

      

4) If, like me, you don’t want the default border, right-click your note, select “Format…” then go to the “Border” dropdown menu on the left side of the screen and click “None.”

That’s it. If your readers filter in such a way that the point on which you right-clicked in step one is no longer graphed, then the note disappears instead of potentially getting in the way. Axes of Tableau graphs are by default automatic, and adjust for filtered data. If you specify axes minimums and maximums inflexibly, the note will hang around.

There are certainly other techniques for offering viz guidance to your audience, but this is my favorite so long as my graphs have enough negative space to accommodate a sufficient explanation. I wouldn’t be surprised if someone else has done the same, but Googling “Tableau disappearing instructions” didn’t yield anything. If I’ve actually discovered a new, useful application of area annotation in Tableau, woohoo.

Google Trends in Sports Part 1: Seasonality and “Popularity”

Nate Silver’s refurbished FiveThirtyEight has run some interesting columns recently, centered around North American sports teams’ and leagues’ results in Google Trends. All of them are recommended reading, but I do have some reservations, particularly around describing Trends data in terms of popularity, but we’ll get to that later. Mostly I came away curious about some issues on the periphery of FiveThirtyEight’s analysis. Silver and his colleagues used worldwide search data, but what would this look like when confined to the United States? It stands to reason that in this vast country, there should be some interesting seasonal and regional issues at play. Also, the FiveThirtyEight focus on teams and leagues was interesting, but since most sports have multiple college and pro options for people to follow, I wanted to investigate at the level of the sports themselves.

Thankfully, Google Trends makes it very easy to gain insight on all these levels. I narrowed searches to April 2006 through March 2014, in order to get a large sample without over-representing particular months or even four-year cycle that are key to some sports’ biggest events. Searching for what are commonly thought of as the big five sports in the USA yielded the following graph:


“Football” and “American Football” are Trends’ official categories for soccer and tackle football (and looking at Google’s export data in detail, I’m confident that the results’ term confusion is minimal). I’ve chosen to make the terminology more USA-centric within the rest of the article and my own visualizations.

Online interest in these sports certainly ebbs and flows over the course of every year, and it appears to follow annual patterns, with the biggest divergences coming from soccer and hockey, in association with World Cups and Olympics, respectively. Tackle football and basketball have the highest peaks, while ice hockey only very rarely even threatened to be the fourth most-searched sport in the country. But it is a bit jumbled, and would be even more so if I added in other sports.

Nicely, Google Trends allows users to download the data so they can take a deeper dive themselves. I pulled reports for every sport I could think of (always including tackle football in the query, as Trends scales results relative to the most popular single instance) and put them into one master list. Trends did not offer mixed martial arts and golf as sports (if there isn’t a Trends category for a sport, then the data would also reflect searches for the Volkswagen Golf, for example), so I searched for their dominant competitions, the UFC and PGA, which have Trends categories as leagues, instead. Sports with overall ratings of zero (sorry, bicycle racing and rugby union) were excluded. With that data set in hand, Tableau Public made it easy to compile a graph of online interest in each of these sports, averaging across each individual week of the year. The minor sports are hard to see, but using the column graph to filter to those smaller competitions makes it easy to compare them.

Unsurprisingly, the patterns that were looped eight times in the first graph are easier to spot in this view. The two consistent high-points are tackle football cresting as August turns to September (just prior to the NFL season), and basketball doing so in mid-March. Probably not a coincidence that both fantasy football drafts and March Madness bracket are commonly researched online.

Soccer is steadily in second or third place throughout the year, even popping into first during most of June and July. Hockey just doesn’t see nearly as much interest as the other major sports (but does feature prominently in some states, which we’ll get into when I write about regionality during part two of this series), and even during the NHL season it is sometimes out-searched by auto racing, tennis, swimming, MMA, and boxing. Online interest in baseball certainly exceeds that of hockey, but not many others, and it’s scores are highest in the first couple months of the MLB season.

Counter-intuitively, online interest seems to rarely peak in association with the end of a sports season, when trophies get handed out, and TV ratings are at their highest. This gets at the heart of some issues around the word FiveThirtyEight used most often when presenting Google Trends data. It is literally a record of how often Google users search regarding topics, and while that points in the direction of popularity, labeling it as online interest or online intrigue is be more accurate, and reflects verbiage from Google’s own descriptions of the service. To illustrate, let’s look at a couple hypothetical fans:

1) A baseball fan in his sixties, who has season tickets to his local MLB park, and new information is regularly available to him through TV, newspapers, magazines, etc., which he takes advantage of religiously. Maybe he bookmarks his favorite baseball websites, but unless he’s big on sabermetrics, he probably doesn’t need to make baseball-related Google searches very often. Even then he might just thumb through his well-worn Baseball Almanac.

2) A twenty-something soccer supporter, who follows multiple teams and leagues around the world. Even if she has a great TV package that carries all the matches she wants to see, she still hits Google regularly to make sure she knows which channel to turn to for a particular match. If her cable package doesn’t include the ESPN, Fox Sports, beIN, GolTV, etc. channel that she needs to watch that fixture, a mass of Googling will probably be necessary to find a reliable-enough, illicit stream of the match.

Is one of them more interested in their favorite sport than the other? They could both be labeled fanatics, but they would have enormously different impacts on Google Trends. Obviously, these are two extremes, but it seems to me that sports that attract younger fans and provide them a natural need to search for information pertinent to that sport online are generally going to have inflated Trends scores, relative to their overall popularity.

Driving the earlier graphs by median provides another useful angle to explore this data, albeit one that would would seem to reflect overall popularity to an even lesser extent.

Soccer is number one! Well, at least it is when isolating the most middle-of-the-road week (layman’s definition of the median) and judging by online intrigue. There isn’t an offseason, since European leagues, MLS, Liga MX, and others all have different schedules, and as I implied in describing the hypothetical fan earlier, soccer fans in this country generally tend to be young, open to technology, and have a real need to use Google in order to follow their sport.

Also worth noting here that while Trends enhancements claim to specialize in sorting out tricky terms, “football” is absolutely one of the trickiest, and is very much central to this study. I’m not as concerned with this as some, because I would hope that google is using outcomes as well as search terms in their criteria. Seems unlikely that the average American would click a link concerning David Beckham when they were trying to search regarding NFL football.

Based on media exposure, advertising revenue, and other factors, we know that soccer isn’t one of the top three of the States’ sports hierarchy, let alone at the head of it. But when factoring in searches for every professional and amateur competition, as well as equipment and miscellaneous other sport-related queries, the beautiful game is clearly the steadiest presence in this Trends data.

While it would be reckless to claim that Google Trends illustrates an indisputable hierarchy of sports in the United States, beware any claims that it is utterly irrelevant. Every year, more and more of American life takes place online, and online interest should at the very least be of great interest to anyone writing about sports online, or who is tasked with marketing to the valuable demographics that skew toward Google Trends.


Since the Tableau visualizations in this post have implications well beyond my usually soccer-centric audience, it’s a good time to note that anyone can feel free to embed any of my Tableau Public work on their own website. Just click the “Share” link in the bottom left of any viz and copy the embed code into your site’s editor that is labeled HTML or text. Mentioning me in the article and on Twitter (@StatHunting) would of course be appreciated as well.

2014 MLS Salaries Visualized

Yesterday was Christmas for a certain breed of MLS need. The MLS Players Union released 2014 player salaries for public consumption. The release does not reflect the most labyrinthine aspects of MLS wage rules, like retention funds, allocation money, etc., but I don’t know of another soccer league anywhere that has player wage data like this available mid-season.

I have delved into all the previous MLSPU salary releases before, and I strongly advise that people resist the urge to focus on specifics here. The odds that a player’s salary listed here it’s their exact salary cap cost fall somewhere on a spectrum between unlikely and impossible. Only the first 20 players on the roster count toward the cap, and the MLSPU release does not reflect allocation money, retention funds, former/lending clubs continuing to pay some wages, or special player statuses like Homegrown, Generation Adidas, or especially Designated Player. Almost feels appropriate that on the MLSPU website this file is listed as up to date through April Fools Day. However, we can get a good sketch of which clubs spend most, and of the wage disparities between the top and bottom of each club’s roster.

According to this release, yes, the guaranteed compensations of Clint Dempsey, Michael Bradley, Jermain Defoe, Landon Donovan, Robbie Keane, and Thierry Henry are higher than those for the full rosters of art least 12 of the 19 clubs in MLS. Those six (1.1% of players on club rosters) make 28.5% of the league’s full player wages. Also, the lowest salary reported, $36,500 made by 54 different players, is 0.5% of the highest, Dempsey’s $6,695,189.00. Of course, those players all bring in merchandising, ticket sales, and headlines that the grunts don’t. It would not be surprising if MLS and club accountants file some portion of the big player expenditures under marketing, instead of wages.

About a month ago I showed that total salaries have been a very poor predictor of league points going all the way back to the first MLSPU release in 2007. That’s not to say they are irrelevant, but their influence is overwhelmingly more subtle on the field than in big European leagues, which some have joked might as well be played on a balance sheet.

Figures like this are sure to be a major topic of conversation in upcoming collective bargaining negotiations between MLS and its Players Union. The current CBA expires after this season, and most of the expected points of contention are related in small or total ways to salary disparity. The players will want higher minimums, free agency, and a big boost to the salary cap, while the league will likely seek to maintain as much of the status quo as they can in the name of profitability and stability. Fans of the league would be well-served by becoming at least passingly familiar with the wage dynamics at play as MLS heads toward this critical juncture. Hopefully the above visual will clarify the issues for some.