After seeing a Reddit post on the convergence of Miss Korea faces, supposedly due to high rates of plastic surgery, graduate student Jia-Bin Huang analyzed the faces of 20 contestants. Below is a short video of each face slowly transitioning to the other.

From the video and pictures it's pretty clear that the photos look similar, but Huang took it a step further with a handful of computer vision techniques to quantify the likeness between faces. And again, the analysis shows similarity between the photos, so the gut reaction is that the contestants are nearly identical.

However, you have to assume that the pictures are accurate representations of the contestants, which doesn't seem to pan out at all. It's amazing what some makeup, hair, and photoshop can do.

You gotta consider your data source before you make assumptions about what that data represents.

David Brooks for *The New York Times* on the philosophy of data and what the future holds:

If you asked me to describe the rising philosophy of the day, I’d say it is data-ism. We now have the ability to gather huge amounts of data. This ability seems to carry with it certain cultural assumptions — that everything that can be measured should be measured; that data is a transparent and reliable lens that allows us to filter out emotionalism and ideology; that data will help us do remarkable things — like foretell the future.

Be sure to read the comments. There's actually quite a bit of anti-data talk.

Jeff Leek, an Assistant Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health, is teaching a course on data analysis on Coursera, appropriately named Data Analysis.

This course is an applied statistics course focusing on data analysis. The course will begin with an overview of how to organize, perform, and write-up data analyses. Then we will cover some of the most popular and widely used statistical methods like linear regression, principal components analysis, cross-validation, and p-values. Instead of focusing on mathematical details, the lectures will be designed to help you apply these techniques to real data using the R statistical programming language, interpret the results, and diagnose potential problems in your analysis.

The course starts on January 22, 2013.

You might also be interested in Computing for Data Analysis taught by Roger Peng, who is also a biostatistics professor at John Hopkins. Leek's course is focused on statistical methods, whereas Peng's course is focused on programming. Better take both. [via Revolutions]

Thomas H. Davenport and D.J. Patil give the rundown on what a data scientist is, what to look for and how to hire them. It's an article in Harvard Business Review, so it's geared towards managers, and I felt like I was reading a horoscope at times, but there are some interesting tidbits in there.

Data scientists don’t do well on a short leash. They should have the freedom to experiment and explore possibilities. That said, they need close relationships with the rest of the business. The most important ties for them to forge are with executives in charge of products and services rather than with people overseeing business functions. As the story of Jonathan Goldman illustrates, their greatest opportunity to add value is not in creating reports or presentations for senior executives but in innovating with customer-facing products and processes.

I still call myself a statistician. The main difference between data scientist and statistician seems to be programming skills, but if you're doing statistics without code, I'm not sure what you're doing (other than theory).

**Update:** This recent panel from DataGotham also discusses the data scientist hiring process. [Thanks, Drew]

In most papers we at Ars cover, we'll be pleasantly surprised to find a single clever turn of phrase that has survived multiple rounds of editing and peer review. So it was an unexpected surprise to come across a paper where the authors, all professors of economics, have spent the entire text with tongues so firmly planted in their cheeks that they threatened to burst out, alien-style. It surprised me even more to find it in a journal that is produced on behalf of the Royal Statistical Society and American Statistical Association. Credit to the statisticians, though, for the journal's clever name: *Significance*.

What topic allowed the economists to cut loose? Bank robberies—or more specifically, the finances thereof. The UK's banking trade organization decided it wanted an analysis of the economic effectiveness of adding security measures to bank branches. The professors did that, but in the process, they also did an analysis that looked at the economics of bank robbery from the thieves' perspective.

The results were not pretty. For guidance on the appropriateness of knocking over a bank, the authors first suggest that a would-be robber might check with a vicar or police officer, but "[f]or the statistics, look no further. We can help. We can tell you exactly why robbing banks is a bad idea."

- American Statistical Association
- Ars Technica
- average bank robbery
- bank
- bank branches
- bank cashiers
- bank holdup
- bank job
- Bank robberies
- bank robberies
- bank robbery
- bank robbery
- bank robbery industry
- bank robbery leaves
- bank tellers
- Cornell
- economics
- Entertainment
- Federal Bureau of Investigation
- John Timmer
- Royal Statistical Society
- Scientific Method
- Statistics
- United Kingdom

Everytime I have talked to someone about learning more machine learning they always point me to the Elements of Statistical Learning by Hastie and Tibshirani. This book has the good fortune of being available online for free (a hard copy does have a certain appeal, but is not required) and it is a really great introduction to the subject. I have not read everything in it yet, but I have read much of it and it has really helped me understand things better.

Another resource that I have been working my way through is the Stanford Machine Learning class, which is also online and free. Andrew Ng does a great job of walking you through things. I find it particularly helpful, because my background in implementing algorithms is weak (I am a self taught programmer) and it shows you how to implement things in Octave (granted R has much of it implemented in packages already). I also found these notes on reddit statistics a few months ago, so I kind of skim through those and then watch the video and reflect on it with my own notes.

My background is in statistics and I got some exposure to machine learning concepts (a good buddy of mine is really into it), but I have always felt like I am lacking on the machine learning front, so I have been trying to learn it all a bit more on my own. Thankfully there are a ton of great resources out there.

As far as getting a job in the industry or graduate school requirements I am not in a great position to advise (turns out I have never hired anyone), but I have noticed that the business world seems to really like people that can do things and are a bit less concerned with pieces of paper that say you can do something.

If I were you, I would spend some of my free time getting confident in my machine learning knowledge and then implement things as you see opportunities. Granted your position may not give you that opportunity, but if you can get something implemented that adds value to your company (while maintaining your other obligations), I can't imagine anyone being upset with you. The nice thing here is if you do find yourself doing a bit of machine learning at this job, when you go out looking for a new job you can talk about the experience you already have, which would help folks look past a lack of a specific degree.

There are a lot of resources and its incredibly interesting, I wish you luck!

Another idea: You could start a blog about your Machine Learning learning process and maybe document a few projects you work on in your free time. I have done this with a programming project and it allows you to talk about a project you are working on in your free time (looks good to the employer) and you can also direct them to the blog (obviously keep it professional) about your work. So far I have sent quite a few people to my dorky little programming blog (I have been a bit lazy on posting lately, but I kept it up to date when I was applying to jobs) and everyone I have talked to has been impressed with it.

I've been reading papers on how people learn statistics (and thoughts on teaching the subject) and came across the frequently-cited work of mathematical psychologists Amos Tversky and Daniel Kahneman. In 1972, they studied statistical misconceptions. It doesn't seem much has changed. Joan Garfield (1995) summarizes in How to Learn Statistics [pdf].

**Representativeness:**

People estimate the likelihood of a sample based on how closely it resembles the population.

You can't always judge how likely or improbable a sample is based on how it compares to a known population. For example, let's say you flip a coin four times and get four tails in a row (TTTT). Then you flip four more times and get HTHT. In the long run, heads and tails are going to be split 50/50, but that doesn't mean the second sequence is more likely.

Similarly, a sequence of ten heads in a row isn't the same as getting a million heads in a row.

**Gambler's fallacy:**

Use of the representative heuristic leads to the view that chance is a self-correcting process.

The history boards at roulette tables mean nothing. They're just for show. Just because a red hasn't come up in a while doesn't mean the roulette wheel is due for a red soon. Each spin is independent of the spins that came before it.

**Base-rate fallacy:**

People ignore the relative sizes of population subgroups when judging the likelihood of contingent

events involving the subgroups.

You have to consider the base population for comparison. Maybe a company is comprised of 80 percent men and 20 percent women. If your base is the US population, you might consider that inequality, but what if the applicant breakdown was 90 percent men and 10 percent women? In the latter case, a higher percentage of women than men were actually hired.

**Availability:**

Strength of association is used as a basis for judging how likely an event will occur.

Just because some percentage of your friends are designers doesn't mean that the same percentage of people are designers elsewhere (obviously). Or the example that Garfield uses: a ten percent divorce rate among people you know isn't necessarily the same nationwide or globally.

**Conjunction fallacy:**

The conjunction of two correlated events is judged to be more likely than either of the events themselves.

The common example from Tversky and Kahneman:

"Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations." A group of people were asked if it was more probable that Linda was a bank teller or a bank teller active in the feminist movement (a sign of the times this poll was taken).

Eighty-five percent of respondents chose the latter, but the probability of two things happening together is always less than or equal to the events occurring individually.

Notice that there's still not much math involved in these examples. It's logic that plays into thinking like a statistician without the math (with statistical foundations). You can get a lot done just by thinking critically about your data.

#### Related

- Aaron
- Amos Tversky
- Amos Tversky
- bank
- bank tellers
- Behavioral finance
- Cognitive bias
- Conjunction fallacy
- Critical thinking
- Daniel Kahneman
- Daniel Kahneman
- Education
- fallacies
- Fallacy
- feminist bank tellers
- Feminist movement
- Gambler's fallacy
- Heuristics
- Is Linda
- Joan Garfield
- Joan Garfield
- Logical fallacies
- Nigel
- Representativeness heuristic
- Richard Thaler
- Statistics
- United States

This is a topological similarity network of 452 NBA players during the 2010-2011 season. Players (in circles) are connected to other players by edges (lines) based on how similar they are with regard to points, rebounds, assists, steals, rebounds, blocks, turnovers and fouls, all normalized to per-minute values in the 2010-2011 season. Further, the network is colored by a player's points-per-minute average, with blue being low and red being high.

For as long as basketball has been played, it’s been played with five positions. Today they are point guard, shooting guard, small forward, power forward and center. A California data geek sees 13 more hidden among them, with the power to help even the Charlotte Bobcats improve their lineup and win more games.

Muthu Alagappan is a Stanford University senior, a basketball fan and an intern at Ayasdi, a data visualization company. Ayasdi takes huge amounts of info like tumor samples and displays it in interactive shapes that highlight patterns like genetic markers that indicate a likelihood of ovarian cancer. It’s called topological data analysis, and it can be applied to sports, too.

That is exactly what Alagappan did.

Dallas Mavericks' Dirk Nowitzki (41) is not a forward and Jason Terry (31) is not a guard, but rather a scoring rebounder and an offensive ball handler under an analytics model that reveals 13 new positions. *Photo: David J. Phillip/Associated Press*

He used the company’s software to crunch a data set of last season’s stats for 452 NBA players. He discovered new ways to group players (.pdf) based on performance after noting, for example, that Rajon Rondo of the Boston Celtics had more in common with Miami Heat forward Shane Battier than with fellow point guard Tony Parker of the San Antonio Spurs.

After reading his map, Alagappan came up with 13 new positions based on the three typical roles of guard, forward and center:

- Offensive Ball-Handler. This guy handles the ball and specializes in points, free throws and shots attempted, but is below average in steals and blocks. Examples include Jason Terry and Tony Parker.
- Defensive Ball-Handler. This is a defense-minded player who handles the ball and specializes in assists and steals, but is only so-so when it comes to points, free throws and shots. See also: Mike Conley and Kyle Lowry.
- Combo Ball-Handler. These players are adept at both offense and defense but don’t stand out in either category. Examples include Jameer Nelson and John Wall.
- Shooting Ball-Handler. Someone with a knack for scoring, characterized by above-average field goal attempts and points. Stephen Curry and Manu Ginobili are examples.
- Role-Playing Ball-Handler. These guys play fewer minutes and don’t have as big a statistical impact on the game. Hello, Arron Afflalo and Rudy Fernandez.
- 3-Point Rebounder. Such a player is a ball-handler and big man above average in rebounds and three-pointers, both attempted and made, compared to ball-handlers. Luol Deng and Chase Budinger fit the bill.
- Scoring Rebounder. He grabs the ball frequently and demands attention when on offense. Dirk Nowitzki and LaMarcus Aldridge play this position.
- Paint Protector. A big man like Marcus Camby and Tyson Chandler known for blocking shots and getting rebounds, but also for racking up more fouls than points.
- Scoring Paint Protector. These players stand out on offense and defense, scoring, rebounding and blocking shots at a very high rate. Examples include Kevin Love and Blake Griffin.
- NBA 1st-Team. This is a select group of players so far above average in every statistical category that the software simply groups them together regardless of their height or weight. Kevin Durant and LeBron James fall in this category.
- NBA 2nd-Team. Not quite as good, but still really, really good. Rudy Gay and Caron Butler are examples.
- Role Player. Slightly less skilled than the 2nd-team guys, and they don’t play many minutes. Guys like Shane Battier and Ronnie Brewer fall under this position.
- One-of-a-Kind. These guys are so good they are off the charts — literally. The software could not connect them to any other player. Derrick Rose and Dwight Howard are examples, but you already knew that.

The 13 positions are based on how players compare to the league average in seven statistical categories: Points, rebounds, assists, steals, blocked shots, turnovers and fouls. The stats were normalized on a per-minute basis to adjust for playing time, so starters got the same consideration as backups.

That said, the names of some of these new positions could use a bit of work. For example, Rondo, the Celtics’ floor leader, is classified as a “role player,” which is commonly used in basketball to describe a so-so player with a specific, if unremarkable, set of skills.

This is the same topological network of players, with red regions indicating the Dallas Mavericks. This representation shows the diversity of playing styles of Mavericks’ players.

Even if no one is going to refer to Dirk Nowitzki of the Dallas Mavericks as one the league’s best “scoring rebounders” any time soon, Alagappan’s prize-winning analysis could change how coaches and general managers think about the roles their players fill. Alagappan proved the title-winning Mavs had a solid diversity of “ball handlers” and “paint protectors,” giving them the ability to put a balanced lineup on the floor with few weak spots. The Western Conference cellar dwellers the Minnesota Timberwolves, on the other hand, had too many players with similar styles and a dearth of “scoring rebounders” and “paint protectors,” leaving them vulnerable along the front line.

This is the same topological network of players, with red regions indicating the Minnesota Timberwolves.

Alagappan’s findings won the award for best Evolution of Sport this spring at the annual MIT Sloan Sports Analytics Conference.

Whenever sports and numbers meet, the *Moneyball* question inevitably arises: Is it possible to use big data sets to find undervalued players? Alagappan believes it is.

He isolated the 40 players in the “scoring rebounder” section who best epitomized that group. At the top were the stars you might expect: Carmelo Anthony and Amare Stoudemire of the New York Knicks, along with Nowitzki and the Los Angeles Lakers’ Paul Gasol. But lesser-known players like Marreese Speights of the Memphis Grizzlies and the Lakers’ Devin Ebanks produced statistically similar per-minute results. Even better, where Anthony’s salary averages around $18.5 million per year, the Lakers are paying Ebanks about $740,000.

Another inevitable question: Could Ayasdi’s software have predicted the success of Knicks rookie Jeremy Lin? Alagappan concedes Lin’s college stats wouldn’t have suggested or predicted Linsanity, but he did create a similarity network to identify those players most similar to Lin in college. Three names emerged from the 3,400 analyzed: DeMarcus Cousins, who the Sacramento Kings picked fifth overall in the 2010 NBA draft; Alec Burks, picked 12th in 2011 by the Utah Jazz; and Nik Raivio, a University of Portland guard currently playing ball in Kaposvar, Hungary.

The lesson? For teams who buy into this new classification of players, the next Jeremy Lin might be in Hungary, awaiting your call.

*Photo: Dallas Mavericks’ Dirk Nowitzki (41) and Jason Terry (31) defend Miami Heat’s Dwyane Wade during the second half of Game 2 of the 2011 NBA Finals. Photo: David J. Phillip/Associated Press*

- Alec Burks
- Amare Stoudemire
- Ayasdi
- Basketball
- Basketball
- basketball
- Basketball position
- Boston Celtics
- California
- Carmelo Anthony
- Center
- Dallas Mavericks
- Dallas Mavericks
- data
- David J. Phillip/Associated Press
- DeMarcus Cousins
- Devin Ebanks
- Dirk Nowitzki
- Dirk Nowitzki
- Dwyane Wade
- Hungary
- Jeremy Lin
- Kaposvar
- Los Angeles Lakers
- M.I.T
- Marreese Speights
- Miami Heat
- Minnesota Timberwolves
- National Basketball Association
- National Basketball Association
- NBA
- NBA
- NBA
- New York Knicks
- New York Knicks
- paint protectors
- Pau Gasol
- Paul Gasol
- Rajon Rondo
- Rajon Rondo
- research
- Sacramento Kings
- San Antonio Spurs
- San Antonio Spurs
- Shane Battier
- similarity network
- Sports
- Stanford University
- Stanford University
- Statistical Analysis
- Statistics
- the 2010 NBA
- the 2011 NBA Finals
- Tony Parker
- University of Portland
- Utah Jazz
- Western Conference

George E.P. Box, a statistician known for his body of work in time series analysis and Bayesian inference (and his quotes), recounts how he became a statistician while trying to solve actual problems. He was a 19-year-old college student studying chemistry. Instead of finishing, he joined the army, fed up with what the British government was doing to stop Hitler.

Before I could actually do any of that I was moved to a highly secret experimental station in the south of England. At the time they were bombing London every night and our job was to help to find out what to do if, one night, they used poisonous gas.

Some of England's best scientists were there. There were a lot of experiments with small animals, I was a lab assistant making biochemical determinations, my boss was a professor of physiology dressed up as a colonel, and I was dressed up as a staff sergeant.

The results I was getting were very variable and I told my colonel that what we really needed was a statistician.

He said "we can't get one, what do you know about it?" I said "Nothing, I once tried to read a book about it by someone called R. A. Fisher but I didn't understand it". He said "You've read the book so you better do it", so I said, "Yes sir".

Box eventually worked with Fischer, studied under E. S. Pearson in college after his discharge from the army, and started the Statistical Techniques Research Group at Princeton on the insistence of one John Tukey.

The following outline is provided as an overview and guide to the variety of topics included within the subject of statistics:

**Statistics** pertains to the collection, analysis, interpretation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities; it is also used and misused for making informed decisions in all areas of business and government.

## Describing data

## Experiments and surveys

### Sampling

## Analysing data

## Filtering data

## Statistical inference

- Statistical inference
- Mathematical statistics
- Bayesian inference
- Frequentist inference
- Decision theory
- Estimation theory
- Non-parametric statistics

## Probability distributions

- Probability distribution
- Conditional probability distribution
- Probability density function
- Cumulative distribution function
- Characteristic function