Analyzing Data with ADAGE

Exploring Events by Type

One of the first questions I had for our data was:

What Does the Distribution of Event Types Look Like?

First, I'm going to show the Python code I used to get a table of event counts by type. Then, I'm going to explain why the code looks the way it does. (Also, please note that I am not a perfect programmer, so this may not be the best way, but it is a way.)

ms.groupby('key').count().sort(columns=['timestamp'], ascending=False)[timestamp]

key	timestamp
MakeConnectComponent	2609
MakeSnapshot	2086
MakeDisconnectComponent	943
MakeAddComponent	891
MakeRemoveComponent	840
MakeCircuitCreated	229
MakeResetBoard	211
MakeSpawnFish	161
MakeCaptureFish	136
MakeEndGame	100
MakeSummonBoard	96
MakeStartGame	92
MakeModeChange	45
MakeVisabilityChange	43
ADAGEStartSession	23

Two things to note here. First, in the table above the numbers are counts. It's slightly confusing because the column title you'll get in output will be timestamp in our case, but what pandas is reporting is the number of timestamp events that match each key type in the left-hand column.

Second, I'm using a pattern of programming called method chaining, which is a common practice in other programming frameworks such as d3. In short, ms is a dataframe object, when I call methods on it, those methods may return other objects which also have methods I might want to call in sequence. So, the groupby() function takes ms as input, returns a grouped dataframe, and passes that grouped dataframe as input to .count(). Another way of thinking of method chaining is that a.b().c() can be understand as the function composition $c(b(a))$.

So, let's break down my line of code from left to right:

ms.groupby('key').count().sort(columns=['timestamp'], ascending=False)[timestamp]

ms - take the ms dataframe, then
groupby('key') - group its data by key, then
count() - count the number of data items in each group, then
sort() the data, in descending order, according to the count in the timestamp column, and finally
[timestamp] selects just the timestamp column

It seems like one big ol' line of code, but it's actually a complex stepwise procedure that I got to express in a compact way because of method chaining.

Visualizing Events by Type

Now, if we'd prefer to visualize our data, we can actually do so quite easily. Here, we'll make a bar chart of the types of events in our data and how many of each type are in the dataset:

msdata = ms.groupby('key').count().sort(columns=['timestamp'], ascending=False)
p = msdata['timestamp'].plot(kind='bar')
print(p)

Bar chart of events by type

Houston, We Have a Problem

A big one. Continue on to the next section to see why this bar chart is bad news.