Mobile gaming experiences can range from a casual, non-commital, completely free experience to that of an potential monetary sinkhole that may demand you to spend hours of your time.
I, personally, play several mobile games and do enjoy jumping from game to game just to see what each one is like. But sometimes, I get curious to see the app store statistics behind these titles and wonder how they rank in terms of other games as well.
The main inspiration for this curiosity was this article (https://www.gamespot.com/articles/an-anime-inspired-game-surpassed-fortnite-as-twitt/1100-6464600/) from Gamespot. In it, it states that Fortnite, the mobile version of the infamous battle royale game that was heavily popularized at the time, was actually talked about less than the mobile game Fate/Grand Order, which I was an avid player of at the time. While I'm not too completely sure about how much revenue and how big of a fanbase each game had at the time (this was from 2018), I have been curious to see how some of the Mobile Games I play stack up to other games on the market.
This project aims to analyze 2019 app data from the iTunes and Google Play stores, analyzing and combining both datasets and viewing any sorts of trends that can be derived from them, in a way, establishing a tierlist of sorts between all of these games. We will also do a brief look into revenue streams for certain apps and look at the relationship between ranking, reviews and revenue.
The iTunes mobile game dataset can be found here: https://www.kaggle.com/tristan581/17k-apple-app-store-strategy-games
The Google Play Store mobile game dataset can be found here: https://www.kaggle.com/lava18/google-play-store-apps/home?select=googleplaystore.csv
This project will also scrape data from https://thinkgaming.com/app-sales-data which covers data pertaining to iTunes mobile game sales (not android). Unfortunately, most data pertaining to mobile game sales is locked behind paywalls from sites such as "Think Gaming" and "Sensor Tower" which provide a wide variety of metrics mainly intended for corporate usage.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import re
While I use this pre-formatted, established iTunes app dataset here, there also is a way to access iTunes app data using search API's that apple provides. The API itself is limited to around 20 queries per minute and is also geared more towards music. The only reason I share this with you is that it is possible to scrape all of the current data for ALL games using the iTunes sitemap and iTunes API. However for the purposes of this tutorial, I will leave the creation and curation of a more "current" dataset up to you.
The documentation is here: https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api/
The iTunes sitemap (for games) is here: https://apps.apple.com/us/genre/ios-games/id6014
I also provide a sample query below:
# Sample query of the itunes API for the game Fate/Grand Order which is one of the games I play
query = requests.get("https://itunes.apple.com/search?producerTerm=Aniplex+Inc.&term=Fate/Grand+Order&entity=software")
print(f"Status of request: {query}")
# This to iterate through the results, it's just a simple for loop
'''
for item in query.json()["results"]:
print(item["trackName"] + ", " + str(item["description"]))
'''
#Just gonna grab the first response for brevity
item = query.json()["results"][0]
print(item["trackName"] + ", " + str(item["description"]))
# We can read in the itunes dataset using a simple pandas read_csv() command
itunes_df = pd.read_csv("./itunes_dataset/appstore_games.csv")
itunes_df.head(5)
For our purposes, I think it's safe to just delete the columns for URL, subtitle, the icon URL and age rating from this dataframe for the time being
try:
del itunes_df["URL"]
del itunes_df["Icon URL"]
del itunes_df["Age Rating"]
del itunes_df["Subtitle"]
except:
print("Rows already deleted")
I will also rename the columns and convert some of them to a more manageable and readable format
# Rename the columns
renamed_cols = [item.lower() for item in itunes_df.columns.tolist()]
itunes_df.columns = renamed_cols
# Convert the release data and current version release date columns into datetime objects
itunes_df["current version release date"] = itunes_df["current version release date"].apply(lambda item: pd.to_datetime(item))
itunes_df["original release date"] = itunes_df["original release date"].apply(lambda item: pd.to_datetime(item))
# Convert the size (in bytes) to megabytes
itunes_df["size"] = itunes_df["size"].apply(lambda item: item/1000000)
# Break up the languages, and genres columns into lists
itunes_df["languages"] = itunes_df["languages"].apply(lambda countries: str(countries).split(","))
itunes_df["genres"] = itunes_df["genres"].apply(lambda countries: str(countries).split(","))
# Rename some columns for brevity
itunes_df.rename(columns={"size": "size (MB)"}, inplace=True)
itunes_df.head(5)
# We can read in the google play CSV using the read_csv pandas function too
google_play_df = pd.read_csv("./google_play_dataset/googleplaystore.csv")
google_play_df.head(5)
Looking at this dataset from a glance, I think it's safe to say that we can remove the columns for current version, android version, and content rating
try:
del google_play_df["Current Ver"]
del google_play_df["Android Ver"]
del google_play_df["Content Rating"]
except:
print("Rows already deleted")
From here, I'm gonna lowercase the category column and remove underscores, change size to a proper float (converting bytes properly), remove any extra bits from installs (so it's just a flat install number), reformat Genres to be a string list, convert price into a float column and convert Last Updated to a proper datetime
# Helper functions
# This is just to extract the digits from a number that is formatted with commas.
# We eliminate commas using the split function to properly extract the proper numerical amount.
def get_numeric(item):
value = str(item).lower()
value = value.replace(",", "")
return re.search("\d+", value).group(0)
# Take the Size entries and convert them to a readable amount in megabytes
def convert_to_megabytes(item):
value = str(item).upper()
# Any value that involes "varies with device" is an nan
if value == "VARIES WITH DEVICE":
return np.nan
# We search for any numerical value followed by an M or a K (mega or kilobytes)
size_regex = r"^(\d+(?:\.\d+)?)([MK])$"
result = re.search(size_regex, value)
#print(result[1] + " " + result[2])
if result.group(2) == "K":
# We divide kilobyte float values by 1000 to get megabytes
return float(result.group(1))/1000
else:
return float(result.group(1))
# Make the column names lowercase
renamed_cols = [item.lower() for item in google_play_df.columns.tolist()]
google_play_df.columns = renamed_cols
# Filter out the non game categories
google_play_df = google_play_df[google_play_df['category'] == "GAME"]
# Convert the last updated column to a proper datetime
google_play_df["last updated"] = google_play_df["last updated"].apply(lambda item: pd.to_datetime(item))
# Convert the size (in bytes) to megabytes
google_play_df["size"] = google_play_df["size"].apply(lambda item: convert_to_megabytes(item))
# Break up the genres of each game into lists, the genres are semicolon separated in their own respective entry
google_play_df["genres"] = google_play_df["genres"].apply(lambda countries: str(countries).split(";"))
# Remove the plus from the installs column and extract the actual number of installs
google_play_df["installs"] = google_play_df["installs"].apply(lambda item: int(get_numeric(item)))
# Rename some columns for brevity
google_play_df.rename(columns={"size": "size (MB)"}, inplace=True)
# Display the data frame
google_play_df.head(10)
For this part of the project, we are actively scraping data from separate pages on a website and actively consolidating them using a for loop. This same process can be done with the iTunes store data although there will be many more pages to parse as a result.
For this example/demonstration, I will be extracting data from four pages on the thinkgaming website pertaining to the app sales and consolidating them into one dataset. This code grabs the most recent results from the website.
# We can't iterate over the four pages so we have to make a separate query for each of them.
pages = ["https://thinkgaming.com/app-sales-data/", "https://thinkgaming.com/app-sales-data/?page=2",
"https://thinkgaming.com/app-sales-data/?page=3", "https://thinkgaming.com/app-sales-data/?page=4"]
mobile_game_sales_df = None
# We will iterate through each page to gather the queries
for page in pages:
query = requests.get(page)
soup = BeautifulSoup(query.text, "html.parser")
table = soup.find("table")
df = pd.read_html(str(table))[0]
# We assign the mobile game sales dataframe to the first dataframe to assume the same number of columns
# The else statement allows us to append the other data to it on the other pages (all the data has the same number of columns)
if mobile_game_sales_df is None:
mobile_game_sales_df = df
else:
mobile_game_sales_df = mobile_game_sales_df.append(df, ignore_index=True)
mobile_game_sales_df.head(10)
From here, we can get rid of the Free and Paid Columns since those pertain to rankings among free and paid games and we need to rename the columns for rank (i.e. "#"), revenue, and new installs
try:
del mobile_game_sales_df["Free"]
del mobile_game_sales_df["Paid"]
except:
print("Rows already deleted")
# Rename the # column, the Revenue column and the New Installs column
# Lowercase first
mobile_game_sales_df.rename(columns=str.lower, inplace=True)
# Rename columns
mobile_game_sales_df.rename(columns={"#": "rank", mobile_game_sales_df.columns[4]: "revenue", mobile_game_sales_df.columns[5]: "new installs"}, inplace=True)
# Convert the revenue column into a proper integer value
mobile_game_sales_df["revenue"] = mobile_game_sales_df["revenue"].apply(lambda item: float(get_numeric(item)))
mobile_game_sales_df.head(5)
import matplotlib.pyplot as plt
from plotnine import *
from sklearn.linear_model import LinearRegression
from scipy import stats
itunes_df.shape
google_play_df.shape
mobile_game_sales_df.head(5)
# Combine the itunes dataframe and google dataframe into one column.
condensed_itunes_df = pd.DataFrame(itunes_df, columns=["name", "average user rating", "user rating count"])
condensed_google_df = pd.DataFrame(google_play_df, columns=["app", "rating", "reviews"])
# Rename the columns so its easier to merge
condensed_google_df.rename(columns={"app":"name", "rating":"average user rating", "reviews": "user rating count"}, inplace=True)
# Merge the dataframes
condensed_df = condensed_itunes_df.append(condensed_google_df, ignore_index=True)
# Convert user rating count to an integer...
condensed_df["user rating count"] = condensed_df["user rating count"].apply(lambda item: float(item))
condensed_df
# Plot histogram of ratings
condensed_df["average user rating"].plot.hist(bins=12, alpha=1)
plt.xlabel('Rating')
plt.title("Frequency of App Ratings")
plt.show()
# Plot histogram of user rating counts
condensed_df["user rating count"].plot.hist(bins=5, alpha=1)
plt.xlabel('Rating Counts')
plt.title("Frequency of App Rating Counts")
plt.show()
Between both of these histograms, most ratings for gaming apps are generally skewed left (which is good in the game developers case) and with the app rating counts (i.e. number of people who rate a given app total), the histogram is skewed right which doesn't sound too good on the surface, but when taking into account the scale of it all, the number of reviews are well within the range from hundreds to millions of people.
The only reason why I am combining the scraped revenue data with the iTunes dataset is because the revenue data pertains only to itunes apps. However, one major problem with this dataset at the time of writing this is that the user ratings and ratings counts are for the games from 2019. The revenue and number of new installs are current.
Grabbing historical revenue data for itunes apps, especially gaming apps, is incredibly difficult as a majority of that content is behind paywalls, so for the sake of this tutorial, I will just be walking through a basic analysis of this merged dataset.
condensed_sales_df = pd.DataFrame(mobile_game_sales_df, columns=["game", "revenue", "new installs"])
condensed_sales_df.rename(columns={"game": "name"}, inplace=True)
merged_df = pd.merge(condensed_sales_df, condensed_itunes_df, on='name')
merged_df
Looking at the merged data, there unfortunately isn't enough to merit splitting into training, validation and testing data for unsupervised machine learning models.
However, we can still extrapolate and do multiple linear regression on this data to determine how certain statistics such as number of installs, user rating or user count affect the total revenue earned for the game.
# These are the parameters we're interested in
X = merged_df[['new installs', 'average user rating', 'user rating count']]
Y = merged_df['revenue']
# Fit the regression model with these terms
regr = LinearRegression()
regr.fit(X, Y)
# Print to verify
# print('Intercept: \n', inter_regr.intercept_)
# print('Coefficients: \n', inter_regr.coef_)
# Shorten the names for retrieving intercept and coefficient
intercept = regr.intercept_
coef = regr.coef_
# Print out the formula
print("y = {}x_1 + {}x_2 + {}x_3 + {}".format(coef[0], coef[1], coef[2], intercept))
print("x_1 is new installs\nx_2 is average user rating\nx_3 is user rating count")
Looking at these coefficients, it seems that average user ratings is statistically significant towards the effectiveness of the model, followed by the number of new installs.
# We can also analyze scatterplots for how user ratings affect revenue
(ggplot(merged_df, aes(x='average user rating', y='revenue', color = 'name'))
+ geom_point()
+ labs(title="Average User Rating vs Revenue",
x = "Average User Rating",
y = "Revenue")
)
# Same thing for rating counts...
(ggplot(merged_df, aes(x='user rating count', y='revenue', color = 'name'))
+ geom_point()
+ labs(title="User Rating Count vs Revenue",
x = "User Rating Count",
y = "Revenue")
)
# ... and for number of installs
(ggplot(merged_df, aes(x='new installs', y='revenue', color = 'name'))
+ geom_point()
+ labs(title="New Installs vs Revenue",
x = "New Installs",
y = "Revenue")
)
Looking at the data plots here, it seems that Clash of Clans is the biggest outlier in these datasets.