Know Your Data
1. The ArnetMiner citation dataset (provided by arnetminer.org) by year 2012 can be downloaded in
the attached file.
(1) Count the number of authors, venues (conferences/journals), and publications in the datasets.
(2) What are the min, max, Q1, Q3, and median number of publications per author? Can you plot
the histogram for number of publications per author?
(3) What are the min, max, Q1, Q3, and median number of citations per author? Can you plot the
histogram for number of citations received per author?
(4) Please plot the scatter plot between the number of publications vs. the number of citations for
authors who have more than 5 publications.
Classification for Matrix Data
2. Decision Tree
Construct a decision tree for the following training data, where “Edible” is the class we are going to
predict. Information gain is used to select the attributes. Please write down the major steps in the
construction process (you need to show the information gain for each candidate attribute when a new
node is created in the tree).
3. Naïve Bayes
Consider a Naïve Bayes model for spam classification with the vocabulary V = {secret, offer, low, price,
valued, customer, today, dollar, million, sports, is, for, play, healthy, pizza}, where each word in the
vocabulary is considered as a feature, and their values could be either 1 or 0, denoting whether they exist
in one message. We have the messages and labels in the following table:Messages Class label
Million dollar offer Spam
Secret offer today Spam
Secret is secret Spam
Low price for valued customer non-spam
Play secret sports today non-spam
Sports is healthy non-spam
Low price pizza non-spam
Give the MLEs for the following parameters:








Jermaine Byrant
Nicole Johnson



