R代寫|Statistics統計代寫 - STATISTICS Data Science Practice
Calculators are permitted
There are 5 questions, with a total of 125 marks.
Page 2 of 6
1. In the following R code
ms <- src(MonetDBLite::src_monetdblite(“WORDS/DB”))
glove <- tbl(ms, “glove”)
db_words <- copy_to(ms, current_words)
word_mat <- db_words %>%
inner_join(glove, by=”word”) %>%
(a) What does inner_join do?
(b) What does select(-word) do?
(c) At what point does the SQL query involving the inner join get run?
(d) Give an advantage and a disadvantage of working with data stored in a database
rather than in memory.
(20 marks total)
Page 3 of 6
2. Random forests and Adaboost both predict using averages of trees, but the trees are
(a) Briefly describe the two algorithms: in particular, the differences in how the observations
and variables considered in training within each node are selected or weighted.
(b) Use the differences to explain:
(i) Why increasing the number of trees will not cause overfitting with random forests,
but may cause overfitting with Adaboost.
(ii) Why it is easier to take advantage of parallel computing for random forests than
(iii) Why the individual trees in random forests are typically grown to full depth, but in
those in Adaboost are typically shallow.
(30 marks total)
Page 4 of 6
3. Consider the multilayer neural network described by the following R keras code
model <- keras_model_sequential() %>%
layer_conv_2d(filters = 32, kernel_size = c(3,3),
activation = 'relu', input_shape = input_shape) %>%
layer_conv_2d(filters = 64, kernel_size = c(3,3),
activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_dropout(rate = 0.25) %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = num_classes, activation = 'softmax')
(a) The layer_conv_2d() function declares a convolutional layer. What is a
convolutional layer, and what do the arguments filters = 32, kernel_size =
(b) How many trainable parameters does this layer have?
(c) What does layer_max_pooling_2d do, and what does pool_size mean?
(d) What does layer_dropout(rate = 0.5) do?
(e) What does layer_dense(units = 128, activation = 'relu') do?
(30 marks total)
Page 5 of 6
4. Briefly describe at least one way regularization is accomplished in each of
(a) subset selection for linear regression
(b) random forests
(c) neural networks
(d) boosted trees
(20 marks total)
5. Last year, a Stanford University psychologist, Michael Kosinski, and colleagues published
a paper on neural network analysis of images scraped from a dating website. He found that in
this dataset the network could predict sexual orientation from one image with 81% accuracy
for men and 71% for women, and that this was better than the accuracy of untrained highspeed human classification using workers on the Amazon Turk website.
a) How would the use of images from a dating website be expected to bias the estimates
b) The accuracy figures given are for a sample that is 50% heterosexual and 50%
homosexual. Suppose that the sensitivity and specificity of the classified in men are
both 0.8. What are the positive and negative predictive value for homosexuality in a
population that is 5% homosexual and 95% heterosexual?
c) The researchers say that their aim was to publicise the risks of automated
identification of sexual orientation. Given this aim, briefly discuss the ethical
justification of the research with reference to the ethical principles of beneficience,
respect for persons, and justice.
(25 marks total)