程序代寫案例-FIT3152

時間：2021-06-12

Page 1

FIT3152 Mock eExam with brief Answers/Marking Guide

R Coding: 10 Marks

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

The following R code is run:

Petal.cor <- as.data.frame(as.table(by(iris, iris[5], function(df)

cor(df[3], df[4]))))

colnames(Petal.cor) <- c("Species", "Petal.cor")

Sepal.cor <- as.data.frame(as.table(by(iris, iris[5], function(df)

cor(df[1], df[2]))))

colnames(Sepal.cor) <- c("Species", "Sepal.cor")

iris.cor <- merge(Sepal.cor, Petal.cor, by = "Species")

iris.cor[,2] = round(iris.cor[,2], digits = 3)

iris.cor[,3] = round(iris.cor[,3], digits = 3)

write.csv(iris.cor, file = "Iris.cor.csv", row.names=FALSE)

Describe the action and outputs of the R code.

Calculate the correlation of sepal length and width [1 Mark]

Calculate the correlation of petal length and width [1 Mark]

Rename and merge data frames [1 Mark]

Round the values [1 Mark]

Save as a csv file [1 Mark]

Describe the action of each function or purpose of each variable in the space provided.

as.data.frame Coerce the previous output into a data

frame [1 Mark]

merge Merge data frames using a common column as

an index [1 Mark]

by Apply a function to a data frame split by

factors [1 Mark]

df Temporary data frame passed to the function

[1 Mark]

round Round the data to a given number of decimal

places (or digits) [1 Mark]

(10 marks)

Page 2

Regression: 10 Marks

A subset of the ‘diamonds’ data set from the R package ‘ggplot2’ was created. The data set reports

price, size(carat) and quality (cut, color and clarity) information as well as specific measurements (x, y

and z). The first 6 rows are printed below.

> head(dsmall)

carat cut color clarity depth table price x y z

46434 0.59 Very Good H VVS2 61.1 57 1771 5.39 5.48 3.32

35613 0.30 Good I VS1 63.3 59 473 4.20 4.23 2.67

43173 0.42 Premium F IF 62.2 56 1389 4.85 4.80 3.00

11200 0.95 Ideal H SI1 61.9 56 4958 6.31 6.35 3.92

37189 0.32 Premium D VVS1 62.0 60 973 4.40 4.37 2.72

45569 0.52 Premium E VS2 60.7 58 1689 5.17 5.21 3.15

The least squares regression of log(price) on log(size) and color is given below. Note that ‘log’ in this

context means ‘Loge(X).’ Based on this output, answer the following questions.

> library(ggplot2)

> set.seed(9999) # Random seed

> dsmall <- diamonds[sample(nrow(diamonds), 1000), ] # sample of 1000 rows

> attach(dsmall)

> contrasts(color) = contr.treatment(7)

> d.fit <- lm(log(price) ~ log(carat) + color)

> d.fit

> summary(d.fit)

Call:

lm(formula = log(price) ~ log(carat) + color)

Residuals:

Min 1Q Median 3Q Max

-0.97535 -0.16001 0.01106 0.15500 0.99937

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.61356 0.02289 376.259 < 2e-16 ***

log(carat) 1.74075 0.01365 127.529 < 2e-16 ***

color2 -0.06717 0.02833 -2.371 0.0179 *

color3 -0.05469 0.02783 -1.965 0.0496 *

color4 -0.07139 0.02770 -2.578 0.0101 *

color5 -0.21255 0.02973 -7.148 1.7e-12 ***

color6 -0.32995 0.03175 -10.393 < 2e-16 ***

color7 -0.50842 0.04563 -11.143 < 2e-16 ***

---

Residual standard error: 0.2393 on 992 degrees of freedom

Multiple R-squared: 0.9446, Adjusted R-squared: 0.9443

F-statistic: 2418 on 7 and 992 DF, p-value: < 2.2e-16

> contrasts(color)

2 3 4 5 6 7

D 0 0 0 0 0 0

E 1 0 0 0 0 0

F 0 1 0 0 0 0

G 0 0 1 0 0 0

H 0 0 0 1 0 0

I 0 0 0 0 1 0

J 0 0 0 0 0 1

Page 3

(a) Write down the regression equation predicting log(price) as a function of size and color.

log(price) = 1.74 * log(carat) + 8.61 + color(i),

where i = indicates color(D,E,F,G,H,I,J) [1 Mark]

(b) Explain the different data types present in the variables: carat and color. What is the effect of

this difference on the regression equation?

carat is a numerical variable (treated as a number) [1 Mark]

color is a factor – it is included in the regression equation

as a contrast whereby each level is estimated individually.

[1 Mark]

(c) What is the predicted price for a diamond of 1 carat of color H?

log(price) = 1.74 * log(carat) + 8.61 + color(i),

log(price) = 1.74 * log(1) + 8.61 -0.21,

log(price) = 1.74 * 0 + 8.61 -0.21,

log(price) = 8.61 -0.21 = 8.40

price = e ^ 8.40 = $ 4447.06 [1 Mark]

(d) Which colour diamonds can be reliably assumed to have the highest value? Explain your

reasoning. How sure can you be?

Color D diamonds have the highest value since the coefficient

for this factor is 0 and all the others are negative. [1 Mark]

For surety, use the significance of the regression equation

overall (***) so better than 0.0001 [1 Mark]

(e) Which colour diamonds have the lowest value? How reliable is the evidence? Explain your

reasoning.

Color J diamonds have lowest value (coeff = -0.51) [1 Mark]

Significance better than 0.0001 [1 Mark]

(f) Comment on the reliability of the model as a whole giving reasons.

Reliability of model is high overall:

Multiple R-squared = 0.94,

p-value very small,

median residual close to 0. [1 Mark each up to 2 Marks]

Page 4

Networks: 10 Marks

The social network of a group of friends (numbered from 1 – 7) is drawn below.

(a) Calculate the betweenness centrality for nodes 4 and 6.

Node(4) betweenness = 11 [1 Mark]

(It is in the following geodesics: 1-5, 1-6, 1-7, 2-5, 2-6, 2-

7, 3-5, 3-6, 3-7, 1-3, 2-3.)

Node(6) betweenness = 8 [1 Mark]

(It is in the following: 1-5, 1-7, 2-5, 2-7, 3-5, 3-7, 4-5, 4-7.)

(b) Calculate the closeness centrality for nodes 4 and 6.

Node(4) closeness = 1/9 [1 Mark] (Since sum of shortest paths

to others = 2 + 1 + 1 + 1 + 2 + 2.)

Node(6) closeness = 1/10 [1 Mark] (3 + 2 + 2 + 1 + 1 + 1)

(c) Calculate the degree of nodes 4 and 6.

|4| = 3 [1 Mark]

|6| = 3 [1 Mark]

(d) Giving reasons based on your results in Parts a – c, which node is most central in the

network?

Node(4) is most central [1 Mark] It has the greatest

betweenness centrality and closeness centrality [1 Mark]

(e) Write down the adjacency matrix for the network.

1 2 3 4 5 6 7

1 0 1 0 0 0 0 0

2 1 0 0 1 0 0 0

3 0 0 0 1 0 0 0

4 0 1 1 0 0 1 0

5 0 0 0 0 0 1 1

6 0 0 0 1 1 0 1

7 0 0 0 0 1 1 0

Correct form [1 Mark], Correct values [1 Mark]

1

2

3

4

6

7

5

Page 5

Na?ve Bayes: 4 Marks

(a) Use data below and Na?ve Bayes classification to predict whether the following test instance

will be happy or not.

Test instance: (Age Range = young, Occupation = professor, Gender = F, Happy = ? )

Test instance: (Age Range = young, Occupation = professor,

Gender = F, Happy = ? )

p(Happy = yes) 0.5

p(Happy = no) 0.5 [1 Mark]

YES P(young/yes) P(professor/yes) P(F/yes) P(Cj)×P(A1| Cj)×P(A2| Cj)× … ×P(An| Cj)

p(yes) 0.5 0.250 0.250 0.500 0.016

NO P(young/no) P(professor/no) P(F/no) P(Cj)×P(A1| Cj)×P(A2| Cj)× … ×P(An| Cj)

p(no) 0.5 0.250 0.250 0.750 0.023

Correct calculations [1 Mark]

So classify as Happy = No [1 Mark or H]

(b) Use the complete Na?ve Bayes formula to evaluate the confidence of predicting Happy = yes,

based on the same attributes as the previous question: (Age Range = young, Occupation = professor,

Gender = F).

NUM P(young/yes) P(professor/yes) P(F/yes) P(Cj)×P(A1| Cj)×P(A2| Cj)× … ×P(An| Cj)

p(yes) 0.5 0.250 0.250 0.500 0.016

DENOM P(young) P(professor) P(F) P(A1)×P(A2)× … ×P(An)

0.250 0.250 0.625 0.039

So p(yes|attributes) = 0.016/0.039 = 0.41 [1 Mark or H]

ID Age Range Occupation Gender Happy

1 Young Tutor F Yes

2 Middle-aged Professor F No

3 Old Tutor M Yes

4 Middle-aged professor M Yes

5 Old Tutor F Yes

6 Young Lecturer M No

7 Middle-aged lecturer F No

8 Old Tutor F No

Page 6

Visualisation: 6 Marks

A World Health study is examining how life expectancy varies between men and women in different

countries and at different times in history. The table below shows a sample of the data that has been

recorded. There are approximately 15,000 records in all.

Country Year of Birth Gender Age at Death

Australia 1818 M 9

Afghanistan 1944 F 40

USA 1846 F 12

India 1926 F 6

China 1860 F 32

India 1868 M 54

Australia 1900 F 37

China 1875 F 75

England 1807 M 15

France 1933 M 52

Egypt 1836 M 19

USA 1906 M 58

Using one of the graphic types from the Visualization Zoo (see formulae and references for a list of

types) suggest a suitable graphic to help the researcher display as many variables as clearly as

possible.

Explain your decision. Which graph elements correspond to the variables you want to display?

Appropriate main graphic [1 Mark] with explanation. [1 Mark]

Mapping of variables to attributes in the graphic and/or data

reduction (summary) as appropriate with explanation. [1 Mark

each up to 4 Marks]

For example, one approach would be a heat map with time

intervals on the x axis (perhaps every 10 or 50 years

depending on range) and continents or countries on the y axis

(depending on how many countries there are). Each cell could

then be coloured for average age of death. You could either

have two heat maps (male/female) or interleave cells so that

m/f for each time period were adjacent.

Page 7

Decision Trees: 10 Marks

Eight university staff completed a questionnaire on happiness. The results are given below.

A decision tree was generated from the data.

(a) Using the decision tree generated from the data provided, assuming a required

confidence level greater than 60% to classify as ‘Happy’, what is the predicted

classification for the following instances:

Instance 1: (Age Range = Young, Occupation = Professor, Gender = F, Happy = ? )

Instance 2: (Age Range = Old, Occupation = Professor, Gender = F, Happy = ? )

Instance 1: Happy = No, because confidence for Happy = Yes is

50%, which is less than required confidence level. [1 Mark]

Instance 2:Happy = Yes, because confidence for Happy = Yes is

66.67%, which is greater than required confidence level. [1

Mark]

(b) Is it possible to generate a 100% accurate decision tree using this data? Explain your answer.

Instances 5 and 8 have identical decision attributes, but

belong to different classes, so No (Old, Tutor, F = Yes; Old,

tutor, F = No). Therefore a 100% accurate decision tree could

not be generated from this data. (Or equivalent) [1 Mark]

ID Age Range Occupation Gender Happy

1 Young Tutor F Yes

2 Middle-aged Professor F No

3 Old Tutor M Yes

4 Middle-aged Professor M Yes

5 Old Tutor F Yes

6 Young Lecturer M No

7 Middle-aged Lecturer F No

8 Old Tutor F No

Page 8

(c) Explain how the concept of entropy is used in some decision tree algorithms.

Information gain is used in the ID3 algorithm to determine

which attribute to split on. Information gain calculates the

reduction in entropy when splitting on a specific attribute

and chooses the attribute which gives the greatest reduction

in entropy or greatest information gain. (Or something

similar) [1 Mark]

(d) Do you think entropy was used to generate the decision tree above? Explain your answer.

The Occupation attribute appears more homogeneous in terms of

the class attribute Happy than the Age attribute. (Or Similar)

[1 Mark] Therefore, no, entropy was not used. (or similar) [1

Mark]

(e) What is the entropy of “Happy”?

50:50 Yes:No = 1 by inspection. [1 Mark]

(e) What is information gain after the first node of the decision tree (Age Range) has been

introduced?

(: ) = ?

? (

) ?

? (

) = . [1 Mark]

(, ) = () ? (

. +

. +

. )

(, ) = ? (. ) = . [1 Mark]

(f) Explain why some decision tree algorithms are referred to as greedy algorithms.

Decision tree algorithms always choose the best option to

branch on at each step without taking into account future

choices. Is never able to back track in order to improve the

final solution. [1 Mark]

Page 9

ROC and Lift: 10 Marks

The following table shows the outcome of a classification model for customer data. The table

lists customers by code and provides the following information: The model confidence of a

customer buying/not buying a new product (confidence-buy); whether in fact the customer did

or did not buy the product (buy = 1 if the customer purchased the model, buy = 0 if the

customer did not buy the model).

customer confidence-buy buy-not-buy 20%+ 80%+

c1 0.9 1 1 1

c2 0.8 1 1 1

c3 0.7 0 1 0

c4 0.7 1 1 0

c5 0.6 1 1 0

c6 0.5 1 1 0

c7 0.4 0 1 0

c8 0.4 1 1 0

c9 0.2 0 1 0

c10 0.1 0 0 0

(a) Calculate the True Positive Rate and the False Positive Rate when a confidence

level of 20% is required for a positive classification.

TP = 6, FP = 3, TN = 1, FN = 0. All correct 1 Mark

TPR = 6/(6+0) = 1, FPR = 3/(3+1) = 0.75. All correct 1 Mark

(b) Calculate the True Positive Rate and the False Positive Rate when a confidence level

of 80% is required for a positive classification.

TP = 2, FP = 0, TN = 4, FN = 4. All correct 1 Mark

TPR = 2/(2+4) = 0.33, FPR = 0/(0+4) = 0. All correct 1 Mark

(c) The ROC chart for the previous question is shown below. Comment on the quality of

the model overall. Give a single measure of classifier performance.

Exact = 0.83 accept 0.7 – 0.9. [1 Mark]

Classifier is good. [1 Mark]

(d) What is the lift value if you target the top 40% of customers that the classifier is most

confident of?

P(true) = 6/10, for top 40% P(true) = 3/4 [1 Mark]

Lift = (3/4) / (6/10) = 1.25 [1 Mark or H]

Page 10

(e) Explain what the value of lift means in the previous question.

Lift is the increase in the response rate over randomly

selection [1 Mark] by choosing those you are most confident

of. [1 Mark]

Clustering: 10 Marks

(a) What does the ‘k’ refer to in k-means clustering. Who/what determines the value of k?

K is the number of clusters. [1 Mark] This is pre-defined by

the user before running the algorithm. [1 Mark]

(b) Describe the steps involved with k-means clustering.

1. Define the number of clusters required, k. [1 Mark]

2. Declare k centroids. [1 Mark]

3. Assign each data point to the closest centroid; 4. Re-

calculate the centroids. [1 Mark]

5. Repeat 3 and 4 until the cluster centroids do not change.

[1 Mark]

(c) Are clustering algorithms supervised or unsupervised learning algorithms? Explain.

They are unsupervised algorithms designed to find groupings of

similar instances. [1 Mark]

Unlike classification, there is no ‘class’ attribute that can

be used to help determine the clusters. [1 Mark]

(b) Is the k-means clustering algorithm a partitional or hierarchical clustering algorithm? Explain

your answer.

k-Means is partitional. [1 Mark] There is no hierarchy from

which clusters can be chosen. The number of clusters cannot be

changed, once set. [1 Mark]

Page 11

Text Analytics: 10 Marks

9 (a) Explain what is meant by the ‘bag of words’ approach to text mining.

Each document in the collection is assumed to be just a set of

words and it is the entire collection of words that is used in

the analysis. [1 Mark]

The semantics or meaning of the text in the documents is not

considered in the ‘bag of words’ approach. [1 Mark]

(b) What is the main problem associated with the bag of words approach? Provide an example.

The main problem is that semantics are not considered and two

documents that mean quite different things, but contain the

same words, will be considered to be similar. [1 Mark]

Example:

? while licking their ice creams, the children chased the dog

? the dog chased the children and licked their ice creams

[1 Mark]

(c) Describe an application where text mining could be used, giving an example of how it would

be applied.

Grouping articles by similar content (or similar). [1 Mark]

For example, job applications, tweets, emails etc. [1 Mark]

(d) Apply the five main steps required to pre-process text documents for analysis to the corpus

below. Write your processed documents in the space provided.

Doc1 = { The choir sang loudly. }

Doc2 = { The boys were singing in church. }

Doc3 = { The boy asked to sing a song. }

(choir, sing-, loud-)

(boy, sing-, church) [Tokenise and stop words 1 Mark]

(boy, ask, sing- song). [Stemming and overall format 1 Mark]

(e) Construct the term document frequency matrix for the processed text documents above. [2

Marks].

Matrix correct format: words = cols, docs = rows [1 Mark]

Indicators are correct [1 Mark or H]

ask boy choir church loud- sing- song

Doc 1 0 0 1 0 1 1 0

Doc 2 1 0 1 0 1 0

Doc 3 1 1 0 0 0 1 1

Page 12

Ensemble Methods 4 Marks

(a) Describe the main similarities of the three ensemble classifiers (bagging, boosting and random

forests) studied.

Create multiple data sets by resampling or cloning [1 Mark]

Build multiple classifiers [1 Mark] Combine classifiers (ave

or vote) [1 Mark up to a total of 2]

(b) How do boosting and random forests differ from bagging?

Boosting re-weights attributes to favour hard to classify

cases. [1 Mark] Random Forests varies the attributes used in

samples as well. [1 Mark]

Dirty Data 6 Marks

The table below is an extract from the list of books in the British Library. Identify the instances of

dirty data present, stating the way in which the data is dirty.

Most of these are instances of incorrect data, although many

records are incomplete also.[1 Mark each up to maximum 6].

1 = incorrect/duplicate (has publisher and place in same cell.

2 = incorrect/duplicate etc, 3 = incorrect/inaccurate using

abbreviation for “Oxford”, 4 = incorrect/inaccurate etc.

Page 13

Formulas and references

學霸聯盟

FIT3152 Mock eExam with brief Answers/Marking Guide

R Coding: 10 Marks

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

The following R code is run:

Petal.cor <- as.data.frame(as.table(by(iris, iris[5], function(df)

cor(df[3], df[4]))))

colnames(Petal.cor) <- c("Species", "Petal.cor")

Sepal.cor <- as.data.frame(as.table(by(iris, iris[5], function(df)

cor(df[1], df[2]))))

colnames(Sepal.cor) <- c("Species", "Sepal.cor")

iris.cor <- merge(Sepal.cor, Petal.cor, by = "Species")

iris.cor[,2] = round(iris.cor[,2], digits = 3)

iris.cor[,3] = round(iris.cor[,3], digits = 3)

write.csv(iris.cor, file = "Iris.cor.csv", row.names=FALSE)

Describe the action and outputs of the R code.

Calculate the correlation of sepal length and width [1 Mark]

Calculate the correlation of petal length and width [1 Mark]

Rename and merge data frames [1 Mark]

Round the values [1 Mark]

Save as a csv file [1 Mark]

Describe the action of each function or purpose of each variable in the space provided.

as.data.frame Coerce the previous output into a data

frame [1 Mark]

merge Merge data frames using a common column as

an index [1 Mark]

by Apply a function to a data frame split by

factors [1 Mark]

df Temporary data frame passed to the function

[1 Mark]

round Round the data to a given number of decimal

places (or digits) [1 Mark]

(10 marks)

Page 2

Regression: 10 Marks

A subset of the ‘diamonds’ data set from the R package ‘ggplot2’ was created. The data set reports

price, size(carat) and quality (cut, color and clarity) information as well as specific measurements (x, y

and z). The first 6 rows are printed below.

> head(dsmall)

carat cut color clarity depth table price x y z

46434 0.59 Very Good H VVS2 61.1 57 1771 5.39 5.48 3.32

35613 0.30 Good I VS1 63.3 59 473 4.20 4.23 2.67

43173 0.42 Premium F IF 62.2 56 1389 4.85 4.80 3.00

11200 0.95 Ideal H SI1 61.9 56 4958 6.31 6.35 3.92

37189 0.32 Premium D VVS1 62.0 60 973 4.40 4.37 2.72

45569 0.52 Premium E VS2 60.7 58 1689 5.17 5.21 3.15

The least squares regression of log(price) on log(size) and color is given below. Note that ‘log’ in this

context means ‘Loge(X).’ Based on this output, answer the following questions.

> library(ggplot2)

> set.seed(9999) # Random seed

> dsmall <- diamonds[sample(nrow(diamonds), 1000), ] # sample of 1000 rows

> attach(dsmall)

> contrasts(color) = contr.treatment(7)

> d.fit <- lm(log(price) ~ log(carat) + color)

> d.fit

> summary(d.fit)

Call:

lm(formula = log(price) ~ log(carat) + color)

Residuals:

Min 1Q Median 3Q Max

-0.97535 -0.16001 0.01106 0.15500 0.99937

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.61356 0.02289 376.259 < 2e-16 ***

log(carat) 1.74075 0.01365 127.529 < 2e-16 ***

color2 -0.06717 0.02833 -2.371 0.0179 *

color3 -0.05469 0.02783 -1.965 0.0496 *

color4 -0.07139 0.02770 -2.578 0.0101 *

color5 -0.21255 0.02973 -7.148 1.7e-12 ***

color6 -0.32995 0.03175 -10.393 < 2e-16 ***

color7 -0.50842 0.04563 -11.143 < 2e-16 ***

---

Residual standard error: 0.2393 on 992 degrees of freedom

Multiple R-squared: 0.9446, Adjusted R-squared: 0.9443

F-statistic: 2418 on 7 and 992 DF, p-value: < 2.2e-16

> contrasts(color)

2 3 4 5 6 7

D 0 0 0 0 0 0

E 1 0 0 0 0 0

F 0 1 0 0 0 0

G 0 0 1 0 0 0

H 0 0 0 1 0 0

I 0 0 0 0 1 0

J 0 0 0 0 0 1

Page 3

(a) Write down the regression equation predicting log(price) as a function of size and color.

log(price) = 1.74 * log(carat) + 8.61 + color(i),

where i = indicates color(D,E,F,G,H,I,J) [1 Mark]

(b) Explain the different data types present in the variables: carat and color. What is the effect of

this difference on the regression equation?

carat is a numerical variable (treated as a number) [1 Mark]

color is a factor – it is included in the regression equation

as a contrast whereby each level is estimated individually.

[1 Mark]

(c) What is the predicted price for a diamond of 1 carat of color H?

log(price) = 1.74 * log(carat) + 8.61 + color(i),

log(price) = 1.74 * log(1) + 8.61 -0.21,

log(price) = 1.74 * 0 + 8.61 -0.21,

log(price) = 8.61 -0.21 = 8.40

price = e ^ 8.40 = $ 4447.06 [1 Mark]

(d) Which colour diamonds can be reliably assumed to have the highest value? Explain your

reasoning. How sure can you be?

Color D diamonds have the highest value since the coefficient

for this factor is 0 and all the others are negative. [1 Mark]

For surety, use the significance of the regression equation

overall (***) so better than 0.0001 [1 Mark]

(e) Which colour diamonds have the lowest value? How reliable is the evidence? Explain your

reasoning.

Color J diamonds have lowest value (coeff = -0.51) [1 Mark]

Significance better than 0.0001 [1 Mark]

(f) Comment on the reliability of the model as a whole giving reasons.

Reliability of model is high overall:

Multiple R-squared = 0.94,

p-value very small,

median residual close to 0. [1 Mark each up to 2 Marks]

Page 4

Networks: 10 Marks

The social network of a group of friends (numbered from 1 – 7) is drawn below.

(a) Calculate the betweenness centrality for nodes 4 and 6.

Node(4) betweenness = 11 [1 Mark]

(It is in the following geodesics: 1-5, 1-6, 1-7, 2-5, 2-6, 2-

7, 3-5, 3-6, 3-7, 1-3, 2-3.)

Node(6) betweenness = 8 [1 Mark]

(It is in the following: 1-5, 1-7, 2-5, 2-7, 3-5, 3-7, 4-5, 4-7.)

(b) Calculate the closeness centrality for nodes 4 and 6.

Node(4) closeness = 1/9 [1 Mark] (Since sum of shortest paths

to others = 2 + 1 + 1 + 1 + 2 + 2.)

Node(6) closeness = 1/10 [1 Mark] (3 + 2 + 2 + 1 + 1 + 1)

(c) Calculate the degree of nodes 4 and 6.

|4| = 3 [1 Mark]

|6| = 3 [1 Mark]

(d) Giving reasons based on your results in Parts a – c, which node is most central in the

network?

Node(4) is most central [1 Mark] It has the greatest

betweenness centrality and closeness centrality [1 Mark]

(e) Write down the adjacency matrix for the network.

1 2 3 4 5 6 7

1 0 1 0 0 0 0 0

2 1 0 0 1 0 0 0

3 0 0 0 1 0 0 0

4 0 1 1 0 0 1 0

5 0 0 0 0 0 1 1

6 0 0 0 1 1 0 1

7 0 0 0 0 1 1 0

Correct form [1 Mark], Correct values [1 Mark]

1

2

3

4

6

7

5

Page 5

Na?ve Bayes: 4 Marks

(a) Use data below and Na?ve Bayes classification to predict whether the following test instance

will be happy or not.

Test instance: (Age Range = young, Occupation = professor, Gender = F, Happy = ? )

Test instance: (Age Range = young, Occupation = professor,

Gender = F, Happy = ? )

p(Happy = yes) 0.5

p(Happy = no) 0.5 [1 Mark]

YES P(young/yes) P(professor/yes) P(F/yes) P(Cj)×P(A1| Cj)×P(A2| Cj)× … ×P(An| Cj)

p(yes) 0.5 0.250 0.250 0.500 0.016

NO P(young/no) P(professor/no) P(F/no) P(Cj)×P(A1| Cj)×P(A2| Cj)× … ×P(An| Cj)

p(no) 0.5 0.250 0.250 0.750 0.023

Correct calculations [1 Mark]

So classify as Happy = No [1 Mark or H]

(b) Use the complete Na?ve Bayes formula to evaluate the confidence of predicting Happy = yes,

based on the same attributes as the previous question: (Age Range = young, Occupation = professor,

Gender = F).

NUM P(young/yes) P(professor/yes) P(F/yes) P(Cj)×P(A1| Cj)×P(A2| Cj)× … ×P(An| Cj)

p(yes) 0.5 0.250 0.250 0.500 0.016

DENOM P(young) P(professor) P(F) P(A1)×P(A2)× … ×P(An)

0.250 0.250 0.625 0.039

So p(yes|attributes) = 0.016/0.039 = 0.41 [1 Mark or H]

ID Age Range Occupation Gender Happy

1 Young Tutor F Yes

2 Middle-aged Professor F No

3 Old Tutor M Yes

4 Middle-aged professor M Yes

5 Old Tutor F Yes

6 Young Lecturer M No

7 Middle-aged lecturer F No

8 Old Tutor F No

Page 6

Visualisation: 6 Marks

A World Health study is examining how life expectancy varies between men and women in different

countries and at different times in history. The table below shows a sample of the data that has been

recorded. There are approximately 15,000 records in all.

Country Year of Birth Gender Age at Death

Australia 1818 M 9

Afghanistan 1944 F 40

USA 1846 F 12

India 1926 F 6

China 1860 F 32

India 1868 M 54

Australia 1900 F 37

China 1875 F 75

England 1807 M 15

France 1933 M 52

Egypt 1836 M 19

USA 1906 M 58

Using one of the graphic types from the Visualization Zoo (see formulae and references for a list of

types) suggest a suitable graphic to help the researcher display as many variables as clearly as

possible.

Explain your decision. Which graph elements correspond to the variables you want to display?

Appropriate main graphic [1 Mark] with explanation. [1 Mark]

Mapping of variables to attributes in the graphic and/or data

reduction (summary) as appropriate with explanation. [1 Mark

each up to 4 Marks]

For example, one approach would be a heat map with time

intervals on the x axis (perhaps every 10 or 50 years

depending on range) and continents or countries on the y axis

(depending on how many countries there are). Each cell could

then be coloured for average age of death. You could either

have two heat maps (male/female) or interleave cells so that

m/f for each time period were adjacent.

Page 7

Decision Trees: 10 Marks

Eight university staff completed a questionnaire on happiness. The results are given below.

A decision tree was generated from the data.

(a) Using the decision tree generated from the data provided, assuming a required

confidence level greater than 60% to classify as ‘Happy’, what is the predicted

classification for the following instances:

Instance 1: (Age Range = Young, Occupation = Professor, Gender = F, Happy = ? )

Instance 2: (Age Range = Old, Occupation = Professor, Gender = F, Happy = ? )

Instance 1: Happy = No, because confidence for Happy = Yes is

50%, which is less than required confidence level. [1 Mark]

Instance 2:Happy = Yes, because confidence for Happy = Yes is

66.67%, which is greater than required confidence level. [1

Mark]

(b) Is it possible to generate a 100% accurate decision tree using this data? Explain your answer.

Instances 5 and 8 have identical decision attributes, but

belong to different classes, so No (Old, Tutor, F = Yes; Old,

tutor, F = No). Therefore a 100% accurate decision tree could

not be generated from this data. (Or equivalent) [1 Mark]

ID Age Range Occupation Gender Happy

1 Young Tutor F Yes

2 Middle-aged Professor F No

3 Old Tutor M Yes

4 Middle-aged Professor M Yes

5 Old Tutor F Yes

6 Young Lecturer M No

7 Middle-aged Lecturer F No

8 Old Tutor F No

Page 8

(c) Explain how the concept of entropy is used in some decision tree algorithms.

Information gain is used in the ID3 algorithm to determine

which attribute to split on. Information gain calculates the

reduction in entropy when splitting on a specific attribute

and chooses the attribute which gives the greatest reduction

in entropy or greatest information gain. (Or something

similar) [1 Mark]

(d) Do you think entropy was used to generate the decision tree above? Explain your answer.

The Occupation attribute appears more homogeneous in terms of

the class attribute Happy than the Age attribute. (Or Similar)

[1 Mark] Therefore, no, entropy was not used. (or similar) [1

Mark]

(e) What is the entropy of “Happy”?

50:50 Yes:No = 1 by inspection. [1 Mark]

(e) What is information gain after the first node of the decision tree (Age Range) has been

introduced?

(: ) = ?

? (

) ?

? (

) = . [1 Mark]

(, ) = () ? (

. +

. +

. )

(, ) = ? (. ) = . [1 Mark]

(f) Explain why some decision tree algorithms are referred to as greedy algorithms.

Decision tree algorithms always choose the best option to

branch on at each step without taking into account future

choices. Is never able to back track in order to improve the

final solution. [1 Mark]

Page 9

ROC and Lift: 10 Marks

The following table shows the outcome of a classification model for customer data. The table

lists customers by code and provides the following information: The model confidence of a

customer buying/not buying a new product (confidence-buy); whether in fact the customer did

or did not buy the product (buy = 1 if the customer purchased the model, buy = 0 if the

customer did not buy the model).

customer confidence-buy buy-not-buy 20%+ 80%+

c1 0.9 1 1 1

c2 0.8 1 1 1

c3 0.7 0 1 0

c4 0.7 1 1 0

c5 0.6 1 1 0

c6 0.5 1 1 0

c7 0.4 0 1 0

c8 0.4 1 1 0

c9 0.2 0 1 0

c10 0.1 0 0 0

(a) Calculate the True Positive Rate and the False Positive Rate when a confidence

level of 20% is required for a positive classification.

TP = 6, FP = 3, TN = 1, FN = 0. All correct 1 Mark

TPR = 6/(6+0) = 1, FPR = 3/(3+1) = 0.75. All correct 1 Mark

(b) Calculate the True Positive Rate and the False Positive Rate when a confidence level

of 80% is required for a positive classification.

TP = 2, FP = 0, TN = 4, FN = 4. All correct 1 Mark

TPR = 2/(2+4) = 0.33, FPR = 0/(0+4) = 0. All correct 1 Mark

(c) The ROC chart for the previous question is shown below. Comment on the quality of

the model overall. Give a single measure of classifier performance.

Exact = 0.83 accept 0.7 – 0.9. [1 Mark]

Classifier is good. [1 Mark]

(d) What is the lift value if you target the top 40% of customers that the classifier is most

confident of?

P(true) = 6/10, for top 40% P(true) = 3/4 [1 Mark]

Lift = (3/4) / (6/10) = 1.25 [1 Mark or H]

Page 10

(e) Explain what the value of lift means in the previous question.

Lift is the increase in the response rate over randomly

selection [1 Mark] by choosing those you are most confident

of. [1 Mark]

Clustering: 10 Marks

(a) What does the ‘k’ refer to in k-means clustering. Who/what determines the value of k?

K is the number of clusters. [1 Mark] This is pre-defined by

the user before running the algorithm. [1 Mark]

(b) Describe the steps involved with k-means clustering.

1. Define the number of clusters required, k. [1 Mark]

2. Declare k centroids. [1 Mark]

3. Assign each data point to the closest centroid; 4. Re-

calculate the centroids. [1 Mark]

5. Repeat 3 and 4 until the cluster centroids do not change.

[1 Mark]

(c) Are clustering algorithms supervised or unsupervised learning algorithms? Explain.

They are unsupervised algorithms designed to find groupings of

similar instances. [1 Mark]

Unlike classification, there is no ‘class’ attribute that can

be used to help determine the clusters. [1 Mark]

(b) Is the k-means clustering algorithm a partitional or hierarchical clustering algorithm? Explain

your answer.

k-Means is partitional. [1 Mark] There is no hierarchy from

which clusters can be chosen. The number of clusters cannot be

changed, once set. [1 Mark]

Page 11

Text Analytics: 10 Marks

9 (a) Explain what is meant by the ‘bag of words’ approach to text mining.

Each document in the collection is assumed to be just a set of

words and it is the entire collection of words that is used in

the analysis. [1 Mark]

The semantics or meaning of the text in the documents is not

considered in the ‘bag of words’ approach. [1 Mark]

(b) What is the main problem associated with the bag of words approach? Provide an example.

The main problem is that semantics are not considered and two

documents that mean quite different things, but contain the

same words, will be considered to be similar. [1 Mark]

Example:

? while licking their ice creams, the children chased the dog

? the dog chased the children and licked their ice creams

[1 Mark]

(c) Describe an application where text mining could be used, giving an example of how it would

be applied.

Grouping articles by similar content (or similar). [1 Mark]

For example, job applications, tweets, emails etc. [1 Mark]

(d) Apply the five main steps required to pre-process text documents for analysis to the corpus

below. Write your processed documents in the space provided.

Doc1 = { The choir sang loudly. }

Doc2 = { The boys were singing in church. }

Doc3 = { The boy asked to sing a song. }

(choir, sing-, loud-)

(boy, sing-, church) [Tokenise and stop words 1 Mark]

(boy, ask, sing- song). [Stemming and overall format 1 Mark]

(e) Construct the term document frequency matrix for the processed text documents above. [2

Marks].

Matrix correct format: words = cols, docs = rows [1 Mark]

Indicators are correct [1 Mark or H]

ask boy choir church loud- sing- song

Doc 1 0 0 1 0 1 1 0

Doc 2 1 0 1 0 1 0

Doc 3 1 1 0 0 0 1 1

Page 12

Ensemble Methods 4 Marks

(a) Describe the main similarities of the three ensemble classifiers (bagging, boosting and random

forests) studied.

Create multiple data sets by resampling or cloning [1 Mark]

Build multiple classifiers [1 Mark] Combine classifiers (ave

or vote) [1 Mark up to a total of 2]

(b) How do boosting and random forests differ from bagging?

Boosting re-weights attributes to favour hard to classify

cases. [1 Mark] Random Forests varies the attributes used in

samples as well. [1 Mark]

Dirty Data 6 Marks

The table below is an extract from the list of books in the British Library. Identify the instances of

dirty data present, stating the way in which the data is dirty.

Most of these are instances of incorrect data, although many

records are incomplete also.[1 Mark each up to maximum 6].

1 = incorrect/duplicate (has publisher and place in same cell.

2 = incorrect/duplicate etc, 3 = incorrect/inaccurate using

abbreviation for “Oxford”, 4 = incorrect/inaccurate etc.

Page 13

Formulas and references

學霸聯盟

- 留學生代寫
- Python代寫
- Java代寫
- c/c++代寫
- 數據庫代寫
- 算法代寫
- 機器學習代寫
- 數據挖掘代寫
- 數據分析代寫
- android/ios代寫
- web/html代寫
- 計算機網絡代寫
- 操作系統代寫
- 計算機體系結構代寫
- R代寫
- 數學代寫
- Finance 金融作業代寫
- Principles of Microeconomics 微觀經濟學代寫
- Accounting 會計代寫
- Statistics統計代寫
- 生物代寫
- 物理代寫
- 機械代寫
- Assignment代寫
- sql數據庫代寫
- analysis代寫
- Haskell代寫
- Linux代寫
- Shell代寫
- SPSS, SAS, R 數據分析代寫
- Principles of Macroeconomics 宏觀經濟學代寫
- Economics 經濟代寫
- Econometrics 計量經濟代寫
- Money and Banking 貨幣銀行學代寫
- Financial statistics 金融統計代寫
- Economic statistics 經濟統計代寫
- Probability theory 概率論代寫
- Algebra 代數代寫
- Engineering工程作業代寫
- Mechanical and Automation Engineering 機械與自動化工程代寫
- Actuarial Science 精算科學代寫
- JavaScript代寫
- Matlab代寫
- Unity代寫
- BigDate大數據代寫
- 匯編代寫
- stat代寫
- scala代寫
- OpenGL代寫
- CS代寫