一站式論文代寫,英国、美国、澳洲留学生Essay代寫—FreePass代写

機器學習代寫 - 機器學習課程
時間:2020-11-26
Week 11 exercises This is the eighth and last page of assessed questions, as described in the background notes. These questions form 70% of your mark for Week 11. The introductory questions in the notes and the Week 11 discussion group task form the remaining 30% of your mark for Week 11. Unlike the questions in the notes, you’ll not immediately see any example answers on this page. However, you can edit and resubmit your answers as many times as you like until the deadline (Friday 04 December 4pm UK time, UTC). This is a hard deadline: This course does not permit extensions and any work submitted after the deadline will receive a mark of zero. See the late work policy. Queries: Please don’t discuss/query the assessed questions on hypothesis until after the deadline. If you think there is a mistake in a question this week, please email Iain. Please only answer what’s asked. Markers will reward succinct to-the-point answers. You can put any other observations in the “Add any extra notes” button (but this is for your record, or to point out things that seemed strange, not to get extra credit). Some questions ask for discussion, and so are open-ended, and probably have no perfect answer. For these, stay within the stated word limits, and limit the amount of time you spend on them (they are a small part of your final mark). Feedback: We’ll return feedback on your submission via email by Friday 11 December. Good Scholarly Practice: Please remember the University requirements for all assessed work for credit. Furthermore, you are required to take reasonable measures to protect your assessed work from unauthorised access. For example, if you put any such work on a public repository then you must set access permissions appropriately (permitting access only to yourself). You may not publish your solutions after the deadline either. 1 The Laplace approximation In note w10b, the second question was about inferring the parameter λ of a Poisson distribution based on an observed count r. You found the Laplace approximation to the posterior over λ given r. Now reparameterize the model in terms of ` = log λ. After performing the change of variables1 , the improper prior on log λ becomes uniform, that is p(`) is constant. a) Find the Laplace approximation to the posterior over `=log λ. Include the derivation of your result in your answer. [10 marks] [The website version of this note has a question here.] b) Which version of the Laplace approximation is better? To help answer this question, we recommend that you plot the true and approximate posteriors of λ and ` for different values of the integer count r. However, we don’t want your code or plots. Describe the relevant findings in no more than 150 words. [15 marks] [The website version of this note has a question here.] 2 Classification with unbalanced data Classification tasks often involve rare outcomes, for example predicting click-through events, fraud detection, and disease screening. We’ll restrict ourselves to binary classification, y ∈ {0, 1}, where the positive class is rare: P(y=1) is small. 1. Some review of how probability densities work: Conservation of probability mass means that: p(`)d` = p(λ)dλ for small corresponding elements d` and dλ. Dividing and taking limits: p(`) = p(λ)|dλ/d`|, evaluated at λ = exp(`). The size of the derivative |dλ/d`| is referred to as a Jacobian term. MLPR:w11q Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1 We are likely to see lots of events before observing enough rare y = 1 events to train a good model. To save resources, it’s common to only keep a small fraction f of the y = 0 ‘negative examples’. A classifier trained naively on this sub-sampled data would predict positive labels more frequently than they actually occur. For Bayes classifiers we could set the class probabilities based on the original class counts (or domain knowledge). This question explores what to do for logistic regression. We write that an input-output pair occurs in the world with probability P(x, y), and that the joint probability of an input-output pair (x, y) and keeping it (k) is P(x, y, k). Because conditional probabilities are proportional to joint probabilities, we can write: P(y | x, k) ∝ P(x, y | k) ∝ P(x, y, k) = (P(x, y) y = 1 f P(x, y) y = 0. (1) A general result that you may quote in answering this question is that for any binary probability distribution: P(z) ∝ ( c z = 1 d z = 0, (2) we can write P(z=1) = σ(log c log d), where σ(a) = 1/(1 + e a). a) We train a logistic regression classifier, with a bias weight, to match subsampled data, so that P(y=1 | x, k) ≈ σ(w>x + b). i) Use the above results to argue that we should add log f to the bias parameter to get a model for the real-world distribution P(y | x) ∝ P(y, x). [15 marks] [The website version of this note has a question here.] ii) Explain whether we have changed the bias in the direction that you would expect. Write no more than one sentence. [5 marks] [The website version of this note has a question here.] b) Consider the following approach. We may wish to minimize the loss: L = EP(x,y)[log P(y | x)] . Multiplying the integrand by 1= P(x, y | k) P(x, y | k) changes nothing, so we can write: L = Z ∑y P(x, y | k) P(x, y) P(x, y | k) log P(y | x) dx = Ep(x,y | k) P(x, y) P(x, y | k) log P(y | x) ≈ ? 1N N∑n=1 P(x(n), y(n)) P(x(n), y(n) | k) log P(y(n) | x(n)), where x(n), y(n) come from the subsampled data. This manipulation is a special case of a trick known as importance sampling (see note w11a). We have converted an expectation under the original data distribution, into an expectation under the subsampling distribution. We then replaced the formal expectation with an average over subsampled data. We can use the same idea to justify multiplying the gradients for y=0 examples by 1/ f , when training a logistic regression classifier on subsampled data. MLPR:w11q Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2 P(x, y | k) was defined up to a constant. Thus the “importance weights” applied to each log probability are: P(x, y) P(x, y | k) ∝ ??? P(x,y) P(x,y) y = 1 P(x,y) f P(x,y) y = 0 ∝ (1 y = 1 1/ f y = 0. (3) We can substitute these values of 1 or 1/ f into the loss. The unknown constant scales the loss surface, but does not change where the minimum is. The loss for a positive example is then log P(y=1 | x) as usual, whereas the loss for a negative example becomes (1/ f)log P(y=0 | x). The gradients for negative examples are then also scaled by (1/ f). This method says that if we throw away all but 1/1000 of the negative examples, we should up-weight the effect of each remaining negative example by 1000× to make up for it. Compare this approach to the approach from part a) for training models based on subsampled data for binary classification, giving pros and cons of each. Write no more than 150 words. [25 marks] [The website version of this note has a question here.]

在線客服

售前咨詢
售后咨詢
微信號
Essay_Cheery
微信
专业essay代写|留学生论文,作业,网课,考试|代做功課服務-PROESSAY HKG 专业留学Essay|Assignment代写|毕业论文代写-rushmyessay,绝对靠谱负责 代写essay,代写assignment,「立减5%」网课代修-Australiaway 代写essay,代写assignment,代写PAPER,留学生论文代写网 毕业论文代写,代写paper,北美CS代写-编程代码,代写金融-第一代写网 作业代写:CS代写|代写论文|统计,数学,物理代写-天天论文网 提供高质量的essay代写,Paper代写,留学作业代写-天才代写 全优代写 - 北美Essay代写,Report代写,留学生论文代写作业代写 北美顶级代写|加拿大美国论文作业代写服务-最靠谱价格低-CoursePass 论文代写等留学生作业代做服务,北美网课代修领导者AssignmentBack