What is Data Overfitting in Machine Learning? 机器学习中的过拟现象

justyy (74)in #cn • 8 years ago (edited)

My understating of data overfitting is that: You have a training set, and you come out with a model, but that model is tuned too much that it only works on a specific dataset like the training set. If you apply the model to other dataset (scenarios), the results are bad.

Data are not perfect. In most cases, the training set contains noise, which needs to be filtered out instead of taken into account in the model.

I have also written this post:

The Machine Learning Case Study – How to Predict Weight over Height/Gender using Linear Regression?

Base on the many samples of Weight/Height relations:

Male Weight = -101.24 + 1.061 * Height

Female Weight = -110.20 + 1.062 * Height

I am 174cm, the weight should be 83.2kg, but I am in fact 80.0kg, so according to this model, I am fit, which is soooo much better than the BMI.

大数据这年头很火. 有着大数据甚至不需要做什么就能发财. 一般来说, 你有了数据然后就可以通过一些算法进行学习得到一些模型. 通过这些模型来进行预测.

但是很有可能你的数据 (Training Set – 训练集) 是含有一些特殊例子, 或者称为噪声, 我们需要过滤掉这些数据或者在学习的过程中不考虑它们. 否则得到的模型就会是一个过分拟合的现象. 过拟表现就是对于当前训练集, 你的模型十分的拟合, 但是这个模型却不适合于其它的场景.

过分拟合 // 图片来自于网络 // Image Credit: Here

这个文章学习了大量的男性/女性体重对于身高的关系, 得出了两组模型:

男性体重 = -101.24 + 1.061 * 身高

女性体重 = -110.20 + 1.062 * 身高

我身高174cm, 所以体重应该是 83.2kg, 我实际体重是 80.0kg, 所以是不胖滴… 这比 BMI 靠谱多了 . 😂

Thank you for reading my post, feel free to FOLLOW and Upvote @justyy which motivates me to create more quality posts.

非常感谢阅读, 欢迎FOLLOW和Upvote @justyy 能激励我创作更多更好的内容.

// 根据我的博文 这里和这里 整理而成。

近期热贴 Reent Popular Posts

#cn-programming #machine-learning #overfitting

8 years ago in #cn by justyy (74)

$17.82

Sort:

Trending

[-]

drycurrynoodles (43) 8 years ago

Thanks for sharing. I have just completed Andrew Ng' ML course and I am still a beginner in machine learning. There are so much things to learn!

$0.00

1 vote

[-]

justyy (74) 8 years ago

me too a beginner..

$0.00

1 vote

[-]

thisjourney (49) 8 years ago

Your work is very interesting and meaningful!!! Even though i don't understand the technicality of it.. lol

I love to learn more about AI/AGI development! May i ask, in the case of "Predict Weight over Height/Gender Using Linear Regression", does the machine process given data and "reorganize" it (come up with a certain formula) or does it also give new insights that were previously unknown?

I'm actually wondering that if 2 AGIs are put in the same environment, do they have similar solutions to a specific task, if not, how is it determined which one has better "potential" or "value"? Because for humans, 2 people can have completely different reactions/solutions in the same situation, and it's very complicated to assess it.

Could you please explain this in layman English/Chinese? Thanks a lot! :D

$0.00

[-]

justyy (74) 8 years ago

The model can be built upon the dataset. The example here is actually very simple: y=kx+b, where you have lots of x and y pairs of dataset and you need to estimate the best k and b that can fit most (x, y).

$0.00

[-]

tumutanzi (72) 8 years ago

$0.00

[-]

justyy (74) 8 years ago

HAHA...

$0.00

[-]

jubi (69) 8 years ago

160还不胖。那我140是瘦子：）

$0.00

[-]

justyy (74) 8 years ago

哈哈。我这是阿Q呢。

$0.00

[-]

dixonloveart (64) 8 years ago

Big data brings some fakers 2,needed to be considered seriously.

$0.00

[-]

justyy (74) 8 years ago

Everybody is talking about Big Data....

$0.00

[-]

themarkymark (81) 8 years ago

It is not uncommon to purposely try to overfit initially and then work on scaling it back so it generalizes well with unknown data. You want to focus on the best validation score you can get rather than completely eliminating overfitting.

$0.00

[-]

justyy (74) 8 years ago

agreed.

$0.00