kmeans

玩一下kmeans,调戏以下国足,顺便预测一下18世界杯冠军,18-7-15 23:00世界杯

物以类聚,人以群分,选亚洲15支球队

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
data = {"zhongguo":[50,50,50,40],
"riben":[28,9,29,12],
"hanguo":[17,15,27,26],
"yilang":[25,40,28,18],
"shate":[28,40,50,25],
"yilake":[50,50,40,40],
"kataer":[50,40,40,40],
"alianqiu":[50,40,50,40],
"wuzibiekesitan":[40,40,40,40],
"taiguo":[50,50,50,40],
"yuenan":[50,50,50,50],
"aman":[50,50,40,50],
"balin":[40,40,50,50],
"chaoxian":[40,32,50,50],
"yinni":[50,50,50,50]}

依次选2006年,2010年,2014年,2018年世界杯的数据作为聚类样本,打进世界杯的得分用排名衡量,预选赛小组未出线的给50,预选赛十强的给40,澳大利亚没统计,18年的排名是估计的,虽然11点是冠亚军决赛,理论上其他队伍排名已经定了,但是我不会,这样算,得分越多的越low。

k选3,初始中心选中国,日本,沙特,先计算每一条数据到三个中心点的欧氏距离,并将其归为最近点那一类,处理完所有数据后,计算每个类的中心点,更新聚类中心,重新以上步骤,知道聚类中心不再变化,代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np


def name2indexf(names):
name2index = {}
for index, name in enumerate(names):
name2index[name] = index
return name2index
def cacul_eudist(vec1, vec2):
assert len(vec1) == len(vec2)
dist = np.linalg.norm(vec1 - vec2)
return dist

if __name__ == "__main__":
data = {"zhongguo":[50,50,50,40],
"riben":[28,9,29,12],
"hanguo":[17,15,27,26],
"yilang":[25,40,28,18],
"shate":[28,40,50,25],
"yilake":[50,50,40,40],
"kataer":[50,40,40,40],
"alianqiu":[50,40,50,40],
"wuzibiekesitan":[40,40,40,40],
"taiguo":[50,50,50,40],
"yuenan":[50,50,50,50],
"aman":[50,50,40,50],
"balin":[40,40,50,50],
"chaoxian":[40,32,50,50],
"yinni":[50,50,50,50]}
data_array = np.zeros(shape=(len(data.keys()),4))
name2index = name2indexf(data.keys())
for name in data.keys():
index = name2index[name]
data_array[index] = np.array(data[name])
k_center = np.array([data_array[0],data_array[1],data_array[4]])
k_center_with_near = {}

# ------cacul center and it near-------
epoch = 0
while True:

for index, item in enumerate(k_center):
k_center_with_near[index] = []
# ----------choice nearest for each data---------
for index in range(len(data.keys())):
data_item = data_array[index]
near = 0
dist_min = 100000
for i in range(len(k_center)):
dist = cacul_eudist(data_item, k_center[i])
if dist_min > dist:
dist_min = dist
near = i
k_center_with_near[near].append(index)

# ------recacul center---------------
end_tag = True
for center_near_index in k_center_with_near.keys():
contry_index = k_center_with_near[center_near_index]
center_near_data = []
for item in contry_index:
center_near_data.append(data_array[item])
center_near_data = np.array(center_near_data)
new_center = np.mean(center_near_data, axis=0)
if not (k_center[center_near_index] == new_center).all():
end_tag = False
k_center[center_near_index] = new_center
print("epoch:",epoch, "center:", k_center)
epoch += 1
test = data.keys()
if end_tag:
for item in k_center_with_near.keys():
print("classv{} include".format(item), [list(data.keys())[n] for n in (k_center_with_near[item])])
break

结果:

1
2
3
classv0 include ['zhongguo', 'yilake', 'kataer', 'alianqiu', 'wuzibiekesitan', 'taiguo', 'yuenan', 'aman', 'balin', 'chaoxian', 'yinni']
classv1 include ['riben', 'hanguo']
classv2 include ['yilang', 'shate']

这样算,中国队在亚洲只能算3流球队

预测克罗地亚冠军,虽然实力比法国弱一些,但是不要低估对冠军渴望的心