机器学习入门

垃圾邮件过滤

感知机法

线性分类器如下,$w_i$为权重,$x_i$为索引:
$$
y=\sum_{i=1}^nw_ix_i
$$
然后有阈值$\theta$:
$$
f(y)=\begin{cases}
+1&wx\geqslant\theta\\
-1&wx<\theta
\end{cases}
$$
引入第0项$w_0=-\theta,x_0=1$,于是得到:
$$
y=\sum_{i=0}^nw_ix_i
$$
每轮学习时,更新各个权重$w_i=w_i+\Delta w_i$,其中$\Delta w_i=\lambda(y-y_i)x_i,\lambda\in[0.0,1.0]$。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
from defs import plot_decision_regions
from sklearn.metrics import accuracy_score
warnings.simplefilter('ignore')
df = pd.read_csv('../datasets/sms_spam_perceptron.csv')
y = df.iloc[:, 0].values
y = np.where(y == 'spam', -1, 1) #spam为-1 否则ham为1
X = df.iloc[:, [1, 2]].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) #30%为测试数据 70%为训练数据
p = Perceptron(max_iter=40, eta0=0.1, random_state=0) #感知机初始化 最大迭代次数40 学习率0.1
p.fit(X_train, y_train) #训练
y_pred = p.predict(X_test) #用训练后的感知机来预估测试数据
X_combined = np.vstack((X_train, X_test))
y_combined = np.hstack((y_train, y_test))
plot_decision_regions(X=X_combined, y=y_combined,classifier=p, test_idx=range(-5, 5))
plt.xlabel('suspect words')
plt.ylabel('spam or ham')
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()
print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred)) #估计值准确率

协助画图的有:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#defs.py
from textblob import TextBlob
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import numpy as np
import warnings
def get_tokens(msg):
return TextBlob(str(msg)).words
def get_lemmas(msg):
lemmas = []
words = get_tokens(msg)
for word in words:
lemmas.append(word.lemma)
return lemmas
def versiontuple(v):
return tuple(map(int, (v.split("."))))
def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):
# setup marker generator and color map
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
# plot the decision surface
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),np.arange(x2_min, x2_max, resolution))
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],alpha=0.8, c=cmap(idx),marker=markers[idx], label=cl)
# highlight test samples
if test_idx:
# plot all samples
if not versiontuple(np.__version__) >= versiontuple('1.9.0'):
X_test, y_test = X[list(test_idx), :], y[list(test_idx)]
warnings.warn('Please update to NumPy 1.9.0 or newer')
else:
X_test, y_test = X[test_idx, :], y[test_idx]
plt.scatter(X_test[:, 0],X_test[:, 1],c=cmap(idx),alpha=1.0,linewidths=1,marker='o',s=55, label='test set')

数据sms_spam_perceptron.csv为:

1
2
3
4
5
6
7
8
9
10
11
"type","sex","buy"
"ham","0","1"
"ham","0","1"
"ham","1","1"
"spam","1","0"
"ham","0","1"
"spam","1","0"
"ham","0","1"
"ham","0","1"
"spam","2","1"
...

SVM法

识别分隔可在多维空间中表示数据类别的超平面。这里$\beta$为偏置,$\mu$为间隔。这里要找到最大可能的正值,以在类别之间找到最大的间隔,以防过拟合。
$$
y=\sum_{i=0}^nw_ix_i+\beta\geqslant\mu
$$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import warnings 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from defs import plot_decision_regions
from sklearn.metrics import accuracy_score
warnings.simplefilter('ignore')
df = pd.read_csv('../datasets/sms_spam_svm.csv')
y = df.iloc[:, 0].values
y = np.where(y == 'spam', -1, 1)
X = df.iloc[:, [1, 2]].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print(X_train, X_test, y_train, y_test)
svm = SVC(kernel='linear', C=1.0, random_state=1)
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
X_combined = np.vstack((X_train, X_test))
y_combined = np.hstack((y_train, y_test))
plot_decision_regions(X_combined, y_combined,classifier=svm, test_idx=range(-15, 15))
plt.xlabel('suspect words')
plt.ylabel('spam or ham')
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()
print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

数据集为:

1
2
3
4
5
6
7
8
9
10
11
type,suspect,neutral
ham,1,3
ham,49,30
spam,47,32
ham,46,31
ham,0,36
spam,4,39
ham,46,34
ham,0,34
spam,44,29
...

Logistic回归与决策树

线性回归不适合做分类,这里简单提一下。公式如下,其中$y$为预测值,矩阵$X$为单个特征,向量$\overrightarrow w$为对应权重向量,常数$\beta$为模型系统失真。
$$
y=\overrightarrow wX+\beta
$$
代码示例:

1
2
3
4
5
6
7
8
9
10
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_csv('../datasets/sms_spam_perceptron.csv')
X = df.iloc[:, [1, 2]].values
y = df.iloc[:, 0].values
y = np.where(y == 'spam', -1, 1)
linear_regression = LinearRegression()
linear_regression.fit(X,y)
print (linear_regression.score(X,y)) #自带R2准确性度量 衡量线性模型返回值比简单均值好多少

对于逻辑回归,公式如下,其中$P(y=c|x_i)$为根据特征$x_i$时给定样本属于类别$c$的条件概率。
$$
P(y=c|x)=\dfrac{e^z}{1+e^z},z=\sum w_ix_i
$$
这里用来检测网络钓鱼,数据集已经过独热编码进行整理,记录了钓鱼网站的多个特征,仅展示代码不用细究:

1
2
3
4
5
6
7
8
9
10
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_csv('../datasets/sms_spam_perceptron.csv')
X = df.iloc[:, [1, 2]].values
y = df.iloc[:, 0].values
y = np.where(y == 'spam', -1, 1)
linear_regression = LinearRegression()
linear_regression.fit(X,y)
print (linear_regression.score(X,y))

还例如用决策树检测网络钓鱼:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pandas as pd
import numpy as np
from sklearn import *
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings
from sklearn.model_selection import train_test_split
from sklearn import tree
warnings.simplefilter('ignore')
phishing_dataset = np.genfromtxt('../datasets/phishing_dataset.csv', delimiter=',', dtype=np.int32)
samples = phishing_dataset[:,:-1]
targets = phishing_dataset[:, -1]
training_samples, testing_samples, training_targets, testing_targets = train_test_split(samples, targets, test_size=0.2, random_state=0)
tree_classifier = tree.DecisionTreeClassifier()
tree_classifier.fit(training_samples, training_targets)
predictions = tree_classifier.predict(testing_samples)
accuracy = 100.0 * accuracy_score(testing_targets, predictions)
print ("Decision Tree accuracy: " + str(accuracy))

NLP法

实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import matplotlib.pyplot as plt
import csv,pandas,sklearn,nltk
from textblob import TextBlob
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split
from defs import get_tokens,get_lemmas
# nltk.download('popular') #第一次运行需要下载
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')
# nltk.download('punkt_tab')
sms = pandas.read_csv('../datasets/sms_spam_no_header.csv', sep=',', names=["type", "text"])
text_train, text_test, type_train, type_test = train_test_split(sms['text'], sms['type'], test_size=0.3)
bow = CountVectorizer(analyzer=get_lemmas).fit(text_train) #用词袋模型处理 为文本中每个已识别词分配一个数字 get_lemmas方法返回从消息文本中提取的各种标记
sms_bow = bow.transform(text_train)
tfidf = TfidfTransformer().fit(sms_bow) #归一化和加权
sms_tfidf = tfidf.transform(sms_bow)
spam_detector = MultinomialNB().fit(sms_tfidf, type_train)
msg = sms['text'][25] #直接选的第26封邮件进行预测
msg_bow = bow.transform([msg])
msg_tfidf = tfidf.transform(msg_bow)
print ('predicted:', spam_detector.predict(msg_tfidf)[0])
print ('expected:', sms.type[25])
predictions = spam_detector.predict(sms_tfidf) #对整个数据集预测
print ('accuracy', accuracy_score(sms['type'][:len(predictions)], predictions))
print (classification_report(sms['type'][:len(predictions)], predictions))

数据集为:

1
2
3
4
5
6
7
8
9
10
11
"ham","Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
"ham","Ok lar... Joking wif u oni..."
"spam","Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
"ham","U dun say so early hor... U c already then say..."
"ham","Nah I don't think he goes to usf, he lives around here though"
"spam","FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
"ham","Even my brother is not like to speak with me. They treat me like aids patent."
"ham","As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"
"spam","WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."
"spam","Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"
...

恶意软件威胁检测

数据提取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import os
import pefile
import glob
csv = file('MalwareArtifacts.csv','w')
files = glob.glob('c:\\MalwareSamples\\*.exe')
csv.write("AddressOfEntryPoint,MajorLinkerVersion,MajorImageVersion,MajorOperatingSystemVersion,,DllCharacteristics,SizeOfStackReserve,NumberOfSections,ResourceSize,\n")
for file in files:
suspect_pe = pefile.PE(file)
csv.write( str(suspect_pe.OPTIONAL_HEADER.AddressOfEntryPoint) + ',') #提取PE文件格式某些字段
csv.write( str(suspect_pe.OPTIONAL_HEADER.MajorLinkerVersion) + ',')
csv.write( str(suspect_pe.OPTIONAL_HEADER.MajorImageVersion) + ',')
csv.write( str(suspect_pe.OPTIONAL_HEADER.MajorOperatingSystemVersion) + ',')
csv.write( str(suspect_pe.OPTIONAL_HEADER.DllCharacteristics) + ',')
csv.write( str(suspect_pe.OPTIONAL_HEADER.SizeOfStackReserve) + ',')
csv.write( str(suspect_pe.FILE_HEADER.NumberOfSections) + ',')
csv.write( str(suspect_pe.OPTIONAL_HEADER.DATA_DIRECTORY[2].Size) + "\n")
csv.close()

提取出的数据集如下,其中最后一列得自己加上,1合法0可疑。

1
2
3
4
5
6
7
8
9
10
11
AddressOfEntryPoint    MajorLinkerVersion	MajorImageVersion	MajorOperatingSystemVersion	DllCharacteristics	SizeOfStackReserve	NumberOfSections	ResourceSize	legitimate
10407 9 6 6 33088 262144 4 952 1
5354 9 6 6 33088 262144 4 952 1
58807 9 6 6 33088 262144 4 136490 1
25166 9 6 6 33088 262144 4 1940 1
70387 9 6 6 33088 262144 4 83098 1
5856 9 6 6 33088 262144 4 1064 1
26798 9 6 6 33088 262144 4 1060 1
28581 9 6 6 33088 262144 4 99567 1
72841 9 6 6 33088 262144 4 1040 1
...

K均值的恶意软件聚类

对于基于举例的聚类算法,可用轮廓系数指标来评估结果。其中$m$为单个样本与最近簇中所有其他样本之间其他举例,$n$为单个样本与同一簇中所有样本间的平均距离。
$$
Sc=\dfrac{m-n}{\max(m,n)}
$$
距离估计值可以是欧几里得距离、曼哈顿距离或切比雪夫距离等。

$Sc$趋于$+1$时为最优聚类,趋于$-1$时为非最优聚类,接近$0$时存在相互重叠的簇。$k$为簇数。

实现有:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings
warnings.simplefilter('ignore')
malware_dataset = pd.read_csv('../datasets/MalwareArtifacts.csv', delimiter=',')
# Extacting artifacts samples fields 'MajorLinkerVersion,MajorImageVersion,MajorOperatingSystemVersion,DllCharacteristics'
samples = malware_dataset.iloc[:, [1,2,3,4]].values #选择这些字段作为恶意软件特征
targets = malware_dataset.iloc[:, 8].values
k_means = KMeans(n_clusters=2,max_iter=300)
k_means.fit(samples) #启动迭代算法
print("K-means labels: " + str(k_means.labels_))
print ("\nK-means Clustering Results:\n\n", pd.crosstab(targets, k_means.labels_,rownames = ["Observed"],colnames = ["Predicted"]) ) #Observed为实际值 Predicted为预测值
print ("\nSilhouette coefficient: %0.3f" % silhouette_score(samples, k_means.labels_, metric='euclidean')) #轮廓系数 欧几里得距离 特别慢...

决策树法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pandas as pd
import numpy as np
from sklearn import *
from sklearn.metrics import accuracy_score
import warnings
from sklearn.model_selection import train_test_split
from sklearn import tree
warnings.simplefilter('ignore')
malware_dataset = pd.read_csv('../datasets/MalwareArtifacts.csv', delimiter=',')
# Extacting artifacts samples fields "AddressOfEntryPoint" and "DllCharacteristics"
samples = malware_dataset.iloc[:, [0, 4]].values
targets = malware_dataset.iloc[:, 8].values
training_samples, testing_samples, training_targets, testing_targets = train_test_split(samples, targets, test_size=0.2, random_state=0)
tree_classifier = tree.DecisionTreeClassifier()
tree_classifier.fit(training_samples, training_targets)
predictions = tree_classifier.predict(testing_samples)
accuracy = 100.0 * accuracy_score(testing_targets, predictions)
print ("Decision Tree accuracy: " + str(accuracy))

决策树的集合叫决策森林:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import numpy as np
from sklearn import *
import warnings
from sklearn.model_selection import train_test_split
warnings.simplefilter('ignore')
malware_dataset = pd.read_csv('../datasets/MalwareArtifacts.csv', delimiter=',')
# Extacting artifacts samples fields "AddressOfEntryPoint" and "DllCharacteristics"
samples = malware_dataset.iloc[:, [0,4]].values
targets = malware_dataset.iloc[:, 8].values
training_samples, testing_samples, training_targets, testing_targets = train_test_split(samples, targets, test_size=0.2)
rfc = ensemble.RandomForestClassifier(n_estimators=50)
rfc.fit(training_samples, training_targets)
accuracy = rfc.score(testing_samples, testing_targets)
print("Random Forest Classifier accuracy: " + str(accuracy*100) )

HMM法

专门用于检测变态恶意软件。

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
from hidden_markov import hmm
ob_types = ('W','N' ) #可能的观测结果类型 W工作 N不工作
states = ('L', 'M') #隐状态 M恶意 L合法
observations = ('W','W','W','N') #观测序列 程序执行的单条指令相关联
start = np.matrix('0.1 0.9') #概率矩阵 第一条指令为恶意0.1 合法0.9
transition = np.matrix('0.7 0.3 ; 0.1 0.9') #转移矩阵
emission = np.matrix('0.2 0.8 ; 0.4 0.6') #发射矩阵
_hmm = hmm(states,ob_types,start,transition,emission)
print("Forward algorithm: ")
print ( _hmm.forward_algo(observations) )
print("\nViterbi algorithm: ")
print( _hmm.viterbi(observations) )