CS20SI 03. 线性与Logistic回归

本次课的所有示例代码和所用数据都可以从本课Github仓库上获得。为了更清楚地梳理老师上课的讲授内容，这次课的笔记我打算使用如下方法进行组织

首先把教授给出的初始代码框架贴出来
由于通常初始代码会分成几块，因此，之后会按照算法逻辑，给出各个块的实现
- 如果这里遇到了TensorFlow的一些新的，之前没有讲的知识点，补上
- 如果这里遇到了一些理论上的知识点（例如不同优化器），对老师的讲义做一个摘抄

那么就开始吧

使用TensorFlow从头实现线性回归

本次课的第一项内容是实现一个线性回归模型。这个模型的输入X是190个国家的出生率，输出Y是该国家的期望寿命。初始代码如下

""" Starter code for simple linear regression example using placeholders
Created by Chip Huyen (huyenn@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 03
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

import utils

DATA_FILE = 'data/birth_life_2010.txt'

# Step 1: read in data from the .txt file
data, n_samples = utils.read_birth_life_data(DATA_FILE)

# Step 2: create placeholders for X (birth rate) and Y (life expectancy)
# Remember both X and Y are scalars with type float
X, Y = None, None
#############################
########## TO DO ############
#############################

# Step 3: create weight and bias, initialized to 0.0
# Make sure to use tf.get_variable
w, b = None, None
#############################
########## TO DO ############
#############################

# Step 4: build model to predict Y
# e.g. how would you derive at Y_predicted given X, w, and b
Y_predicted = None
#############################
########## TO DO ############
#############################

# Step 5: use the square error as the loss function
loss = None
#############################
########## TO DO ############
#############################

# Step 6: using gradient descent with learning rate of 0.001 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)

start = time.time()

# Create a filewriter to write the model's graph to TensorBoard
#############################
########## TO DO ############
#############################

with tf.Session() as sess:
    # Step 7: initialize the necessary variables, in this case, w and b
    #############################
    ########## TO DO ############
    #############################

    # Step 8: train the model for 100 epochs
    for i in range(100):
        total_loss = 0
        for x, y in data:
            # Execute train_op and get the value of loss.
            # Don't forget to feed in data for placeholders
            _, loss = ########## TO DO ############
            total_loss += loss

        print('Epoch {0}: {1}'.format(i, total_loss/n_samples))

    # close the writer when you're done using it
    #############################
    ########## TO DO ############
    #############################
    writer.close()
    
    # Step 9: output the values of w and b
    w_out, b_out = None, None
    #############################
    ########## TO DO ############
    #############################

print('Took: %f seconds' %(time.time() - start))

# uncomment the following lines to see the plot 
# plt.plot(data[:,0], data[:,1], 'bo', label='Real data')
# plt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data')
# plt.legend()
# plt.show()

定义计算图

使用占位符定义变量

这块比较直接，用占位符定义X和Y就可以了。在讲义里，“变量”这个词对TF的完全新手来说，可能会有歧义，因为会让人想起tf.Variable。关于tf.placeholder和tf.Variable的区别，可以参考这篇StackOverflow的问答，我在这里做个摘要：tf.placeholder通常用来存储用来训练模型的数据和标签（这里就是X和Y），而tf.Variable通常用来定义要求解的模型变量（这里要训练一个线性模型，对应的就是权重W和偏置b ）

讲义中的定义方法如下，不过没有显式地指定X和Y的形状。根据前面的授课内容，这似乎会使调试过程比较痛苦

1 2	X = tf.placeholder(tf.float32, name='X') Y = tf.placeholder(tf.float32, name='Y')

定义训练变量并初始化

这里定义的是模型要求解的变量，由上面的说明，应该使用tf.get_variable。注意定义变量的时候要初始化，这里使用0来做初始值。如果使用常数做初始化函数，就不用指定形状

1 2	w = tf.get_variable('weights', initializer=tf.constant(0.0)) b = tf.get_variable('bias', initializer=tf.constant(0.0))

定义预测值

直接写出如何使用模型做预测就可以。这里每次输入的X实际是一个标量，所以可以直接使用*符号

1	Y_predicted = w * X + b

定义损失函数

也是直接写出平方误差的计算式。由于这里是每读入一条数据做一次计算，因此不用求和。如果需要求和，使用tf.reduce_sum()就可以

1	loss = tf.square(Y - Y_predicted, name='loss')

定义优化器

尽管可以手写定义梯度下降的计算式，不过TensorFlow已经提供了一个很好的封装，直接调用就可以。TF还提供了其它优化器，其原理在之后详细介绍（课程里最推荐的优化器是Adam优化器）

这里有一点值得一提：tf.train中定义的所有优化器都有一个成员方法minimize，这个方法的签名里写明可以传入参数var_list，而其功能写明就是“通过更新var_list中的变量来最小化loss”。那么为什么这里没有传入参数var_list以指明更新什么变量呢？原因是前面定义w和b的时候默认指明了其trainable为True，而且loss的定义依赖了这两个变量，因此TF知道需要更新（训练）这两个变量

1	optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)

可以通过tf.stop_gradient来防止某些张量参与某个损失函数导数的计算。如果想在训练过程中“冻结”某些特定的变量，就可以使用这样的操作。例如，训练GAN时，在生成对抗样本的过程中，是不需要反向传播的

Optimizer类会自动计算图中的导数，不过也可以通过tf.gradients来写明计算某些指定的梯度。当要训练模型的一部分时，适合使用这种方式

训练模型

初始化变量

要初始化w和b，只需要调用前面讲过的tf.global_variables_initializer()就可以。这里还可以多走一步，创建一个tf.summary.FileWriter类对象来写日志，使得可以在TensorBoard上观察训练过程

1 2	sess.run(tf.global_variables_initializer()) writer = tf.summary.FileWriter('./my_graph/03/linear_reg', sess.graph)

训练模型

训练模型的过程实际上就是求解优化器和损失函数值的过程，因此可以如下调用sess.run。注意计算这两个值的时候用到的X和Y只使用了占位符代替，因此要在run的时候指定一个feed_dict来传入实际值

for i in range(100):
    total_loss = 0
    for x, y in data:
        _, l = sess.run([optimizer, loss], feed_dict={X:x, Y:y})
        total_loss += l

这里optimizer和loss，更规范地说，是要计算的两个张量

输出模型变量

这部分相对来讲更加简单，只需要向run方法传入要“获得”（fetch）的变量就可以

1	w, b = sess.run([w, b])

完整的代码在github上可以获得

改进方案

从最后得到的图中可以看到，数据集中有一些离群点（outlier）。离群点的存在会影响线性回归模型的求解。为了避免这一点，可以使用如下定义的Huber loss让模型更加鲁棒 \[ L_\delta (y, f(x)) = \begin{cases} \frac{1}{2}(y-f(x))^2 & {\rm for\ }|y-f(x)| \le \delta \\ \delta|y-f(x)| - \frac{1}{2}\delta^2 & {\rm otherwise}\end{cases} \] 需要注意的是，在实现时，不能直接写if Y-Y_predicted < delta。因为Y和Y_predict都是变量，根据StackOverflow上的讨论，一方面，TF变量之间的比较需要使用操作tf.less() ；另一方面，对tf.less()得到的结果，需要调用Session对象的run方法才能得到python的布尔变量。达到同样的目的的一种更简单的做法是调用tf.cond操作，这个操作有点像Excel里的IF函数：第一个参数是一个条件，第二个参数是条件为真时调用的函数，第二个参数是条件为假时调用的函数。因此huber loss完整的TF实现如下：

def huber_loss(labels, predictions, delta=1.0):
    residual = tf.abs(labels - predictions)
    def f1(): return 0.5 * tf.square(residual)
    def f2(): return delta * residual - 0.5 * tf.square(delta)
    return tf.cond(residual < delta, f1, f2)

`tf.data`

根据Derek Murray的文章，占位符和feed dict的好处是它们把数据处理的过程放在了TF的框架之外，因此打乱数据顺序、构建batch等操作用python实现起来比较容易。但是这种做法可能会让程序变慢，因为这些数据处理的代码通常都是单线程运行，所以会造成性能瓶颈。作为处理数据的另一种方法，可以使用TensorFlow提供的队列来完成同样的任务，这样做还能享受管道操作、多线程带来的好处，降低了时间损耗。但是这种方法不太容易使用，而且容易崩溃

TensorFlow从1.4版本开始，将原先contrib包中的tf.contrib.data下的API移动到了核心API中。这样，输入数据可以存放在tf.data.Dataset对象中，用法如下

1	tf.data.Dataset.from_tensor_slices((features, labels))

尽管features和labels应该是张量，但是由于TF和numpy可以无缝集成，因此实际使用时也可以传入numpy的array，即

1	dataset = tf.data.Dataset.from_tensor_slices((data[:, 0], data[:, 1]))

其它常用的Dataset还包括

tf.data.TextLineDataset。其要求文件的每一行都是一个数据项，适用于机器翻译
tf.data.FixedLengthRecordDataset。其要求每条数据长度都一样，适用于CIFAR和ImageNet等
tf.data.TFRecordDataset。适用于存储为tfrecord格式的数据

Dataset对象有batch、shuffle、repeat等方法，也支持通过map来创建一个新的对象

将数据转化为Dataset对象以后，可以使用迭代器进行迭代，它在每次调用get_next()的时候都会返回一个新的样本或者batch。如果需要迭代多个epoch，需要使用dataset.make_initializable_iterator。关键代码如下：

data, n_samples = utils.read_birth_life_data(DATA_FILE)
dataset = tf.data.Dataset.from_tensor_slices((data[:, 0], data[:, 1]))
iterator = dataset.make_initializable_iterator()
X, y = iterator.get_next()

... # Use X and y as what you did when you used placeholder

with tf.Session() as sess:
    ...
    for i in range(100):
        sess.run(iterator.initializer)
        total_loss = 0
        try:
            while True:
                _, l = sess.run([optimizer, loss])
                total_loss += l
        except tf.errors.OutOfRangeError:
            pass

注意所有样本被迭代一遍以后会抛出tf.errors.OutOfRangeError这个异常。该异常TensorFlow不会处理，需要自己手工应对

使用TensorFlow从头实现Logistic回归

本次课程的第二个部分是实现一个Logistic回归模型。这里使用的数据集是Hinton经常拿来用的MNIST手写数据集，因此X是原始的图片像素值，而Y是这张图片对应的实际数字。初始代码如下所示

""" Starter code for simple logistic regression model for MNIST
with tf.data module
MNIST dataset: yann.lecun.com/exdb/mnist/
Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 03
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
import time

import utils

# Define paramaters for the model
learning_rate = 0.01
batch_size = 128
n_epochs = 30
n_train = 60000
n_test = 10000

# Step 1: Read in data
mnist_folder = 'data/mnist'
utils.download_mnist(mnist_folder)
train, val, test = utils.read_mnist(mnist_folder, flatten=True)

# Step 2: Create datasets and iterator
# create training Dataset and batch it
train_data = tf.data.Dataset.from_tensor_slices(train)
train_data = train_data.shuffle(10000) # if you want to shuffle your data
train_data = train_data.batch(batch_size)

# create testing Dataset and batch it
test_data = None
#############################
########## TO DO ############
#############################


# create one iterator and initialize it with different datasets
iterator = tf.data.Iterator.from_structure(train_data.output_types, 
                                           train_data.output_shapes)
img, label = iterator.get_next()

train_init = iterator.make_initializer(train_data)	# initializer for train_data
test_init = iterator.make_initializer(test_data)	# initializer for train_data

# Step 3: create weights and bias
# w is initialized to random variables with mean of 0, stddev of 0.01
# b is initialized to 0
# shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w)
# shape of b depends on Y
w, b = None, None
#############################
########## TO DO ############
#############################


# Step 4: build model
# the model that returns the logits.
# this logits will be later passed through softmax layer
logits = None
#############################
########## TO DO ############
#############################


# Step 5: define loss function
# use cross entropy of softmax of logits as the loss function
loss = None
#############################
########## TO DO ############
#############################


# Step 6: define optimizer
# using Adamn Optimizer with pre-defined learning rate to minimize loss
optimizer = None
#############################
########## TO DO ############
#############################


# Step 7: calculate accuracy with test set
preds = tf.nn.softmax(logits)
correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(label, 1))
accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))

writer = tf.summary.FileWriter('./graphs/logreg', tf.get_default_graph())
with tf.Session() as sess:
   
    start_time = time.time()
    sess.run(tf.global_variables_initializer())

    # train the model n_epochs times
    for i in range(n_epochs): 	
        sess.run(train_init)	# drawing samples from train_data
        total_loss = 0
        n_batches = 0
        try:
            while True:
                _, l = sess.run([optimizer, loss])
                total_loss += l
                n_batches += 1
        except tf.errors.OutOfRangeError:
            pass
        print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches))
    print('Total time: {0} seconds'.format(time.time() - start_time))

    # test the model
    sess.run(test_init)			# drawing samples from test_data
    total_correct_preds = 0
    try:
        while True:
            accuracy_batch = sess.run(accuracy)
            total_correct_preds += accuracy_batch
    except tf.errors.OutOfRangeError:
        pass

    print('Accuracy {0}'.format(total_correct_preds/n_test))
writer.close()

模型的实现仍然是总体上分为定义计算图和训练模型两个部分，而定义计算图时也包括了使用占位符、定义训练变量等过程，整体上和线性回归类似。因此为了简单起见，本节不再新建小小节来描述这些过程，而且讲解也会变短

定义计算图

Logistic回归的计算图分如下几步定义

使用占位符定义变量：
1
2
X = tf.placeholder(tf.float32, [batch_size, 784], name='image')
Y = tf.placeholder(tf.float32, [batch_size, 10], name='label')
原来的图像是一个\(28 \times 28\)的黑白图片，这里对每张图片把这个矩阵拍平成了一个长度为784的向量。我们不再像前面线性回归那样每次只读入一条数据，而是一次读入batch_size条数据。整个模型的损失函数实际上也是使用mini-batch SGD进行优化

定义训练变量并初始化。这里与前一个例子类似

1 2	w = tf.Variable(tf.random_normal(shape=[784, 10], stddev=0.01), name='weights') b = tf.Variable(tf.zeros([1, 10]), name='bias')

定义预测值和损失函数。与前面讲过的Logistic回归稍有不同：前面讲的Logistic回归大多是处理二元分类问题，因此可以直接使用Logistic回归将\({\bf w}^\mathsf{T}{\bf x} + \bf b\)的结果映射到\((0, 1)\)区间；这里要处理的是一个多元分类问题，相应的是要使用Softmax函数对最后的得分进行处理，得到\(\bf x\)属于每个类别的概率。其常用的损失函数是交叉熵（cross-entropy）函数，具体的原理在之后的理论部分做进一步的讲解。代码如下
1
2
3
logits = tf.matmul(X, w) + b
entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y, name='loss')
loss = tf.reduce_mean(entropy)

定义优化器。这里使用了Adam优化器

1	optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)

训练模型

训练模型的代码和线性回归相比没什么变化

1	_, loss_batch = sess.run([optimizer, loss], feed_dict={X: X_batch, Y: Y_batch})

这样，Logistic回归的模型也使用TF实现完毕。完整的代码可以参看课程github

课件中提供了一个非常有用的贴士：如果mini batch开得比较大，就需要多训练几个epoch，因为只有这样才能保证对权重足够多次数的更新