亚洲人精品,乱人伦中文视频在线观看免费

寫這個系列的初衷在于，現(xiàn)在關(guān)于tensorflow的教程還是太少了，有也都是歪果仁寫的。比如以下幾個：
TensorFlow-Examples
tensorflow_tutorials
TensorFlow-Tutorials
Tensorflow-101
個人感覺這些教程對于新手來說講解的并不細致，幾乎都是作者寫好了代碼放到ipython notebook上，大家下載到本地run一run，很開心地得到結(jié)果，實際并不明白為什么要這么搭建，每一步得到什么樣的結(jié)果?；蛘咦约汉芟肱@些牛人的代碼，但是官方的api文檔對于入門來說還不夠友好，看了文檔也不太清楚，這時候十分渴望有人來指導(dǎo)一把。
因此我就萌生了寫一個”手把手&零門檻的tensorflow中文教程”的想法。希望更多的人能了解deep learning和tensorflow，大家多多提意見，多多交流！
今天來解讀的代碼還是基于CNN來實現(xiàn)文本分類，這個問題很重要的一步是原始數(shù)據(jù)的讀取和預(yù)處理，詳細代碼參看
(1) load data and labels
實驗用到的數(shù)據(jù)是爛番茄上的moview reviews，先看看提供的數(shù)據(jù)長什么樣
sorry, 圖片缺失
可以看到，每一行是一條review，數(shù)據(jù)進行過初步的處理，但是類似于”doesn’t/it’s”這種并沒有進行分割。后面會講到這個問題。

def load_data_and_labels():    """    Loads MR polarity data from files, splits the data into words and generates labels.    Returns split sentences and labels.    """    # Load data from files    positive_examples = list(open("./data/rt-polaritydata/rt-polarity.pos", "r").readlines())    positive_examples = [s.strip() for s in positive_examples]    negative_examples = list(open("./data/rt-polaritydata/rt-polarity.neg", "r").readlines())    negative_examples = [s.strip() for s in negative_examples]    # Split by words    x_text = positive_examples + negative_examples    x_text = [clean_str(sent) for sent in x_text]    x_text = [s.split(" ") for s in x_text]    # Generate labels    positive_labels = [[0, 1] for _ in positive_examples]    negative_labels = [[1, 0] for _ in negative_examples]    y = np.concatenate([positive_labels, negative_labels], 0)    return [x_text, y]1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

這個函數(shù)的作用是從文件中加載positive和negative數(shù)據(jù)，將它們組合在一起，并對每個句子都進行分詞，因此x_text是一個二維列表，存儲了每個review的每個word；它們對應(yīng)的labels也組合在一起，由于labels實際對應(yīng)的是二分類輸出層的兩個神經(jīng)元，因此用one-hot編碼成0/1和1/0，然后返回y。
其中，f.readlines()的返回值就是一個list，每個元素都是一行文本（str類型，結(jié)尾帶有”\n”），因此其實不需要在外層再轉(zhuǎn)換成list()
用s.strip()函數(shù)去掉每個sentence結(jié)尾的換行符和空白符。
去除了換行符之后，由于剛才提到的問題，每個sentence還需要做一些操作（具體在clean_str()函數(shù)中），將標點符號和縮寫等都分割開來。英文str最簡潔的分詞方式就是按空格split，因此我們只需要將各個需要分割的部位都加上空格，然后對整個str調(diào)用split(“ “)函數(shù)即可完成分詞。
labels的生成也類似。

(2) padding sentence

def pad_sentences(sentences, padding_word="<PAD/>"):    """    Pads all sentences to the same length. The length is defined by the longest sentence.    Returns padded sentences.    """    sequence_length = max(len(x) for x in sentences)    padded_sentences = []    for i in range(len(sentences)):        sentence = sentences[i]        num_padding = sequence_length - len(sentence)        new_sentence = sentence + [padding_word] * num_padding        padded_sentences.append(new_sentence)    return padded_sentences1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13

為什么要對sentence進行padding？
因為TextCNN模型中的input_x對應(yīng)的是tf.placeholder，是一個tensor，shape已經(jīng)固定好了，比如[batch, sequence_len]，就不可能對tensor的每一行都有不同的長度，因此需要找到整個dataset中最長的sentence的長度，然后在不足長度的句子的末尾加上padding words，以保證input sentence的長度一致。

由于在load_data函數(shù)中，得到的是一個二維列表來存儲每個sentence數(shù)據(jù)，因此padding_sentences之后，仍以這樣的形式返回。只不過每個句子列表的末尾可能添加了padding word。

(3) build vocabulary

def build_vocab(sentences):    """    Builds a vocabulary mapping from word to index based on the sentences.    Returns vocabulary mapping and inverse vocabulary mapping.    """    # Build vocabulary    word_counts = Counter(itertools.chain(*sentences))    # Mapping from index to word    vocabulary_inv = [x[0] for x in word_counts.most_common()]    vocabulary_inv = list(sorted(vocabulary_inv))    # Mapping from word to index    vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}    return [vocabulary, vocabulary_inv]1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13

我們知道，collections模塊中的Counter可以實現(xiàn)詞頻的統(tǒng)計，例如：

import collectionssentence = ["i", "love", "mom", "mom", "loves", "me"]collections.Counter(sentence)>>> Counter({'i': 1, 'love': 1, 'loves': 1, 'me': 1, 'mom': 2})1
2
3
4
1
2
3
4

Counter接受的參數(shù)是iterable，但是現(xiàn)在有多個句子列表，如何將多個sentence word list中的所有word由一個高效的迭代器生成呢？
這就用到了itertools.chain(*iterables)，具體用法參考這里

將多個迭代器作為參數(shù), 但只返回單個迭代器, 它產(chǎn)生所有參數(shù)迭代器的內(nèi)容, 就好像他們是來自于一個單一的序列.

由此可以得到整個數(shù)據(jù)集上的詞頻統(tǒng)計，word_counts。
但是要建立字典vocabulary，就需要從word_counts中提取出每個pair的第一個元素也就是word（相當(dāng)于Counter在這里做了一個去重的工作），不需要根據(jù)詞頻建立vocabulary，而是根據(jù)word的字典序，所以對vocabulary進行一個sorted，就得到了字典順序的word list。首字母小的排在前面。
再建立一個dict，存儲每個word對應(yīng)的index，也就是vocabulary變量。

(4) build input data

def build_input_data(sentences, labels, vocabulary):    """    Maps sentencs and labels to vectors based on a vocabulary.    """    x = np.array([[vocabulary[word] for word in sentence] for sentence in sentences])    y = np.array(labels)    return [x, y]1
2
3
4
5
6
7
1
2
3
4
5
6
7

由上面兩個函數(shù)我們得到了所有sentences分詞后的二維列表，sentences對應(yīng)的labels，還有查詢每個word對應(yīng)index的vocabulary字典。
但是?。∠胍幌?，當(dāng)前的sentences中存儲的是一個個word字符串，數(shù)據(jù)量大時很占內(nèi)存，因此，最好存儲word對應(yīng)的index，index是int，占用空間就小了。
因此就利用到剛生成的vocabulary，對sentences的二維列表中每個word進行查詢，生成一個word index構(gòu)成的二維列表。最后將這個二維列表轉(zhuǎn)化成numpy中的二維array。
對應(yīng)的lables因為已經(jīng)是0,1的二維列表了，直接可以轉(zhuǎn)成array。
轉(zhuǎn)成array后，就能直接作為cnn的input和labels使用了。

(5) load data

def load_data():    """    Loads and preprocessed data for the MR dataset.    Returns input vectors, labels, vocabulary, and inverse vocabulary.    """    # Load and preprocess data    sentences, labels = load_data_and_labels()    sentences_padded = pad_sentences(sentences)    vocabulary, vocabulary_inv = build_vocab(sentences_padded)    x, y = build_input_data(sentences_padded, labels, vocabulary)    return [x, y, vocabulary, vocabulary_inv]1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11

最后整合上面的各部分處理函數(shù)，

1.首先從文本文件中加載原始數(shù)據(jù)，一開始以sentence形式暫存在list中，然后對每個sentence進行clean_str，并且分詞，得到word為基本單位的二維列表sentences，labels對應(yīng)[0,1]和[1,0]
2.找到sentence的最大長度，對于長度不足的句子進行padding
3.根據(jù)數(shù)據(jù)建立詞匯表，按照字典序返回，且得到每個word對應(yīng)的index。
4.將str類型的二維列表sentences，轉(zhuǎn)成以int為類型的sentences，并返回二維的numpy array作為模型的input和labels供后續(xù)使用。
(6) generate batch

def batch_iter(data, batch_size, num_epochs, shuffle=True):    """    Generates a batch iterator for a dataset.    """    data = np.array(data)    data_size = len(data)    num_batches_per_epoch = int(len(data)/batch_size) + 1    for epoch in range(num_epochs):        # Shuffle the data at each epoch        if shuffle:            shuffle_indices = np.random.permutation(np.arange(data_size))            shuffled_data = data[shuffle_indices]        else:            shuffled_data = data        for batch_num in range(num_batches_per_epoch):            start_index = batch_num * batch_size            end_index = min((batch_num + 1) * batch_size, data_size)            yield shuffled_data[start_index:end_index]1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

這個函數(shù)的作用是在整個訓(xùn)練時，定義一個batches = batch_iter(…)，整個訓(xùn)練過程中就只需要for循環(huán)這個batches即可對每一個batch數(shù)據(jù)進行操作了。

本站僅提供存儲服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊舉報。

免费视频淫片aa毛片_日韩高清在线亚洲专区vr_日韩大片免费观看视频播放_亚洲欧美国产精品完整版