自然语言处理 - 主题识别

2019-05-10 08:00:00 · 飞浪

介绍自然语言处理(NLP)是处理人类语言或文本数据的科学。NLP应用之一是主题识别，这是一种用于发现文本文档中主题的技术。在本指南中，我们将了解主题识别和建模的基础知识。使用词袋方法

介绍

自然语言处理 (NLP) 是处理人类语言或文本数据的科学。NLP 应用之一是主题识别，这是一种用于发现文本文档中主题的技术。

在本指南中，我们将了解主题识别和建模的基础知识。使用词袋方法和简单的 NLP 模型，我们将学习如何从文本中识别主题。

我们将首先导入本指南中将使用的库。

导入所需的库和模块

      import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
nltk.download('wordnet')      #download if using this module for the first time


from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
nltk.download('stopwords')    #download if using this module for the first time


#For Gensim
import gensim
import string
from gensim import corpora
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize
    

词袋方法

词袋法是一种识别文档中主题的简单方法。它基于这样的假设：术语出现的频率越高，其重要性就越高。我们将使用下面给出的文本示例来了解如何实现这一点：

      text1 = "Avengers: Infinity War was a 2018 American superhero film based on the Marvel Comics superhero team the Avengers. It is the 19th film in the Marvel Cinematic Universe (MCU). The running time of the movie was 149 minutes and the box office collection was around 2 billion dollars. (Source: Wikipedia)"
print(text1)
    

输出：

      Avengers: Infinity War was a 2018 American superhero film based on the Marvel Comics superhero team the Avengers. It is the 19th film in the Marvel Cinematic Universe (MCU). The running time of the movie was 149 minutes and the box office collection was around 2 billion dollars. (Source: Wikipedia)
    

这段文字是关于《复仇者联盟》电影《无限战争》的。首先，我们将使用标记化创建标记。下面的第一行代码将文本拆分为标记。第二行将标记转换为小写，第三行打印输出。

      tokens = word_tokenize(text1)
lowercase_tokens = [t.lower() for t in tokens]
print(lowercase_tokens)
    

输出：

      'avengers'

上面生成的标记列表可以作为“Counter”类的初始化参数传递，该类已从库模块“collections”中导入。

下面的第一行代码创建了一个计数器对象“bagofwords_1”，它允许我们查看每个标记及其频率。第二行打印最常见的 10 个标记及其频率。

      bagofwords_1 = Counter(lowercase_tokens)
print(bagofwords_1.most_common(10))
    

输出：

      ('the'

文本预处理

上面生成的输出很有趣，但从主题识别目的来看没什么用。这是因为像“the”和“was”这样的标记是常用词，对识别主题没有太大帮助。为了克服这个问题，我们将进行文本预处理。

下面的第一行代码创建了一个名为“alphabets”的列表，该列表循环遍历“lowercase_tokens”并仅保留字母字符。第二行和第三行删除了英文停用词，第四行打印了名为“stopwords_removed”的新列表。

      alphabets = [t for t in lowercase_tokens if t.isalpha()]

words = stopwords.words("english")
stopwords_removed = [t for t in alphabets if t not in words]

print(stopwords_removed)

输出：

      'avengers'

我们已经完成了初始的文本预处理步骤，但仍有更多工作要做。其中一项重要技术是词形还原，即将单词缩短为词根或词干的过程。此操作在下面的代码中完成。

第一行代码实例化了 WordNetLemmatizer。第二行使用 '.lemmatize()' 方法创建一个名为 lem_tokens 的新列表，而第三行调用 Counter 类并创建一个名为 bag_words 的新 Counter。最后，第四行打印六个最常见的标记。

      lemmatizer = WordNetLemmatizer()

lem_tokens = [lemmatizer.lemmatize(t) for t in stopwords_removed]

bag_words = Counter(lem_tokens)
print(bag_words.most_common(6))

输出：

      ('avenger'

上面的输出更有用。我们没有像“the”和“was”这样的停用词，通过查看新的常用词集，我们可以轻松识别出我们文本的主题是复仇者联盟。

我们已经了解了如何在预处理后使用词袋模型来识别语料库中的主题。现在我们将学习另一个用于主题建模的强大 NLP 库“genism”。

使用 Gensim 和潜在狄利克雷分配 (LDA)

Gensim 是一个开源 NLP 库，可用于创建和查询语料库。它的工作原理是构建词向量或向量，然后用于执行主题建模。

词向量是使用深度学习方法创建的单词的多维数学表示。它们让我们深入了解语料库中术语之间的关系。例如，“印度”和“新德里”这两个词之间的距离可能与“中国”和“北京”之间的距离相似，因为它们代表“国家-首都”向量。

首先，我们从 Pluralsight 网站创建了 9 个示例文档。这些文档在下面的代码行中表示为 sample1 到 sample9。最后，我们在最后一行代码中创建了这些文档的集合。

      sample1 = "Our board of directors boasts 11 seasoned technology and business leaders from Adobe, GSK, HGGC and more."
sample2 = "Our executives lead by example and guide us to accomplish great things every day."
sample3 = "Working at Pluralisght means being surrounded by smart, passionate people who inspire us to do our best work."
sample4 = "A leadership team with vision."
sample5 = "Courses on cloud, microservices, machine learning, security, Agile and more."
sample6 = "Interactive courses and projects."
sample7 = "Personalized course recommendations from Iris."
sample8 = "We’re excited to announce that Pluralsight has ranked #9 on the Great Place to Work 2018, Best Medium Workplaces list!"
sample9 = "Few of the job opportunities include Implementation Consultant - Analytics, Manager - assessment production, Chief Information Officer, Director of Communications."

# compile documents
compileddoc = [sample1, sample2, sample3, sample4, sample5, sample6, sample7, sample8, sample9]
    

让我们检查一下可以通过下面的代码完成的第一个文档。

      print(compileddoc[0])

输出：

      Our board of directors boasts 11 seasoned technology and business leaders from Adobe, GSK, HGGC and more.
    

在本指南的后续部分中，我们将尝试对语料库“compileddoc”执行主题建模。与往常一样，第一步是文本预处理。

下面前三行代码设定了清理文档的基本框架，第四到八行我们定义了一个清理文档的函数，最后最后一行代码我们用该函数创建了名为“final_doc”的清理后的文档。

      stopwords = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

def clean(document):
    stopwordremoval = " ".join([i for i in document.lower().split() if i not in stopwords])
    punctuationremoval = ''.join(ch for ch in stopwordremoval if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punctuationremoval.split())
    return normalized

final_doc = [clean(document).split() for document in compileddoc]
    

现在让我们使用以下代码来查看第一个文档——前后文本清理。

      print("Before text-cleaning:", compileddoc[0]) 

print("After text-cleaning:",final_doc[0])

输出：

      Before text-cleaning: Our board of directors boasts 11 seasoned technology and business leaders from Adobe, GSK, HGGC and more.
After text-cleaning: ['board', 'director', 'boast', '11', 'seasoned', 'technology', 'business', 'leader', 'adobe', 'gsk', 'hggc', 'more']
    

现在，我们准备使用一种名为潜在狄利克雷分配 (LDA) 的强大统计方法对“final_doc”语料库进行主题建模。LDA 使用生成方法来查找相似的文本。它不是一种分类技术，不需要标签来推断模式。相反，该算法更像是一种无监督方法，它使用概率模型来识别主题组。

为 LDA 准备文档术语矩阵

第一步是将语料库转换为矩阵表示，如下面的代码所示。

第一行代码创建了语料库的术语词典，其中每个唯一术语都被分配了一个索引。第二行使用上面准备的词典将语料库转换为文档-术语矩阵。最后，在准备好文档-术语矩阵后，我们在第三行代码中为 LDA 模型创建对象。

      dictionary = corpora.Dictionary(final_doc)

DT_matrix = [dictionary.doc2bow(doc) for doc in final_doc]

Lda_object = gensim.models.ldamodel.LdaModel

创建 LDA 模型对象后，我们将在文档术语矩阵上对其进行训练。下面的第一行代码通过将 LDA 对象传递到“DT_matrix”来执行此任务。我们还需要指定主题和词典的数量。由于我们有一个包含 9 个文档的小型语料库，因此我们可以将主题数量限制为 2 或 3 个。

在下面的代码行中，我们将主题数量设置为 2。第二行打印结果。

      lda_model_1 = Lda_object(DT_matrix, num_topics=2, id2word = dictionary)

print(lda_model_1.print_topics(num_topics=2, num_words=5))

输出：

(0

在上面的输出中，每一行代表一个主题，其中包含单独的主题术语和术语权重。主题 1 似乎更多地涉及 Pluralisght 提供的“课程”，而第二个主题似乎涉及“工作”。

我们还可以更改主题数量，看看它如何改变输出。在下面的代码中，我们选择了三个主题。

      lda_model_2 = Lda_object(DT_matrix, num_topics=3, id2word = dictionary)

print(lda_model_2.print_topics(num_topics=3, num_words=5))

输出：

(0

结果几乎相同，主题 2 表示“课程”，而主题 1 和 3 似乎类似于“工作”。

结论

在本指南中，您了解了如何使用词袋技术进行主题识别。您还了解了使用强大的开源 NLP 库“gensim”进行 LDA 的介绍。

主题模型的性能取决于语料库中存在的术语，以文档术语矩阵表示。由于此矩阵本质上是稀疏的，因此降低维度可能会提高模型性能。但是，由于我们的语料库不是很大，因此我们可以对所取得的结果有信心。

要了解有关自然语言处理的更多信息，请参阅以下指南：

[自然语言处理 - 文本解析] (/guides/text-parsing)
[自然语言处理 - 使用文本数据进行机器学习]（/guides/nlp-machine-learning-text-data）

_{免责声明：本内容来源于第三方作者授权、网友推荐或互联网整理，旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有，其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况，请与我们取得联系，我们将尽快进行相关处理与修改。感谢您的理解与支持！}

_查看原文

技术指南

阅读全文

自然语言处理 - 主题识别

杭州电子商务研究院

5年前 · 面向社会、服务行业、政产学研结合、整合资源、和谐发展

import nltk from nltk.tokenize import word_tokenize from collections import Counter nltk.download('wordnet') #download if using this module for the first time from nltk.stem import WordNetLemmatizer from nltk.corpus import stopwords nltk.download('stopwords') #download if using this module for the first time #For Gensim import gensim import string from gensim import corpora from gensim.corpora.dictionary import Dictionary from nltk.tokenize import word_tokenize

text1 = "Avengers: Infinity War was a 2018 American superhero film based on the Marvel Comics superhero team the Avengers. It is the 19th film in the Marvel Cinematic Universe (MCU). The running time of the movie was 149 minutes and the box office collection was around 2 billion dollars. (Source: Wikipedia)" print(text1)

Avengers: Infinity War was a 2018 American superhero film based on the Marvel Comics superhero team the Avengers. It is the 19th film in the Marvel Cinematic Universe (MCU). The running time of the movie was 149 minutes and the box office collection was around 2 billion dollars. (Source: Wikipedia)

sample1 = "Our board of directors boasts 11 seasoned technology and business leaders from Adobe, GSK, HGGC and more." sample2 = "Our executives lead by example and guide us to accomplish great things every day." sample3 = "Working at Pluralisght means being surrounded by smart, passionate people who inspire us to do our best work." sample4 = "A leadership team with vision." sample5 = "Courses on cloud, microservices, machine learning, security, Agile and more." sample6 = "Interactive courses and projects." sample7 = "Personalized course recommendations from Iris." sample8 = "We’re excited to announce that Pluralsight has ranked #9 on the Great Place to Work 2018, Best Medium Workplaces list!" sample9 = "Few of the job opportunities include Implementation Consultant - Analytics, Manager - assessment production, Chief Information Officer, Director of Communications." # compile documents compileddoc = [sample1, sample2, sample3, sample4, sample5, sample6, sample7, sample8, sample9]

stopwords = set(stopwords.words('english')) exclude = set(string.punctuation) lemma = WordNetLemmatizer() def clean(document): stopwordremoval = " ".join([i for i in document.lower().split() if i not in stopwords]) punctuationremoval = ''.join(ch for ch in stopwordremoval if ch not in exclude) normalized = " ".join(lemma.lemmatize(word) for word in punctuationremoval.split()) return normalized final_doc = [clean(document).split() for document in compileddoc]

Before text-cleaning: Our board of directors boasts 11 seasoned technology and business leaders from Adobe, GSK, HGGC and more. After text-cleaning: ['board', 'director', 'boast', '11', 'seasoned', 'technology', 'business', 'leader', 'adobe', 'gsk', 'hggc', 'more']