使用词云可视化文本数据

2019-06-27 08:00:00 · 飞浪

介绍近年来，文本数据呈指数级增长，因此对大量此类数据的分析需求也日益增加。词云是一种极佳的分析文本数据的方法，它以标签或单词的形式进行可视化，其中单词的重要性由其出现频率来解释。在本

介绍

近年来，文本数据呈指数级增长，因此对大量此类数据的分析需求也日益增加。词云是一种极佳的分析文本数据的方法，它以标签或单词的形式进行可视化，其中单词的重要性由其出现频率来解释。

在本指南中，我们将学习如何创建词云并找到有助于从数据中提取见解的重要词语。我们将从理解问题陈述和数据开始。

问题陈述

这些数据涉及一个熟悉的话题，每个电子邮件用户在某个时间点都必须遇到过 - 即“垃圾”电子邮件，它们是未经请求的消息，通常广告产品，包含恶意软件链接或试图欺骗收件人。

在本指南中，我们将使用一个公开可用的数据集，该数据集首次在 2006 年会议论文“使用朴素贝叶斯过滤垃圾邮件 - 哪种朴素贝叶斯？”中描述，作者是 V. Metsis、I. Androutsopoulos 和 G. Paliouras。此数据集中的“正常”消息来自前安然研究总经理 Vincent Kaminski 的收件箱，这是安然语料库中的收件箱之一。此数据集中的垃圾邮件来源之一是 SpamAssassin 语料库，其中包含互联网用户贡献的手动标记垃圾邮件。其余垃圾邮件由 Project Honey Pot 收集，该项目收集垃圾邮件并通过发布人类知道不要联系但机器人可能会发送垃圾邮件的电子邮件地址来识别垃圾邮件发送者。我们将使用的完整数据集构建为大约 75/25 的正常消息和垃圾邮件的混合。

数据集仅包含两个字段：

文本——电子邮件的文本。
垃圾邮件 - 一个二进制变量，指示电子邮件是否为垃圾邮件。

让我们首先导入所需的库。

导入库

      # Supress Warnings
import warnings
warnings.filterwarnings('ignore')

#loading all necessary libraries
import numpy as np
import pandas as pd

import string
import collections
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as plt
% matplotlib inline
    

读取文件并理解数据

下面的第一行代码将数据读入为 pandas 数据框，而第二行打印形状 - 2 个变量的 5726 个观测值。第三行打印前五条记录。上面已经解释过，只有两个变量 - “文本”和“垃圾邮件”。大多数电子邮件都是“普通”电子邮件，标记为“0”，占总数据的 76%。

      # loading the data file
df = pd.read_csv('emails2.csv')

#shape of the dataframe
print('The shape of the dataframe is :',df.shape)

#first few records
df.head()
    

输出：

      The shape of the dataframe is : (5726, 2)

|   	| text                                              	| spam 	|
|---	|---------------------------------------------------	|------	|
| 0 	| Subject: naturally irresistible your corporate... 	| 1    	|
| 1 	| Subject: the stock trading gunslinger fanny i...  	| 1    	|
| 2 	| Subject: unbelievable new homes made easy im ...  	| 1    	|
| 3 	| Subject: 4 color printing special request add...  	| 1    	|
| 4 	| Subject: do not have money , get software cds ... 	| 1    	|
    

让我们检查文本变量中是否存在缺失值，这可以通过下面的代码行来完成。输出显示没有缺失值。

      #Checking for null values in `description`
df['text'].isnull().sum()
    

输出：

我们将首先为所有垃圾邮件构建词云。下面的第一行代码用垃圾邮件过滤数据，而第二行打印形状 - 2 个变量的 1368 个观测值。

      spam1 = df[df.spam == 1]
print(spam1.shape)
    

输出：

      (1368, 2)

数据清理和准备

在构建词云之前，清理数据非常重要。通用步骤将在后续章节中进行。

将文本转换为小写

下面的第一行代码将文本转换为小写，而第二行则打印前五条记录。这将确保在计算词频时将“Enron”和“enron”等词视为相同。

      spam1['text']= spam1['text'].str.lower()
spam1['text'].head()
    

输出：

  subject: naturally irresistible your corporate...
  subject: the stock trading gunslinger  fanny i...
  subject: unbelievable new homes made easy  im ...
  subject: 4 color printing special  request add...
  subject: do not have money , get software cds ...
Name: text, dtype: object
    

拆分和删除文本中的标点符号

      all_spam = spam1['text'].str.split(' ')
all_spam.head()
    

输出：

  [subject:, naturally, irresistible, your, corp...
  [subject:, the, stock, trading, gunslinger, , ...
  [subject:, unbelievable, new, homes, made, eas...
  [subject:, 4, color, printing, special, , requ...
  [subject:, do, not, have, money, ,, get, softw...
Name: text, dtype: object
    

参与整个审查

在此步骤中，我们将合并所有“文本”记录。这是构建用于构建词云的文本语料库所必需的。下面的代码行为我们完成了这项任务。

      all_spam_cleaned = []

for text in all_spam:
    text = [x.strip(string.punctuation) for x in text]
    all_spam_cleaned.append(text)

all_spam_cleaned[0]

text_spam = [" ".join(text) for text in all_spam_cleaned]
final_text_spam = " ".join(text_spam)
final_text_spam[:500]
    

输出：

      'subject naturally irresistible your corporate identity  lt is really hard to recollect a company  the  market is full of suqgestions and the information isoverwhelminq  but a good  catchy logo  stylish statlonery and outstanding website  will make the task much easier   we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader it isguite ciear that  without good products  effective business organization and practicable aim it  will be hotat nowadays mar'
    

“垃圾邮件”邮件词云

让我们构建第一个词云。第一行代码在“final_text_spam”语料库上生成词云，而第二至第五行代码打印词云。

      wordcloud_spam = WordCloud(background_color="white").generate(final_text_spam)

# Lines 2 - 5
plt.figure(figsize = (20,20))
plt.imshow(wordcloud_spam, interpolation='bilinear')
plt.axis("off")
plt.show()
    

输出：

上面显示的词云效果不错，但有些词比其他词大。这是因为词云中单词的大小与语料库中单词的频率成正比。有各种参数可以调整以更改词云的显示，可以使用“?WordCloud”命令查看这些参数的列表。

在本指南中，我们将使用以下参数：

background_color：此参数指定词云图像的背景颜色，默认颜色为“黑色”。
max_font_size：此参数指定最大单词的最大字体大小。如果没有，则使用图像的高度。
max_words：该参数指定最大的单词数，默认为200。
stopwords：此参数指定构建词云时不考虑的词。如果没有，则将使用内置的停用词列表。

让我们修改之前的词云以包含这些参数。下面的第一行代码利用了现有的停用词列表。在之前构建的词云中，像“subject”、“will”、“us”、“enron”、“re”等词是常用词，并没有提供太多见解。第二行用特定于我们数据的这些词更新停用词。

第三行在“final_text_spam”语料库上生成词云。请注意，我们更改了一些可选参数，如 max_font_size、max_words 和 background_color，以更好地可视化词云。

第四至第七行代码绘制词云。参数‘interpolation=bilinear’用于使图像看起来更平滑。

      stopwords = set(STOPWORDS)
stopwords.update(["subject","re","vince","kaminski","enron","cc", "will", "s", "1","e","t"])

wordcloud_spam = WordCloud(stopwords=stopwords, background_color="white", max_font_size=50, max_words=100).generate(final_text_spam)

# Lines 4 to 7
plt.figure(figsize = (15,15))
plt.imshow(wordcloud_spam, interpolation='bilinear')
plt.axis("off")
plt.show()
    

输出：

从上图可以看出，停用词没有显示出来。此外，我们观察到，像 new、account、company、program、mail 等词是词云中最突出的词。接下来，我们将学习另一种将最流行的单词提取为频率表的技术。在我们的例子中，我们将提取最常见的 30 个单词。下面的代码行执行此任务并将前 30 个单词及其计数打印为输出。

      filtered_words_spam = [word for word in final_text_spam.split() if word not in stopwords]
counted_words_spam = collections.Counter(filtered_words_spam)

word_count_spam = {}

for letter, count in counted_words_spam.most_common(30):
    word_count_spam[letter] = count
    
for i,j in word_count_spam.items():
        print('Word: {0}, count: {1}'.format(i,j))
    

输出：

      Word: business, count: 844
Word: company, count: 805
Word: email, count: 804
Word: information, count: 740
Word: 5, count: 687
Word: money, count: 662
Word: 2, count: 613
Word: free, count: 606
Word: 3, count: 604
Word: mail, count: 586
Word: one, count: 581
Word: please, count: 581
Word: now, count: 575
Word: 000, count: 560
Word: us, count: 537
Word: click, count: 531
Word: time, count: 521
Word: new, count: 504
Word: make, count: 496
Word: may, count: 489
Word: website, count: 465
Word: adobe, count: 462
Word: 0, count: 450
Word: software, count: 438
Word: message, count: 418
Word: 10, count: 405
Word: list, count: 392
Word: report, count: 391
Word: 2005, count: 374
Word: want, count: 364
    

结论

在本指南中，您了解了如何构建词云以及可以更改以改善其外观的重要参数。您还了解了如何提取热门词，同时使用停用词词典识别和消除噪音。

在本指南中，我们已将这些技术应用于数据集中的“垃圾邮件”。可以执行类似的步骤来为“正常”电子邮件或整个文本创建词云。然后可以使用重要的单词进行决策或作为模型构建的特征。

要了解有关使用 Python 进行自然语言处理的更多信息，请参阅以下指南：

_{免责声明：本内容来源于第三方作者授权、网友推荐或互联网整理，旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有，其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况，请与我们取得联系，我们将尽快进行相关处理与修改。感谢您的理解与支持！}

_查看原文

技术指南

阅读全文

使用词云可视化文本数据

杭州电子商务研究院

5年前 · 面向社会、服务行业、政产学研结合、整合资源、和谐发展

# Supress Warnings import warnings warnings.filterwarnings('ignore') #loading all necessary libraries import numpy as np import pandas as pd import string import collections from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib.cm as cm import matplotlib.pyplot as plt % matplotlib inline

The shape of the dataframe is : (5726, 2) | | text | spam | |--- |--------------------------------------------------- |------ | | 0 | Subject: naturally irresistible your corporate... | 1 | | 1 | Subject: the stock trading gunslinger fanny i... | 1 | | 2 | Subject: unbelievable new homes made easy im ... | 1 | | 3 | Subject: 4 color printing special request add... | 1 | | 4 | Subject: do not have money , get software cds ... | 1 |

0 subject: naturally irresistible your corporate... 1 subject: the stock trading gunslinger fanny i... 2 subject: unbelievable new homes made easy im ... 3 subject: 4 color printing special request add... 4 subject: do not have money , get software cds ... Name: text, dtype: object

0 [subject:, naturally, irresistible, your, corp... 1 [subject:, the, stock, trading, gunslinger, , ... 2 [subject:, unbelievable, new, homes, made, eas... 3 [subject:, 4, color, printing, special, , requ... 4 [subject:, do, not, have, money, ,, get, softw... Name: text, dtype: object

all_spam_cleaned = [] for text in all_spam: text = [x.strip(string.punctuation) for x in text] all_spam_cleaned.append(text) all_spam_cleaned[0] text_spam = [" ".join(text) for text in all_spam_cleaned] final_text_spam = " ".join(text_spam) final_text_spam[:500]

'subject naturally irresistible your corporate identity lt is really hard to recollect a company the market is full of suqgestions and the information isoverwhelminq but a good catchy logo stylish statlonery and outstanding website will make the task much easier we do not promise that havinq ordered a iogo your company will automaticaily become a world ieader it isguite ciear that without good products effective business organization and practicable aim it will be hotat nowadays mar'

wordcloud_spam = WordCloud(background_color="white").generate(final_text_spam) # Lines 2 - 5 plt.figure(figsize = (20,20)) plt.imshow(wordcloud_spam, interpolation='bilinear') plt.axis("off") plt.show()

stopwords = set(STOPWORDS) stopwords.update(["subject","re","vince","kaminski","enron","cc", "will", "s", "1","e","t"]) wordcloud_spam = WordCloud(stopwords=stopwords, background_color="white", max_font_size=50, max_words=100).generate(final_text_spam) # Lines 4 to 7 plt.figure(figsize = (15,15)) plt.imshow(wordcloud_spam, interpolation='bilinear') plt.axis("off") plt.show()

filtered_words_spam = [word for word in final_text_spam.split() if word not in stopwords] counted_words_spam = collections.Counter(filtered_words_spam) word_count_spam = {} for letter, count in counted_words_spam.most_common(30): word_count_spam[letter] = count for i,j in word_count_spam.items(): print('Word: {0}, count: {1}'.format(i,j))

Word: business, count: 844 Word: company, count: 805 Word: email, count: 804 Word: information, count: 740 Word: 5, count: 687 Word: money, count: 662 Word: 2, count: 613 Word: free, count: 606 Word: 3, count: 604 Word: mail, count: 586 Word: one, count: 581 Word: please, count: 581 Word: now, count: 575 Word: 000, count: 560 Word: us, count: 537 Word: click, count: 531 Word: time, count: 521 Word: new, count: 504 Word: make, count: 496 Word: may, count: 489 Word: website, count: 465 Word: adobe, count: 462 Word: 0, count: 450 Word: software, count: 438 Word: message, count: 418 Word: 10, count: 405 Word: list, count: 392 Word: report, count: 391 Word: 2005, count: 374 Word: want, count: 364