您好,欢迎来到化拓教育网。
搜索
您的当前位置:首页Python 中文文件统计词频 + 中文词云

Python 中文文件统计词频 + 中文词云

来源:化拓教育网

1. 词频统计:

 1 import jieba
 2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
 3 words  = jieba.lcut(txt)
 4 counts = {}
 5 for word in words:
 6     if len(word) == 1:
 7         continue
 8     else:
 9         counts[word] = counts.get(word,0) + 1
10 items = list(counts.items())
11 items.sort(key=lambda x:x[1], reverse=True)
12 for i in range(15):
13     word, count = items[i]
14     print ("{0:<10}{1:>5}".format(word, count))

结果是:

曹操 946
孔明 737
将军 622
玄德 585
却说 534
关公 509
荆州 413
二人 410
丞相 405
玄德曰 390
不可 387
孔明曰 374
张飞 358
如此 320
不能 318

进一步改进, 我想只知道人物出场统计,代码如下:

 1 import jieba
 2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
 3 names = {'曹操','孔明','刘备','关羽','张飞','吕布','赵云','孙权','周瑜','袁绍','黄忠','魏延'}
 4 words  = jieba.lcut(txt)
 5 counts = {}
 6 for word in words:
 7     if len(word) == 1:
 8         continue
 9     elif word == "诸葛亮" or word == "孔明曰":
10         rword = "孔明"
11     elif word == "关公" or word == "云长":
12         rword = "关羽"
13     elif word == "玄德" or word == "玄德曰":
14         rword = "刘备"
15     elif word == "孟德" or word == "丞相":
16         rword = "曹操"
17     else:
18         rword = word
19     counts[rword] = counts.get(rword,0) + 1
20 # for word in excludes:
21 #     del counts[word]
22 items = list(counts.items())
23 items.sort(key=lambda x:x[1], reverse=True)
24 for i in range(40):
25     word, count = items[i]
26     if word in names:
27         print ("{0:<10}{1:>5}".format(word, count))

运行结果为:

曹操 1358
孔明 1265
刘备 1251
关羽 783
张飞 358
吕布 300
赵云 278
孙权 257
周瑜 217
袁绍 191

进一步的做词云图:

 1 import jieba
 2 import os
 3 import wordcloud
 4  
 5 def getText(file):
 6     with open(file, 'r', encoding= 'UTF-8') as txt:
 7         txt = txt.read()
 8         jieba.lcut(txt)
 9     return txt
10  
11  
12 directoryname =  os.getcwd()
13 filename = input()
14 txt = getText(filename + '.txt')
15 wordclouds = wordcloud.WordCloud(width=1000, height= 800, margin=2).generate(txt)
16 wordclouds.to_file('{}.png'.format(filename))
17  
18 os.system('{}.png'.format(filename))

名称是可以进一步优化的,参见第二部分代码。

中文wordcloud库默认会出现乱码,解决方法参考 

 

参考:

转载于:https://www.cnblogs.com/116970u/p/11611821.html

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- huatuo9.cn 版权所有 赣ICP备2023008801号-1

违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务