kNakajima's Blog

技術系のアウトプットブログです。

「言語処理100本ノック」第4章をPythonで解く

言語処理100本ノック

言語処理100本ノックの4章を解きました。
4章は、MeCabを用いた形態素解析に関する問題です。
解答例としてどうぞ。質問，ご指摘などがありましたら、コメントしてください。

言語処理100本ノック第4章

言語処理100本ノック第4章

前準備

夏目漱石の小説『吾輩は猫である』の文章（neko.txt）をMeCabを使って形態素解析し，その結果をneko.txt.mecabというファイルに保存せよ．このファイルを用いて，以下の問に対応するプログラムを実装せよ．

import MeCab

mecab = MeCab.Tagger("-Ochasen")

with open('neko.txt', 'r') as neko, open('neko.txt.mecab', 'a') as neko_mecab:
    neko = neko.read()
    neko = mecab.parse(neko)
    neko_mecab.write(neko)

30. 形態素解析結果の読み込み

形態素解析結果（neko.txt.mecab）を読み込むプログラムを実装せよ．ただし，各形態素は表層形（surface），基本形（base），品詞（pos），品詞細分類1（pos1）をキーとするマッピング型に格納し，1文を形態素（マッピング型）のリストとして表現せよ．第4章の残りの問題では，ここで作ったプログラムを活用せよ．

mecab_list = []

with open('neko.txt.mecab', 'r') as neko_mecab:
    lines = neko_mecab.readlines()
    
    for line in lines:
        line = line.split("\t")
        
        if line[0] != 'EOS\n':
            mecab_dict = {}
            mecab_dict['surface'] = line[0]
            mecab_dict['base']    = line[2]

            if '-' in line[3]:
                hinshi = line[3].split('-')
                mecab_dict['pos']  = hinshi[0]
                mecab_dict['pos1'] = hinshi[1]
            else:
                mecab_dict['pos']  = line[3]
            
            mecab_list.append(mecab_dict)

31. 動詞

動詞の表層形をすべて抽出せよ．

verb_list = []

for word in mecab_list:
    if word['pos'] == '動詞':
        verb_list.append(word['surface'])

32. 動詞の原形

動詞の原形をすべて抽出せよ．

verb_prot_list = []

for word in mecab_list:
    if word['pos'] == '動詞':
        verb_prot_list.append(word['base'])

33. サ変名詞

サ変接続の名詞をすべて抽出せよ．

sahen_meishi_list = []

for word in mecab_list:
    if word['pos'] == '名詞' and word['pos1'] == 'サ変接続':
        sahen_meishi_list.append(word['base'])

34. 「AのB」

2つの名詞が「の」で連結されている名詞句を抽出せよ．

rentai_list = []
for i in range(len(mecab_list)):
    if mecab_list[i]['surface'] == 'の':
        if mecab_list[i-1]['pos'] == '名詞' and mecab_list[i+1]['pos'] == '名詞':
            word = mecab_list[i-1]['surface'] + mecab_list[i]['surface'] + mecab_list[i+1]['surface']
            rentai_list.append(word)

35. 名詞の連接

名詞の連接（連続して出現する名詞）を最長一致で抽出せよ．

rensetsu_meishi_list = []

meishi_count = 0
for i in range(len(mecab_list)):
    if mecab_list[i]['pos'] == '名詞':
        word += mecab_list[i]['surface']
        meishi_count += 1
    else:
        if word != '' and meishi_count > 1:
            rensetsu_meishi_list.append(word)
    
        word = ''
        meishi_count = 0

36. 単語の出現頻度

文章中に出現する単語とその出現頻度を求め，出現頻度の高い順に並べよ．

word_count = {}

for word in mecab_list:
    if word['surface'] in word_count:
        word_count[word['surface']] += 1
    else:
        word_count[word['surface']] = 1
     
word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)

37. 頻度上位10語

出現頻度が高い10語とその出現頻度をグラフ（例えば棒グラフなど）で表示せよ．

import matplotlib.pyplot as plt
import japanize_matplotlib

label = []
number = []

for word in word_count:
    label.append(word[0])
    number.append(word[1])
    
    if len(label) == 10:
        break
        
plt.bar(range(10), number, tick_label=label)

38. ヒストグラム

単語の出現頻度のヒストグラム（横軸に出現頻度，縦軸に出現頻度をとる単語の種類数を棒グラフで表したもの）を描け．

import matplotlib.pyplot as plt
import japanize_matplotlib

number = []
for word in word_count:
    number.append(word[1])
    
plt.hist(number, bins=100)

39. Zipfの法則

単語の出現頻度順位を横軸，その出現頻度を縦軸として，両対数グラフをプロットせよ．

import matplotlib.pyplot as plt
import japanize_matplotlib

number = []
for word in word_count:
    number.append(word[1])
    
plt.hist(number, bins=100, log=True)