[NLP] 5. Semantic Network Analysis

사회 연결망 분석(Social Network Analysis)은 분석 대상 및 분석 대상들 간의 관계를 연결망 구조로 표현하고 이를 계량적으로 제시하는 분석 기법이다.
- 사회 연결망 분석은 사람, 장소, 물품 등의 객체 간의 관계를 분석하는데 효과적이며, 주로 친구 관계, 전력 공급 등을 분석하는데 사용한다.

사회 연결망 분석 기법을 텍스트 내 단어의 관계에 적용한 것이 바로 의미 연결망 분석이다.
의미 연결망 분석에서는 일정한 범위 내에서 어휘가 동시에 등장하면 서로 연결된 것으로 간주하며, 이 연결 관계들을 분석한다.

1. 어휘 동시 출현 빈도의 계수화

동시 출현(Co-occurrence)란 두 개 이상의 어휘가 일정한 범위나 거리 내에서 함께 출현하는 것을 의미한다.
단어 간의 동시 출현 관계를 분석하면 문서나 문장으로부터 두 단어가 유사한 의미를 가졌는지 등의 추상화된 정보를 얻을 수 있다.
동시 출현 빈도는 Window라는 지정 범위 내에서 동시 등장한 어휘를 확률 등으로 계수화가 가능하다.
- 예를 들어, 단어 뒤에 잘못된 단어가 온다면, 이를 동시 출현 빈도가 높은 단어로 교정이 가능하다.

어휘 동시 출현 빈도 행렬은 하나하나 측정할 수도 있지만, bigram 개수를 정리하면 편리하게 만들어볼 수 있다.

from nltk import ConditionalFreqDist

sentences = ['I love data science and deep learning','I love science','I know this code']
tokens = [word_tokenize(x) for x in sentences]
bgrams = [bigrams(x) for x in tokens]

token = []
for i in bgrams:
  token += ([x for x in i])

cfd = ConditionalFreqDist(token)
cfd.conditions()
----------------------------------------------
['I', 'love', 'data', 'science', 'and', 'deep', 'know', 'this']

print(cfd['I'])
print(cfd['I']['love']) # I,Love가 동시 출현한 횟수
print(cfd['I'].most_common(1)) # I랑 동시출현 빈도가 높은 것
----------------------------------------------------------
<FreqDist with 2 samples and 3 outcomes>
2
[('love', 2)]

import numpy as np

freq_matrix = []
for i in cfd.keys():
  temp = []
  for j in cfd.keys():
    temp.append(cfd[i][j])
  freq_matrix.append(temp)

freq_matrix = np.array(freq_matrix)

print(cfd.keys())
print(freq_matrix)
-----------------------------------
dict_keys(['I', 'love', 'data', 'science', 'and', 'deep', 'know', 'this'])
[[0 2 0 0 0 0 1 0]
 [0 0 1 1 0 0 0 0]
 [0 0 0 1 0 0 0 0]
 [0 0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0]]

위의 동시 출현 빈도 행렬을 데이터프레임으로 시각화

import pandas as pd

df=pd.DataFrame(freq_matrix,index=cfd.keys(),columns=cfd.keys())
df.style.background_gradient(cmap='coolwarm')

동시 출현 빈도 행렬은 인접 행렬로도 간주할 수 있다.

networkx 라이브러리를 사용해 위의 데이터프레임을 그래프로 변환해보자.
- numpy 배열을 사용할 경우에는 별도로 라벨을 지정해줘야 한다.

import networkx as nx

G = nx.from_pandas_adjacency(df)

print(G.nodes())
print(G.edges())
------------------------------
['I', 'love', 'data', 'science', 'and', 'deep', 'know', 'this']
[('I', 'love'), ('I', 'know'), ('love', 'data'), ('love', 'science'), ('data', 'science'), ('science', 'and'), ('and', 'deep'), ('know', 'this')]

각 edge에 접근해보면 각 edge의 가중치에 대해 각 단어 간의 빈도가 사용된 것을 확인이 가능하다.

print(G.edges()[('I','love')])
print(G.edges()[('I','know')])
------------------------------
{'weight': 2}
{'weight': 1}

nx.draw(G,with_labels=True)

어휘 동시 출현 빈도를 이용하면 어휘 동시 출현 확률까지 측정이 가능하다.

from nltk.tag.hmm import ConditionalProbDist
from nltk.probability import ConditionalFreqDist,MLEProbDist

cpd=ConditionalProbDist(cfd,MLEProbDist)
cpd.conditions()
------------------------------------------------------------
['I', 'love', 'data', 'science', 'and', 'deep', 'know', 'this']

prob_matrix=[]

for i in cpd.keys():
  prob_matrix.append([cpd[i].prob(j) for j in cpd.keys()])

prob_matrix=np.array(prob_matrix)

print(cpd.keys())
print(prob_matrix)
----------------------------------------------------------
dict_keys(['I', 'love', 'data', 'science', 'and', 'deep', 'know', 'this'])
[[0.         0.66666667 0.         0.         0.         0.
  0.33333333 0.        ]
 [0.         0.         0.5        0.5        0.         0.
  0.         0.        ]
 [0.         0.         0.         1.         0.         0.
  0.         0.        ]
 [0.         0.         0.         0.         1.         0.
  0.         0.        ]
 [0.         0.         0.         0.         0.         1.
  0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         1.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.        ]]

df = pd.DataFrame(prob_matrix,index=cpd.keys(),columns=cpd.keys())
df.style.background_gradient(cmap='coolwarm')

확률 행렬 또한 인접 행렬로 간주할 수 있다.
빈도 행렬과 동일한 그래프 결과를 얻을 수 있으나, 확률을 가중치로 사용하면 부정확한 결과를 얻을 수 있다.

prob_G=nx.from_pandas_adjacency(df)

print(prob_G.nodes())
print(prob_G.edges())
-----------------------------------
['I', 'love', 'data', 'science', 'and', 'deep', 'know', 'this']
[('I', 'love'), ('I', 'know'), ('love', 'data'), ('love', 'science'), ('data', 'science'), ('science', 'and'), ('and', 'deep'), ('know', 'this')]

print(G.edges()[('I','love')])
print(G.edges()[('I','know')])

print(prob_G.edges()[('I','love')])
print(prob_G.edges()[('I','know')])
-----------------------------------
{'weight': 2}
{'weight': 1}
{'weight': 0.6666666666666666}
{'weight': 0.3333333333333333}

nx.draw(prob_G,with_labels=True)

2. 중심성 지수

연결망 분석에서 가장 많이 주목하는 속성은 바로 중심성 지수이다.
중심성이란 전체 연결망에서 중심에 위치하는 정도를 표현하는 지표이다.
- 이를 분석하면 연결 정도, 중요도 등을 알 수 있다.
중심성 지수는 나타내는 특징에 따라 연결 중심성, 매개 중심성, 근접 중심성, 위세 중심성으로 구분한다.

연결 중심성(Degree Centrality)

연결 중심성은 가장 기본적이고 직관적으로 중심성을 측정하는 지표이다.
텍스트에서 다른 단어와의 동시 출현 빈도가 많은 특정 단어는 연결 중심성이 높다고 볼 수 있다.
연결 정도로만 측정하면 연결망의 크기에 따라 달라져 비교가 어렵기 때문에 여러 방법으로 표준화한다.
- 주로 (특정 노드 i와 직접적으로 연결된 노드 수 / 노드 i와 직간접적으로 연결된 노드 수) 로 계산한다.
- 여기서 직접적으로 연결된 노드는 서로 edge 관계인 노드를 뜻하며, 간접적으로 연결된 노드는 서로 edge 관계는 아니나 다른 노드와 edge에 의해 도달할 수 있는 노드를 말한다.
연결 중심성의 계산식은 다음과 같다.

$$ \text{Degree}_{ik} = \sum\limits_{i=1}^N Z_{ijk} = Z_{jk}$$

$$ \text{Out Degree}_{ik} = \sum\limits_{j=1}^N Z_{ijk} = Z_{ik} $$

$$ C_i = \sum\limits_{j=1}^n (Z_{ij} + Z_{ji}) / \sum\limits_{i=1}^n \sum\limits_{j=1}^n (Z_{ij}) \quad\quad \text{단, } 0 \leq C \leq 1$$

nx.degree_centrality(G)
-----------------------
{'I': 0.2857142857142857,
 'love': 0.42857142857142855,
 'data': 0.2857142857142857,
 'science': 0.42857142857142855,
 'and': 0.2857142857142857,
 'deep': 0.14285714285714285,
 'know': 0.2857142857142857,
 'this': 0.14285714285714285}

위세 중심성(Eigenvector Centrality)

위세 중심성은 연결된 상대 단어의 중요성에 가중치를 둔다.
중요한 단어와 많이 연결되었다면 위세 중심성은 높아지게 된다.
위세 중심성은 고유 벡터로써 인접해 있는 노드의 위세 점수와 관련되어 있어 직접 계산하기는 쉽지 않다.

위세 중심성의 계산식은 다음과 같다.

$$ P_i = \sum\limits_{j=1}^{N-1} P_i Z_{ji}, \quad\quad 0 \leq P_i \leq 1 $$

nx.eigenvector_centrality(G,weight='weight')
--------------------------------------------
{'I': 0.5055042648573065,
 'love': 0.6195557831651917,
 'data': 0.3570359388519657,
 'science': 0.39841035839294925,
 'and': 0.15933837227495715,
 'deep': 0.055886131430398216,
 'know': 0.2021657335029144,
 'this': 0.0709058113463014}

근접 중심성(Closeness Centrality)

근접 중심성은 한 단어가 다른 단어에 얼마나 가깝게 있는지를 측정하는 지표이다.
직접적으로 연결된 노드만 측정하는 연결 중심성과는 다르게, 근접 중심성은 직간접적으로 연결된 모든 노드들 사이의 거리를 측정한다.

근접 중심성의 계산식은 다음과 같다.

$$ C_C(A) = \frac{1}{\frac{1}{N-1} \sum\limits_{x \neq A} L_{X,A}} = \frac{N-1}{\sum\limits_{x \neq A} L_{X,A}}$$

nx.closeness_centrality(G,distance='weight')
--------------------------------------------
{'I': 0.35,
 'love': 0.4375,
 'data': 0.3684210526315789,
 'science': 0.4117647058823529,
 'and': 0.3333333333333333,
 'deep': 0.25925925925925924,
 'know': 0.2916666666666667,
 'this': 0.23333333333333334}

매개 중심성(Betweeness Centrality)

매개 중심성은 한 단어가 단어들과의 연결망을 구축하는데 얼마나 도움을 주는지 측정하는 지표이다.
매개 중심성이 높은 단어는 빈도수가 작더라도 단어 간 의미부여 역할이 크기 때문에, 해당 단어를 제거하면 의사소통이 어려워진다.
매개 중심성은 모든 노드 간 최단 경로에서 특정 노드가 등장하는 횟수로 측정하며, 표준화를 위해 최댓값인 $ (N-1) \times (N-2)/2$로 나눈다.

매개 중심성의 계산식은 다음과 같다.

$$ C'_B(P_m) = \frac{\sum\limits_{i}^N \sum\limits_{j}^N \frac{g_{img}}{g_{ij}}}{(\frac{N^2 - 3N + 2}{2})}, \quad\quad \text{단, } i<j, \quad i \neq j $$

nx.betweenness_centrality(G)
----------------------------
{'I': 0.47619047619047616,
 'love': 0.5714285714285714,
 'data': 0.0,
 'science': 0.47619047619047616,
 'and': 0.2857142857142857,
 'deep': 0.0,
 'know': 0.2857142857142857,
 'this': 0.0}

페이지랭크(PageRank)

월드 와이드 웹과 같은 하이퍼링크 구조를 가지는 문서에 상대적 중요도에 따라 가중치를 부여하는 방법이다.
- 이 알고리즘은 서로 간에 인용과 참조로 연결된 임의의 묶음에 적용이 가능하다.
페이지랭크는 더 중요한 페이지는 더 많은 다른 사이트로부터 링크를 받는다는 관찰에 기초한다.

nx.pagerank(G)
--------------
{'I': 0.1536831077679558,
 'love': 0.19501225218917406,
 'data': 0.10481873412175656,
 'science': 0.15751225722745082,
 'and': 0.12417333539164832,
 'deep': 0.07152392879557615,
 'know': 0.1224741813421488,
 'this': 0.07080220316428934}

3. 시각화

def get_node_size(node_values):
  nsize=np.array([v for v in node_values])
  nsize=1000*(nsize-min(nsize))/(max(nsize)-min(nsize))

  return nsize

import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

dc=nx.degree_centrality(G).values()
ec=nx.eigenvector_centrality(G,weight='weight').values()
cc=nx.closeness_centrality(G,distance='weight').values()
bc=nx.betweenness_centrality(G).values()
pr=nx.pagerank(G).values()

plt.figure(figsize=(14,20))
plt.axis('off')

plt.subplot(321)
plt.title('Normal',fontsize=16)
nx.draw_networkx(G,font_size=16,alpha=0.7,cmap=plt.cm.Blues)

plt.subplot(322)
plt.title('Degree Centrality',fontsize=16)
nx.draw_networkx(G,font_size=16,
                  node_color=list(dc),node_size=get_node_size(dc),
                  alpha=0.7,cmap=plt.cm.Blues)

plt.subplot(323)
plt.title('EigenVector Centrality',fontsize=16)
nx.draw_networkx(G,font_size=16,
                  node_color=list(ec),node_size=get_node_size(ec),
                  alpha=0.7,cmap=plt.cm.Blues)

plt.subplot(324)
plt.title('Closeness Centrality',fontsize=16)
nx.draw_networkx(G,font_size=16,
                  node_color=list(cc),node_size=get_node_size(cc),
                  alpha=0.7,cmap=plt.cm.Blues)

plt.subplot(325)
plt.title('Betweenness Centrality',fontsize=16)
nx.draw_networkx(G,font_size=16,
                  node_color=list(bc),node_size=get_node_size(bc),
                  alpha=0.7,cmap=plt.cm.Blues)

plt.subplot(326)
plt.title('PageRank',fontsize=16)
nx.draw_networkx(G,font_size=16,
                  node_color=list(pr),node_size=get_node_size(pr),
                  alpha=0.7,cmap=plt.cm.Blues)

plt.show()

'Study' 카테고리의 다른 글

[NLP] 7. Embedding (0)	2024.08.21
[NLP] 6. Topic Modeling (0)	2024.08.21
[NLP] 4. Document Classification (0)	2024.08.20
[NLP] 3. Cluster Analysis (0)	2024.08.19
[NLP] 2. Keyword Analysis (1)	2024.08.19

Hwan's AI

[NLP] 5. Semantic Network Analysis

1. 어휘 동시 출현 빈도의 계수화

2. 중심성 지수

연결 중심성(Degree Centrality)

위세 중심성(Eigenvector Centrality)

근접 중심성(Closeness Centrality)

매개 중심성(Betweeness Centrality)

페이지랭크(PageRank)

3. 시각화

'Study' 카테고리의 다른 글

티스토리툴바

[NLP] 5. Semantic Network Analysis

1. 어휘 동시 출현 빈도의 계수화

2. 중심성 지수

연결 중심성(Degree Centrality)

위세 중심성(Eigenvector Centrality)

근접 중심성(Closeness Centrality)

매개 중심성(Betweeness Centrality)

페이지랭크(PageRank)

3. 시각화

'Study' 카테고리의 다른 글

관련글

티스토리툴바