MS Learn - Train and evaluate classification models 3 - Multiclass Classification Model

April 13, 2025

Thumbnail

Multiclass Classification ModelPermalink

이제 Binary가 아닌 Multiclass로서 여러가지 카테고리의 response를 얻을 수 있는 모델을 배워보자.

OVR, OVOPermalink

Multiclass 분류는 다음과 같이 두 방법으로 접근해볼 수 있다.

OVR: One vs RestPermalink

OVR은 One vs Rest의 준말로 범주 하나와 그 나머지로 분류를 하는 것이다.

square or not
circle or not
triangle or not

분류의 개수가 $N$ 개라면 $N$ 번의 모델 학습이 이루어진다.

OVO: One vs OnePermalink

OVO는 One vs One의 준말로 범주 하나와 다른 범주 하나로 분류를 하는 것이다.

square or circle
square or triangle
triangle or circle

분류의 개수가 $N$ 개라면 $\dfrac{N(N-1)}{2}$ 번의 모델 학습이 이루어진다.

| 구분 | 설명 | 장점 | 단점 | | —————– | ————— | ——- | ———– | | OVR (One-vs-Rest) | 각 클래스 vs 나머지 전체 | 모델 수 적음 | 불균형 데이터에 취약 | | OVO (One-vs-One) | 각 클래스 쌍끼리 비교 | 높은 정확도 | 모델 수 많음, 느림 | Scikit Learn같은 머신러닝 라이브러리들은 모두 multiclass classification을 쉽게 구현할 수 있게 되어있다. Estimator(e.g. LogisticRegression)들이 내부적으로 Multiclass Classification을 추상화해두었으며 보통 기본값으로 OVR을 사용한다.

ExercisePermalink

펭귄의 특징들을 이용한 펭귄 종의 Multiclass Classification을 수행해본다.

데이터는 대략 이렇게 생겼다. Species가 펭귄의 종이다.

결측치 처리Permalink

다음과 같이 각 feature에 대해 결측치들을 파악해본다.

# Count the number of null values for each column
penguins.isnull().sum()

CulmenLength     2
CulmenDepth      2
FlipperLength    2
BodyMass         2
Species          0
dtype: int64

response인 Species를 제외하고 모두 결측치들이 두개가 있는데, 이걸 필터링해서 어떤 row들이 결측치가 있는지도 보자.

# Show rows containing nulls
penguins[penguins.isnull().any(axis=1)]

이제 이 결측치가 있는 row들을 제거한다.

# Drop rows containing NaN values
penguins=penguins.dropna()

EDA, Feature AnalysisPermalink

EDA는 Exploratory Data Analysis의 준말로 데이터의 특성, 패턴, 이상치를 파악하는 과정이다.

통계값 확인 (mean, median, std 등)
시각화 (Histogram, Box Plot, Scatter Plot)
이상치 탐지
상관관계 분석

등을 할 수 있다. 다음과 같이 또 Box Plot을 그려본다.

from matplotlib import pyplot as plt
%matplotlib inline

penguin_features = ['CulmenLength','CulmenDepth','FlipperLength','BodyMass']
penguin_label = 'Species'
for col in penguin_features:
    penguins.boxplot(column=col, by=penguin_label, figsize=(6,6))
    plt.title(col)
plt.show()

이런식으로 Box Plot을 그렸을 때 Culmen Depth(부리의 두께)가 1은 많이 다르므로 0, 2과 종을 분류해내기에 좋은 Feature임을 파악할 수 있다.

데이터 준비Permalink

이제 훈련용 데이터와 테스트용 데이터로 분류를 해보자.

from sklearn.model_selection import train_test_split

# Separate features and labels
penguins_X, penguins_y = penguins[penguin_features].values, penguins[penguin_label].values

# Split data 70%-30% into training set and test set
x_penguin_train, x_penguin_test, y_penguin_train, y_penguin_test = train_test_split(penguins_X, penguins_y,
                                                                                    test_size=0.30,
                                                                                    random_state=0,
                                                                                    stratify=penguins_y)

print ('Training Set: %d, Test Set: %d \n' % (x_penguin_train.shape[0], x_penguin_test.shape[0]))

여기서 주의깊게 봐야할건 train_test_split에서 stratify=penguins_y 이다.

stratification은 층화 추출을 의미하는 것으로 비율에 맞게 추출을 한다는 것이다.

현재 펭귄의 세 종에서 각 종의 데이터 수가 차이가 나기 때문에 종 별로 원본 데이터의 비율에 맞춰서 train, test set을 구성한다.

species = penguins['Species'].unique()

for s in species:
    print(s, len(penguins[penguins['Species'] == s]))

151
123
68

학습하기Permalink

from sklearn.linear_model import LogisticRegression

# Set regularization rate
reg = 0.1

# train a logistic regression model on the training set
multi_model = LogisticRegression(C=1/reg, solver='lbfgs', multi_class='auto', max_iter=10000).fit(x_penguin_train, y_penguin_train)
print (multi_model)

Scikit-Learn의 LogisticRegression은 내부적으로 자동으로 Multiclass Classification을 지원하기 때문에 코드를 많이 변경하지 않고 그대로 사용할 수 있다.

lbfgs: L-BFGS(Broyden-Fletcher-Goldfarb-Shanno) 알고리즘으로 작은 데이터셋에 적합하고 빠르고 안정적인 최적화 알고리즘
auto: LogisticRegression에서 지원하는 다중 클래스 처리방식 ovr, multinomial 중에 자동으로 선택

모델 평가Permalink

penguin_predictions = multi_model.predict(x_penguin_test)
print('Predicted labels: ', penguin_predictions[:15])
print('Actual labels   : ', y_penguin_test[:15])

predict를 통해 동일하게 모델로 예측값들을 얻을 수 있다.

그리고 classification_report로 동일하게 전반적인 분류의 성능을 확인할 수 있다.

              precision    recall  f1-score   support

           0       0.96      0.98      0.97        45
           1       1.00      1.00      1.00        37
           2       0.95      0.90      0.93        21

    accuracy                           0.97       103
   macro avg       0.97      0.96      0.96       103
weighted avg       0.97      0.97      0.97       103

이 경우 accuracy가 이미 0.97 이므로 상당히 좋은 모델이라고 말할 수 있다.

이 값들을 sklearn.metrics에서도 하나하나 뽑아올 수 있다.

accuracy_score, precision_score, recall_score 함수로 값들을 얻어올 수 있는데, Muticlass Classification이기 때문에 Average 값들이 Macro일지 Weighted 된 값일지도 인자로 넣어주어야 한다.

from sklearn.metrics import accuracy_score, precision_score, recall_score

print("Overall Accuracy:",accuracy_score(y_penguin_test, penguin_predictions))
print("Overall Precision:",precision_score(y_penguin_test, penguin_predictions, average='macro'))
print("Overall Recall:",recall_score(y_penguin_test, penguin_predictions, average='macro'))

Overall Accuracy: 0.970873786407767
Overall Precision: 0.9688405797101449
Overall Recall: 0.9608465608465608

Confusion Matrix는 Binary Classification 처럼 $2 \times 2$ 행렬로 나오는게 아닌 $N \times N$ 행렬로 나온다.

from sklearn.metrics import confusion_matrix

# Print the confusion matrix
mcm = confusion_matrix(y_penguin_test, penguin_predictions)
print(mcm)

[[44  0  1]
 [ 0 37  0]
 [ 2  0 19]]

이걸 Heatmap으로 Visualization 할 수 있다.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

plt.imshow(mcm, interpolation="nearest", cmap=plt.cm.Blues)
plt.colorbar()
tick_marks = np.arange(len(penguin_classes))
plt.xticks(tick_marks, penguin_classes, rotation=45)
plt.yticks(tick_marks, penguin_classes)
plt.xlabel("Predicted Species")
plt.ylabel("Actual Species")
plt.show()

imshow 는 이미지(행렬 데이터)시각화 함수이다.
colorbar 는 우측의 색상 바를 렌더링해주는 역할이다.
np.arange 는 $0$ 부터 $n-1$ 까지의 수를 가진 ndarray를 반환해준다.

ROC CurvePermalink

Multiclass Classification에서는 ROC 커브도 한 줄만 나오는게 아니다.

하지만 OVR(One vs Rest)방식으로 ROC 커브를 그려 하나의 차트에 그려지도록 시각화할 수 있다.

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

# Get class probability scores
penguin_prob = multi_model.predict_proba(x_penguin_test)

# Get ROC metrics for each class
fpr = {}
tpr = {}
thresh ={}
for i in range(len(penguin_classes)):    
    fpr[i], tpr[i], thresh[i] = roc_curve(y_penguin_test, penguin_prob[:,i], pos_label=i)
    
# Plot the ROC chart
plt.plot(fpr[0], tpr[0], linestyle='--',color='orange', label=penguin_classes[0] + ' vs Rest')
plt.plot(fpr[1], tpr[1], linestyle='--',color='green', label=penguin_classes[1] + ' vs Rest')
plt.plot(fpr[2], tpr[2], linestyle='--',color='blue', label=penguin_classes[2] + ' vs Rest')
plt.title('Multiclass ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive rate')
plt.legend(loc='best')
plt.show()

마찬가지로 predict가 아닌 predict_proba를 통해 Classification 되기 전의 확률 자체를 이용해 roc_curve의 인자로 넘겨줘서 ROC를 그려야 한다.

우리가 앞서 여러 score들을 보면서 살펴보았듯이 상당히 $1$ 에 근접한 score가 나왔기 때문에 ROC의 모양도 ㄱ 자에 가까운 모양이 나옴을 알 수 있다.

이는 AUC가 $1$ 에 근접함을 의미한다.

모델 개선Permalink

LogisticRegression이 아닌 Support Vector Machine을 사용하고 Feature들을 전처리해보자.

StandardScaler로 평균이 $0$ 이고 표준 편차가 $1$ 인 형태로 각 Feature를 변경하고 SVC(Support Vector Classification) 알고리즘을 이용해 구현한다.

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Define preprocessing for numeric columns (scale them)
feature_columns = [0,1,2,3]
feature_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
    ])

# Create preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('preprocess', feature_transformer, feature_columns)])

# Create training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', SVC(probability=True))])


# fit the pipeline to train a linear regression model on the training set
multi_model = pipeline.fit(x_penguin_train, y_penguin_train)
print (multi_model)

마지막으로 평가하는 코드이다.

# Get predictions from test data
penguin_predictions = multi_model.predict(x_penguin_test)
penguin_prob = multi_model.predict_proba(x_penguin_test)

# Overall metrics
print("Overall Accuracy:", accuracy_score(y_penguin_test, penguin_predictions))
print("Overall Precision:", precision_score(y_penguin_test, penguin_predictions, average='macro'))
print("Overall Recall:", recall_score(y_penguin_test, penguin_predictions, average='macro'))
print('Average AUC:', roc_auc_score(y_penguin_test,penguin_prob, multi_class='ovr'))

# Confusion matrix
plt.imshow(mcm, interpolation="nearest", cmap=plt.cm.Blues)
plt.colorbar()
tick_marks = np.arange(len(penguin_classes))
plt.xticks(tick_marks, penguin_classes, rotation=45)
plt.yticks(tick_marks, penguin_classes)
plt.xlabel("Predicted Species")
plt.ylabel("Actual Species")
plt.show()

MS Learn - Train and evaluate classification models 3 - Multiclass Classification Model

Multiclass Classification ModelPermalink

OVR, OVOPermalink

OVR: One vs RestPermalink

OVO: One vs OnePermalink

ExercisePermalink

결측치 처리Permalink

EDA, Feature AnalysisPermalink

데이터 준비Permalink

학습하기Permalink

모델 평가Permalink

ROC CurvePermalink

모델 개선Permalink

Comments

You may also enjoy

웹페이지를 개편하다

Xcode 26 WWDC25

iOS Architecture에 대한 고찰과 회고

Git Worktree