TensorFlow | 간단한 Deep learning 구현 (Keras)

19 Oct 2022 in Machine Learning / Tf

Tensorflow를 이용하여 모델을 생성하고 학습시키는 방법

What is TensorFlow
What is Keras
Tensor Flow 딥러닝 모델 구현
Tensor Flow + Keras를 활용하여 비선형회귀 구현
Reference

What is TensorFlow

머신러닝(딥러닝)을 위한 오픈소스 라이브러리

구글이 개발
데이터 처리, 딥러닝 모델 생성 및 학습을 위한 도구
페이스북이 개발한 딥러닝 프레임워크인 pytorch도 많이 쓰는 추세. pytorch vs.tensorflow

What is Keras

Tensor Flow의 패키지로 제공되는 고수준 API, 딥러닝 모델을 간단하고 빠르게 구현 가능

Tensor Flow 딥러닝 모델 구현

모델 클래스 객체 생성
```
 tf.keras.models.Sequential() 
```

모델의 layer 구성

 tf.keras.layers.Dense(units, activation)

units: 레이어 안의 Node 수
activation: 적용할 activiation 함수 (ex. relu, sigmoid, softmax)

모델 구성 예시 (1) - 리스트 형태

  model = tf.keras.models.Sequential([
      tf.keras.layers.Dense(20, input_dim = 2, activation='relu'),
      tf.keras.layers.Dense(20, activation='relu'),
      tf.keras.layers.Dense(1, activation='sigmoid')
  ])

모델 구성 예시 (2) - add()를 통해 layer 추가

  model = tf.keras.Sequential()
  model.add(tf.keras.layers.Dense(16, input_dim=2, activation='relu'))
  model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

모델 학습 방식 설정
```
 model.compile(optimizer, loss)
```
- optimizer: 모델 학습 최적화 방식
- loss: 손실 함수 설정
- 예시
```
 model.compile(optimizer='adam', loss='mse')
```
모델 학습
```
 model.fit(train_x, train_y)
```
모델 평가
```
 model.evaluate(test_x, test_y)
```

모델 예측

 pred = model.predict(test_x) # 예측값 반환

Tensor Flow + Keras를 활용하여 비선형회귀 구현

import tensorflow as tf
import numpy as np

import matplotlib.pyplot as plt
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

np.random.seed(100)
tf.random.set_seed(100)


# 비선형 데이터 샘플 생성
x_data = np.linspace(0, 10, 100) 
y_data = 1.5 * x_data**2 -12 * x_data + np.random.randn(*x_data.shape)*2 + 0.5

# print(x_data)
# print(x_data.shape) # (100, )

# 모델 정의
model = tf.keras.models.Sequential([
(    tf.keras.layers.Dense(20, input_dim = 1, activation='relu'),
)    tf.keras.layers.Dense(20, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(loss = 'mean_squared_error', optimizer = 'adam')

history = model.fit(x_data, y_data, epochs=500, verbose=1)

predictions = model.predict(x_data)

데이터(파란점)과 선형회귀(빨간선) 시각화

Reference

https://wikidocs.net/156950

TensorFlow | 상수, 시퀀스, 변수 텐서

19 Oct 2022 in Machine Learning / Tf

텐서플로에서 텐서(Tensor) 데이터를 생성하는 기본 방법

Tensor 생성하기
Tensor 연산

import tensorflow as tf import os os.environ[‘TF_CPP_MIN_LOG_LEVEL’] = ‘2’

Tensor 생성하기

Constant Tensor

t1 = tf.constant(5, dtype=tf.int8, shape=(1, 1))
'''
[[5]]
'''

t2 = tf.zeros((3, 5), dtype=tf.int16)
'''
[[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]
'''

t3 = tf.ones((4, 3), dtype=tf.int8)
'''
[[1 1 1]
 [1 1 1]
 [1 1 1]
 [1 1 1]]
'''

Sequence Tensor

seq_t1 = tf.range(1.5, 11, 4.5)
'''
[ 1.5  6.  10.5]
'''

seq_t2 = tf.range(2.5, 21, 4.5)
'''
[ 2.5  7.  11.5 16.  20.5]
'''

Variable Tensor

var_tensor = tf.Variable(100)
'''
100
'''

W = tf.Variable(tf.ones((2, 2)), name='W')
'''
[[1. 1.]
 [1. 1.]]
'''

b = tf.Variable(tf.zeros(2,), name='b')
'''
[0. 0.]
'''

Tensor 연산

a = tf.constant(10, dtype = tf.int32)
b = tf.constant(3, dtype = tf.int32)

add = tf.add(a, b) # tensor 덧셈
sub = tf.subtract(a, b) # tensor 뺄셈
mul = tf.multiply(a, b) # tensor 곱
div = tf.truediv(a, b) # tensor 나눗셈 (a/b)

Deep Learning | 역전파(BackPropagation)

19 Oct 2022 in Machine Learning / Basic

딥러닝에서 역전파는 어떻게 이루어지나, 구현

역전파
python 구현
Reference

역전파

python 구현

import math

def sigmoid(x) :
    return 1 / (1 + math.exp(-x))

def getParameters(X, y) :
    
    f = len(X[0])
    w = [1] * f
    values = []
    
    while True :

        wPrime = [0] * f    # 초기 가중치 w에 더해지는 값 (역전파로 구해짐)
        
        for i in range(len(y)) :
            r = 0
            for j in range(f) :
                r = r + X[i][j]*w[j]
            
            v = sigmoid(r)
            
            # w를 업데이트하기 위한 wPrime을 역전파를 이용해 구하는 식
            for j in range(f) :
                wPrime[j] += -((v - y[i]) * v * (1-v) * X[i][j])
        
        flag = False
        
        for i in range(f) :
            if abs(wPrime[i]) >= 0.001:
                flag = True
                break
        
        if flag == False :
            break
        
        for j in range(f) :
            w[j] = w[j] + wPrime[j]
    
    return w

def main():
    
    # [예제 1]
    X = [(1, 0, 0), (1, 0, 1), (0, 0, 1)]
    y = [0, 1, 1]

    # [예제 2]
    # X = [(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0), (1, 1, 1)]
    # y = [0, 0, 0, 1, 0, 1, 1, 1]

    print(getParameters(X, y))

if __name__ == "__main__":
    main()

Reference

<텐서플로> 딥러닝 코드 샘플

19 Oct 2022 in Archive

코드 샘플

import numpy as np
import tensorflow as tf
from visual import *

import logging, os
logging.disable(logging.WARNING)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

# 데이터를 전처리하는 함수

def sequences_shaping(sequences, dimension):
    
    results = np.zeros((len(sequences), dimension))
    for i, word_indices in enumerate(sequences):
        results[i, word_indices] = 1.0 
    
    return results


def OPT_model(word_num):
    
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(100, input_dim=word_num, activation='relu'),
        tf.keras.layers.Dense(100, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    
    return model

def main():
    
    word_num = 100
    data_num = 25000
    
    # Keras에 내장되어 있는 imdb 데이터 세트를 불러오고 전처리합니다.
    
    (train_data, train_labels), (test_data, test_labels) = tf.keras.datasets.imdb.load_data(num_words = word_num)
    
    train_data = sequences_shaping(train_data, dimension = word_num)
    test_data = sequences_shaping(test_data, dimension = word_num)
    
    adagrad_model = OPT_model(word_num)  # Adagrad를 사용할 모델입니다.
    rmsprop_model = OPT_model(word_num)  # RMSProp을 사용할 모델입니다.
    adam_model = OPT_model(word_num)     # Adam을 사용할 모델입니다.
    
    adagrad_opt = tf.keras.optimizers.Adagrad(lr=0.01, epsilon=0.00001, decay=0.4)
    adagrad_model.compile(optimizer=adagrad_opt, loss='binary_crossentropy', metrics=['accuracy', 'binary_crossentropy'])
    
    rmsprop_opt = tf.keras.optimizers.RMSprop(lr=0.001) 
    rmsprop_model.compile(optimizer=rmsprop_opt, loss='binary_crossentropy', metrics=['accuracy', 'binary_crossentropy'])
    
    adam_opt = tf.keras.optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999)
    adam_model.compile(optimizer=adam_opt, loss='binary_crossentropy', metrics=['accuracy', 'binary_crossentropy'])
    
    adagrad_model.summary()
    rmsprop_model.summary()
    adam_model.summary()


    adagrad_history = adagrad_model.fit(train_data, train_labels, epochs=20, batch_size=500, validation_data=(test_data, test_labels), verbose=0)
    print('\n')
    rmsprop_history = rmsprop_model.fit(train_data, train_labels, epochs=20, batch_size=500, validation_data=(test_data, test_labels), verbose=0)
    print('\n')
    adam_history = adam_model.fit(train_data, train_labels, epochs=20, batch_size=500, validation_data=(test_data, test_labels), verbose=0)
    
    scores_adagrad = adagrad_model.evaluate(test_data, test_labels)
    scores_rmsprop = rmsprop_model.evaluate(test_data, test_labels)
    scores_adam = adam_model.evaluate(test_data, test_labels)
    
    print(adagrad_history.history)

    print('\nscores_adagrad: ', scores_adagrad[-1])
    print('scores_rmsprop: ', scores_rmsprop[-1])
    print('scores_adam: ', scores_adam[-1])
    
    Visulaize([('Adagrad', adagrad_history),('RMSprop', rmsprop_history),('Adam', adam_history)])
    
    return adagrad_history, rmsprop_history, adam_history
    
if __name__ == "__main__":
    main()

L1, L2 정규화

ratio : 가중치에 L1 정규화를 적용하는 비율 (0.001 ~0.005)

tf.keras.layers.Dense(kernel_regularizer = tf.keras.regularizers.l2(ratio))

drop out

prob: 드롭 아웃 적용 확률 0.1~0.5

tf.keras.layers.Dropout(prob)

Batch Normalization

tf.keras.layers.BatchNormalization()

tf.keras.layers.Activation()

Numpy | random 함수 사용법, 예시

18 Oct 2022 in Python for 데이터 분석

넘파이 라이브러리를 활용해 난수 생성하기 (랜덤한 정수, 0~1사이, 정규분포)

import numpy as np

numpy.random 쓰는 방법을 자꾸 까먹어서,,, 적어 놔야지.

랜덤한 int 생성하기

numpy.random.randint를 사용하여 임의의 정수 array를 만들 수 있다.

random.randint(low, high, size, dtype)

[low, high) 범위의 int 생성, discrete uniform distribution¹ 기반
필수 파라미터: low
파라미터 high를 명시하지 않은 경우 [0, low) 범위의 int 생성
파라미터 size를 명시하지 않은 경우 single int 생성

randint() 예시

import numpy as np

np.random.randint(5) # 0~4까지 랜덤 정수 1개 생성
### >>> 0

np.random.randint(5, size=4) # 0~4까지 랜덤 정수, 길이가 4인 array 생성
### >>> array([4, 1, 4, 3])

np.random.randint(5, size=(2, 3)) # 0~4까지 랜덤 정수, 크기가 (2, 3)인 2차원 array 생성
### >>> array([[2, 1, 3],
### >>>        [4, 1, 0]])

np.random.randint(1, 11, size=6) # 1~10까지 랜덤 정수, 길이가 6인 array 생성
### >>> array([3, 5, 2, 8, 6, 8])

0 ~ 1 사이의 랜덤한 난수 생성하기

numpy.random.rand를 활용하여 0~1 범위의 랜덤한 실수 array 생성

random.rand(d0, d1, …, dn)

[0, 1) 범위의 난수 생성
numpy.zeros 혹은 numpy.ones를 사용해서 0과 1로 초기화된 특정 크기의 array를 만드는 것처럼, 랜덤한 샘플 array를 생성할 때 자주 쓰인다.
파라미터로 dimension, 즉 array 크기만 지정

rand() 예시

import numpy as np

np.random.rand(3) # 1x3 array
### >>> array([0.88843017, 0.08603825, 0.574518  ])

np.random.rand(2, 4) # 2x4 array
### >>> array([[0.97885576, 0.78156378, 0.74111103, 0.38431599],
### >>>       [0.51179429, 0.43498217, 0.4356467 , 0.86757286]])

표준 정규 분포를 따르는 난수 생성하기

numpy.random.randn를 활용하여 표준 정규분포 (standard normal distribution)을 따르는 난수 생성

random.randn(d0, d1, … , dn)

standard normal distribution: 평균이 0, 표준 편차가 1인 표준 정규 분포를 따름
파라미터로 dimension 만 지정

import numpy as np

np.random.randn()
### >>> 2.1923875335537315  # random

np.random.randn(2, 3)
### >>> array([[ 0.67294205, -0.65654402, -1.21734529],
### >>>       [-1.32495713, -1.01700438, -0.13663216]])

정규 분포를 따르는 난수 생성하기

numpy.random.normal를 활용하여 평균과 표준 편차를 지정한 정규 분포에서의 난수 생성

random.normal(loc, scale, size)

평균이 loc, 표준 편차가 scale인 정규 분포를 따름
deep learning에서 가중치 초기화를 조절할 때 사용할 수 있음 (Xavier 초기화², He 초기화³)

import numpy as np

s = np.random.normal(0, 0.1, 5)
### >>> array([-0.00212207,  0.1124257 , -0.09070432,  0.01982677, -0.17008441])

Reference

https://numpy.org/doc/stable/reference/random/generator.html
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html
https://reniew.github.io/13/

이산균등분포, 모든 숫자가 나올 확률이 동등한 분포 (ex. 주사위) ↩︎
노드 개수에 따라 표준 편차를 다르게 하여 가중치 초기화를 하는 방법 ↩︎
Xavier 초기화 방법과 마찬가지로 노드 개수에 따라 표준 편차를 조절, relu형 함수(선형 activation function)에 사용 ↩︎

Basic | 지도학습 분류 알고리즘 - Decision Tree

17 Oct 2022 in Machine Learning / Basic

의사결정나무를 이용하여 binary classifier 만들기

Decision Tree
언제 Decision Tree를 사용하는가.
불순도/불확실성
Decision Tree 학습 과정
Decision Tree 특징
Scikit-learn 모델 활용
Reference

Decision Tree

의사결정트리

예시
OUTLOOK, Temperature, HUMIDITY, WINDY 네 가지 attribute에 따라 테니스 경기 Play/Don’t Play에 대한 기록이 아래와 같다.

출처: nulpointerexception.com

그렇다면 각 attribute의 값이 Sunny, Cool, Normal, True 일 때 테니스 경기를 할까 말까? 위의 테이블 데이터를 바탕으로 decision tree를 만들면 아래와 같다.

출처: nulpointerexception.com

출처: imgur, Humdity 부분이 가지가 반대로 되어 있는 점 참고

마치 스무고개처럼 동작하는데, 위의 의사결정 트리에 따르면 Sunny, Cool, Normal, True의 경우 테니스 경기를 play할 것이라고 예측할 수 있다.

언제 Decision Tree를 사용하는가.

attribute에 대한 값이 pair 형태일 때 (ex. Wind: Strong / Weak)
예측해야 하는 타겟 값이 명확할 때 (Play?: Yes / No)
훈련용 데이터에 손실이 있거나 에러가 있을 수 있을 때 (Decision Tree는 다수결로 동작하기 때문에 데이터 손실이나 에러에 상대적으로 강함)

=> 물론 분류(classification) 뿐만 아니라 회귀(regression)에도 Decision Tree를 사용할 수 있다.

불순도/불확실성

Decision Tree가 만들어지면 node를 따라 내려가면서 값을 예측하는 건 이제 알겠고, 그럼 Decision Tree는 어떻게 만드는 걸까? 나무가지가 나눠지는 분기 기준을 어떻게 세우는 걸까?

불순도(impurity)가 최대한 감소하는 방향으로 학습을 진행하여 Decision Tree를 만든다.

불순도를 나타내는 대표적인 지표에는 Entropy와 Gini Index가 있다. 계산 방법은 아래와 같고, Tree의 한 node 범주로 분류되는 데이터들이 얼마나 비슷한지에 대한 값이라고 생각하면 된다. 예를 들면, Tennis에서 Outlook-Overcast 루트로 분류되는 4개의 데이터(위의 데이터 테이블 참고)가 모두 Yes 값이므로 불순도가 0이다. (Entropy: 0) 반면 Outlook-Sunny 루트로 분류되는 5개의 데이터의 경우 Yes가 2개 No가 3개이므로 불순도가 높다. (Entropy: 0.97)

Entropy
\(Entropy(A)=-\sum _{ k=1 }^{ m }{ { p }_{ k }\log _{ 2 }{ { (p }_{ k }) } }\)

Gini Index (지니 계수)
\(G.I(A)=\sum _{ i=1 }^{ d }{ { \left( { R }_{ i }\left( 1-\sum _{ k=1 }^{ m }{ { p }_{ ik }^{ 2 } } \right) \right) } }\)

두 지표의 가장 큰 차이점: 불순도가 가장 낮을 때 (모든 데이터 값이 동일할 때)는 Entropy: 0, Gini Index: 0이지만, 불순도가 가장 높을 때 (예를 들어 이진분류에서 반반씩 섞여 있는 경우) Entropy: 1, Gini Index: 0.5이다.

Decision Tree 학습 과정

재귀적 분기(Recursive Partitioning)과 가지치기(Prunning)이 이루어진다.

임의로 attribute를 하나 골라 분할한다고(분기) 가정한 후, 위의 지표값 중 하나를 선택하여 계산한다.
(분기 전의 불순도 - 분기 후의 불순도) = 정보 획득(Information Gain)을 계산한다.
정보 획득이 가장 큰 변수(attribute)와 그 지점(value 혹은 범위)를 선택하여 분기
불순도가 0이 되거나 leaf node의 데이터 개수가 특정 개 미만이 될때까지 반복한다. 재귀적 분기(Recursive Partitioning)
tree의 깊이가 너무 깊어지면 overfitting이 발생할 수 있으므로, 가지치기(prunning)을 통해 여러 분기를 통합시킬 수 있다.

대충 다양한 기준으로 나눠보고 최대한 균일하게 나뉘는 기준을 선택한다는 뜻, brute force와 느낌이 비슷하다.

Decision Tree 특징

계산복잡성 대비 높은 에측 성능
변수 단위 설명력 높음 (왜 이런 결과가 나왔는지 tree 시각적 확인 가능)
decision boundary가 데이터 축에 수직이기 때문에 특정 데이터에만 잘 작동할 가능성이 높다.
이를 해결하기 위해 random forest¹ 모델을 활용할 수 있다.

Scikit-learn 모델 활용

load_iris 데이터셋 활용
scikit-learn의 Dicision Tree Classifier 클래스 활용

파이썬 코드

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from elice_utils import EliceUtils
elice_utils = EliceUtils()

def load_data():
    
    X, y = load_iris(return_X_y = True) 
    
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=100)
    
    return train_X, test_X, train_y, test_y

def main():
    
    train_X, test_X, train_y, test_y = load_data()
    
    clf = DecisionTreeClassifier() # 분류기(classifier) 정의
    
    clf.fit(train_X, train_y) # 분류기 학습
    
    pred = clf.predict(test_X) # test_x 값에 대한 예측값 생성

    print('정확도 : {0:.4f}'.format(accuracy_score(test_y, pred)))
    
    return pred
    
if __name__ == "__main__":
    main()

Reference

https://ratsgo.github.io/machine%20learning/2017/03/26/tree/
https://nulpointerexception.com/2017/12/16/a-tutorial-to-understand-decision-tree-id3-learning-algorithm/

Decision Tree를 여러개 만들어 결과를 종합하여 예측 성능을 높이는 기법 (앙상블 - Bagging) ↩︎

Basic | Multiple Linear, Polynomial Regression

13 Oct 2022 in Machine Learning / Basic

다중 선형회귀와 다항 회귀 그리고 과적합 방지

Multiple Linear Regression
Polynomial Regression
- 다항 회귀 원리
- 다항 회귀 특징
Scikit-learn을 이용한 다중회귀 구현
과적합 방지
- 1. 교차 검증 (Cross Validation)
- 2. 정규화 (Regularization)
Reference

Multiple Linear Regression

다중 선형 회귀

여러 개의 X (input) -> Y (output) 예측
직선 하나
직선 방정식: \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_iX_i\)
애초에 데이터들의 관계가 선형적이지 않다면? (아래와 같이)

출처: elice

-> 다항회귀의 필요성

Polynomial Regression

다항 회귀

1차 함수 선형식으로 표현하기 어려운 분포의 데이터를 위한 회귀 (곡선)
다항 회귀식: \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2^2 + ... + \beta_iX_i^i\)

다항 회귀 원리

기존 입력값 \(X_i\)를 전처리한 새로운 변수를 추가시켜 선형 회귀 모델로 예측
선형회귀모델을 사용하는 것은 동일, X의 전처리가 중요함
예를 들어, \(Y = \beta_0 + \beta_1X_1 + \beta_2X_1^2\) 에서 전처리된 새로운 변수 \(X_1^2\)를 \(X_2\)로 치환하면
\(Y = \beta_0 + \beta_1X_1 + \beta_2X_2\) 의 다중 선형회귀 식과 같다.

다항 회귀 특징

일차 함수식으로 표현할 수 없는 복잡한 데이터 분포에도 적용 가능
과적합¹ 현상 발생 가능성

Scikit-learn을 이용한 다중회귀 구현

fetch_california_housing 데이터를 이용해서 다중 회귀 분석을 구현해보고 모델의 점수를 매겨보자.

fetch_california_housing: scikit learn 라이브러리에서 지원하는 데이터셋 중 하나, California housing price 데이터

파이썬 구현 코드

import matplotlib.pyplot as plt

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# fetch_california_housing 데이터 가져오기
from sklearn.datasets import fetch_california_housing

def load_data():
    
    X, y  = fetch_california_housing(return_X_y = True)

    print("데이터의 입력값(X)의 개수 :", X.shape[1])
    
    # train-test 데이터 나누기 (8:2)
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=123)
    
    return train_X, test_X, train_y, test_y

# regression 모델 생성, 학습
def Multi_Regression(train_X,train_y):
    
    multilinear = LinearRegression()

    multilinear.fit(train_X, train_y)

    return multilinear


def main():
    
    train_X, test_X, train_y, test_y = load_data()
    
    # 모델 생성, 학습
    multilinear = Multi_Regression(train_X,train_y) 
    
    # 학습된 모델로 테스트 데이터에 대해 예측하기
    predicted = multilinear.predict(test_X) 
    
    model_score = multilinear.score(test_X, test_y)
    
    print("\n> 모델 평가 점수 :", model_score)
     
    beta_0 = multilinear.intercept_
    beta_i_list = multilinear.coef_
    
    print("\n> beta_0 : ",beta_0)
    print("> beta_i_list : ",beta_i_list)
    
    return predicted, beta_0, beta_i_list, model_score
    
if __name__ == "__main__":
    main()

실행 결과

과적합 방지

1. 교차 검증 (Cross Validation)

train / test / valiation 으로 데이터 나누기
k-fold 교차 검증을 많이 사용함
출처: Scikit-learn

k-fold validation 방식

훈련 데이터를 계속 변경하여 학습시킴 (특정 데이터에 지나치게 과적합 되는 것을 방지)
데이터 셋을 K개로 나눔
K중 한 개를 검증용, 나머지를 훈련용으로 사용
K개 모델의 평균 성능이 최종 모델 성능
test data는 무조건 고정, 학습에 사용하지 않음

2. 정규화 (Regularization)

모델의 복잡성을 줄여 일반화를 증가시키자. 파라미터 \(\beta\) 의 개수를 줄이자

선형 회귀를 위한 정규화 방법

L1 정규화 (Lasso)
L2 정규화 (Ridge)

Reference

https://www.geeksforgeeks.org/multiple-linear-regression-with-scikit-learn/
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

과도하게 학습 데이터에 맞춰져 일반화 능력이 떨어지는 상태, 학습 데이터에 대해서는 오차가 감소하지만 실제 데이터에 대해서는 오차가 증가하게 된다. ↩︎

Basic | K-fold Validation

13 Oct 2022 in Machine Learning / Basic

과적합 방지를 위한 k-fold validation, scikit-learn을 이용한 구현

K-fold Validation
Scikit-learn의 KFold 사용법
Scikit-learn을 이용한 k-fold 구현
Reference

K-fold Validation

교차 검증 방법 중 하나

출처: Scikit-learn

k-fold validation 방식

훈련 데이터를 계속 변경하여 학습시킴 (특정 데이터에 지나치게 과적합 되는 것을 방지)
데이터 셋을 K개로 나눔
K중 한 개를 검증용, 나머지를 훈련용으로 사용
K개 모델의 평균 성능이 최종 모델 성능
test data는 무조건 고정, 학습에 사용하지 않음

Scikit-learn의 KFold 사용법

>>> import numpy as np
>>> from sklearn.model_selection import KFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4])
>>> kf = KFold(n_splits=2)
>>> kf.get_n_splits(X)
2
>>> print(kf)
KFold(n_splits=2, random_state=None, shuffle=False)

>>> for train_index, test_index in kf.split(X): # split을 이용해 반환된 인덱스 가져오기
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]

Scikit-learn을 이용한 k-fold 구현

fetch_california_housing 데이터를 이용해서 다중 회귀 분석을 구현해보고 모델의 점수를 매겨보자.

fetch_california_housing: scikit learn 라이브러리에서 지원하는 데이터셋 중 하나, California housing price 데이터
먼저 train-test data를 split 한 후
train data에 대해 scikit-learn의 KFold 모듈을 이용해 train-val로 다시 분리
n_splits = 5

파이썬 구현 코드

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression

# sklearn의 KFold 모듈 불러오기
from sklearn.model_selection import KFold

def load_data():
    X, y = fetch_california_housing(return_X_y=True)
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=100)
    
    return train_X, test_X, train_y, test_y
    
def kfold_regression(train_X, train_y):
    
    # 반복문 내에서 횟수를 표시하기 위한 변수 설정하기
    n_iter = 0
    # 각 fold 마다 모델 검증 점수를 저장하기 위한 빈 리스트 생성하기
    model_scores = []
    # K-fold에서 몇 개로 분리할지
    n_splits = 5

    kfold = KFold(n_splits) # 5개로 분리
    
    for train_idx, val_idx in kfold.split(train_X):
        
        X_train, X_val =  train_X[train_idx], train_X[val_idx]
        y_train, y_val =  train_y[train_idx], train_y[val_idx]
        
        # 동일한 가중치 사용을 위해 각 fold 마다 모델 초기화 하기
        model = LinearRegression()
        # 모델 학습
        model.fit(X_train, y_train)
        
        # 각 Iter 별 모델 평가 점수 측정
        score = model.score(X_val, y_val)
        
        # 학습용 데이터의 크기를 저장합니다.
        train_size = X_train.shape[0]
        val_size = X_val.shape[0]
    
        print("Iter : {0} Cross-Validation Accuracy : {1}, Train Data 크기 : {2}, Validation Data 크기 : {3}"
              .format(n_iter, score, train_size, val_size))
    
        n_iter += 1
        
        # 전체 모델 점수를 저장하는 리스트에 추가하기
        model_scores.append(score)
        
    return kfold, model, model_scores
        
def main():
    
    # 학습용 데이터와 테스트 데이터 불러오기
    train_X, test_X, train_y, test_y = load_data()
    # KFold 교차 검증을 통한 학습 결과와 회귀 모델을 반환하는 함수 호출하기
    kfold, model, model_scores = kfold_regression(train_X, train_y)
    # 전체 성능 점수의 평균 점수 출력
    print("\n> 평균 검증 모델 점수 : ", np.mean(model_scores))
    
if __name__ == "__main__":
    main()

실행 결과

Reference

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

Basic | CSV 파일 읽기

12 Oct 2022 in Python for 데이터 분석

파이썬에서 csv 파일 읽는 법

import csv

csvreader = csv.reader(open("test.txt"))

for line in csvreader:
    print(line)

Basic | K-means 클러스터링 알고리즘

12 Oct 2022 in Machine Learning / Basic

비지도학습의 한 종류인 K-means 알고리즘

비지도학습
숨겨진 구조 (hidden structure)
Hard Clustering
- Finding K
차원축소
- PCA (Principal Componenet Analysis)
클러스터링

비지도학습

ML
지도학습
regression
classification
비지도학습
차원축소
클러스터링
강화학습

숨겨진 구조 (hidden structure)

Hard Clustering
- 데이터 포인트들은 비슷한 것들끼리 뭉쳐있다.
- ex. 강아지 그룹이거나 고양이 그룹이거나, 둘 중 하나.
- 대표 알고리즘: K-means, Hierarchical Clustering
Soft Clustering
- 한 개의 데이터 포인트는 숨겨진 클러스터들의 결합이다.
- ex. 책 장르 분류 (과학 60% + 역사 40 %)
- 대표 알고리즘: Gaussian Mixture Models (EM, GMM), Soft K-means

Hard Clustering

목표: 비슷한 데이터 포인트끼리 모은다.
k: 클러스터 개수, 찾아야 하는 군집 개수

Finding K

눈으로 확인
모델이 데이터를 얼마나 잘 설명하는가
- k 개수를 변화시켜가면서

데이터의 특성
분석 결과로 얻고자 하는 것 (특정 라벨로 분류, or 분석용인지)

차원축소

데이터들의 여러 특성들을 2차원, 3차원에 나타내기 위해 차원 축소

PCA (Principal Componenet Analysis)

주성분 분석

고차원의 데이터를 저차원으로 줄이기 위해 (ex. 시각화)
- 차원이 축소되면서 손실되는 데이터를 줄이는 것이 PCA의 목적
데이터 정제 (자주 사용)
- noise 제거

클러스터링

주어진 데이터를 비슷한 그룹(클러스터)으로 묶는 알고리즘

몇 개의 그룹으로 묶을 건지 (k 값)은 사람이 줌

K-means

중심 (Centroid)
각 클러스터의 “중심”을 의미
- 각 데이터 포인트들의 x, y이 평균값
중심과의 거리 (Distance)
중심과 데이터 포인트와의 거리
- 보통 norm으로 계산: ||x-c||

Step 1

중심 위치 잡기 (랜덤)
각각의 데이터 포인트들과 임의로 설정한 중심들에 대해 distance 계산
distance를 비교하여 이 데이터포인트가 어느 클러스터에 속할지 결정
A 클러스터의 중심과의 거리가 가깝냐, B 클러스터의 중심과의 거리가 가깝냐

Step 2

한바퀴 다 돌고 클러스터가 정해지면 중심점을 다시 계산
중심점 계산법: 해당 클러스터 내 데ㅣ터 포인트 위치의 무게중심값 (또는 평균)

Step 3

Step 1과 2를 반복하면 중심점 업데이트
어떠한 데이터 포인트의 할당이 변하지 않을 때까지. (중심점을 옮겼는데도 클러스터링이 같게 나올 때)

K-means 구현

wine.csv 데이터를 이용하여 와인들을 clustering 해보기
파일은 아래 텍스트로 첨부

import sklearn.decomposition
import sklearn.cluster
import matplotlib.pyplot as plt
import numpy as np

def main():
    X, attributes = input_data() # 파일에서 데이터 읽어오기
    X = normalize(X) # x값 정규화 (0~1 사이)
    pca, pca_array = run_PCA(X, 2) # 13개의 attributes를 2개로 축소
    labels = kmeans(pca_array, 3, [0, 1, 2]) # k=3, 3개의 군집으로 나누기
    labels = labels.astype(int)
    
    visualize(pca_array, labels) # 시각화
    print(labels)

def input_data():
    X = []
    attributes = []
    
    with open('wine.csv') as fp:
        for line in fp:
            X.append([float(x) for x in line.strip().split(',')])
    
    with open('attributes.txt') as fp:
        attributes = [x.strip() for x in fp.readlines()]

    return np.array(X), attributes

def run_PCA(X, num_components):
    pca = sklearn.decomposition.PCA(n_components=num_components)
    pca.fit(X)
    pca_array = pca.transform(X)

    return pca, pca_array

def kmeans(X, num_clusters, initial_centroid_indices):
    import time
    N = len(X)
    centroids = X[initial_centroid_indices]
    labels = np.zeros(N)
    
    while True:
        is_changed = False # 라벨이 바뀌었는지 

        for i in range(N):
            distances = []
            for centroid in centroids:
                distances.append(distance(X[i], centroid))
                
            if labels[i] != np.argmin(distances):
                is_changed = True
            labels[i] = np.argmin(distances) # 클러스터 0, 1, 2 중 하나

        # print(labels)
        
        ### 새 중심점 계산
        for k in range(num_clusters):
            x = X[labels == k][:, 0]
            y = X[labels == k][:, 1]

            x = np.mean(x)
            y = np.mean(y)
            centroids[k] = [x, y]

        if not is_changed:
            break

    return labels

### 유클리드 거리 norm
def distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

### 정규화
# 0~1사이 값으로 선형 
def normalize(X):
    for dim in range(len(X[0])):
        X[:, dim] -= np.min(X[:, dim])
        X[:, dim] /= np.max(X[:, dim])
    return X


def visualize(X, labels):
#     plt.style.use("ggplot")
    plt.figure(figsize=(10, 6))
    plt.scatter(X[:, 0], X[:, 1], c=labels)
    

if __name__ == '__main__':
    main()

3개의 클러스터로 나눠짐

attributes.csv

Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline

wine.csv

23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0
2,1.76,2.45,15.2,112.0,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450.0
39,1.87,2.45,14.6,96.0,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290.0
06,2.15,2.61,17.6,121.0,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295.0
83,1.64,2.17,14.0,97.0,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045.0
86,1.35,2.27,16.0,98.0,2.98,3.15,0.22,1.85,7.22,1.01,3.55,1045.0
1,2.16,2.3,18.0,105.0,2.95,3.32,0.22,2.38,5.75,1.25,3.17,1510.0
12,1.48,2.32,16.8,95.0,2.2,2.43,0.26,1.57,5.0,1.17,2.82,1280.0
75,1.73,2.41,16.0,89.0,2.6,2.76,0.29,1.81,5.6,1.15,2.9,1320.0
75,1.73,2.39,11.4,91.0,3.1,3.69,0.43,2.81,5.4,1.25,2.73,1150.0
38,1.87,2.38,12.0,102.0,3.3,3.64,0.29,2.96,7.5,1.2,3.0,1547.0
63,1.81,2.7,17.2,112.0,2.85,2.91,0.3,1.46,7.3,1.28,2.88,1310.0
3,1.92,2.72,20.0,120.0,2.8,3.14,0.33,1.97,6.2,1.07,2.65,1280.0
83,1.57,2.62,20.0,115.0,2.95,3.4,0.4,1.72,6.6,1.13,2.57,1130.0
19,1.59,2.48,16.5,108.0,3.3,3.93,0.32,1.86,8.7,1.23,2.82,1680.0
64,3.1,2.56,15.2,116.0,2.7,3.03,0.17,1.66,5.1,0.96,3.36,845.0
06,1.63,2.28,16.0,126.0,3.0,3.17,0.24,2.1,5.65,1.09,3.71,780.0
93,3.8,2.65,18.6,102.0,2.41,2.41,0.25,1.98,4.5,1.03,3.52,770.0
71,1.86,2.36,16.6,101.0,2.61,2.88,0.27,1.69,3.8,1.11,4.0,1035.0
85,1.6,2.52,17.8,95.0,2.48,2.37,0.26,1.46,3.93,1.09,3.63,1015.0
5,1.81,2.61,20.0,96.0,2.53,2.61,0.28,1.66,3.52,1.12,3.82,845.0
05,2.05,3.22,25.0,124.0,2.63,2.68,0.47,1.92,3.58,1.13,3.2,830.0
39,1.77,2.62,16.1,93.0,2.85,2.94,0.34,1.45,4.8,0.92,3.22,1195.0
3,1.72,2.14,17.0,94.0,2.4,2.19,0.27,1.35,3.95,1.02,2.77,1285.0
87,1.9,2.8,19.4,107.0,2.95,2.97,0.37,1.76,4.5,1.25,3.4,915.0
02,1.68,2.21,16.0,96.0,2.65,2.33,0.26,1.98,4.7,1.04,3.59,1035.0
73,1.5,2.7,22.5,101.0,3.0,3.25,0.29,2.38,5.7,1.19,2.71,1285.0
58,1.66,2.36,19.1,106.0,2.86,3.19,0.22,1.95,6.9,1.09,2.88,1515.0
68,1.83,2.36,17.2,104.0,2.42,2.69,0.42,1.97,3.84,1.23,2.87,990.0
76,1.53,2.7,19.5,132.0,2.95,2.74,0.5,1.35,5.4,1.25,3.0,1235.0
51,1.8,2.65,19.0,110.0,2.35,2.53,0.29,1.54,4.2,1.1,2.87,1095.0
48,1.81,2.41,20.5,100.0,2.7,2.98,0.26,1.86,5.1,1.04,3.47,920.0
28,1.64,2.84,15.5,110.0,2.6,2.68,0.34,1.36,4.6,1.09,2.78,880.0
05,1.65,2.55,18.0,98.0,2.45,2.43,0.29,1.44,4.25,1.12,2.51,1105.0
07,1.5,2.1,15.5,98.0,2.4,2.64,0.28,1.37,3.7,1.18,2.69,1020.0
22,3.99,2.51,13.2,128.0,3.0,3.04,0.2,2.08,5.1,0.89,3.53,760.0
56,1.71,2.31,16.2,117.0,3.15,3.29,0.34,2.34,6.13,0.95,3.38,795.0
41,3.84,2.12,18.8,90.0,2.45,2.68,0.27,1.48,4.28,0.91,3.0,1035.0
88,1.89,2.59,15.0,101.0,3.25,3.56,0.17,1.7,5.43,0.88,3.56,1095.0
24,3.98,2.29,17.5,103.0,2.64,2.63,0.32,1.66,4.36,0.82,3.0,680.0
05,1.77,2.1,17.0,107.0,3.0,3.0,0.28,2.03,5.04,0.88,3.35,885.0
21,4.04,2.44,18.9,111.0,2.85,2.65,0.3,1.25,5.24,0.87,3.33,1080.0
38,3.59,2.28,16.0,102.0,3.25,3.17,0.27,2.19,4.9,1.04,3.44,1065.0
9,1.68,2.12,16.0,101.0,3.1,3.39,0.21,2.14,6.1,0.91,3.33,985.0
1,2.02,2.4,18.8,103.0,2.75,2.92,0.32,2.38,6.2,1.07,2.75,1060.0
94,1.73,2.27,17.4,108.0,2.88,3.54,0.32,2.08,8.9,1.12,3.1,1260.0
05,1.73,2.04,12.4,92.0,2.72,3.27,0.17,2.91,7.2,1.12,2.91,1150.0
83,1.65,2.6,17.2,94.0,2.45,2.99,0.22,2.29,5.6,1.24,3.37,1265.0
82,1.75,2.42,14.0,111.0,3.88,3.74,0.32,1.87,7.05,1.01,3.26,1190.0
77,1.9,2.68,17.1,115.0,3.0,2.79,0.39,1.68,6.3,1.13,2.93,1375.0
74,1.67,2.25,16.4,118.0,2.6,2.9,0.21,1.62,5.85,0.92,3.2,1060.0
56,1.73,2.46,20.5,116.0,2.96,2.78,0.2,2.45,6.25,0.98,3.03,1120.0
22,1.7,2.3,16.3,118.0,3.2,3.0,0.26,2.03,6.38,0.94,3.31,970.0
29,1.97,2.68,16.8,102.0,3.0,3.23,0.31,1.66,6.0,1.07,2.84,1270.0
72,1.43,2.5,16.7,108.0,3.4,3.67,0.19,2.04,6.8,0.89,2.87,1285.0
37,0.94,1.36,10.6,88.0,1.98,0.57,0.28,0.42,1.95,1.05,1.82,520.0
33,1.1,2.28,16.0,101.0,2.05,1.09,0.63,0.41,3.27,1.25,1.67,680.0
64,1.36,2.02,16.8,100.0,2.02,1.41,0.53,0.62,5.75,0.98,1.59,450.0
67,1.25,1.92,18.0,94.0,2.1,1.79,0.32,0.73,3.8,1.23,2.46,630.0
37,1.13,2.16,19.0,87.0,3.5,3.1,0.19,1.87,4.45,1.22,2.87,420.0
17,1.45,2.53,19.0,104.0,1.89,1.75,0.45,1.03,2.95,1.45,2.23,355.0
37,1.21,2.56,18.1,98.0,2.42,2.65,0.37,2.08,4.6,1.19,2.3,678.0
11,1.01,1.7,15.0,78.0,2.98,3.18,0.26,2.28,5.3,1.12,3.18,502.0
37,1.17,1.92,19.6,78.0,2.11,2.0,0.27,1.04,4.68,1.12,3.48,510.0
34,0.94,2.36,17.0,110.0,2.53,1.3,0.55,0.42,3.17,1.02,1.93,750.0
21,1.19,1.75,16.8,151.0,1.85,1.28,0.14,2.5,2.85,1.28,3.07,718.0
29,1.61,2.21,20.4,103.0,1.1,1.02,0.37,1.46,3.05,0.906,1.82,870.0
86,1.51,2.67,25.0,86.0,2.95,2.86,0.21,1.87,3.38,1.36,3.16,410.0
49,1.66,2.24,24.0,87.0,1.88,1.84,0.27,1.03,3.74,0.98,2.78,472.0
99,1.67,2.6,30.0,139.0,3.3,2.89,0.21,1.96,3.35,1.31,3.5,985.0
96,1.09,2.3,21.0,101.0,3.38,2.14,0.13,1.65,3.21,0.99,3.13,886.0
66,1.88,1.92,16.0,97.0,1.61,1.57,0.34,1.15,3.8,1.23,2.14,428.0
03,0.9,1.71,16.0,86.0,1.95,2.03,0.24,1.46,4.6,1.19,2.48,392.0
84,2.89,2.23,18.0,112.0,1.72,1.32,0.43,0.95,2.65,0.96,2.52,500.0
33,0.99,1.95,14.8,136.0,1.9,1.85,0.35,2.76,3.4,1.06,2.31,750.0
7,3.87,2.4,23.0,101.0,2.83,2.55,0.43,1.95,2.57,1.19,3.13,463.0
0,0.92,2.0,19.0,86.0,2.42,2.26,0.3,1.43,2.5,1.38,3.12,278.0
72,1.81,2.2,18.8,86.0,2.2,2.53,0.26,1.77,3.9,1.16,3.14,714.0
08,1.13,2.51,24.0,78.0,2.0,1.58,0.4,1.4,2.2,1.31,2.72,630.0
05,3.86,2.32,22.5,85.0,1.65,1.59,0.61,1.62,4.8,0.84,2.01,515.0
84,0.89,2.58,18.0,94.0,2.2,2.21,0.22,2.35,3.05,0.79,3.08,520.0
67,0.98,2.24,18.0,99.0,2.2,1.94,0.3,1.46,2.62,1.23,3.16,450.0
16,1.61,2.31,22.8,90.0,1.78,1.69,0.43,1.56,2.45,1.33,2.26,495.0
65,1.67,2.62,26.0,88.0,1.92,1.61,0.4,1.34,2.6,1.36,3.21,562.0
64,2.06,2.46,21.6,84.0,1.95,1.69,0.48,1.35,2.8,1.0,2.75,680.0
08,1.33,2.3,23.6,70.0,2.2,1.59,0.42,1.38,1.74,1.07,3.21,625.0
08,1.83,2.32,18.5,81.0,1.6,1.5,0.52,1.64,2.4,1.08,2.27,480.0
0,1.51,2.42,22.0,86.0,1.45,1.25,0.5,1.63,3.6,1.05,2.65,450.0
69,1.53,2.26,20.7,80.0,1.38,1.46,0.58,1.62,3.05,0.96,2.06,495.0
29,2.83,2.22,18.0,88.0,2.45,2.25,0.25,1.99,2.15,1.15,3.3,290.0
62,1.99,2.28,18.0,98.0,3.02,2.26,0.17,1.35,3.25,1.16,2.96,345.0
47,1.52,2.2,19.0,162.0,2.5,2.27,0.32,3.28,2.6,1.16,2.63,937.0
81,2.12,2.74,21.5,134.0,1.6,0.99,0.14,1.56,2.5,0.95,2.26,625.0
29,1.41,1.98,16.0,85.0,2.55,2.5,0.29,1.77,2.9,1.23,2.74,428.0
37,1.07,2.1,18.5,88.0,3.52,3.75,0.24,1.95,4.5,1.04,2.77,660.0
29,3.17,2.21,18.0,88.0,2.85,2.99,0.45,2.81,2.3,1.42,2.83,406.0
08,2.08,1.7,17.5,97.0,2.23,2.17,0.26,1.4,3.3,1.27,2.96,710.0
6,1.34,1.9,18.5,88.0,1.45,1.36,0.29,1.35,2.45,1.04,2.77,562.0
34,2.45,2.46,21.0,98.0,2.56,2.11,0.34,1.31,2.8,0.8,3.38,438.0
82,1.72,1.88,19.5,86.0,2.5,1.64,0.37,1.42,2.06,0.94,2.44,415.0
51,1.73,1.98,20.5,85.0,2.2,1.92,0.32,1.48,2.94,1.04,3.57,672.0
42,2.55,2.27,22.0,90.0,1.68,1.84,0.66,1.42,2.7,0.86,3.3,315.0
25,1.73,2.12,19.0,80.0,1.65,2.03,0.37,1.63,3.4,1.0,3.17,510.0
72,1.75,2.28,22.5,84.0,1.38,1.76,0.48,1.63,3.3,0.88,2.42,488.0
22,1.29,1.94,19.0,92.0,2.36,2.04,0.39,2.08,2.7,0.86,3.02,312.0
61,1.35,2.7,20.0,94.0,2.74,2.92,0.29,2.49,2.65,0.96,3.26,680.0
46,3.74,1.82,19.5,107.0,3.18,2.58,0.24,3.58,2.9,0.75,2.81,562.0
52,2.43,2.17,21.0,88.0,2.55,2.27,0.26,1.22,2.0,0.9,2.78,325.0
76,2.68,2.92,20.0,103.0,1.75,2.03,0.6,1.05,3.8,1.23,2.5,607.0
41,0.74,2.5,21.0,88.0,2.48,2.01,0.42,1.44,3.08,1.1,2.31,434.0
08,1.39,2.5,22.5,84.0,2.56,2.29,0.43,1.04,2.9,0.93,3.19,385.0
03,1.51,2.2,21.5,85.0,2.46,2.17,0.52,2.01,1.9,1.71,2.87,407.0
82,1.47,1.99,20.8,86.0,1.98,1.6,0.3,1.53,1.95,0.95,3.33,495.0
42,1.61,2.19,22.5,108.0,2.0,2.09,0.34,1.61,2.06,1.06,2.96,345.0
77,3.43,1.98,16.0,80.0,1.63,1.25,0.43,0.83,3.4,0.7,2.12,372.0
0,3.43,2.0,19.0,87.0,2.0,1.64,0.37,1.87,1.28,0.93,3.05,564.0
45,2.4,2.42,20.0,96.0,2.9,2.79,0.32,1.83,3.25,0.8,3.39,625.0
56,2.05,3.23,28.5,119.0,3.18,5.08,0.47,1.87,6.0,0.93,3.69,465.0
42,4.43,2.73,26.5,102.0,2.2,2.13,0.43,1.71,2.08,0.92,3.12,365.0
05,5.8,2.13,21.5,86.0,2.62,2.65,0.3,2.01,2.6,0.73,3.1,380.0
87,4.31,2.39,21.0,82.0,2.86,3.03,0.21,2.91,2.8,0.75,3.64,380.0
07,2.16,2.17,21.0,85.0,2.6,2.65,0.37,1.35,2.76,0.86,3.28,378.0
43,1.53,2.29,21.5,86.0,2.74,3.15,0.39,1.77,3.94,0.69,2.84,352.0
79,2.13,2.78,28.5,92.0,2.13,2.24,0.58,1.76,3.0,0.97,2.44,466.0
37,1.63,2.3,24.5,88.0,2.22,2.45,0.4,1.9,2.12,0.89,2.78,342.0
04,4.3,2.38,22.0,80.0,2.1,1.75,0.42,1.35,2.6,0.79,2.57,580.0
86,1.35,2.32,18.0,122.0,1.51,1.25,0.21,0.94,4.1,0.76,1.29,630.0
88,2.99,2.4,20.0,104.0,1.3,1.22,0.24,0.83,5.4,0.74,1.42,530.0
81,2.31,2.4,24.0,98.0,1.15,1.09,0.27,0.83,5.7,0.66,1.36,560.0
7,3.55,2.36,21.5,106.0,1.7,1.2,0.17,0.84,5.0,0.78,1.29,600.0
51,1.24,2.25,17.5,85.0,2.0,0.58,0.6,1.25,5.45,0.75,1.51,650.0
6,2.46,2.2,18.5,94.0,1.62,0.66,0.63,0.94,7.1,0.73,1.58,695.0
25,4.72,2.54,21.0,89.0,1.38,0.47,0.53,0.8,3.85,0.75,1.27,720.0
53,5.51,2.64,25.0,96.0,1.79,0.6,0.63,1.1,5.0,0.82,1.69,515.0
49,3.59,2.19,19.5,88.0,1.62,0.48,0.58,0.88,5.7,0.81,1.82,580.0
84,2.96,2.61,24.0,101.0,2.32,0.6,0.53,0.81,4.92,0.89,2.15,590.0
93,2.81,2.7,21.0,96.0,1.54,0.5,0.53,0.75,4.6,0.77,2.31,600.0
36,2.56,2.35,20.0,89.0,1.4,0.5,0.37,0.64,5.6,0.7,2.47,780.0
52,3.17,2.72,23.5,97.0,1.55,0.52,0.5,0.55,4.35,0.89,2.06,520.0
62,4.95,2.35,20.0,92.0,2.0,0.8,0.47,1.02,4.4,0.91,2.05,550.0
25,3.88,2.2,18.5,112.0,1.38,0.78,0.29,1.14,8.21,0.65,2.0,855.0
16,3.57,2.15,21.0,102.0,1.5,0.55,0.43,1.3,4.0,0.6,1.68,830.0
88,5.04,2.23,20.0,80.0,0.98,0.34,0.4,0.68,4.9,0.58,1.33,415.0
87,4.61,2.48,21.5,86.0,1.7,0.65,0.47,0.86,7.65,0.54,1.86,625.0
32,3.24,2.38,21.5,92.0,1.93,0.76,0.45,1.25,8.42,0.55,1.62,650.0
08,3.9,2.36,21.5,113.0,1.41,1.39,0.34,1.14,9.4,0.57,1.33,550.0
5,3.12,2.62,24.0,123.0,1.4,1.57,0.22,1.25,8.6,0.59,1.3,500.0
79,2.67,2.48,22.0,112.0,1.48,1.36,0.24,1.26,10.8,0.48,1.47,480.0
11,1.9,2.75,25.5,116.0,2.2,1.28,0.26,1.56,7.1,0.61,1.33,425.0
23,3.3,2.28,18.5,98.0,1.8,0.83,0.61,1.87,10.52,0.56,1.51,675.0
58,1.29,2.1,20.0,103.0,1.48,0.58,0.53,1.4,7.6,0.58,1.55,640.0
17,5.19,2.32,22.0,93.0,1.74,0.63,0.61,1.55,7.9,0.6,1.48,725.0
84,4.12,2.38,19.5,89.0,1.8,0.83,0.48,1.56,9.01,0.57,1.64,480.0
45,3.03,2.64,27.0,97.0,1.9,0.58,0.63,1.14,7.5,0.67,1.73,880.0
34,1.68,2.7,25.0,98.0,2.8,1.31,0.53,2.7,13.0,0.57,1.96,660.0
48,1.67,2.64,22.5,89.0,2.6,1.1,0.52,2.29,11.75,0.57,1.78,620.0
36,3.83,2.38,21.0,88.0,2.3,0.92,0.5,1.04,7.65,0.56,1.58,520.0
69,3.26,2.54,20.0,107.0,1.83,0.56,0.5,0.8,5.88,0.96,1.82,680.0
85,3.27,2.58,22.0,106.0,1.65,0.6,0.6,0.96,5.58,0.87,2.11,570.0
96,3.45,2.35,18.5,106.0,1.39,0.7,0.4,0.94,5.28,0.68,1.75,675.0
78,2.76,2.3,22.0,90.0,1.35,0.68,0.41,1.03,9.58,0.7,1.68,615.0
73,4.36,2.26,22.5,88.0,1.28,0.47,0.52,1.15,6.62,0.78,1.75,520.0
45,3.7,2.6,23.0,111.0,1.7,0.92,0.43,1.46,10.68,0.85,1.56,695.0
82,3.37,2.3,19.5,88.0,1.48,0.66,0.4,0.97,10.26,0.72,1.75,685.0
58,2.58,2.69,24.5,105.0,1.55,0.84,0.39,1.54,8.66,0.74,1.8,750.0
4,4.6,2.86,25.0,112.0,1.98,0.96,0.27,1.11,8.5,0.67,1.92,630.0
2,3.03,2.32,19.0,96.0,1.25,0.49,0.4,0.73,5.5,0.66,1.83,510.0
77,2.39,2.28,19.5,86.0,1.39,0.51,0.48,0.64,9.899999,0.57,1.63,470.0
16,2.51,2.48,20.0,91.0,1.68,0.7,0.44,1.24,9.7,0.62,1.71,660.0
71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.7,0.64,1.74,740.0
4,3.91,2.48,23.0,102.0,1.8,0.75,0.43,1.41,7.3,0.7,1.56,750.0
27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.2,0.59,1.56,835.0
17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.3,0.6,1.62,840.0
13,4.1,2.74,24.5,96.0,2.05,0.76,0.56,1.35,9.2,0.61,1.6,560.0

Scikit-learn 활용

load_iris 데이터셋 활용

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


def load_data():
    
    iris = load_iris()
    irisDF = pd.DataFrame(data = iris.data, columns = iris.feature_names)
    irisDF['target'] = iris.target

    return irisDF
    
def k_means_clus(irisDF):
    
    kmeans = KMeans(init="random", n_clusters=3, random_state=100)
    
    kmeans.fit(irisDF)
    
    irisDF['cluster'] = kmeans.labels_
    
    # 군집화 결과를 보기 위해 groupby 함수를 사용해보겠습니다.
    iris_result = irisDF.groupby(['target','cluster'])['sepal length (cm)'].count()
    print(iris_result)
    
    return iris_result, irisDF

# 군집화 결과 시각화하기
def Visualize(irisDF):
    
    pca = PCA(n_components=2)
    pca_transformed = pca.fit_transform(irisDF)

    irisDF['pca_x'] = pca_transformed[:,0]
    irisDF['pca_y'] = pca_transformed[:,1]

    # 군집된 값이 0, 1, 2 인 경우, 인덱스 추출
    idx_0 = irisDF[irisDF['cluster'] == 0].index
    idx_1 = irisDF[irisDF['cluster'] == 1].index
    idx_2 = irisDF[irisDF['cluster'] == 2].index
    
    # 각 군집 인덱스의 pca_x, pca_y 값 추출 및 시각화
    fig, ax = plt.subplots()
    
    ax.scatter(x=irisDF.loc[idx_0, 'pca_x'], y= irisDF.loc[idx_0, 'pca_y'], marker = 'o')
    ax.scatter(x=irisDF.loc[idx_1, 'pca_x'], y= irisDF.loc[idx_1, 'pca_y'], marker = 's')
    ax.scatter(x=irisDF.loc[idx_2, 'pca_x'], y= irisDF.loc[idx_2, 'pca_y'], marker = '^')
    ax.set_title('K-menas')
    ax.set_xlabel('PCA1')
    ax.set_ylabel('PCA2')
    
    fig.savefig("plot.png")


def main():
    
    irisDF = load_data()
    
    iris_result, irisDF = k_means_clus(irisDF)
    
    Visualize(irisDF)
    
if __name__ == "__main__":
    main()

What is TensorFlow

What is Keras

Tensor Flow 딥러닝 모델 구현

Tensor Flow + Keras를 활용하여 비선형회귀 구현

Reference

Tensor 생성하기

Constant Tensor

Sequence Tensor

Variable Tensor

Tensor 연산

역전파

python 구현

Reference

L1, L2 정규화

drop out

Batch Normalization

랜덤한 int 생성하기

randint() 예시

0 ~ 1 사이의 랜덤한 난수 생성하기

rand() 예시

표준 정규 분포를 따르는 난수 생성하기

정규 분포를 따르는 난수 생성하기

Reference

Decision Tree

언제 Decision Tree를 사용하는가.

불순도/불확실성

Decision Tree 학습 과정

Decision Tree 특징

Scikit-learn 모델 활용

Reference

Multiple Linear Regression

Polynomial Regression

다항 회귀 원리

다항 회귀 특징

Scikit-learn을 이용한 다중회귀 구현

과적합 방지

1. 교차 검증 (Cross Validation)

2. 정규화 (Regularization)

Reference

K-fold Validation

Scikit-learn의 KFold 사용법

Scikit-learn을 이용한 k-fold 구현

Reference

비지도학습

숨겨진 구조 (hidden structure)

Hard Clustering

Finding K

차원축소

PCA (Principal Componenet Analysis)

클러스터링

K-means

K-means 구현

Scikit-learn 활용

Pagination

Templates (for web app):

Error