스터디 노트 (tensorflow, LeNET)

KloudHyun 2023. 11. 1. 18:01

📌 Classify Wearing Mask

➡️ Data Source

→ https://www.kaggle.com/datasets/ashishjangra27/face-mask-12k-images-dataset

➡️ Find file path using keyword ls

→ ls 명령어로 현재 경로에 존재하는 파일을 알 수 있다.

ls
>>>
2023-11-01  오후 01:37    <DIR>          .
2023-10-30  오후 04:04    <DIR>          ..
2023-10-31  오후 09:59    <DIR>          .ipynb_checkpoints
2023-10-31  오후 09:57         3,172,365 1. Beginning of Deeplearning.ipynb
2023-10-31  오후 09:56         1,298,479 2. Deep Learning from scratch.ipynb
2023-11-01  오후 01:37           167,739 3. Dive to cnn.ipynb
2023-11-01  오후 01:59    <DIR>          data
2023-10-31  오후 02:57        37,918,144 MNIST_CNN_model.h5

➡️ Dealwith zipfiles

→ zipfile 명령어를 사용하여 알집 파일의 압축을 풀 수 있다.

import zipfile

content_zip = zipfile.ZipFile("./data/archive.zip")
content_zip.extractall("./data")

content_zip.close()

➡️ Organizing Data files

→ listdir 명령어를 사용하여, 해당 경로의 폴더 파일을 확인할 수 있다.

import os
import glob

path = "./data/Face Mask Dataset/"
os.listdir(path)
>>>>
['Test', 'Train', 'Validation']

path = "./data/Face Mask Dataset/"
os.listdir(path+"/"+'Train')
>>>>
['WithMask', 'WithoutMask']

path = "./data/Face Mask Dataset/"
dataset = {
	"image_path" : [], 
	"mask_status" : [], 
	"where" : []
}
for where in os.listdir(path):
    for status in os.listdir(path+"/"+where):
        for image in glob.glob(path+where+"/"+status+"/"+"*.png"):
            dataset["image_path"].append(image)
            dataset["mask_status"].append(status)
            dataset["where"].append(where)

import pandas as pd

dataset = pd.DataFrame(dataset) 
dataset.head()

🔻Confirm is data ratio

→ DataFrame의 value_counts 명령어를 사용하고 column을 넣어주면 해당 컬럼의 value_counts를 알 수 있다.

dataset.value_counts("mask_status")
>>>>
mask_status
WithoutMask    5909
WithMask       5883

print("With Mask:", dataset.value_counts("mask_status")[0])
print("Without Mask:", dataset.value_counts("mask_status")[1])

sns.countplot(x=dataset['mask_status'])
plt.show()
>>>>
With Mask: 5909
Without Mask: 5883

🔻Check the image randomly

import cv2
plt.figure(figsize=(15, 10))

for i in range(9):
	# dataset에서 random으로 이미지 가져오기
	random = np.random.randint(1, len(dataset))
    
	# subplot을 활용하여 3x3의 형태로 그리기
	plt.subplot(3, 3, i + 1)
    
	# cv2 모듈을 이용하여 image를 로드, random으로 추출한 dataset의 index를 활용, location 메서드를 활용
	plt.imshow(cv2.imread(dataset.loc[random, "image_path"]))
	
	# 각 image에 대한 title, 여기서는 마스크의 상태를 표현
	plt.title(dataset.loc[random, "mask_status"], size=15)
    
	# x축 값, y축 값을 없앤다 (빈 리스트를 활용)
	plt.xticks([])
	plt.yticks([])
    
plt.show()

🔻Arrange Data

→ train data만 사용할 것이기 때문에, train_df를 다시 만든다.

train_df = dataset[dataset["where"]=="Train"]
test_df = dataset[dataset["where"]=="Test"]
valid_df = dataset[dataset["where"]=="Validation"]
train_df = train_df.reset_index().drop("index", axis=1)
train_df.head()

🔻Check Data ratio

→ 각각 데이터 셋을 확인 해보니, 거의 동일한 비율을 나타내고 있다.

plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
sns.countplot(x = train_df['mask_status'])
plt.title("Training Dataset", size=10)

plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 2)
sns.countplot(x = test_df['mask_status'])
plt.title("Test Dataset", size=10)

plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 3)
sns.countplot(x = valid_df['mask_status'])
plt.title("Validation Dataset", size=10)

plt.show()

🔻Preprocessing image data

→ image를 grayscale로 읽은 다음, resizing

→ 이 데이터를 리스트에 append

from tqdm.notebook import tqdm
data = []
image_size = 150

for i in range(len(train_df)):
    # image를 grayscale로 읽는다
    img_array = cv2.imread(train_df['image_path'][i], cv2.IMREAD_GRAYSCALE)
    
    # image를 resize
    new_image_array = cv2.resize(img_array, (image_size, image_size))
    
    if train_df["mask_status"][i] == 'WithMask":
        data.append([new_image_array, 1])
    else:
        data.append([new_image_array, 0])

data 셋 예시, list안에 resizing한 이미지 데이터와, labeling한 1이 들어가 있다.

➡️ The reason why dataset shuffle

→ 데이터가 순차적으로 입력이 되었기 때문에, 한 번 섞는 것이 좋다.

np.random.shuffle(data)

🔻Check resizing image

fig, ax = plt.subplots(2, 3, figsize=(10, 6))

for row in range(2):
    for col in range(3):
        image_index = row*100+col
        
        ax[row, col].axis("off")
        ax[row, col].imshow(data[image_index][0], cmap="grey")
        
        if data[image_index][1] == 0:
            ax[row, col].set_title("Without Mask")
        else:
            ax[row, col].set_title("With Mask")

🔻Data Setting (Arrange)

→ Pixel Data를 X, Labeling Data를 y에 append

X = [] ; y= []

for image in data:
    X.append(image[0])
    y.append(image[1])

X=np.array(X)
y=np.array(y)

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state = 13)

🔻Modeling

from tensorflow.keras import Sequential, models
from tensorflow.keras.layers import Flatten, Dense, Conv2D, MaxPool2D

model = models.Sequential([
    # Conv2D, 32채널, 커널 사이즈는 5x5 (3x3과 함께 많이 쓰임)
    # strides 옵션 1x1 (한 칸씩 훑는다), padding
    layers.Conv2D(32, kernel_size=(5, 5), strides=(1, 1), padding="same", activation="relu", input_shape=(150, 150, 1)),
    
    # 보통 MaxPooling size와 strides size를 동일하게 가져가는 경우가 많다.
    layers.MaxPooling2D(pool_size = (2, 2), strides=(2, 2)),
    
    # MaxPooling을 만나서 크기가 줄었으니, 더 많은 특성을 찾기 위해서 Channel을 늘린다.
    layers.Conv2D(64, (2, 2), activation="relu", padding="same"),
    
    # 동일
    layers.MaxPooling2D(pool_size = (2, 2), strides=(2, 2)),
    
    # 과적합 방지를 위해 Dropout을 25% 비율로 설정
    layers.Dropout(0.25),
    
    layers.Flatten(),
    layers.Dense(1000, activation='relu'),
    
    # 마지막 출력단 설정
    layers.Dense(1, activation="sigmoid")
])

model.compile(
    optimizer="adam", loss=tf.keras.losses.BinaryCrossentropy(),
    metrics =["accuracy"]
)

🔻Fit

# X_train = X_train.reshape(len(X_train), X_train.shape[1], X_train.shape[2], 1)
X_train = X_train.reshape(-1, 150, 150, 1)
X_val = X_val.reshape(-1, 150, 150, 1)

hist = model.fit(X_train, y_train, epochs = 4, batch_size = 32)

🔻Evaluate

→ 성능 평가

model.evaluate(X_val, y_val)
>>>>
63/63 [==============================] - 3s 54ms/step - loss: 0.1177 - accuracy: 0.9740
[0.11772098392248154, 0.9739999771118164]

🔻Predict

prediction = (model.predict(X_val) > 0.5).astype("int32")

print(classification_report(y_val, prediction))
print(confusion_matrix(y_val, prediction))
>>>>
63/63 [==============================] - 3s 47ms/step
              precision    recall  f1-score   support

           0       0.98      0.97      0.97       982
           1       0.97      0.98      0.97      1018

    accuracy                           0.97      2000
   macro avg       0.97      0.97      0.97      2000
weighted avg       0.97      0.97      0.97      2000

[[950  32]
 [ 20 998]]

🔻 Showing Wrong number Image

→ 예측이 틀린 데이터 확인

→ 제대로 얼굴 안면이 안나온 사진도 있고... 이상한(?) 마스크를 쓰고 있는 사진도 있는 듯 하다.

wrong_result = []

for n in range(0, len(y_val)):

# 예측한 값 중에서, 실제 값과 다른 것의 이미지 번호를 확인
    if prediction[n] != y_val[n]:
        wrong_result.append(n)
        
len(wrong_result)
>>>>
52

import random

samples = random.choices(population = wrong_result, k=6)

plt.figure(figsize=(14, 12))
for idx, n in enumerate(sample):
    plt.subplot(2, 3, idx + 1)
    plt.imshow(X_val[n].reshape(150, 150), interpolation='nearest')
    plt.title(prediction[n])
    plt.axis("off")
plt.show()