[DL] MNIST 데이터셋의 이해

Smart Factory + Data Science/AI ⊃ ML ⊃ DL

[DL] MNIST 데이터셋의 이해

부파이워 2023. 10. 20. 11:06

딥러닝계의 hello, world라 불리는 MNIST는 흑백 손글씨 숫자 이미지(28x28픽셀) 0~9까지 10개 범주를 분류하는데 사용되는 데이터셋이다.

1980년대 미국 국립표준기술연구소 (National Institute of Standards & Technology, NIST)에서 수집한 6만개 훈련(Training) 이미지와 1만개 테스트(Test) 이미지로 구성되어 있다.

MNIST데이터셋은 Keras 라이브러리에 이미 포함되어 있으며 아래와 같이 불러 올 수 있다.

from tensorflow.keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

여기서, 나같은 딥린이은 궁금하다. 왜 (train_images, train_labels), (test_images, test_labels) 형태로 데이터를 저장하는 하는 것인가?

최근에 알게된 내장된 함수 코드를 불러올 수 있는 방법으로 알게 되었다.

import inspect

code = inspect.getsource(mnist.load_data)

print(code)

위의 코드를 실행해 보면 아래와 같은 결과를 얻을 수 있는데, return하는 형태를 보고 아~~~ 이해하게 되었다.

@keras_export("keras.datasets.mnist.load_data")
def load_data(path="mnist.npz"):
    """Loads the MNIST dataset.

    This is a dataset of 60,000 28x28 grayscale images of the 10 digits,
    along with a test set of 10,000 images.
    More info can be found at the
    [MNIST homepage](http://yann.lecun.com/exdb/mnist/).

    Args:
      path: path where to cache the dataset locally
        (relative to `~/.keras/datasets`).

    Returns:
      Tuple of NumPy arrays: `(x_train, y_train), (x_test, y_test)`.

    **x_train**: uint8 NumPy array of grayscale image data with shapes
      `(60000, 28, 28)`, containing the training data. Pixel values range
      from 0 to 255.

    **y_train**: uint8 NumPy array of digit labels (integers in range 0-9)
      with shape `(60000,)` for the training data.

    **x_test**: uint8 NumPy array of grayscale image data with shapes
      (10000, 28, 28), containing the test data. Pixel values range
      from 0 to 255.

    **y_test**: uint8 NumPy array of digit labels (integers in range 0-9)
      with shape `(10000,)` for the test data.

    Example:

    ```python
    (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
    assert x_train.shape == (60000, 28, 28)
    assert x_test.shape == (10000, 28, 28)
    assert y_train.shape == (60000,)
    assert y_test.shape == (10000,)
    ```

    License:
      Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset,
      which is a derivative work from original NIST datasets.
      MNIST dataset is made available under the terms of the
      [Creative Commons Attribution-Share Alike 3.0 license.](
      https://creativecommons.org/licenses/by-sa/3.0/)
    """
    origin_folder = (
        "https://storage.googleapis.com/tensorflow/tf-keras-datasets/"
    )
    path = get_file(
        path,
        origin=origin_folder + "mnist.npz",
        file_hash="731c5ac602752760c8e48fbffcf8c3b850d9dc2a2aedcf2cc48468fc17b673d1",  # noqa: E501
    )
    with np.load(path, allow_pickle=True) as f:
        x_train, y_train = f["x_train"], f["y_train"]
        x_test, y_test = f["x_test"], f["y_test"]

        return (x_train, y_train), (x_test, y_test)

훈련데이터 및 테스트데이터를 살펴 보면 아래와 같다.

train_images.shape # 28*28픽셀로 이뤄진 60000개 샘플 데이터라는 의미로 uint8 타입의 배열데이터셋

(60000, 28, 28)

len(train_labels)

60000

test_images.shape

(10000, 28, 28)

len(test_labels)

10000

* 참조. 캐라스 창시자에게 배우는 딥러닝 개정 2판