LLMの肥やし

こんにちは!機械学習・プログラミングをやる医学生です。情報発信、アウトプットによるよりよい理解、に使っていきたいと思っています。

RTX4090 windows WSL2 dockerでtensorflow環境を作る

 ChatGPTに聞けば大体のエラーが解決する状況で、技術ブログを書く意味などあるのでしょうか。

 今回、タイトルの通りに環境構築を頑張っていましたが、すこーしだけ詰まり、chatGPTに聞いても解決しなかったので一応記事として残しておきます。詰まったところ以外は、ChatGPTに聞けば教えてくれるので適当に書きます。

 私が書いた記事は読む人もなく、静かに埋もれ消えゆくが、それでも、広大なる言語モデルの学びの一滴となり、そのパラメーターをわずかに書き換え、環境構築に悩むひとびとの背中を押すのです。

結論

latestのtensorflow docker imageを脳死でpullするだけではdocker環境からGPUを認識できません。

NvidiaDriver, Cuda、対応するtensorflowのバージョンを確認して、適切なdocker imageを探す必要があります。

docker pull tensorflow/tensorflow:2.13.0rc2-gpu

を使いましょう。それだけ。

概要

  • WSL
  • Docker Desktop
  • Nvidia Driver
  • Nvidia container toolkit
  • tensorflowのdocker image を順に準備すればOK

WSL2~Docker Desktopの導入まで

WSL2

WSLというのは、windowsパソコンで、Linux環境をとても簡単に使えますよ。というやつです。 Windows PowerShellを開き、以下を実行します。usernameとpasswordを設定してください。

wsl --install

Windows PowerShellを開いて、以下を実行するとwsl環境に入れます。

wsl

Docker Desktopのインストール

普通のソフトウェアと同様にインストール

Docker Desktop→設定→Resources→WSL integration → Enable integration with my default WSL distroにチェックを入れて、UUbuntuを選択します。

確認。 wslに入って、dockerコマンドを実行してみる。

wsl
docker run hello-world

こんな感じになればOK

PS C:\Users\AAA> wsl
AAA@AAA-home:/mnt/c/Users/AAA$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
c1ec31eb5944: Download complete
Digest: sha256:305243c734571da2d100c8c8b3c3167a098cab6049c9a5b066b6021a60fcb966
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

Nvidia Driverのインストール

普通のソフトウェアのインストールと同様。 場合によってはいつの間にか入っていたりする。

NVIDIA Container Toolkitのインストール

cudaのバージョン合わせのごちゃごちゃをしなくてよいのがdockerのうれしさなのですが、その代わりにNVIDIA Container Toolkitを使用する必要があります。 公式のインストールガイドがわかりよいです。

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html

wslに入って、以下を実行すればよろしい。

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker

tensorflowのDocker image

デフォルトで、最新のimageを持ってくると、GPUを認識できません。 GPUをcudaのバージョン依存のもろもろは、ここで調整する必要があります。 dockerを使用はしていないが↓がとても丁寧。

medium.com

要するに、tensorflowが最新のcudaにまだ対応しておらんようです。 なので、latest imageを使用すると、nvidia-smiは機能しますが、tensorflowがGPUを認識しません。 自分のGPU→cuda, Nvidiaドライバの対応→tensorflowのバージョン の順番に決めていけばいいです。 決まったら、hubのリストから適切なimageを探します。

頑張って探す↓ https://hub.docker.com/r/tensorflow/tensorflow/tags

自分は、

docker pull tensorflow/tensorflow:2.13.0rc2-gpu

でうまくいきました。

確認。

wsl
nvidia-smi
docker pull tensorflow/tensorflow:2.13.0rc2-gpu
docker run -it --gpus all 9e81a648bca3
AAA@AAA-home:/mnt/c/Users/AAA$ docker pull tensorflow/tensorflow:2.13.0rc2-gpu
2.13.0rc2-gpu: Pulling from tensorflow/tensorflow
20d547ab5eb5: Download complete
98936854e6e0: Download complete
f4df3073462b: Download complete
56e0351b9876: Download complete
ad96c12fed87: Download complete
45d1be9ec31b: Download complete
9bbe44e6478e: Download complete
b08eef4b90c8: Download complete
36cf074838a5: Download complete
aa315b7808f0: Download complete
ed0785269bb8: Download complete
6580bd400b80: Download complete
7882189dc767: Download complete
d9fc3f86fea2: Download complete
c3bec927edaa: Download complete
ee594d3d1ae9: Download complete
ece84004a3cd: Download complete
f6694f81f1cc: Download complete
Digest: sha256:9e81a648bca3501a73a4101c4a850209e4292088d2f389512e58e7ee12be0825
Status: Downloaded newer image for tensorflow/tensorflow:2.13.0rc2-gpu
docker.io/tensorflow/tensorflow:2.13.0rc2-gpu
AAA@AAA-home:/mnt/c/Users/kazuharu$ docker images
REPOSITORY                       TAG                       IMAGE ID       CREATED         SIZE
tensorflow/tensorflow            latest-gpu-jupyter        a671d13af415   3 weeks ago     11.8GB
tensorflow/tensorflow            latest-gpu                1f16fbd9be8b   3 weeks ago     11.3GB
nvidia/cuda                      12.2.0-base-ubuntu22.04   ecdf8549dd5f   12 months ago   341MB
tensorflow/tensorflow            2.15.0rc0-gpu             4a0ac12d0ffd   12 months ago   11.2GB
tensorflow/tensorflow            2.13.0rc2-gpu             9e81a648bca3   17 months ago   9.97GB
nvcr.io/nvidia/k8s/cuda-sample   nbody                     59261e419d6d   2 years ago     456MB
AAA@AAA-home:/mnt/c/Users/AAA$ docker run -it --gpus all 9e81a648bca3

________                               _______________
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ /
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

/sbin/ldconfig.real: /lib/x86_64-linux-gnu/libcudadebugger.so.1 is not a symbolic link

/sbin/ldconfig.real: /lib/x86_64-linux-gnu/libcuda.so.1 is not a symbolic link

よい感じ。 最後に、tensorflowからGPUを認識できるかチェック。

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

以下の通り

root@a7221f2c9a83:/# python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2024-11-19 08:07:25.094835: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-19 08:07:25.114735: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-19 08:07:25.821671: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-11-19 08:07:25.824830: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-11-19 08:07:25.825000: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

まあとりあえずOK

root@a7221f2c9a83:/# python3 -c "
import tensorflow as tf
from tensorflow.keras import layers, models;

model = models.Sequential([
    layers.Dense(128, activation='relu', input_shape=(100,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
]);
model.compile(optimizer='adam', loss='categorical_crossentropy');

x = tf.random.normal((1000, 100));
y = tf.keras.utils.to_categorical(tf.random.uniform((1000,), maxval=10, dtype=tf.int32), num_classes=10);

model.fit(x, y, epochs=5, batch_size=32)
"
2024-11-19 08:10:54.763876: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-19 08:10:54.783711: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-19 08:10:55.494032: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-11-19 08:10:55.497499: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-11-19 08:10:55.497752: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-11-19 08:10:55.499740: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-11-19 08:10:55.500037: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-11-19 08:10:55.500308: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-11-19 08:10:55.595095: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-11-19 08:10:55.595572: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-11-19 08:10:55.595611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1726] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2024-11-19 08:10:55.596648: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-11-19 08:10:55.596686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21458 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9
Epoch 1/5
2024-11-19 08:10:56.331115: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:606] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2024-11-19 08:10:56.343800: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fe6c15e07e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-11-19 08:10:56.343827: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2024-11-19 08:10:56.346762: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-11-19 08:10:56.353786: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600
2024-11-19 08:10:56.412279: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
32/32 [==============================] - 1s 3ms/step - loss: 2.4236
Epoch 2/5
32/32 [==============================] - 0s 3ms/step - loss: 2.1865
Epoch 3/5
32/32 [==============================] - 0s 3ms/step - loss: 2.0418
Epoch 4/5
32/32 [==============================] - 0s 2ms/step - loss: 1.9089
Epoch 5/5
32/32 [==============================] - 0s 2ms/step - loss: 1.7685