ChatGPTに聞けば大体のエラーが解決する状況で、技術ブログを書く意味などあるのでしょうか。
今回、タイトルの通りに環境構築を頑張っていましたが、すこーしだけ詰まり、chatGPTに聞いても解決しなかったので一応記事として残しておきます。詰まったところ以外は、ChatGPTに聞けば教えてくれるので適当に書きます。
私が書いた記事は読む人もなく、静かに埋もれ消えゆくが、それでも、広大なる言語モデルの学びの一滴となり、そのパラメーターをわずかに書き換え、環境構築に悩むひとびとの背中を押すのです。
結論
latestのtensorflow docker imageを脳死でpullするだけではdocker環境からGPUを認識できません。
NvidiaDriver, Cuda、対応するtensorflowのバージョンを確認して、適切なdocker imageを探す必要があります。
docker pull tensorflow/tensorflow:2.13.0rc2-gpu
を使いましょう。それだけ。
概要
WSL2~Docker Desktopの導入まで
WSL2
WSLというのは、windowsパソコンで、Linux環境をとても簡単に使えますよ。というやつです。 Windows PowerShellを開き、以下を実行します。usernameとpasswordを設定してください。
wsl --install
Windows PowerShellを開いて、以下を実行するとwsl環境に入れます。
wsl
Docker Desktopのインストール
普通のソフトウェアと同様にインストール
Docker Desktop→設定→Resources→WSL integration → Enable integration with my default WSL distroにチェックを入れて、UUbuntuを選択します。
確認。 wslに入って、dockerコマンドを実行してみる。
wsl docker run hello-world
こんな感じになればOK
PS C:\Users\AAA> wsl AAA@AAA-home:/mnt/c/Users/AAA$ docker run hello-world Unable to find image 'hello-world:latest' locally latest: Pulling from library/hello-world c1ec31eb5944: Download complete Digest: sha256:305243c734571da2d100c8c8b3c3167a098cab6049c9a5b066b6021a60fcb966 Status: Downloaded newer image for hello-world:latest Hello from Docker! This message shows that your installation appears to be working correctly. To generate this message, Docker took the following steps: 1. The Docker client contacted the Docker daemon. 2. The Docker daemon pulled the "hello-world" image from the Docker Hub. (amd64) 3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading. 4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal. To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash Share images, automate workflows, and more with a free Docker ID: https://hub.docker.com/ For more examples and ideas, visit: https://docs.docker.com/get-started/
Nvidia Driverのインストール
普通のソフトウェアのインストールと同様。 場合によってはいつの間にか入っていたりする。
NVIDIA Container Toolkitのインストール
cudaのバージョン合わせのごちゃごちゃをしなくてよいのがdockerのうれしさなのですが、その代わりにNVIDIA Container Toolkitを使用する必要があります。 公式のインストールガイドがわかりよいです。
wslに入って、以下を実行すればよろしい。
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update sudo apt-get install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker
tensorflowのDocker image
デフォルトで、最新のimageを持ってくると、GPUを認識できません。 GPUをcudaのバージョン依存のもろもろは、ここで調整する必要があります。 dockerを使用はしていないが↓がとても丁寧。
要するに、tensorflowが最新のcudaにまだ対応しておらんようです。 なので、latest imageを使用すると、nvidia-smiは機能しますが、tensorflowがGPUを認識しません。 自分のGPU→cuda, Nvidiaドライバの対応→tensorflowのバージョン の順番に決めていけばいいです。 決まったら、hubのリストから適切なimageを探します。
頑張って探す↓ https://hub.docker.com/r/tensorflow/tensorflow/tags
自分は、
docker pull tensorflow/tensorflow:2.13.0rc2-gpu
でうまくいきました。
確認。
wsl nvidia-smi docker pull tensorflow/tensorflow:2.13.0rc2-gpu docker run -it --gpus all 9e81a648bca3
AAA@AAA-home:/mnt/c/Users/AAA$ docker pull tensorflow/tensorflow:2.13.0rc2-gpu 2.13.0rc2-gpu: Pulling from tensorflow/tensorflow 20d547ab5eb5: Download complete 98936854e6e0: Download complete f4df3073462b: Download complete 56e0351b9876: Download complete ad96c12fed87: Download complete 45d1be9ec31b: Download complete 9bbe44e6478e: Download complete b08eef4b90c8: Download complete 36cf074838a5: Download complete aa315b7808f0: Download complete ed0785269bb8: Download complete 6580bd400b80: Download complete 7882189dc767: Download complete d9fc3f86fea2: Download complete c3bec927edaa: Download complete ee594d3d1ae9: Download complete ece84004a3cd: Download complete f6694f81f1cc: Download complete Digest: sha256:9e81a648bca3501a73a4101c4a850209e4292088d2f389512e58e7ee12be0825 Status: Downloaded newer image for tensorflow/tensorflow:2.13.0rc2-gpu docker.io/tensorflow/tensorflow:2.13.0rc2-gpu AAA@AAA-home:/mnt/c/Users/kazuharu$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE tensorflow/tensorflow latest-gpu-jupyter a671d13af415 3 weeks ago 11.8GB tensorflow/tensorflow latest-gpu 1f16fbd9be8b 3 weeks ago 11.3GB nvidia/cuda 12.2.0-base-ubuntu22.04 ecdf8549dd5f 12 months ago 341MB tensorflow/tensorflow 2.15.0rc0-gpu 4a0ac12d0ffd 12 months ago 11.2GB tensorflow/tensorflow 2.13.0rc2-gpu 9e81a648bca3 17 months ago 9.97GB nvcr.io/nvidia/k8s/cuda-sample nbody 59261e419d6d 2 years ago 456MB AAA@AAA-home:/mnt/c/Users/AAA$ docker run -it --gpus all 9e81a648bca3 ________ _______________ ___ __/__________________________________ ____/__ /________ __ __ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / / _ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ / /_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/ WARNING: You are running this container as root, which can cause new files in mounted volumes to be created as the root user on your host machine. To avoid this, run the container by specifying your user's userid: $ docker run -u $(id -u):$(id -g) args... /sbin/ldconfig.real: /lib/x86_64-linux-gnu/libcudadebugger.so.1 is not a symbolic link /sbin/ldconfig.real: /lib/x86_64-linux-gnu/libcuda.so.1 is not a symbolic link
よい感じ。 最後に、tensorflowからGPUを認識できるかチェック。
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
以下の通り
root@a7221f2c9a83:/# python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" 2024-11-19 08:07:25.094835: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2024-11-19 08:07:25.114735: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-11-19 08:07:25.821671: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2024-11-19 08:07:25.824830: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2024-11-19 08:07:25.825000: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
まあとりあえずOK
root@a7221f2c9a83:/# python3 -c " import tensorflow as tf from tensorflow.keras import layers, models; model = models.Sequential([ layers.Dense(128, activation='relu', input_shape=(100,)), layers.Dense(64, activation='relu'), layers.Dense(10, activation='softmax') ]); model.compile(optimizer='adam', loss='categorical_crossentropy'); x = tf.random.normal((1000, 100)); y = tf.keras.utils.to_categorical(tf.random.uniform((1000,), maxval=10, dtype=tf.int32), num_classes=10); model.fit(x, y, epochs=5, batch_size=32) " 2024-11-19 08:10:54.763876: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2024-11-19 08:10:54.783711: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-11-19 08:10:55.494032: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2024-11-19 08:10:55.497499: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2024-11-19 08:10:55.497752: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2024-11-19 08:10:55.499740: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2024-11-19 08:10:55.500037: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2024-11-19 08:10:55.500308: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2024-11-19 08:10:55.595095: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2024-11-19 08:10:55.595572: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2024-11-19 08:10:55.595611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1726] Could not identify NUMA node of platform GPU id 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2024-11-19 08:10:55.596648: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2024-11-19 08:10:55.596686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21458 MB memory: -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9 Epoch 1/5 2024-11-19 08:10:56.331115: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:606] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once. 2024-11-19 08:10:56.343800: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fe6c15e07e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2024-11-19 08:10:56.343827: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9 2024-11-19 08:10:56.346762: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable. 2024-11-19 08:10:56.353786: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600 2024-11-19 08:10:56.412279: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. 32/32 [==============================] - 1s 3ms/step - loss: 2.4236 Epoch 2/5 32/32 [==============================] - 0s 3ms/step - loss: 2.1865 Epoch 3/5 32/32 [==============================] - 0s 3ms/step - loss: 2.0418 Epoch 4/5 32/32 [==============================] - 0s 2ms/step - loss: 1.9089 Epoch 5/5 32/32 [==============================] - 0s 2ms/step - loss: 1.7685