分类 六、编程语言 下的文章

一、pycharm中运行/调试torch分布式训练

整体比较简单,可以参考:我下面的二、pycharm中运行/调试deepspeed分布式训练
关键步骤为:
软链接distributed文件
通过对调用分布式的命令分析,我们首先需要找到torch.distributed.launch这个文件,并将它软链接到我们的Pycharm项目目录下。为什么使用软链接而不是直接复制呢?因为软链接不会变更文件的路径,从而使得launch.py文件可以不做任何改动的情况下去import它需要的包。

在Ubuntu中,通过以下命令创建软链接

ln -s /yourpython/lib/python3.6/site-packages/torch/distributed/ /yourprogram/
以上命令没有直接链接launch.py而是它的父目录distributed,是因为这样比较容易知道launch.py是一个软链接,不与项目中的其他文件混淆。

设置Pycharm运行参数
打开Pycharm,依次点击Run->Edit Configurations 进入参数配置界面
微信截图_20240314134956.png

只需要配置Script path为launch.py路径;Parameters为launch.py运行参数,参考命令行调用的方法,设置如下。

--nproc_per_node=4
tools/train.py --cfg xxx.yaml
通过以上步骤就可以在Pycharm中运行分布式训练了。不过,如果是在调试模型最好还是修改一下trian.py文件,通过单GPU方式调试,并不是说分布式模式不能调试,仅仅是因为在单GPU方式下,对于数据流更好把控,减少调试时间

二、pycharm中运行/调试deepspeed分布式训练

1.pycharm版本

我用的是2020.1

2.环境

(1)首先服务器上需要配好对应的环境,并且保留代码一份,以及对应的模型,用conda配好虚拟环境
(2)本地只需要拷贝对应的代码和数据,不用拷贝模型
(3)代码,我主要是复现大语言模型的预训练,https://github.com/hiyouga/LLaMA-Factory,其他代码同理
最主要的启动脚本,其他代码也是同理,主要是使用deepspeed启动就行。
pretarin.sh

deepspeed  --master_port=9901 src/train_bash.py \
    --deepspeed ./ds_config.json \
    --stage pt \
    --do_train \
    --model_name_or_path ../Yi-34B  \
    --dataset input_test \
    --finetuning_type full \
    --lora_target q_proj,v_proj \
    --output_dir Yi-34B_output_test \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 5 \
    --save_steps 300 \
    --learning_rate 5e-5 \
    --num_train_epochs 1.0 \
    --preprocessing_num_workers 20 \
    --plot_loss \
    --bf16

3.本地配置:deepspeed安装

将虚拟环境的deepspeed安装包压缩,拷贝到本地。
我服务器虚拟环境的位置:/home/centos/anaconda3/envs/factory/lib/python3.10/site-packages/deepspeed/,
将deepspeed包压缩为zip包,拷贝到本地项目目录:D:codeLLaMA-Factory 然后解压,D:codeLLaMA-Factorydeepspeed

4.远程配置:软链接

查看:vim /home/centos/anaconda3/envs/factory/bin/deepspeed 可以知道实际使用的hideepspeed.launcher.runner
文件,

通过对调用分布式的命令分析,我们首先需要找到deepspeed.launcher.runner这个文件,并将它软链接到我们的Pycharm项目目录下。为什么使用软链接而不是直接复制呢?因为软链接不会变更文件的路径,从而使得runner.py文件可以不做任何改动的情况下去import它需要的包。

在centos中,通过以下命令创建软链接

ln -s /home/centos/anaconda3/envs/factory/lib/python3.10/site-packages/deepspeed/  /data/liulei/cpt/LLaMA-Factory/

如果要删除软连接用:

unlink  /data/liulei/cpt/LLaMA-Factorydeepspeed/

5.pycharm配置

配置本地代码用远程服务器的python解析器:

(1)从set里面进去
微信截图_.png

(2)add新的解析器
微信截图_20240314111021.png

(3)通过ssh 添加远程的,写Ip和用户名,后面输入密码
11531.png

(4)从远程服务器中选择解析器、远程服务器代码和本地代码进行映射

12024.png

2247.png

(5)配置debug执行入口命令

20240314112930.png

(6)脚本、参数、python解析器、代码路径配置

微信截图_20240314113559.png

入口脚本:D:codeLLaMA-Factorydeepspeedlauncherrunner.py
参数:注意这里面的脚步,就是上面的pretrain.sh启动脚本,但是需要把deepspeed命令去掉,并且把反斜杠去掉,且要把默认参数True 补充完整。


--master_port=9901  src/train_bash.py      --deepspeed ./ds_config.json      --stage pt     --do_train True      --model_name_or_path ../Yi-34B-Chat      --dataset input_test      --finetuning_type full      --lora_target q_proj,v_proj      --output_dir path_to_pt_checkp1      --overwrite_cache True      --per_device_train_batch_size 1      --gradient_accumulation_steps 1      --lr_scheduler_type cosine      --logging_steps 1     --save_steps 100      --learning_rate 5e-5      --num_train_epochs 1.0      --plot_loss True      --fp16

完成就可以进行debug了,然后你就发现在实际运行的时候还是会报错的。需要几个地方的代码修改。

(7)本地对应代码修改

问题:1

ssh://centos@18:22/home/centos/anaconda3/envs/factory/bin/python -u /home/centos/.pycharm_helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 0.0.0.0 --port 34567 --file /data/liulei/cpt/LLaMA-Factory/deepspeed/launcher/runner.py --master_port=9901 src/train_bash.py --deepspeed ./ds_config.json --stage pt --do_train True --model_name_or_path ../Yi-34B-Chat --dataset input_test --finetuning_type full --lora_target q_proj,v_proj --output_dir path_to_pt_checkp1 --overwrite_cache True --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --logging_steps 1 --save_steps 100 --learning_rate 5e-5 --num_train_epochs 1.0 --plot_loss True --fp16
/home/centos/.pycharm_helpers/pydev/pydevd.py:1806: DeprecationWarning: currentThread() is deprecated, use current_thread() instead
  dummy_thread = threading.currentThread()
pydev debugger: process 232478 is connecting
Connected to pydev debugger (build 201.6668.115)
Traceback (most recent call last):
  File "/home/centos/.pycharm_helpers/pydev/pydevd.py", line 1438, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/centos/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/data/liulei/cpt/LLaMA-Factory/deepspeed/launcher/runner.py", line 24, in <module>
    from .multinode_runner import PDSHRunner, OpenMPIRunner, MVAPICHRunner, SlurmRunner, MPICHRunner, IMPIRunner
ImportError: attempted relative import with no known parent package

解决办法:

将远程文件/data/liulei/cpt/LLaMA-Factory/deepspeed/launcher/runner.py 映射的本地文件D:\code\LLaMA-Factory\deepspeed\launcher\runner.py进行修改

修改前:

from .multinode_runner import PDSHRunner, OpenMPIRunner, MVAPICHRunner, SlurmRunner, MPICHRunner, IMPIRunner
from .constants import PDSH_LAUNCHER, OPENMPI_LAUNCHER, MVAPICH_LAUNCHER, SLURM_LAUNCHER, MPICH_LAUNCHER, IMPI_LAUNCHER
from ..constants import TORCH_DISTRIBUTED_DEFAULT_PORT
from ..nebula.constants import NEBULA_EXPORT_ENVS
from ..utils import logger

from ..autotuning import Autotuner
from deepspeed.accelerator import get_accelerator


修改后:
from deepspeed.launcher.multinode_runner import PDSHRunner, OpenMPIRunner, MVAPICHRunner, SlurmRunner, MPICHRunner, IMPIRunner
from deepspeed.launcher.constants import PDSH_LAUNCHER, OPENMPI_LAUNCHER, MVAPICH_LAUNCHER, SLURM_LAUNCHER, MPICH_LAUNCHER, IMPI_LAUNCHER
from deepspeed.constants import TORCH_DISTRIBUTED_DEFAULT_PORT
from deepspeed.nebula.constants import NEBULA_EXPORT_ENVS
from deepspeed.utils import logger

from deepspeed.autotuning import Autotuner

然后重新运行debug,就发现本地可以愉快的跑起来了。

centos安装nginx

yum -y install nginx 

CentOS系统中Nginx的默认安装目录为/etc/nginx。

如果需要修改Nginx的配置文件,可以使用vi或者nano等编辑器打开该目录下的nginx.conf文件进行编辑。

示例代码(在命令行中输入)

vim /etc/nginx/nginx.conf

启动、停止、重启Nginx服务


systemctl start nginx   # 启动Nginx
systemctl stop nginx    # 停止Nginx
systemctl restart nginx # 重启Nginx

nginx 日志

/var/log/nginx/error.log 
/var/log/nginx/access.log

配置websocket

vim /etc/nginx/nginx.conf
配置文件如下:

# For more information on configuration, see:
#   * Official English Documentation: http://nginx.org/en/docs/
#   * Official Russian Documentation: http://nginx.org/ru/docs/

user root;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;

# Load dynamic modules. See /usr/share/doc/nginx/README.dynamic.
include /usr/share/nginx/modules/*.conf;

events {
    worker_connections 1024;
}

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile            on;
    tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   65;
    types_hash_max_size 2048;

    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;

    # Load modular configuration files from the /etc/nginx/conf.d directory.
    # See http://nginx.org/en/docs/ngx_core_module.html#include
    # for more information.
    include /etc/nginx/conf.d/*.conf;


map $http_upgrade $connection_upgrade {
    default upgrade;
    '' close;
}

upstream wsbackend{
    server 192.168.17.188:9005;
    server 192.168.17.188:9006;
    keepalive 1000;
}

    server {
        listen       8009 default_server;
        #listen       [::]:80 default_server;
        server_name  localhost;
        root         /usr/share/nginx/html;

        # Load configuration files for the default server block.
        include /etc/nginx/default.d/*.conf;

        location / {
      proxy_pass http://wsbackend; 
          proxy_http_version 1.1;
          proxy_read_timeout   3600s; # 超时设置
          # 启用支持websocket连接
          proxy_set_header Upgrade $http_upgrade;
          proxy_set_header Connection "upgrade";
        }

        error_page 404 /404.html;
            location = /40x.html {
        }

        error_page 500 502 503 504 /50x.html;
            location = /50x.html {
        }
    }

# Settings for a TLS enabled server.
#
#    server {
#        listen       443 ssl http2 default_server;
#        listen       [::]:443 ssl http2 default_server;
#        server_name  _;
#        root         /usr/share/nginx/html;
#
#        ssl_certificate "/etc/pki/nginx/server.crt";
#        ssl_certificate_key "/etc/pki/nginx/private/server.key";
#        ssl_session_cache shared:SSL:1m;
#        ssl_session_timeout  10m;
#        ssl_ciphers PROFILE=SYSTEM;
#        ssl_prefer_server_ciphers on;
#
#        # Load configuration files for the default server block.
#        include /etc/nginx/default.d/*.conf;
#
#        location / {
#        }
#
#        error_page 404 /404.html;
#            location = /40x.html {
#        }
#
#        error_page 500 502 503 504 /50x.html;
#            location = /50x.html {
#        }
#    }

}

重要的是这两行,它表明是websocket连接进入的时候,进行一个连接升级将http连接变成websocket的连接。

proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";

proxy_read_timeout; 表明连接成功以后等待服务器响应的时候,如果不配置默认为60s;
proxy_http_version 1.1; 表明使用http版本为1.1

遇到的问题:

2023/12/18 10:59:30 [crit] 626773#0: *1 connect() to :9006 failed (13: Permission denied) while connecting to upstream, client: , server: localhost, request: "GET / HTTP/1.1", upstream: "http://192:9006/", host: ":8009

解决办法:
1.nginx.conf的 开头改为:user root;
2.关闭SeLinux
临时关闭(不用重启机器)

setenforce 0 

参考:
https://www.jianshu.com/p/6205c8769e3c
https://blog.csdn.net/lazycheerup/article/details/117323466

一、下载安装包

此处的安装环境为离线环境,需要先下载cuda安装文件,安装文件可以去官网地址下载对应的系统版本。官网下载地址:https://developer.nvidia.com/cuda-toolkit-archive

1143511.png

驱动和cuda版本对应:
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

问题 centos7

Using built-in stream user interface
-> Detected 32 CPUs online; setting concurrency level to 32.
-> The file '/tmp/.X0-lock' exists and appears to contain the process ID '2647' of a running X server.
ERROR: You appear to be running an X server; please exit X before installing.  For further details, please see the section INSTALLING THE NVIDIA DRIVER in the README available on the Linux driver download page at www.nvidia.com.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

解决:
systemctl stop gdm.service

问题 centos8

-> Detected 128 CPUs online; setting concurrency level to 32.
-> Tagging shared libraries with chcon -t textrel_shlib_t.
ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
~

解决:
GPU正在使用,关闭正在使用的GPU

问题 centos 8

Using built-in stream user interface
-> Detected 128 CPUs online; setting concurrency level to 32.
-> Tagging shared libraries with chcon -t textrel_shlib_t.
ERROR: An NVIDIA kernel module 'nvidia' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

解决:
GPU正在使用,关闭正在使用的GPU,通过命令:
sudo lsof /dev/nvidia*
kill -9 pid

问题:

sh ./cuda_11.6.0_510.39.01_linux.run
Extraction failed.
Ensure there is enough space in /tmp and that the installation package is not corrupt
Signal caught, cleaning up

没有安装解压软件
yum install tar

问题 centos

nvidia-smi 中可以显示GPU,但是torch.cuda.is_available() 出现如下错误:

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling

Error 802: system not yet initialized 

centos8中解决办法

注意 驱动版本要对应上

wget https://developer.download.nvidia.cn/compute/cuda/repos/rhel8/x86_64/nvidia-fabric-manager-515.65.01-1.x86_64.rpm

sudo yum install nvidia-fabric-manager-515.65.01-1.x86_64.rpm

systemctl enable nvidia-fabricmanager

systemctl restart nvidia-fabricmanager

systemctl status nvidia-fabricmanager

二、cuda 版本切换

一般情况下从官网下载:https://developer.nvidia.com/cuda-toolkit-archive

注意安装的时候:不要安装cuda driver

安装完成后切换软连接:

rm -rf /usr/local/cuda  #删除之前创建的软链接 
sudo ln -s /usr/local/cuda-11.3/  /usr/local/cuda/ 
nvcc --version #查看当前 cuda 版本

如果还不行直接在环境变量中修改:

vim ~/.bashrc

#然后添加

export CUDA_HOME=/usr/local/cuda
export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.3/lib64

遇见过一种情况还不行:

查看 which nvcc 发现指向没有变 可能就是环境变量没有改过来
查看环境变量:

echo $PATH
## 打印:/home/centos/anaconda3/bin:/home/centos/anaconda3/condabin:/home/centos/.local/bin:/home/centos/bin:/usr/local/cuda-12.2/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/cuda/bin

##发现环境变量没有变,需要将固定指向的环境变量修改:

export PATH=/home/centos/anaconda3/bin:/home/centos/anaconda3/condabin:/home/centos/.local/bin:/home/centos/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/cuda/bin

一、创建虚拟环境

1.1 创建 exp_detect_1 的虚拟环境:

conda create -n exp_detect_1 python=3.7

或者 直接激活虚拟环境 exp_detect_1:
conda activate exp_detect_1
(注意:退出环境的命令为:#conda deactivate)

二、虚拟环境打包

1.3 打包压缩虚拟环境 exp_detect_1
go to 此目录
cd /home/liu/miniconda3/envs/
打包压缩
tar -czvf exp_detect_1.tar.gz ./exp_detect_1

三、打包上传,解压

解压 exp_detect_1.tar.gz 到 ~/miniconda3/envs 环境目录。
cd ~/miniconda3/envs
tar -xzvf exp_detect_1.tar.gz
激活 conda 虚拟环境> exp_detect_1
conda activate exp_detect_1
python xxx

OK, run the source package

四、注意事项

怎么快速迁移 “训练环境”补丁-1

一、存在的问题
在使用文档“怎么快速迁移训练环境.pdf”中存在两个问题:
1.直接拷贝虚拟环境,部分文件路径没有修改
2.cudatoolkit包没有拷贝

二、对应的解决办法如下:
1.直接拷贝虚拟环境,部分文件路径没有修改
需要修改的文件所在位置:/home/anaconda3/envs/pretrain/bin (注意不同服务器conda安装位置可能不太一样)

图片1.png

修改文件:此目录下会用到的文件例如:pip、deepspeed(根据自己实际需求)
图片2.png

修改方式:将文件首行修改为当前文件实际路径

2.cudatoolkit包没有拷贝
如果虚拟环境的cudatoolkit是用过conda install 方式安装,例如下面的方式:
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
实际上cudatoolkit的安装位置不在拷贝的虚拟环境中,而是在conda公用目录pkgs文件夹下(/home/anaconda3/pkgs/),注意:此目录下可能共存多个cudatoolkit包,通过如下方式选择拷贝哪一个。

在原主机中激活待拷贝的虚拟环境:
conda activate pretrain6
查看cudatoolkit的build id:
conda list cudatoolkit

图片3.png

到pkgs目前查找对应的cudatoolkit包:
cd /home/anaconda3/pkgs/
ls -hl|grep “cudatoolkit”
图片4.png

打包,压缩到:
zip -r cudatoolkit-11.3.1-h9edb442_10.zip cudatoolkit-11.3.1-h9edb442_10
拷贝到目标虚拟环境对应位置(/home/anaconda3/pkgs/)解压缩:
unzip cudatoolkit-11.3.1-h9edb442_10.zip

三、torch安装的两种方式和推荐方式
安装方式官方链接:
https://pytorch.org/get-started/previous-versions/#installing-previous-versions-of-pytorch
1.linux环境下两种安装方式
官方文档中有两种安装方式:
第一种是通过conda安装,例如:
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
第二种是通过pip安装,例如;
pip install torch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1

这两种安装方式的区别在于:
通过conda安装会安装cudatoolkit,而通过pip安装则不会。

2.推荐安装方式:conda install
通过conda install 安装方式会适配pytorch,torchvision,torchaudio,cudatoolkit的版本,如果版本不对应,就安装不成功。

用pip install 不能安装cudatoolkit,只能安装pytorch,torchvision,torchaudio,调用的是主机的cuda,可能存在cuda和torch不匹配问题。

可能遇见的问题

1.openssl

OpenSSL 3.0's legacy provider failed to load. This is a fatal error by default, but cryptography supports running without legacy algorithms by setting the environment variable CRYPTOGRAPHY_OPENSSL_NO_LEGACY. If you did not expect this error, you have likely made a mistake with your OpenSSL configuration. 

修改方式:
在虚拟环境下直接执行:export CRYPTOGRAPHY_OPENSSL_NO_LEGACY=1

一、docker安装

目前服务器基本上都是centos版本的,以下是centos版本安装方案
共有三种安装方式:
官方文档参考:
https://docs.docker.com/engine/install/centos/
其中在线安装,可以使用国内源,教程参考:
https://www.runoob.com/docker/centos-docker-install.html
安装命令如下:

curl -fsSL https://get.docker.com | bash -s docker --mirror Aliyun

确定是否安装成功:

docker --version

二、docker调用GPU安装依赖

要注意:要想使用GPU,带部署的服务器也必须要安装以下四个依赖,不安装无法调用GPU:
需要安装4个依赖程序:

yum install -y nvidia-container-toolkit
yum -y install nvidia-container-runtime
yum -y install libnvidia-container-tools 
yum -y install libnvidia-container1

可以离线下载安装:

使用命令下载(在可联网机器下载)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo yum -y install nvidia-container-toolkit --downloadonly --downloaddir ./ yum -y install nvidia-container-runtime --downloadonly --downloaddir ./ yum -y install libnvidia-container-tools --downloadonly --downloaddir ./ yum -y install libnvidia-container1 --downloadonly --downloaddir ./
(参考:https://juejin.cn/post/7066566268379201544)
安装
sudo rpm -ivh *.rpm

三、docker启动

重启:

systemctl daemon-reload
systemctl restart docker

四、docker导入导出

导出

docker save [OPTIONS] IMAGE [IMAGE...]
-o :输出到的文件。

#将镜像 runoob/ubuntu:v3 生成 my_ubuntu_v3.tar 文档
docker save -o my_ubuntu_v3.tar runoob/ubuntu:v3

导人

docker load --input ./ cuda102_runtime.tar

五、 docker占满根目录解决方案

docker 默认占用根目录位置: /var/lib/docker
解决方法:移动 /var/lib/docker 并设置软连接,最后重启docker

mv /var/lib/docker /data/docker
ln -s /data/docker /var/lib/docker

systemctl daemon-reload
systemctl restart docker

六、 Docker 镜像下载

挑选需要的docker 镜像
Docker镜像链接
https://hub.docker.com
ubuntu带 cuda 的 docker 镜像下载
https://hub.docker.com/r/nvidia/cuda/tags?page=1&name=16.04
微信截图_20231208181708.png

需要注意的重点:

1.在centos上面的程序是可以直接在ubuntu镜像下面执行的
2.如果要下载可以直接调用GPU的镜像需要下载含有cuda和cudnn的镜像
3.线上可以用推理环境的镜像:11.2.2-devel-ubuntu20.04,也可以用开发环境镜像:11.2.2-runtime-ubuntu20.04
只是大小不一样。如下图:
4.具体要下载哪个镜像要根据要部署的服务器驱动版本有关系,例如要部署的机器驱动是cuda 11.2对应的版本版本,下载的镜像最好一直,如果大于11.2版本,则无法正常启动镜像或者无法调用GPU,比11.2小的应该可以,但是没有使用。

5.下载镜像后,通过目录共享启动容器,例如:
docker run -it --shm-size="80g" -p 5012:5012 -v /data/liulei/:/data/liulei --gpus all nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04 /bin/bash 
可以把conda虚拟环境的目前拷贝到共享目录下面:然后通过虚拟环境的绝对python路径测试虚拟环境是否正常运行,也可以先通过
torch.cuda.is_available()来为true和false来判断GPU是否可用

82125.png

七、镜像

查看镜像:

docker images

修改镜像名称:

docker tag  imageid   name:version

八、容器

创建容器

docker run [OPTIONS] IMAGE [COMMAND] [ARG...]

OPTION说明(常用):有些是一个减号,有些是两个减号
–name=“容器新名字” 为容器指定一个名称
-d:后台运行容器并返回容器ID,也即启动守护式容器(后台运行)
-i:以交互模式运行容器,通常与-t同时使用
-t:为容器重新分配一个伪输入终端,通常与-i同时使用;也即启动交互式容器(前台有伪终端,等待交互)
-P:随机端口映射,大写P
-p:指定端口映射,小写p
例如 -p 8080:80,即容器的80端口映射到宿主机的8080端口


1.不使用显卡创建 container
docker run -it nvidia/cuda:10.2-cudnn8-runtime-ubuntu16.04 /bin/bash
2.使用显卡
docker run -it -d --shm-size="80g" -p 5000:5000 -v /home/centos/:/home/centos --gpus all belleagi/belle:v1.0 /bin/bash 


例如:
docker run -it --shm-size="80g" -p 5012:5012 -v /data/liulei/:/data/liulei --gpus all nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04 /bin/bash 

设置网络:
docker run --network=host -it --shm-size="80g" -p 5012:5012 -v /data/liulei/:/data/liulei --gpus all nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04 /bin/bash

2. 查询容器

命令:docker ps [OPTIONS]
OPTIONS说明
-a列出当前所有正在运行的容器+历史上运行过的
-i:显示最近创建过的容器
-n:显示最近n个创建的容器
-q:静默模式,只显示容器编号

docker ps   #当前正在运行的容器
docker ps -a #显示所有容器,包括当前没有运行的容器
docker ps -a | grep "" 按照关键词查询

3、 进入容器

存在 docker attach、docker exec 两种方式。
docker attach container_id   (**注意:** 如果使用此命令,从容器退出,会导致容器的停止。)
docker exec -it container_id /bin/bash

container_id:详见 docker ps –a 命令输出结果。
docker attach 44fc0f0582d9  

4、 退出容器

exit    # run进去容器,exit退出,容器停止
ctrl+p+q  #run进去容器,ctrl+p+q退出,容器不停止

5、 启动和停止容器

1.启动已经停止的容器
docker start 容器ID或者容器名

2.重启容器
docker restart 容器ID或者容器名

3.停止容器
docker stop 容器ID或者容器名

4.强制停止容器
docker kill 容器ID或者容器名

5.删除已经停止容器
docker rm 容器ID或者容器名
注意:rmi为删除镜像 rm为删除容器

6.强制删除:没有停止的容器
docker rm -f 容器ID或者容器名



九、与宿主机拷贝文件

宿主机 -> docker
docker cp /opt/software/temp/test/test.txt f7b37b56fb98:/home
docker -> 宿主机
docker cp f7b37b56fb98:/home  /opt/software/temp/test/test.txt
f7b37b56fb98:container_id