记一次容器内执行ansible命令卡住 |
您所在的位置:网站首页 › ansible运行python脚本 › 记一次容器内执行ansible命令卡住 |
1.由来
最近在使用kylin_v10系统,发现当在此系统下运行的容器内执行#ansible localhost -m setup 命令会卡住不动,于是和同事一起经过如下排查最终找到解决问题的办法。 2.环境 2.1.系统信息 # cat /etc/*-release Kylin Linux Advanced Server release V10 (Tercel) NAME="Kylin Linux Advanced Server" VERSION="V10 (Tercel)" ID="kylin" VERSION_ID="V10" PRETTY_NAME="Kylin Linux Advanced Server V10 (Tercel)" ANSI_COLOR="0;31" Kylin Linux Advanced Server release V10 (Tercel) 2.2.内核信息 # uname -a Linux 4.19.90-17.ky10.aarch64 #1 SMP Sun Jun 28 14:27:40 CST 2020 aarch64 aarch64 aarch64 GNU/Linux 2.3. docker信息 # docker info Containers: 1 Running: 1 Paused: 0 Stopped: 0 Images: 1 Server Version: 18.09.9 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs 2.4.ansible信息 # ansible --version ansible 2.6.2 config file = None configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.16 (default, Jul 9 2020, 06:35:45) [GCC 7.3.0] 3.分析排查在排查时候发现#ansible localhost -m setup命令卡住,放将localhost换成自定义ip+账号密码的配置文件即可正常运行。 于是加入export ANSIBLE_DEBUG=True用于输出debug日志。 发现卡在如下地方: 82 1606185861.10586: transferring module to remote /root/.ansible/tmp/ansible-tmp-1606185860.41-269842916667107/AnsiballZ_setup.py 82 1606185861.10840: done transferring module to remote 82 1606185861.10894: _low_level_execute_command(): starting 82 1606185861.10924: _low_level_execute_command(): executing: /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1606185860.41-269842916667107/ /root/.ansible/tmp/ansible-tmp-1606185860.41-269842916667107/AnsiballZ_setup.py && sleep 0' 82 1606185861.10940: in local.exec_command() 82 1606185861.10957: opening command with Popen() 82 1606185861.11488: done running command with Popen() 82 1606185861.11523: getting output with communicate() 82 1606185861.11918: done communicating 82 1606185861.11936: done with local.exec_command() 82 1606185861.11961: _low_level_execute_command() done: rc=0, stdout=, stderr= 82 1606185861.11977: _low_level_execute_command(): starting 82 1606185861.12019: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1606185860.41-269842916667107/AnsiballZ_setup.py && sleep 0' 82 1606185861.12038: in local.exec_command() 82 1606185861.12055: opening command with Popen() 82 1606185861.12599: done running command with Popen() 82 1606185861.12631: getting output with communicate()于是进到物理机上去查看ansible进程 # ps -ef |grep ansible root 672540 672016 99 10:44 pts/0 00:03:06 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606185860.41-269842916667107/AnsiballZ_setup.py root 673881 672428 51 10:47 pts/0 00:00:02 /usr/bin/python /usr/local/bin/ansible localhost -m setup root 673893 673881 33 10:47 pts/0 00:00:00 /usr/bin/python /usr/local/bin/ansible localhost -m setup root 673908 673893 0 10:47 pts/0 00:00:00 /bin/sh -c /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1606186046.03-129145088760493/AnsiballZ_setup.py && sleep 0' root 673909 673908 0 10:47 pts/0 00:00:00 /bin/sh -c /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606186046.03-129145088760493/AnsiballZ_setup.py && sleep 0 root 673910 673909 23 10:47 pts/0 00:00:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606186046.03-129145088760493/AnsiballZ_setup.py root 673914 673910 99 10:47 pts/0 00:00:01 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606186046.03-129145088760493/AnsiballZ_setup.py root 673971 443741 0 10:47 pts/1 00:00:00 grep ansible再用strace追踪下673914进程 # strace -p 673914 close(216995106) = -1 EBADF (错误的文件描述符) close(216995107) = -1 EBADF (错误的文件描述符) close(216995108) = -1 EBADF (错误的文件描述符) close(216995109) = -1 EBADF (错误的文件描述符) close(216995110) = -1 EBADF (错误的文件描述符) close(216995111) = -1 EBADF (错误的文件描述符) close(216995112) = -1 EBADF (错误的文件描述符) close(216995113) = -1 EBADF (错误的文件描述符) close(216995114) = -1 EBADF (错误的文件描述符) close(216995115) = -1 EBADF (错误的文件描述符) close(216995116) = -1 EBADF (错误的文件描述符) close(216995117) = -1 EBADF (错误的文件描述符) close(216995118) = -1 EBADF (错误的文件描述符) close(216995119) = -1 EBADF (错误的文件描述符) close(216995120) = -1 EBADF (错误的文件描述符) close(216995121) = -1 EBADF (错误的文件描述符) close(216995122) = -1 EBADF (错误的文件描述符) close(216995123) = -1 EBADF (错误的文件描述符) close(216995124) = -1 EBADF (错误的文件描述符) close(216995125) = -1 EBADF (错误的文件描述符) close(216995126) = -1 EBADF (错误的文件描述符) close(216995127) = -1 EBADF (错误的文件描述符) close(216995128) = -1 EBADF (错误的文件描述符) close(216995129) = -1 EBADF (错误的文件描述符) close(216995130) = -1 EBADF (错误的文件描述符) close(216995131) = -1 EBADF (错误的文件描述符) close(216995132) = -1 EBADF (错误的文件描述符)终端一直刷上面的,看样子是文件描述符泄露,搜了下 docker Bad file descriptor,找到了 Spawning PTY processes is many times slower on Docker 18.09 里几位大佬排查到是容器的 nofile 太高就会卡,如果启动容器 nofile 设置低则没问题, 在容器内执行ulimit -n果然默认值很高 > ulimit -n 1073741816再查了下 docker nofile limit 找到 Docker: How to increase number of open files limit 里面描述可以在run docker的时候设置容器内的nofile参数大小。 于是添加 --ulimit nofile=65535 重新启动docker,并查看容器内ulimit -n值果然变小了,而且#ansible localhost -m setup 问题也得到了解决。 4.参考https://github.com/pexpect/ptyprocess/issues/50 https://github.com/docker/for-linux/issues/502 https://github.com/moby/moby/issues/38814 |
今日新闻 |
推荐新闻 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |