centos8 安装nvidia 显卡驱动(一路踩坑一路填)

您所在的位置:网站首页 centos安装图形化界面报错 centos8 安装nvidia 显卡驱动(一路踩坑一路填)

centos8 安装nvidia 显卡驱动(一路踩坑一路填)

2024-07-14 07:52| 来源: 网络整理| 查看: 265

最近在安装centos8的nvidia显卡驱动,遇到了一些问题,希望能在大家的帮助下共同解决、共同学习。废话不多说,直接上内容

1 首先确认内核版本和发行版本,再确认显卡型号

(1)uname -a

Linux localhost.localdomain 4.18.0-147.el8.x86_64 #1 SMP Wed Dec 4 21:51:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

(2)cat /etc/redhat-release CentOS Linux release 8.1.1911 (Core)

(3)lspci | grep -i nvidia 04:00.0 3D controller: NVIDIA Corporation GK208M [GeForce GT 730M] (rev a1)

2从官网下载对应版本驱动

 官方高级驱动搜索 | NVIDIA

https://www.nvidia.cn/Download/index.aspx?lang=cn

 点击搜索后,可以看到它的版本418.113,发布日期2019.11.5,网页上还有它的发布重点、产品支持列表和其他信息

Linux x64 (AMD64/EM64T) Display Driver 版本:418.113发布日期:2019.11.5操作系统:Linux 64-bit语言:Chinese (Simplified)文件大小:104.78 MB

3 开始安装

chmod 777 NVIDIA-Linux-x86_64-418.113.run #添加文件执行权限

以root 运行,进入命令行模式init 3

./NVIDIA-Linux-x86_64-418.113.run

3.1

此时遇到了第一个错误:

ERROR: The Nouveau kernel driver is currently in use by your system.  This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding.  Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.

nouveau 是很多linux 发行版带的驱动,目的是为了兼容各种不同显卡,要安装nvidia驱动必须禁用nouveau驱动。点击‘OK’后,会出现如下所示:

 这个页面是提示安装程序可以在modprobe添加文件来达到禁用nouveau的目的。可以先选择‘Yes’,点击后,可以看到生成了如下文件

/usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf

/etc/modprobe.d/nvidia-installer-disable-nouveau.conf.

两个文件里的内容是一样的,如下所示:

# generated by nvidia-installer blacklist nouveau options nouveau modeset=0

然后,再次安装,会有提示:

WARNING: One or more modprobe configuration files to disable Nouveau are already present at:/usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf,/etc/modprobe.d/nvidia-installer-disable-nouveau.conf.  Please be sure you have rebooted your system since these files were written. If you have rebooted, then Nouveau may be enabled for other reasons, such as being included in the system initial ramdisk or in your X configuration file.  Please consult the NVIDIA driver READMEand your Linux distribution's documentation for details on how to  correctly disable the Nouveau kernel driver.

但关键的是,这样的操作并没有禁用nouveau驱动,一样会报第一张图片的错误,使用命令查看(所有操作均在root帐户下执行)

[root@localhost ***]# lsmod | grep nouveau nouveau              2215936  2 mxm_wmi                16384  1 nouveau i2c_algo_bit           16384  2 i915,nouveau drm_kms_helper        217088  2 i915,nouveau ttm                   110592  1 nouveau drm                   524288  15 drm_kms_helper,i915,ttm,nouveau wmi                    32768  3 wmi_bmof,mxm_wmi,nouveau video                  45056  3 thinkpad_acpi,i915,nouveau

正确的做法应该如下所示

(1)在grub 启动中禁用nouveau,

vim /etc/default/grub  "GRUB_CMDLINE_LINUX"中添加 rd.driver.blacklist=nouveau nouveau.modeset=0

然后更新grub:grub2-mkconfig -o /boot/grub2/grub.cfg

(2)在/usr/lib/modprobe.d/dist-blacklist.conf 或/etc/modprobe.d/blacklist.conf中末尾添加blacklist。如下是/usr/lib/modprobe.d/dist-blacklist.conf的原有内容

vim /usr/lib/modprobe.d/dist-blacklist.conf

# # Listing a module here prevents the hotplug scripts from loading it. # Usually that'd be so that some other driver will bind it instead, # no matter which driver happens to get probed first.  Sometimes user # mode tools can also control driver binding. # # Syntax: see modprobe.conf(5). #

# watchdog drivers blacklist i8xx_tco

# framebuffer drivers blacklist aty128fb blacklist atyfb blacklist radeonfb blacklist i810fb blacklist cirrusfb blacklist intelfb blacklist kyrofb blacklist i2c-matroxfb blacklist hgafb blacklist nvidiafb

在末尾添加:

blacklist nouveau

保存

(3)备份 initramfs nouveau image镜像

mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak

(4)使用 dracut重新建立 initramfs

dracut -v /boot/initramfs-$(uname -r).img $(uname -r)

(5)reboot 重启,然后lsmod | grep nouveau 确认nouveau没有被加载

重新安装 ./NVIDIA-Linux-x86_64-418.113.run

3.2

遇到第二个错误

ERROR: Unable to find the kernel source tree for the currently running kernel.  Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the 'kernel-source' or 'kernel-devel' RPM installed.  If you know the correct kernel source files are installed, you may specify the kernel source path with the '--kernel-source-path' command line option. ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

找不到kernel-source-tree

确认kernel-headers 和kernel-devel 是否已安装和安装版本

root@localhost ***]# dnf list kernel-headers 已安装的软件包 kernel-headers.x86_64               4.18.0-305.19.1.el8_4                @baseos

[root@localhost ***]# dnf list kernel-headers 已安装的软件包 kernel-headers.x86_64               4.18.0-193.28.1.el8_2                @BaseOS 可安装的软件包 kernel-headers.x86_64               4.18.0-305.19.1.el8_4                BaseOS

[root@localhost ***]# dnf list kernel-devel kernel-devel.x86_64                4.18.0-305.19.1.el8_4                 @baseos

可以看到kernel-headers 已安装,安装版本4.18.0-193.28.1.el8_2 ,kernel-devel还未安装。此时要解决这个问题有两点:一是要确保kernel-headers、kernel-devel都要安装,而要确保两个软件包的版本与uname -a 显示的版本一致。

使用dnf install kernel-devel 安装后,安装是当前主版本4.18.0的最高小版本4.18.0-305.19.1.el8_4 。也可以安装当前版本,在网上没找到对应版本的kernel-headers和kernel-devel。现在就只有升级内核版本了。使用dnf distro-sync同步(等同于yum update)命令将所有软件更新至最新版本,当然也包括内核版本,同步结束后重启,再次确认版本

[root@localhost ***]# uname -a Linux localhost.localdomain 4.18.0-305.19.1.el8_4.x86_64 #1 SMP Wed Sep 15 15:39:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux [root@localhost ***]# dnf list kernel-headers kernel-headers.x86_64               4.18.0-305.19.1.el8_4                @baseos [root@localhost ***]# dnf list kernel-devel 已安装的软件包 kernel-devel.x86_64                4.18.0-305.19.1.el8_4                 @baseos

可以看到升级后的版本为4.18.0-305.19.1.el8_4 ,升级后的默认启动内核为最新版本4.18.0-305.19.1.el8_4 ,而且启动项也多了这个版本的内核选项,而且kernel-headers 、kernel-devel 版本已于uname -a 显示的一致。

此时要注意不能直接dnf install kernel-4.18.0-305.19.1.el8_4,如果这样版本内核安装了,但是没有生成启动内核,进入不了新内核版本,而要使用dnf distro-sync

kernel-headers 下载地址https://pkgs.org/download/kernel-headers

kernel-devel 下载地址https://pkgs.org/download/kernel-devel

继续安装 ./NVIDIA-Linux-x86_64-418.113.run

3.3 遇到了第三个错误

ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details.

Checking to see whether the nvidia kernel module was successfully built

构建内核时出错。网上的主要方法有两个,一是降版本(参考资料3),二是下载最新驱动(参考资料4)

3.3.1

冥思苦想,没有发现dnf 可以将升级后的降回去,有一个dnf downgrade ,是针对某个特定软件包的。最后只好自己试一次。

升级之后的kernel 软件包如下:

[root@localhost ***]# dnf --showduplicates list kernel | expand 已安装的软件包 kernel.x86_64                  4.18.0-147.el8                          @anaconda kernel.x86_64                  4.18.0-305.19.1.el8_4                   @baseos   可安装的软件包 kernel.x86_64                  4.18.0-305.3.1.el8                      baseos    kernel.x86_64                  4.18.0-305.7.1.el8_4                    baseos    kernel.x86_64                  4.18.0-305.10.2.el8_4                   baseos    kernel.x86_64                  4.18.0-305.12.1.el8_4                   baseos    kernel.x86_64                  4.18.0-305.17.1.el8_4                   baseos    kernel.x86_64                  4.18.0-305.19.1.el8_4                   baseos  

4.18.0-147.el8是之前的内核版本,4.18.0-305.19.1.el8_4是升级之后的内核版本

[root@localhost ***]# dnf list kernel*

已安装的软件包 kernel.x86_64                            4.18.0-147.el8                @anaconda kernel.x86_64                            4.18.0-305.19.1.el8_4         @baseos   kernel-core.x86_64                       4.18.0-147.el8                @anaconda kernel-core.x86_64                       4.18.0-305.19.1.el8_4         @baseos   kernel-devel.x86_64                      4.18.0-305.19.1.el8_4         @baseos   kernel-headers.x86_64                    4.18.0-305.19.1.el8_4         @baseos   kernel-modules.x86_64                    4.18.0-147.el8                @anaconda kernel-modules.x86_64                    4.18.0-305.19.1.el8_4         @baseos   kernel-tools.x86_64                      4.18.0-305.19.1.el8_4         @baseos   kernel-tools-libs.x86_64                 4.18.0-305.19.1.el8_4         @baseos   可安装的软件包 kernel-abi-stablelists.noarch            4.18.0-305.19.1.el8_4         baseos    kernel-cross-headers.x86_64              4.18.0-305.19.1.el8_4         baseos    kernel-debug.x86_64                      4.18.0-305.19.1.el8_4         baseos    kernel-debug-core.x86_64                 4.18.0-305.19.1.el8_4         baseos    kernel-debug-devel.x86_64                4.18.0-305.19.1.el8_4         baseos    kernel-debug-modules.x86_64              4.18.0-305.19.1.el8_4         baseos    kernel-debug-modules-extra.x86_64        4.18.0-305.19.1.el8_4         baseos    kernel-doc.noarch                        4.18.0-305.19.1.el8_4         baseos    kernel-modules-extra.x86_64              4.18.0-305.19.1.el8_4         baseos    kernel-rpm-macros.noarch                 125-1.el8                     appstream kernelshark.x86_64                       2.7-9.el8                     appstream

可以发现kernel-core,kernel-modules均是新旧版本并存。开始尝试。

第一步:dnf remove kernel-4.18.0-305.19.1.el8_4,然后重启,发现4.18.0-305.19.1.el8_4的启动项还在,并且可以进入,uname -a 的版本依旧是4.18.0-305.19.1.el8_4

第二步删除 /boot/loader/entries/ 的下文件

3dfed34393c14fd091784d3c4f08ca02-4.18.0-305.19.1.el8_4.x86_64.conf

和/boot下文件

vmlinuz-4.18.0-305.19.1.el8_4.x86_64

最后使用grubby --set-default /boot/vmlinuz-4.18.0-147.el8.x86_64 更改默认启动内核为4.18.0-147.el8.x86_64。

经过这样一番”神操作“后,恢复了原来的样子,再重新安装驱动

发现问题依旧。。。。。。。。。。。

其实可以不用降级,可以从启动项进入原来之前版本的内核,不过软件版本都是同步之后的新版本软件

3.3.2 从官网下载最新驱动,

当时这个版本418.113的发布日期为2019.11.5,而内核4.18.0的产生时间大约是在2019年初,而且418.113 发布重点里有这么一句话Fixed kernel module build problems with Linux kernel 5.4.0 release candidates。可以确认这个版本为较新版本。从其他地方下载了一个418.56的版本,尝试后一样的错误

3.3.3 再次升级。

没有办法,只有再次升级,使用dnf distro-sync后,问题来了,只安装了最新内核kernel-4.18.0-305.19.1.el8_4,开机启动项无kernel-4.18.0-305.19.1.el8_4,且/boot/loader/entries 和/boot 下无对应版本的vmlinuz文件

其实3.3.1的降级中只删除了内核文件,内核相关的文件都没有删除,正确的方法应该是:

第一布:dnf remove kernel-4.18.0-305.19.1.el8_4

第二步:dnf remove kernel-core-4.18.0-305.19.1.el8_4

这里会有提示,让你确认删除依赖的软件包,如modules,确认就行

第三步 删除4.18.0-305.19.1.el8_4的其他软件包,如tools,tools-libs,而kernel-headers 不要删除,它只是升级版本而已

进行这三步后,会自动删除启动项//boot/loader/entries/ 和/boot下的vmlinuz-4.18.0-305.19.1.el8_4.x86_64 二进制文件,当然重启后用dnf distro-sync 也可以同步至最新版本。

3.3.4 查看log 日志

ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details.

Checking to see whether the nvidia kernel module was successfully built

Using: nvidia-installer ncurses v6 user interface -> Detected 4 CPUs online; setting concurrency level to 4. -> Tagging shared libraries with chcon -t textrel_shlib_t. -> Installing NVIDIA driver version 470.74. -> Performing CC sanity check with CC="/usr/bin/cc". -> Performing CC check. -> Kernel source path: '/lib/modules/4.18.0-305.19.1.el8_4.x86_64/source' -> Kernel output path: '/lib/modules/4.18.0-305.19.1.el8_4.x86_64/build' -> Performing Compiler check. -> Performing Dom0 check. -> Performing Xen check. -> Performing PREEMPT_RT check. -> Performing vgpu_kvm check. -> Cleaning kernel module build directory.    executing: 'cd ./kernel; /usr/bin/make -k -j4 clean NV_EXCLUDE_KERNEL_MODULES="" SYSSRC="/lib/modules/4.18.0-305.19.1.el8_4.x86_64/source" SYSOUT="/lib/modules/4.18.0-305.19.1.el8_4.x86_64/build"'...    rm -f -r conftest    make[1]: Entering directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64'    make[2]: Entering directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64'    make[2]: Leaving directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64'    make[1]: Leaving directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64' -> Building kernel modules    executing: 'cd ./kernel; /usr/bin/make -k -j4  NV_EXCLUDE_KERNEL_MODULES="" SYSSRC="/lib/modules/4.18.0-305.19.1.el8_4.x86_64/source" SYSOUT="/lib/modules/4.18.0-305.19.1.el8_4.x86_64/build"'...    make[1]: Entering directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64'    make[2]: Entering directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64'    /usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64/Makefile:984: *** "Cannot generate ORC metadata for CONFIG_UNWINDER_ORC=y, please install libelf-dev, libelf-devel or elfutils-libelf-devel".  Stop.    make[2]: Leaving directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64'    make[1]: *** [Makefile:157: sub-make] Error 2    make[1]: Target 'modules' not remade because of errors.    make[1]: Leaving directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64'    make: *** [Makefile:80: modules] Error 2 -> ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details.

日至中提示缺少 libelf-dev, libelf-devel 或 elfutils-libelf-devel,最后只安装了第三个

root@localhost ***]# dnf install elfutils-libelf-devel ================================================================================  软件包                     架构        版本                  仓库         大小 ================================================================================ 安装:  elfutils-libelf-devel      x86_64      0.182-3.el8           baseos       59 k 安装依赖关系:  zlib-devel                 x86_64      1.2.11-17.el8         baseos       58 k

事务概要 ================================================================================ 安装  2 软件包

总下载:116 k 安装大小:171 k 确定吗?[y/N]:

确认后,重启,重新安装

3.4 此时遇到了第四个问题

3.4.1

ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See /var/log/nvidia-installer.log for details.

ERROR: The nvidia kernel module was not created. ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

直接查看日志:

Using: nvidia-installer ncurses v6 user interface -> Detected 4 CPUs online; setting concurrency level to 4. -> Tagging shared libraries with chcon -t textrel_shlib_t. -> Installing NVIDIA driver version 418.113. -> Performing CC sanity check with CC="/usr/bin/cc". -> Kernel source path: '/lib/modules/4.18.0-305.19.1.el8_4.x86_64/source' -> Kernel output path: '/lib/modules/4.18.0-305.19.1.el8_4.x86_64/build' -> Performing Compiler check. -> Performing Dom0 check. -> Performing Xen check. -> Performing PREEMPT_RT check. -> Performing vgpu_kvm check. -> Cleaning kernel module build directory.    executing: 'cd ./kernel; /usr/bin/make -k -j4 clean NV_EXCLUDE_KERNEL_MODULES="" SYSSRC="/lib/modules/4.18.0-305.19.1.el8_4.x86_64/source" SYSOUT="/lib/modules/4.18.0-305.19.1.el8_4.x86_64/build"'...    rm -f -r conftest    make[1]: Entering directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64'    make[2]: Entering directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64'    make[2]: Leaving directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64'    make[1]: Leaving directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64' -> Building kernel modules    executing: 'cd ./kernel; /usr/bin/make -k -j4  NV_EXCLUDE_KERNEL_MODULES="" SYSSRC="/lib/modules/4.18.0-305.19.1.el8_4.x86_64/source" SYSOUT="/lib/modules/4.18.0-305.19.1.el8_4.x86_64/build"'...    make[1]: Entering directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64'    make[2]: Entering directory '/usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64'      SYMLINK /tmp/selfgz9906/NVIDIA-Linux-x86_64-418.113/kernel/nvidia/nv-kernel.o      SYMLINK /tmp/selfgz9906/NVIDIA-Linux-x86_64-418.113/kernel/nvidia-modeset/nv-modeset-kernel.o     CONFTEST: INIT_WORK     CONFTEST: remap_pfn_range     CONFTEST: follow_pfn     CONFTEST: hash__remap_4k_pfn     CONFTEST: vmap     CONFTEST: set_pages_uc     CONFTEST: list_is_first     CONFTEST: set_memory_uc     CONFTEST: set_memory_array_uc     CONFTEST: change_page_attr     CONFTEST: pci_get_class     CONFTEST: pci_choose_state     CONFTEST: vm_insert_page     CONFTEST: acpi_device_id     CONFTEST: acquire_console_sem     CONFTEST: console_lock     CONFTEST: kmem_cache_create     CONFTEST: on_each_cpu     CONFTEST: smp_call_function     CONFTEST: acpi_evaluate_integer     CONFTEST: ioremap_cache     CONFTEST: ioremap_wc     CONFTEST: acpi_walk_namespace     CONFTEST: pci_domain_nr     CONFTEST: pci_dma_mapping_error     CONFTEST: sg_alloc_table     CONFTEST: sg_init_table     CONFTEST: pci_get_domain_bus_and_slot     CONFTEST: get_num_physpages     CONFTEST: efi_enabled     CONFTEST: proc_create_data     CONFTEST: pde_data     CONFTEST: proc_remove     CONFTEST: pm_vt_switch_required     CONFTEST: xen_ioemu_inject_msi     CONFTEST: phys_to_dma     CONFTEST: get_dma_ops     CONFTEST: write_cr4     CONFTEST: of_get_property     CONFTEST: of_find_node_by_phandle     CONFTEST: of_node_to_nid     CONFTEST: pnv_pci_get_npu_dev     CONFTEST: of_get_ibm_chip_id     CONFTEST: for_each_online_node     CONFTEST: node_end_pfn     CONFTEST: pci_bus_address     CONFTEST: pci_stop_and_remove_bus_device     CONFTEST: pci_remove_bus_device     CONFTEST: request_threaded_irq     CONFTEST: register_cpu_notifier     CONFTEST: cpuhp_setup_state     CONFTEST: dma_map_resource     CONFTEST: backlight_device_register     CONFTEST: register_acpi_notifier     CONFTEST: timer_setup     CONFTEST: pci_enable_msix_range     CONFTEST: compound_order     CONFTEST: do_gettimeofday     CONFTEST: dma_direct_map_resource     CONFTEST: vmf_insert_pfn     CONFTEST: remap_page_range     CONFTEST: address_space_init_once     CONFTEST: kbasename     CONFTEST: fatal_signal_pending     CONFTEST: list_cut_position     CONFTEST: vzalloc     CONFTEST: wait_on_bit_lock_argument_count     CONFTEST: bitmap_clear     CONFTEST: usleep_range     CONFTEST: radix_tree_empty     CONFTEST: radix_tree_replace_slot     CONFTEST: pnv_npu2_init_context     CONFTEST: drm_dev_unref     CONFTEST: drm_reinit_primary_mode_group     CONFTEST: get_user_pages_remote     CONFTEST: get_user_pages     CONFTEST: drm_gem_object_lookup     CONFTEST: drm_atomic_state_ref_counting     CONFTEST: drm_driver_has_gem_prime_res_obj     CONFTEST: drm_atomic_helper_connector_dpms     CONFTEST: drm_connector_funcs_have_mode_in_name     CONFTEST: drm_framebuffer_get     CONFTEST: drm_gem_object_get     CONFTEST: drm_dev_put     CONFTEST: is_export_symbol_gpl_of_node_to_nid     CONFTEST: is_export_symbol_present_swiotlb_map_sg_attrs     CONFTEST: is_export_symbol_present_swiotlb_dma_ops     CONFTEST: i2c_adapter     CONFTEST: pm_message_t     CONFTEST: irq_handler_t     CONFTEST: acpi_device_ops     CONFTEST: acpi_op_remove     CONFTEST: outer_flush_all     CONFTEST: proc_dir_entry     CONFTEST: scatterlist     CONFTEST: sg_table     CONFTEST: file_operations     CONFTEST: vm_operations_struct     CONFTEST: atomic_long_type     CONFTEST: file_inode     CONFTEST: task_struct     CONFTEST: kuid_t     CONFTEST: dma_ops     CONFTEST: swiotlb_dma_ops     CONFTEST: dma_map_ops     CONFTEST: noncoherent_swiotlb_dma_ops     CONFTEST: vm_fault_present     CONFTEST: vm_fault_has_address     CONFTEST: backlight_properties_type     CONFTEST: vmbus_channel_has_ringbuffer_page     CONFTEST: kmem_cache_has_kobj_remove_work     CONFTEST: sysfs_slab_unlink     CONFTEST: fault_flags     CONFTEST: atomic64_type     CONFTEST: address_space     CONFTEST: backing_dev_info     CONFTEST: mm_context_t     CONFTEST: vm_ops_fault_removed_vma_arg     CONFTEST: node_states_n_memory     CONFTEST: drm_bus_present     CONFTEST: drm_bus_has_bus_type     CONFTEST: drm_bus_has_get_irq     CONFTEST: drm_bus_has_get_name     CONFTEST: drm_driver_has_legacy_dev_list     CONFTEST: drm_driver_has_set_busid     CONFTEST: drm_crtc_state_has_connectors_changed     CONFTEST: drm_init_function_args     CONFTEST: drm_mode_connector_list_update_has_merge_type_bits_arg     CONFTEST: drm_helper_mode_fill_fb_struct     CONFTEST: drm_master_drop_has_from_release_arg     CONFTEST: drm_driver_unload_has_int_return_type     CONFTEST: kref_has_refcount_of_type_refcount_t     CONFTEST: drm_atomic_helper_crtc_destroy_state_has_crtc_arg     CONFTEST: drm_crtc_helper_funcs_has_atomic_enable     CONFTEST: drm_mode_object_find_has_file_priv_arg     CONFTEST: dma_buf_owner     CONFTEST: drm_connector_list_iter     CONFTEST: drm_atomic_helper_swap_state_has_stall_arg     CONFTEST: drm_driver_prime_flag_present     CONFTEST: dom0_kernel_present     CONFTEST: nvidia_vgpu_hyperv_available     CONFTEST: nvidia_vgpu_kvm_build     CONFTEST: nvidia_grid_build     CONFTEST: drm_available     CONFTEST: drm_atomic_available     CONFTEST: is_export_symbol_gpl_refcount_inc     CONFTEST: is_export_symbol_gpl_refcount_dec_and_test      CC [M]  /tmp/selfgz9906/NVIDIA-Linux-x86_64-418.113/kernel/nvidia/nv-frontend.o      CC [M]  /tmp/selfgz9906/NVIDIA-Linux-x86_64-418.113/kernel/nvidia/nv-instance.o      CC [M]  /tmp/selfgz9906/NVIDIA-Linux-x86_64-418.113/kernel/nvidia/nv.o      CC [M]  /tmp/selfgz9906/NVIDIA-Linux-x86_64-418.113/kernel/nvidia/nv-acpi.o    /tmp/selfgz9906/NVIDIA-Linux-x86_64-418.113/kernel/nvidia/nv.c: In function 'nvidia_probe':    /tmp/selfgz9906/NVIDIA-Linux-x86_64-418.113/kernel/nvidia/nv.c:4129:5: error: implicit declaration of function 'vga_tryget'; did you mean 'vga_get'? [-Werror=implicit-function-declaration]         vga_tryget(VGA_DEFAULT_DEVICE, VGA_RSRC_LEGACY_MASK);         ^~~~~~~~~~         vga_get      CC [M]  /tmp/selfgz9906/NVIDIA-Linux-x86_64-418.113/kernel/nvidia/nv-chrdev.o

下面还有其他错误,如drm/drmP.h: No such file or directory, 'NULL' undeclared,field 'base' has incomplete type等等。针对第一个错误,查看本机的vgaar.h 文件内容,cat /usr/src/kernels/4.18.0-305.19.1.el8_4.x86_64/include/linux/vgaarb.h

#ifndef LINUX_VGA_H #define LINUX_VGA_H

#include

/* Legacy VGA regions */ #define VGA_RSRC_NONE           0x00 #define VGA_RSRC_LEGACY_IO     0x01 #define VGA_RSRC_LEGACY_MEM    0x02 #define VGA_RSRC_LEGACY_MASK   (VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM) /* Non-legacy access */ #define VGA_RSRC_NORMAL_IO     0x04 #define VGA_RSRC_NORMAL_MEM    0x08

/* Passing that instead of a pci_dev to use the system "default"  * device, that is the one used by vgacon. Archs will probably  * have to provide their own vga_default_device();  */ #define VGA_DEFAULT_DEVICE     (NULL)

struct pci_dev;

/* For use by clients */

/**  *     vga_set_legacy_decoding  *  *     @pdev: pci device of the VGA card  *     @decodes: bit mask of what legacy regions the card decodes  *  *     Indicates to the arbiter if the card decodes legacy VGA IOs,  *     legacy VGA Memory, both, or none. All cards default to both,  *     the card driver (fbdev for example) should tell the arbiter  *     if it has disabled legacy decoding, so the card can be left  *     out of the arbitration process (and can be safe to take  *     interrupts at any time.  */ #if defined(CONFIG_VGA_ARB) extern void vga_set_legacy_decoding(struct pci_dev *pdev,                     unsigned int decodes); #else static inline void vga_set_legacy_decoding(struct pci_dev *pdev,                        unsigned int decodes) { }; #endif

#if defined(CONFIG_VGA_ARB) extern int vga_get(struct pci_dev *pdev, unsigned int rsrc, int interruptible); #else static inline int vga_get(struct pci_dev *pdev, unsigned int rsrc, int interruptible) { return 0; } #endif

/**  * vga_get_interruptible  * @pdev: pci device of the VGA card or NULL for the system default  * @rsrc: bit mask of resources to acquire and lock  *  * Shortcut to vga_get with interruptible set to true.  *  * On success, release the VGA resource again with vga_put().  */ static inline int vga_get_interruptible(struct pci_dev *pdev,                     unsigned int rsrc) {        return vga_get(pdev, rsrc, 1); }

/**  * vga_get_uninterruptible - shortcut to vga_get()  * @pdev: pci device of the VGA card or NULL for the system default  * @rsrc: bit mask of resources to acquire and lock  *  * Shortcut to vga_get with interruptible set to false.  *  * On success, release the VGA resource again with vga_put().  */ static inline int vga_get_uninterruptible(struct pci_dev *pdev,                       unsigned int rsrc) {        return vga_get(pdev, rsrc, 0); }

#if defined(CONFIG_VGA_ARB) extern void vga_put(struct pci_dev *pdev, unsigned int rsrc); #else #define vga_put(pdev, rsrc) #endif

#ifdef CONFIG_VGA_ARB extern struct pci_dev *vga_default_device(void); extern void vga_set_default_device(struct pci_dev *pdev); extern int vga_remove_vgacon(struct pci_dev *pdev); #else static inline struct pci_dev *vga_default_device(void) { return NULL; }; static inline void vga_set_default_device(struct pci_dev *pdev) { }; static inline int vga_remove_vgacon(struct pci_dev *pdev) { return 0; }; #endif

/*  * Architectures should define this if they have several  * independent PCI domains that can afford concurrent VGA  * decoding  */ #ifndef __ARCH_HAS_VGA_CONFLICT static inline int vga_conflicts(struct pci_dev *p1, struct pci_dev *p2) {        return 1; } #endif

#if defined(CONFIG_VGA_ARB) int vga_client_register(struct pci_dev *pdev, void *cookie,             void (*irq_set_state)(void *cookie, bool state),             unsigned int (*set_vga_decode)(void *cookie, bool state)); #else static inline int vga_client_register(struct pci_dev *pdev, void *cookie,                       void (*irq_set_state)(void *cookie, bool state),                       unsigned int (*set_vga_decode)(void *cookie, bool state)) {     return 0; } #endif

#endif /* LINUX_VGA_H */

可以看出,本机头文件vgaarb.h 有vga_get函数,但没有vga_tryget函数,所以报错。

3.4.2

尝试安装其他版本,如418.56,安装后查看日志

error: "NV_BUILD_MODULE_INSTANCES" is not defined,也有vga_get、vga_tryget错误

3.4.3从官网寻找其他版本

 

选择传统GPU超级新版本390.144,发布日期2021.7.20。安装,出现提示“Install NVIDIA's 32-bit compatibility libraries?”,选择‘yes’,继续安装,安装成功

[root@localhost ***]# nvidia-smi Mon Sep 27 21:07:30 2021        +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.144                Driver Version: 390.144                   | |-------------------------------+----------------------+----------------------+ | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | |===============================+======================+======================| |   0  GeForce GT 730M     Off  | 00000000:04:00.0 N/A |                  N/A | | N/A   44C    P8    N/A /  N/A |      0MiB /  2004MiB |     N/A      Default | +-------------------------------+----------------------+----------------------+                                                                                 +-----------------------------------------------------------------------------+ | Processes:                                                       GPU Memory | |  GPU       PID   Type   Process name                             Usage      | |=============================================================================| |    0                    Not Supported                                       | +-----------------------------------------------------------------------------+ 390.144版本可以安装成功,但418.113不行,可能内核版本与驱动版本不匹配,等待大神解答

3.4.4

390.144版本虽然已经安装成功,但可能版本较低,与正在运行的cuda版本不符,出现too old,决定卸载原版本,安装470.74版本,安装过程同390.144一样,以下是470.74的输出

root@localhost ~]$ nvidia-smi Wed Sep 29 21:40:08 2021        +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     | |-------------------------------+----------------------+----------------------+ | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | |                               |                      |               MIG M. | |===============================+======================+======================| |   0  NVIDIA GeForce ...  Off  | 00000000:04:00.0 N/A |                  N/A | | N/A   44C    P8    N/A /  N/A |      3MiB /  2004MiB |     N/A      Default | |                               |                      |                  N/A | +-------------------------------+----------------------+----------------------+                                                                                 +-----------------------------------------------------------------------------+ | Processes:                                                                  | |  GPU   GI   CI        PID   Type   Process name                  GPU Memory | |        ID   ID                                                   Usage      | |=============================================================================| |  No running processes found                                                 | +-----------------------------------------------------------------------------+

可以看出cuda版本11.4

4 后记

nvidia 驱动卸载

./NVIDIA-Linux-x86_64-470.74.run --uninstall

./NVIDIA-Linux-x86_64-470.74.run --update      //更新

然后nvidia-smi 提示找不到命令,说明卸载成功

附nvidia驱动和版本对应关系Release Notes :: CUDA Toolkit Documentation

参考资料:

1 centos 7 禁用nouveau驱动.https://blog.csdn.net/qq_37296212/article/details/114265216

2 ERROR: Unable to find the kernel source tree for the currently running kernel – CentOS / RHEL / AlmaLinux https://linuxconfig.org/error-unable-to-find-the-kernel-source-tree-for-the-currently-running-kernel-centos-rhel3 centos7.5英伟达驱动:unable to find the kernel source tree for current running kernel;nvidia-smi has faild https://blog.csdn.net/HaixWang/article/details/90408538

4成功解决 ERROR: An error occurred while performing the step: “Building kernel modules“. See  /var/log/nv_一个处女座的程序猿-CSDN博客5 安装NVIDIA显卡驱动报错:An error occurred while performing the step: “Building kernel modules”_muli-CSDN博客

6 centos7-内核版本降级_weixin_33842328的博客-CSDN博客

8 redhat - Nvidia driver installation on RHEL 8 - Stack Overflow



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3