OR you can follow the steps below: The code is tested with python 3.9, cuda == 11.3, pytorch == 1.10.1. Additionally dependencies include: h5py kornia torch torchvision omegaconf torchmetrics==0.10.3 ...
Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object ...