Awesome-Multimodal-Large-Language-Models
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
https://github.com/DAMO-NLP-SG/VideoLLaMA2
AskVideos-VideoCLIP
https://github.com/AskYoutubeAI/AskVideos-VideoCLIP
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
https://github.com/OpenGVLab/InternVideo
INTERNVIDEO2: SCALING VIDEO FOUNDATION MODELS FOR MULTIMODAL VIDEO UNDERSTANDING
https://github.com/OpenGVLab/InternVideo2
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid
ViCLIP: a video-text representation learning model trained on InternVid
https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo1/Pretrain/ViCLIP