Back to search
Image Captioning with CLIP and GPT-2 — Multimodal deep learning model integrating CLIP vision encoder and GPT-2 text decoder for image-to-text generation. Trained on 30K+ Flickr images, achieving BLEU-4 of 5.28% and CIDEr of 37.76%, outperforming CNN+LSTM baseline by 12%.
Stars
1
Forks
0
Watchers
1
Open Issues
0
Overall repository health assessment
No package.json found
This might not be a Node.js project
16
commits
1
commits