Found 6 repositories(showing 6)
Use CLIP to represent video for Retrieval Task
yangbang18
(PRCV'2022) CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
Abhishekjha111
Context When evaluating computer vision projects, training and test data are essential. The used data is a representation of a challenge a proposed system shall solve. It is desirable to have a large database with large variation representing the challenge, e.g detecting and recognizing traffic lights (TLs) in an urban environment. From surveying existing work it is clear that currently evaluation is limited primarily to small local datasets gathered by the authors themselves and not a public available dataset. The local datasets are often small in size and contain little variation. This makes it nearly impossible to compare the work and results from different author, but it also become hard to identify the current state of a field. In order to provide a common basis for future comparison of traffic light recognition (TLR) research, an extensive public database is collected based on footage from US roads. The database consists of continuous test and training video sequences, totaling 43,007 frames and 113,888 annotated traffic lights. The sequences are captured by a stereo camera mounted on the roof of a vehicle driving under both night- and daytime with varying light and weather conditions. Only the left camera view is used in this database, so the stereo feature is in the current state not used. Content The database is collected in San Diego, California, USA. The database provides four day-time and two night-time sequences primarily used for testing, providing 23 minutes and 25 seconds of driving in Pacific Beach and La Jolla, San Diego. The stereo image pairs are acquired using the Point Grey’s Bumblebee XB3 (BBX3-13S2C-60) which contains three lenses which capture images with a resolution of 1280 x 960, each with a Field of View(FoV) of 66°. Where the left camera view is used for all the test sequences and training clips. The training clips consists of 13 daytime clips and 5 nighttime clips. Annotations The annotation.zip contains are two types of annotation present for each sequence and clip. The first annotation type contains information of the entire TL area and what state the TL is in. This annotation file is called frameAnnotationsBOX, and is generated from the second annotation file by enlarging all annotation larger than 4x4. The second one is annotation marking only the area of the traffic light which is lit and what state it is in. This second annotation file is called frameAnnotationsBULB. The annotations are stored as 1 annotation per line with the addition of information such as class tag and file path to individual image files. With this structure the annotations are stored in a csv file, where the structure is exemplified in below listing: Filename;Annotation tag;Upper left corner X;Upper left corner Y;Lower right corner X;Lower right corner Y;Origin file;Origin frame number;Origin track;Origin track frame number Acknowledgements When using this dataset we would appreciate if you cite the following papers: Jensen MB, Philipsen MP, Møgelmose A, Moeslund TB, Trivedi MM. Vision for Looking at Traffic Lights: Issues, Survey, and Perspectives. I E E E Transactions on Intelligent Transportation Systems. 2016 Feb 3;17(7):1800-1815. Available from, DOI: 10.1109/TITS.2015.2509509 Philipsen, M. P., Jensen, M. B., Møgelmose, A., Moeslund, T. B., & Trivedi, M. M. (2015, September). Traffic light detection: A learning algorithm and evaluations on challenging dataset. In intelligent transportation systems (ITSC), 2015 IEEE 18th international conference on (pp. 2341-2345). IEEE.
saumyasinha023
NLP application which converts text to sign language translations in form of video and graphical representation clips.
VushakolaBhavani
The purpose of this script is to detect a specific gesture within a video sequence. It accomplishes this by comparing the gesture representation (an image or a short video clip) with each frame of the test video. If the gesture is detected in a frame, the script overlays the word "DETECTED" in bright green on the top right corner of the frame.
sufyn
The purpose of this script is to detect a specific gesture within a video sequence. It accomplishes this by comparing the gesture representation (an image or a short video clip) with each frame of the test video. If the gesture is detected in a frame, the script overlays the word "DETECTED" in bright green on the top right corner of the frame.
All 6 repositories loaded