    Autonomous AI Video Analysis 2.0 | GPT-4V Turbo x Whisper

    In today's video, I am excited to share the upgrades I have made to the video analysis script. This upgraded version takes a video as an input and generates a description as the output, making it even more exciting than before. Let's dive into the details of these enhancements and see how they improve the results.

    Upgrade Details

    In the previous version of the script, we followed a simple flowchart process. We took an MP4 video as input, extracted frames from it, converted the frames to obtain a description solely based on the visual content. The output was a voice-over in the MP3 format for the video.

    Now, in the upgraded version, we have added new functionalities. First, we extract the audio clips from the video as an MP3 file. Then, we introduce The Whisper API to transcribe or translate the audio into text. After obtaining the visual description and the audio transcription, we combine these two elements to create a more comprehensive understanding of the video content.

    To make this description accessible, we pass it to a Text-To-Speech (TTS) API to obtain a spoken report. Additionally, we also provide an option to print the description as plain text. Overall, these upgrades enhance the accuracy and richness of the generated description.

    Code Enhancements

    To implement these upgrades, we added three essential functions to the script. First, the "extract_audio" function extracts audio from the video, saving it as an MP3 file. Next, the "transcribe_audio" function utilizes the Whisper API to transcribe the audio into text. Lastly, the "get_Rewritten_description" function combines the video description and the audio transcription into a revised description.

    There are also two additional components in the code. The "combine text" function merges the video and audio descriptions. And the "get_rewritten_description" function establishes a prompt that includes the video duration and word count, ensuring an optimal description length.

    For a more detailed overview of the code implementation, refer to the video mentioned earlier.

    Testing the Upgraded Script

    To test the effectiveness of the upgrades, we analyzed different video clips based on subscriber recommendations. One such video was about Boston Dynamics robots. After running the script, we obtained a description comprising both the video frames and the audio transcript. Although the revised description was a bit long, we could modify it to meet desired specifications.

    We then proceeded to test the script on other videos, such as news segments and informative presentations. In each case, the script successfully generated revised descriptions that accurately captured the content. The timing and alignment of the descriptions were also commendable.

    Summary and Keywords

    In summary, the Autonomous AI Video Analysis 2.0 script, powered by GPT-4V Turbo and Whisper API, offers improved video analysis capabilities. By combining visual description and audio transcription, it provides a comprehensive understanding of video content. The upgraded script generates voice-over reports and enables printing of descriptions. With further fine-tuning, this script can be adapted for various applications.

    Keywords: Autonomous AI, Video Analysis, GPT-4V Turbo, Whisper API, description, voice-over, audio transcription, TTS API, transcript, video frames, revised description, code enhancements, testing, video clips, Boston Dynamics, news segments, informative presentations.


    1. Can this upgraded script handle videos of any length? The script can handle videos of various lengths, but it's necessary to consider the word count and duration to ensure concise and accurate descriptions.

    2. Can I adjust the length of the generated descriptions? Yes, the code allows for adjustments in the description length. By modifying the prompt, you can control the word count and achieve desired results.

    3. What APIs are utilized in this script? The upgraded script uses the Whisper API for audio transcription and the Text-to-Speech (TTS) API for generating voice-over reports.

    4. Is the script compatible with different video formats? The script is compatible with the MP4 video format, which is commonly used. However, slight modifications may be required to handle other formats.

    5. Can this script be customized for specific applications? Yes, the code can be fine-tuned to cater to different applications by adjusting prompts, word count preferences, and other parameters.

