Video

VisualAudio Flow

Video

What we want

Description (with LLM?) of: what is in video chunk(str), what objects are there(str), how object are moving there(str), how camera is moving (str), what emotions are shown (list[enum]), were there multiple changes in view(bool). How to understand video - is camera shaking, is it professional or spontaneous.

Process

Segmentation to chunks of 10 seconds. Send all chunks to LLaVA NexT OV 0.5b Then send response to some small llm which will parse to JSON.

Save this to list with: (response: {llm_str concat + emotions + multiple changes + is shaking + is prof} , metadata: {start_timestamp, emotions, multiple changes, is shaking, is prof})

Audio

What we want

Transcription with speaker ids, loud timestamps ranges, quiet timestamps ranges

Transform video to audio Build spectogram - we will know when it's loud, when quiet Send whole audio to WhisperX - respond should be timestamped

Save this to list with: (response, metadata: {start_timestamp, is_loud, is_quiet})

Output

Audio only

We can spot dialogs on video timeline (marked with green color)
Spot themes of dialogs on video timeline - they can overaly on each other (sub themes)

Video and audio

"Dense Highlights" - based on multiple changes, emotions changes between chunks
EVEN MORE :D

Overlay both outputs. Save to Vector Store FAISS, each 10s is document, metadata is start time of chunk.

Now respond with overlayed outputs formatted as described below:

{
    "general_summarization": "general_summarization",
    "dialogs": [
        {
            "start_time": 0.05,
            "end_time": 0.25,
            "subject": "subject_1",
            "speakers": ["speaker_1", "speaker_2"]
        }
    ],
    "themes": [
        {
            "description": "theme_1",
            "start_time": 0.5,
            "end_time": 5
        },
        {
            "description": "theme_2",
            "start_time": 2,
            "end_time": 4
        }
    ],
    "highlights": [
        {
            "description": "highlight_1",
            "reason": "reason_1"
        }
    ]
}

After that let's allow user to query for specific data using schema:

user question
LLM generates query for RAG
LLM gathers results from vector store
LLM asks it self is it responds to user task
if not return to step 2 with tips what should be generated additionally, also provide already collected data
respond to user

graph LR
    A[User question] --> B[LLM generates query for vector store]
    B --> Z["Query Vector Store and respond to user task"]
    Z --> C["Query vector store"]
    C["Does it respond to user task?"]
    C-- No --> D[LLM tells what is missing in response]
    D --> B
    C -- Yes --> F[Respond to user]

Propozycja wizualizacji

Po wgraniu video ui pozwala na zadawanie pytań już zaraz po rozpoczęciu analizy audio i video (pokazujemy progress analizy audio i video w trybie live - podczas analizy będziemy zapisywać "snapshoty" vector store). użytkownik może wpisywać pytanie do podsumowania i otrzymywać je na bazie aktualnego stanu vector store.

Required data

https://hf.co/pyannote/speaker-diarization

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.project		.project
apps		apps
data		data
dsc_hackai		dsc_hackai
frontend		frontend
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
apt.txt		apt.txt
poetry.lock		poetry.lock
postBuild.bash		postBuild.bash
preBuild.bash		preBuild.bash
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
variables.env		variables.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Video

What we want

Process

Audio

What we want

Output

Audio only

Video and audio

Propozycja wizualizacji

Required data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

PJWSTK-Data-Science-Dojo/hack-ai-2024

Folders and files

Latest commit

History

Repository files navigation

Video

What we want

Process

Audio

What we want

Output

Audio only

Video and audio

Propozycja wizualizacji

Required data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages