ECVA | European Computer Vision Association

Multi-Modal Video Dialog State Tracking in the Wild

Adnen Abdessaied*, Lei Shi, Andreas Bulling ;

Abstract

"We present figures/mixeri con.pdf −−anovelvideodialogmodeloperatingoveragenericmulti− modalstatetrackingscheme.Currentmodelsthatclaimtoperf ormmulti−modalstatetrackingf allshortintwoma (1)T heyeithertrackonlyonemodality(mostlythevisualinput)or(2)theytargetsyntheticdatasetsthatdonotref lec worldin−the−wildscenarios.Ourmodeladdressesthesetwolimitationsinanattempttoclosethiscrucialresearch modalgraphstructurelearningmethod.Subsequently, thelearnedlocalgraphsandf eaturesareparsedtogethertof grainedgraphnodef eaturesareusedtoenhancethehiddenstatesof thebackboneV ision− LanguageM odel(V LM ). achievesnewstate−of −the−artresultsonfivechallengingbenchmarks."

Related Material

[pdf] [supplementary material] [DOI]