This dataset contains data for behavioural outcomes from the audio-visual simultaneity judgement (SJ) and emotion recognition (ER) tasks described in the paper: "An RCT study showing few weeks of music lessons enhance audio-visual temporal processing". In the SJ task, participants judged whether the presented auditory and visual cues were synchronised by making a key press. The SJ task included two types of audio-visual cues, which are the flash and beep, and the face and voice. In the ER task, participants made emotional judgements about dynamic facial expression stimuli, having to classify them as being either joy, sadness, fear, anger, disgust, surprise, or neutral by making a speeded mouse click on the target emotion. Three levels of emotional intensity (low, medium, and high) were included for all the emotions except the neutral. Participants were screened before being recruited in the study so that only non-musician adults with normal or adjust-to-normal vision and hearing were included. This study used a parallel group RCT design. We did not include blinding in this study as the design required participants’ active involvement in certain conditions. The experimenter had to know and run the sessions, with the experimenter also serving as the trainer. However, the experimenter had no control over the group allocation process as the participants were randomly assigned to their group at the beginning of the study.