Questions Regarding the Audio Data in the Dataset #3

badhorselgy · 2024-12-04T06:24:33Z

Thank you for making the Batvision dataset and the audio-only baseline open source. We sincerely appreciate your contributions to both the acoustic and open-source communities. However, we have encountered some confusing issues while using the open-source Batvision dataset and the UNetSoundOnly baseline. We would appreciate any assistance you could provide in clarifying these matters.

Audio data

Taking the data file 2019.08.26/audio/raw_long/180336.309017_left.npy from BatVisionv1 as an example, along with its corresponding camera images and depth maps, what is the meaning of the approximately 980 initial sampling points in the audio data that seem to be nearly blank?

The paper mentions: "Designed for smaller spaces, audio recordings were cut at 72.5ms, including echoes from objects at a 12m distance." However, in the code, we only found a section that processes depth data, and the audio data input to the network appears to be around 0.1 seconds in length. How should the code be modified to correctly reproduce the audio input representation as described in the BatVision paper?

Additionally, does the provided audio data include the direct path signal emitted by the JBL Flip4 Bluetooth speaker?Since the JBL Flip4 Bluetooth speaker's sound units seem to be located at both ends of a cylindrical body, we are unable to determine whether the first path in the spectrum is due to a reflection from the corridor walls or if it originates from the direct path between the speaker's sound unit and the microphone.

Sensor Parameters
Would you mind sharing the coordinate transformation parameters between the camera and the microphone?

Finally, thank you once again for making the dataset and baseline open source, which greatly aids future researchers. We look forward to your response.

The text was updated successfully, but these errors were encountered:

AmandineBtto · 2024-12-04T11:20:53Z

Hello @badhorselgy

Thank you for your interest in the BatVision dataset and for your questions. We truly appreciate your feedback and the detailed observations you've shared. Below are responses to your questions:

1. Audio Data

1.1 Initial blank sampling points
You are correct that in the BatVision v1 (BV1) audio data, the first ~980 samples appear nearly blank. These initial samples represent a brief period of silence before the chirp is emitted. This was intentional, as it allowed us to apply jittering to the window position, augmenting the audio data by introducing up to a 30% shift in the input window (cf. Dataloader link below).

1.2 Audio length
The raw BV1 audio recordings are 0.1 seconds in length, but the intended input was 72.5 ms. Unfortunately, the current BatVision loader code omits this cut. You can correct this by:

Using the original loader from pervious work on BV 1 https://github.com/SaschaHornauer/Batvision/blob/master/Dataloaders.py
Alternatively, adapt lines 53 to 68 in BatvisionV2_Dataset.py to trim the audio length to 72.5 ms.
We apologize for this oversight.

1.3 JBL direct path
The provided audio data includes the entire signal emitted by the JBL Flip4 Bluetooth speaker, encompassing both the direct path and subsequent reflections. The first peak in the waveform corresponds to the direct chirp signal as recorded by each microphone. Given the speaker’s cylindrical design, distinguishing between direct and reflected paths in the spectrum may be complex, but typically, the earliest peak represents the direct path.

2. Sensors

Unfortunately, we no longer have access to the BatVision v1 robot, and the precise coordinate transformation parameters between the camera and microphone are unavailable. However, here are some details that may help:

The microphones are set 23 cm apart horizontally.
Based on the ZED camera’s dimensions and figure 3, the vertical distance between the microphones and their respective cameras is estimated to be 5.5 cm.
From figure 3, the height difference between the left and right cameras and microphones appears to be approximately 0.3 cm, roughly equivalent to the ZED camera height.

Throughout our work, we assumed the cameras and microphones were co-located for simplicity. We hope these estimates provide a helpful starting point for your calculations.
Please note that the BV 2 setup is different from BV1 resulting in different data (different length, less near blank data at the beginning etc).

Please let us know if you need further clarification. We sincerely hope this information helps you move forward with your research.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions Regarding the Audio Data in the Dataset #3

Questions Regarding the Audio Data in the Dataset #3

badhorselgy commented Dec 4, 2024

AmandineBtto commented Dec 4, 2024

Questions Regarding the Audio Data in the Dataset #3

Questions Regarding the Audio Data in the Dataset #3

Comments

badhorselgy commented Dec 4, 2024

AmandineBtto commented Dec 4, 2024

1. Audio Data

2. Sensors