CALM: Cognitive Assessment using Light-insensitive Model

Additional Material

Akhil Pilakkatt Meethal, Anita Paas, Nerea Urrestilla and David St-Onge, 2024

The following results are from the additional experiments performed as a part of the study in our paper CALM: Cognitive Assessment using Light-insensitive Model.

There were two levels of mental workload (rest and high workload) and two levels of lighting (light [210 Lux] and dark [1 lux]) in this experimental setting. As far as the mental workload level identification is concerned, this is a binary classification problem. This experiment focused on the impact of ambient light and how multimodal data can mitigate this sensitivity. In the rest condition, participants looked at a point on the wall and relaxed while their pupil and heart rate data were recorded. In the high workload condition, partici- pants performed a 2-back task. During this experiment, pupil data was recorded with the Pupil Labs glasses, and heart rate data was recorded with the BioPac M35. These devices are shown below.

PupilLabs
Biopac MP35
PupilLabs Biopac MP35

Cognitive load classification results

The results from the binary cognitive load classification task are reported in the table below.

Sensors Train Test Accuracy
Pupil Light Light 71.87±0.27
Pupil Light Dark 59.38±0.32
Pupil All Light 75.10±0.26
Pupil All Dark 62.53±0.30
Pupil All All 71.09±0.22
HRV+Pupil Light Light 81.25±0.17
HRV+Pupil Light Dark 78.87±0.23
HRV+Pupil All Light 94.74±0.21
HRV+Pupil All Dark 88.98±0.23
HRV+Pupil All All 92.20±0.18

The performance of pupillometry-only models significantly decreased (by more than 12 percentage points) when trained on light conditions and tested on dark conditions. Despite using features like IPA to reduce sensitivity to light, its impact remains. As expected, training on a mix of light and dark conditions yields better performance because it mitigates the distribution shift during testing. It is evident that using multimodal inputs significantly enhances the classifier’s performance. The overall improvement is 21%. Although the multimodal settings show improvement compared to the pupillometry-only case when the testing condition is dark, the performance remains lower than the results when the training set contains data from all light conditions. This implies that the distribution shift due to light conditions still impacts this task, even with multimodal features.

Feature level changes due to light conditions

We selected the most common features from both pupillometry and HRV and compared their distribution using violin plots under light and dark condition. The following figure shows the distribution under light conditions on the left and dark conditions on the right of each violin plot.

Image 1
Image 2
Image 3
Mean Pupil Diameter IPA PDRoC
Image 4
Image 5
Image 6
RMSSD SDNN Mean RR Interval

Statistical test results

The following are the results of statistical tests conducted to analyze the significant differences between cognitive workload conditions for each feature. As this dataset has only two states - rest and high workload - the t-tests test whether the feature concerned is able to differentiate between these states. The tests follow the same settings as in the main paper. The results are summarized in the following table:

Feature P-Value of Pairwise T-Tests
LFHF Ratio 0.973
RMSSD 0.043
pNN50 0.316
SDNN 0.663
HF <0.005
Mean RR Interval <0.005
Median RR Interval <0.005
resp_rate 0.850
Mean Pupil Diameter 0.849
Variance Pupil Diameter 0.192
IPA Pupil 0.240
pupil_slope 0.156

Here are some ablation studies to understand the role of important pre-processing steps and design choices. Several aspects of the data processing pipeline are often assumed in the literature and in the available libraries for HRV and pupil data. Here, we discuss the impact of some key processing steps on classifier design and performance. To this end, we conducted a handful of ablation studies. Polar H10 is used as the HRV device for the studies discussed in this section. The RF classifier with engineered features is used for these experiments.

Impact of window size for feature estimation

The recommended window length for HRV features is 60 seconds. When the window length is shorter, the computed features are typically less meaningful, especially the frequency domain. We compared the results from a 30-second window to a 60-second window, as shown in the table below. The accuracy drops significantly for HRV features when the window size is 30 seconds (by 20.6 percentage points). As our feature importance analysis from the RF classifier highlights the significance of frequency-domain features, this drop in performance is expected without a precise estimation of those features. The pupillometry features show a smaller drop of 4.4%, indicating that they remain relatively reliable with a 30-second window. The overall performance drop in multimodal settings is also more than 20%. These results confirm the recommended window size of 60 seconds.

Sensors 30 Sec 60 Sec
Pupil 66.67±0.22 77.07±0.40
HRV 69.05±0.24 82.15±0.28
HRV+Pupil 71.42±0.18 92.26±0.19

Impact of noise filtering on pupillometry data

To cope with the noise in the pupil diameter data, we used filtering and smoothing with a butter low-pass filter of order 5 and a cut-off of 4 Hz. Butter low-pass filters are commonly used in the pre-processing step for noise removal from pupillometry. Filtering with a low-pass filter helps us to remove high-frequency noise in the data. Smoothing helps to reduce the impact of short-term fluctuations. Table below the importance of filtering in our pre-processing pipeline. Without this filtering, the accuracy drops by more than 13.3% points. Thus, a significant fraction of the noise is filtered by the butter low-pass filter.

Settings Accuracy
Without Butter Low-pass Filter 57.81
With Butter Low-pass Filter 77.05