CALM

The following results are from the additional experiments performed as a part of the study in our paper CALM: Cognitive Assessment using Light-insensitive Model.

There were two levels of mental workload (rest and high workload) and two levels of lighting (light [210 Lux] and dark [1 lux]) in this experimental setting. As far as the mental workload level identification is concerned, this is a binary classification problem. This experiment focused on the impact of ambient light and how multimodal data can mitigate this sensitivity. In the rest condition, participants looked at a point on the wall and relaxed while their pupil and heart rate data were recorded. In the high workload condition, partici- pants performed a 2-back task. During this experiment, pupil data was recorded with the Pupil Labs glasses, and heart rate data was recorded with the BioPac M35. These devices are shown below.


PupilLabs	Biopac MP35

Cognitive load classification results

The results from the binary cognitive load classification task are reported in the table below.

Sensors	Train	Test	Accuracy
Pupil	Light	Light	71.87±0.27
Pupil	Light	Dark	59.38±0.32
Pupil	All	Light	75.10±0.26
Pupil	All	Dark	62.53±0.30
Pupil	All	All	71.09±0.22
HRV+Pupil	Light	Light	81.25±0.17
HRV+Pupil	Light	Dark	78.87±0.23
HRV+Pupil	All	Light	94.74±0.21
HRV+Pupil	All	Dark	88.98±0.23
HRV+Pupil	All	All	92.20±0.18

The performance of pupillometry-only models significantly decreased (by more than 12 percentage points) when trained on light conditions and tested on dark conditions. Despite using features like IPA to reduce sensitivity to light, its impact remains. As expected, training on a mix of light and dark conditions yields better performance because it mitigates the distribution shift during testing. It is evident that using multimodal inputs significantly enhances the classifier’s performance. The overall improvement is 21%. Although the multimodal settings show improvement compared to the pupillometry-only case when the testing condition is dark, the performance remains lower than the results when the training set contains data from all light conditions. This implies that the distribution shift due to light conditions still impacts this task, even with multimodal features.

Feature level changes due to light conditions

We selected the most common features from both pupillometry and HRV and compared their distribution using violin plots under light and dark condition. The following figure shows the distribution under light conditions on the left and dark conditions on the right of each violin plot.


Mean Pupil Diameter	IPA	PDRoC

RMSSD	SDNN	Mean RR Interval

Statistical test results

The following are the results of statistical tests conducted to analyze the significant differences between cognitive workload conditions for each feature. As this dataset has only two states - rest and high workload - the t-tests test whether the feature concerned is able to differentiate between these states. The tests follow the same settings as in the main paper. The results are summarized in the following table:

Feature	P-Value of Pairwise T-Tests
LFHF Ratio	0.973
RMSSD	0.043
pNN50	0.316
SDNN	0.663
HF	<0.005
Mean RR Interval	<0.005
Median RR Interval	<0.005
resp_rate	0.850
Mean Pupil Diameter	0.849
Variance Pupil Diameter	0.192
IPA Pupil	0.240
pupil_slope	0.156

Here are some ablation studies to understand the role of important pre-processing steps and design choices. Several aspects of the data processing pipeline are often assumed in the literature and in the available libraries for HRV and pupil data. Here, we discuss the impact of some key processing steps on classifier design and performance. To this end, we conducted a handful of ablation studies. Polar H10 is used as the HRV device for the studies discussed in this section. The RF classifier with engineered features is used for these experiments.

Impact of window size for feature estimation

The recommended window length for HRV features is 60 seconds. When the window length is shorter, the computed features are typically less meaningful, especially the frequency domain. We compared the results from a 30-second window to a 60-second window, as shown in the table below. The accuracy drops significantly for HRV features when the window size is 30 seconds (by 20.6 percentage points). As our feature importance analysis from the RF classifier highlights the significance of frequency-domain features, this drop in performance is expected without a precise estimation of those features. The pupillometry features show a smaller drop of 4.4%, indicating that they remain relatively reliable with a 30-second window. The overall performance drop in multimodal settings is also more than 20%. These results confirm the recommended window size of 60 seconds.

Sensors	30 Sec	60 Sec
Pupil	66.67±0.22	77.07±0.40
HRV	69.05±0.24	82.15±0.28
HRV+Pupil	71.42±0.18	92.26±0.19

Impact of noise filtering on pupillometry data

To cope with the noise in the pupil diameter data, we used filtering and smoothing with a butter low-pass filter of order 5 and a cut-off of 4 Hz. Butter low-pass filters are commonly used in the pre-processing step for noise removal from pupillometry. Filtering with a low-pass filter helps us to remove high-frequency noise in the data. Smoothing helps to reduce the impact of short-term fluctuations. Table below the importance of filtering in our pre-processing pipeline. Without this filtering, the accuracy drops by more than 13.3% points. Thus, a significant fraction of the noise is filtered by the butter low-pass filter.

Settings	Accuracy
Without Butter Low-pass Filter	57.81
With Butter Low-pass Filter	77.05

CALM: Cognitive Assessment using Light-insensitive Model

Additional Material

Akhil Pilakkatt Meethal, Anita Paas, Nerea Urrestilla and David St-Onge, 2024

Cognitive load classification results

Feature level changes due to light conditions

Statistical test results

Impact of window size for feature estimation

Impact of noise filtering on pupillometry data