Abstract

In this paper, a new application of the optical mouse sensor is presented. The optical mouse is used as a main low-cost infrared vision system of a new proposal of a head-mounted human-computer interaction (HCI) device controlled by eye movements. The default optical mouse sensor lens and illumination source are replaced in order to improve its field of view and capture entire eye images. A complementary 8-bit microcontroller is used to acquire and process these images with two optimized algorithms to detect forced eye blinks and pupil displacements which are translated to computer pointer actions. This proposal introduces an inexpensive and approachable plug and play (PnP) device for people with severe disability in the upper extremities, neck, and head. The presented pointing device performs standard computer mouse actions with no extra software required. It uses the human interface device (HID) standard class of the universal serial bus (USB) increasing its compatibility for most computer platforms. This new device approach is aimed at improving comfortability and portability of the current commercial devices with simple installation and calibration. Several performance tests were done with different volunteer users obtaining an average pupil detection error of 0.34 pixels with a successful detection in 82.6% of all mouse events requested by means of pupil tracking.

1. Introduction

Nowadays, the most widely used human-computer interaction is a graphic pointer that is displaced across the screen of a display peripheral. The screen pointer is usually controlled using standard devices such as computer mice, touchpads, joysticks, pens, or tactile screen panels. All of these require some physical interaction moving some extremities of the user’s body like fingers, hands, or forearms to operate properly. Several alternatives have been proposed to allow people with mobility impairments in the upper extremities to control the computer pointer. These are mainly based on the detection and measurement of such remaining body motions as facial gestures [1, 2], mouth movement [3, 4], head movements [58], eye tracking [9, 10], sticker tracking [11, 12], breath [13, 14], tongue displacements [15], or a combination of them [16]. Nevertheless, there are people with such severe disability that they cannot move any extremity, neck, or head and are only able to interact with computer devices using their eyes, eyebrows, tongue, mouth, breath, or brain activity [1719]. Commercial devices to achieve the inclusion of these people to the communications technology (ICTs) are available, although most of them have several drawbacks related to software compatibility, limitation in computer interaction, and complex configuration and calibration; or are highly intrusive and not affordable due to their complexity and low prospective market.

Focusing on the eye-controlled devices, they can be classified into two main groups: remote and head-mounted. The remote devices are mainly based on the application of different eye gaze algorithms to the image acquired by a fixed high-resolution camera, usually attached to the computer screen [2, 5, 6, 9, 20]. Examples of remote pupil gaze devices on the market are [2124] which are expensive (> $6,000) due to the high-resolution camera, the lenses, and the dedicated illumination system cost. The head-mounted devices are mostly based on a custom structure holding a small intrusive low-resolution camera in front of the user’s face [10, 2527]. They are less expensive but not low-cost [2830] (> $2,000) because of the camera’s quality and the customised software.

Several head-mounted eye-tracking devices have been proposed in the literature. In [25], three cameras and an external computer are used to simultaneously track the eye and the relative orientation of the head-mounted structure. It uses a very complex hardware to detect the pupil by a simple gray level segmentation of infrared images. In [26], two mini cameras and two medium-sized processing boards are used to simultaneously track the pupil and the user’s view, allowing an absolute pointer control on the computer screen. Although it proposes to use low-cost off-the-shelf components and a set of open-source software, the installation is complicated because the multiple parts. Recently, Cáceres et al. [10] proposed to use a head-mounted commercial eye tracker along with a modified webcam to develop an inexpensive eye pointer, obtaining a successful user evaluation. However, it requires a precise installation of infrared lights at the corners of the screen. A new low-cost alternative of eye gaze head-mounted device with a near-eye display is reported in [27]. This proposal can reach an eye gaze accuracy of 0.53° but needs a near-eye display degrading the quality of the user’s view and reducing its field.

In this work, a new proposal of an inexpensive human-computer eye-controlled pointing device based on a low-cost optical mouse sensor is presented (Figure 1). The main goal of this contribution is the proposed device cost (~$20), compared with the most affordable current commercial eye-tracking devices—e.g., Pupil Labs Pupil Core, $1,850 [31]; Tobii PCEye Mini, $1,149 [32]; or Tobii Pro Glasses II, $10,000 [33]. This is possible because the new proposal uses the same components of a common commercial optical mouse. Likewise, it avoids high-resolution cameras and the need of additional hardware parts, like [2130]. It is a lightweight (41 g) wearable plug and play (PnP) device with no additional software required and fully compatible (Windows, MacOS, Linux, Android, etc.), using the human interface device (HID) standard class of the universal serial bus (USB).

The optical mouse sensors, which were originally designed as the computer’s mouse main sensor, are widely used in many scenarios to estimate the relative displacement of the surface under the sensor [3438]. In [38], it was used to estimate the translation of the eye analyzing the scleral surface. A horizontal and vertical angular resolution of 27.8 and 18.2 counts per degree were obtained; however, the eye blinks preclude its usage as an eye gaze. Furthermore, several research works proposed the use of the optical sensor open access internal image to develop alternative inexpensive applications such as counterfeit coin detectors [39], oxygen and pH quantifiers [40], yarn diameter meters [41], and rotary encoders [42, 43]. Our previous work [44] analyzes the capabilities of these sensors for eye tracking, detecting the pupil as the darkest part of the infrared images acquired. A custom valley detection method and the well-known integrodifferential operator proposed by Daugman [45, 46] were tested to detect the pupil. Results showed that both algorithms allow to track the pupil with an average error of 0.58 pixels; however, the valley detection method requires 90% less memory and is faster since it does not use trigonometric functions.

Based on the successful results from our previous work [44], this work proposes the use of the optical mouse sensor as an inexpensive imaging system to detect eye blinks and movements and validates its responsivity to perform basic pointer actions properly. To do so, a basic 8-bit microcontroller is used to capture and translate pupil displacements and eye blinks to common computer mouse actions (pointer displacements, clicks, and double clicks). The custom pupil detection method proposed in our previous work has been enhanced to increase its robustness. To do so, new restrictions were defined and adjusted to their optimal values for the proposed setup. Additionally, a procedure to detect forced eye blinks has been implemented and used as complementary interaction input.

2. Sensor

2.1. Optical Mouse Sensor

The optical flow sensor used in this work is the ADNS-3080 from Avago Technologies [47]. This sensor is known by its use as an optical mouse displacement sensor and it has many interesting features for our approach. It is a low-cost compact sensor based on a metal-oxide-semiconductor (CMOS) matrix of grayscale pixels of 4-bit intensity which is highly sensitive to near-infrared wavelength, regularly used for pupil location and iris recognition [48, 49]. The CMOS is complemented with a digital signal processor (DSP), integrated on the same chip (Figure 2). Although it implements a proprietary optical flow algorithm for displacement measurements, it includes an extra feature called PIXEL_GRAB which allows access to the current surface image (frame). It is frozen and can be read at any asynchronous rate pixel by pixel through a standard serial peripheral interface (SPI) bus. Hence, the image acquisition by a serial peripheral interface (SPI), the storage of a very small array of pixels (900 bytes) and its processing of pupil tracking can be implemented in a low-performance microcontroller to keep the device as inexpensive as possible.

The optical sensor was originally developed to acquire small variations in the roughness of the surface in a very short focal distance and reduced area. In general, the sensor package is combined with a polycarbonate plastic convex lens system (Figure 3(a)) to provide an adequate infrared illumination and a field of view for proper operation. Using the default lens and the recommended working height defined by the manufacturer as 2.4 mm, the sensor can measure two-dimensional displacements with resolutions up to 800 counts per inch (cpi). Using this configuration, the working area captured by the sensor is very reduced (1.82 mm2) [43].

2.2. Lens and Light Source

To achieve the goal of this work, both the default optics and the illumination source have been replaced in order to increase the capturing area and working distance. The lens used is the CAY46 low-cost plastic aspheric lens from Laser Components [50] which has a suitable focal distance of 4.6 mm. The use of this lens allows to capture essentially all pupil movements at working distance between 40 and 100 mm, obtaining a pupil diameter between 5 and 15 pixels [44]; therefore, it can be suitable in a head-mounted device as planned.

The external light source has been replaced by the higher radiant intensity SFH-4350 near-infrared (NIR) light-emitting diode (LED) from OSRAM [51] in order to obtain a uniform light in the field of view and highlight the pupil as the darkest part of the image. It has a 3 mm (T1) transparent plastic package, a wavelength peak of 850 nm, and 26° of view angle. Its typical operation generates a radiant flux of 50 mW at a forward current of 100 mA. According to the EN 62471:2008 [52] European Standard, the irradiance to the eye surface obtained is 7.76 W·m-2 and the radiation is 899.13 kW·m-2·sr-1, which are less than the limits of the retinal thermal hazard () and the eye radiation hazard (), respectively. Therefore, the proposed infrared emitter setup should be safe.

Figure 3 shows the sensor with two different lenses: the default lenses kit (ADNS-2120-001 [53] from Avago Technologies) and the CA46 lens. A chess template of 1.5 mm2 squares was captured using both configurations. In the first case (Figure 3(a)), it was made using the default light source and the default working height of 2.4 mm. In the modified setup (Figure 3(b)), the light source came from the SHF 4350 LED using an optimal focusing distance between the sensor and the template of 60 mm. Thus, the capturing area of the sensor is increased from 1.82 mm2 to 21.6 mm2 which is considered suitable to capture pupil displacements when staring at the computer screen.

3. Pupil Tracking

There are several well-known pupil detection algorithms in the literature. One of the most widely used during the years is the integrodifferential operator proposed by Daugman [45, 46] which is extensively applied for pupil localization in iris recognition applications [48, 49]. It assumes that the pupil has circular contours and operate as a circular edge detector. Another well-known method to locate the pupil and the iris was introduced by Wildes [54], which is based on searching ellipses into an edge-filtered image using the Hough transform [5557]. Moreover, there are traditional robust pupil detection methods that combine different image processing techniques as edge detectors, morphologic operations, contour extraction, thresholding fitting, limbus ellipse fitting, etc. Some of the most important are the Starburst [58], Swirski [59], Pupil Labs [60], SET [61], ExCuSe [62], and ElSe [63]. A deep discussion of these methods for head-mounted eye trackers can be found in [64]. These robust algorithms are not suitable in our proposed device due to its limited resources. On the one hand, the acquisition sensor limitation with very low image resolution ( pixels) and low grayscale intensity (6 bits). Most of the morphological search algorithms require better capture. On the other hand, the DSP limitations include very small RAM size (3.7 Kbytes), low processor performance (48 MHz), and slow image acquisition (71.2 ms). The execution of floating point trigonometrical and morphological operations is not feasible to maintain an acceptable frame rate for our application. For instance, the most recent robust pupil detection method, PuReST [65], takes an average execution time of 5.56 ms using an Intel® Core™ i5-4590 CPU @ 3.30 GHz. Only in terms of processor frequency, reducing the frequency to ours (48 MHz), the execution time was proportionally increased to 382.25 ms (2.6 Hz) which is already not suitable for fast pupil gaze tracking.

To detect the pupil, this work proposes an adaption of the custom-made optimized pupil detection algorithm introduced in our previous work [44], having a similar detection error than the integrodifferential method but improving the memory size and execution time. Considering that the pupil is the darkest part of the image, the algorithm uses a simple valley location to detect the pupil centroid. Therefore, it can analyze row by row independently without any extra auxiliary image buffer or a nested pixel level search. The algorithm is based on three image processing steps: (1) specular highlight removal, where the image is filtered to smooth the reflections of the NIR LED used; 2) intensity valleys location, where for each row a possible valley of pupil is found with specific constraints; and 3) pupil centroid estimation, where valleys are used to determine whether the center of the pupil exists in the image. The following subsections present the new changes from the previous algorithm and detail the three main sequential steps used in this proposed system. The last subsection, describes the optimal algorithm hyperparameter values used in this work.

3.1. Algorithm Upgrade

The algorithm presented in [44] has been enhanced in the last two steps in order to increase its robustness. In the case of the valley detection step, the previous approach detects the valley limits (valley boundaries) when a difference between consecutive pixels decrease/increase more than / left and right thresholds, respectively, and the valley length is more than . In this new approach, the limits of the valleys are obtained by checking the valley height that must be comprised between and , and the valley length that now must be comprised into the range of and . In this approach, the intensity of the pixel’s valley always must be equal or growing regarding the neighbor in order to avoid internal intensity peaks. Furthermore, three extra conditions have been defined: , , and MNP. The is the minimum intensity difference between two consecutive pixels of the valley. This condition guarantee that the valley has an abrupt rising. The MNP is the minimum number of pixels inside the valley that the difference with the absolute minimum reach the . This condition is aimed at helping the valleys’ rejection with poor concavity and guarantee that most of the valley pixels belongs to the pupil intensity value.

In the case of the pupil centroid estimation step, instead of selecting the valleys that have the limits near to the maximum of a window accumulation of 3 [44], the proposed method selects the valleys that contain the maximum number of consecutive valleys where the difference of consecutive limit points (valley boundaries), both left and right, are less than a threshold, , and the maximum difference between all of them are less than the threshold . The condition permits to detect a continuous smooth circular pupil edge discarding horizontal valley limit peaks or vertical discontinuities. The condition ensures a valid distortion of the pupil due to its spherical displacement. Additionally, in this new approach, the number of valleys selected has to be between the and (vertical pupil length). Once group valleys are selected, the calculation of the pupil centroid is done in the same way as the previous approach, averaging the coordinates of the limits of the selected valleys. In case that more than one group of valleys satisfies the conditions, the group with the darkest averaged value is chosen.

3.2. Specular Highlight Removal

In the acquired images, the eye’s conjunctiva mucous membrane produces light reflections from the NIR LED. In case the reflections are located into the pupil area, the detection of the pupil center could be inaccurate. To smooth the highlighted pixels in the current image (Equation (1)) a filter is performed.

First, an intensity threshold is defined, (Equation (2)), to discard pixels with values less than 80% of the maximum.

Then, the following spatial median filter is applied starting at the pixel as shown in Equation (3).

Figure 4 shows an example of the filter results’ two common cases: (a) when the reflected light is outside the pupil area and (b) when the reflected light is in the pupil edge. In the second case, the result is not as smooth as expected due to the filtered pixels average pupil and iris area.

3.3. Intensity Valleys Location

For each row, , an intensity valley search is performed to find the best valley which could be part of the pupil. First, for each row, the positions of the local minimum intensity pixels are located by searching the intensity local peaks (Equation (4)). contains the column positions in row where it starts a right or left pixel intensity slope. Then, the column position of the absolute minimum is calculated using Equation (5).

Starting from the absolute minimum pixel, , the next step is to locate the left () and right () limits (valley boundaries), where a possible valley is comprised of, using the conditions defined in Equations (6) and (7), respectively. The and limits must be the furthest pixel from the which satisfies (a)The pixel intensity always increases from the minimum () to the limits (, )(b)One of these increments must be greater than (c)The valley height (both left and right) must be between and

Then, to check that the valley belongs to the pupil region, it requires to have at least MNP pixels into its region (between and ) which the intensity, regarding the minimum (), is below (Equation (8)).

Finally, the valley length must be comprised within a valid pupil range diameter between and thresholds (Equation (9)).

In the case that there is no valley found with the proposed restrictions, row will be automatically rejected as a pupil portion. Figure 5 shows an example result of an intensity valley location for the 15th row of the image shown in Figure 6. A valley is found from column 12 to 19. The threshold values used in this work are . These values are calculated according to the results obtained in Table 1.

3.4. Pupil Centroid Estimation

Once the valley limits are detected, it is necessary to find pupil region candidates. For each row (with a valley found), the algorithm search a possible pupil region finishing at the lower row () which satisfies the restrictions of Equation (10). The region has to be comprised with a valid set of adjacent valleys (consecutive rows in the image) with a length between and (pupil height), and the difference between adjacent valley limits (consecutive rows), both left and right, has to be less than . Finally, the maximum difference between the limit points, both left and right, has to be equal or less than .

In the case that there is a pupil candidate for row , its parameters are calculated using Equation (11). The left bounding coordinate, , is the average value of all left limits of the comprised valleys, and the right bounding coordinate, , is the average of all right limits. Then, the centroid of the pupil (, ) is calculated averaging the right and left bounding coordinates and averaging the upper and lower valley row ( and ). Finally, the average intensity of all pixels into the pupil bounding box, , is calculated.

In the case that there is more than one candidate, these results (, , and ) are stored in the vectors , and , respectively. Then, since the pupil is always the darkest part of the image, the final pupil region selected is the one with the lowest average intensity color (Equation (12)).

Figure 6 shows an example of pupil centroid estimation using the proposed algorithm. The square and circular marks are the limit valley points and obtained in the previous step. In this example, there is only a pupil candidate and hence the final pupil is comprised between the and rows. The pupil centroid estimation is and . The threshold values used in this work are and . These values are calculated according to the results obtained in Table 1.

3.5. Hyperparameter Estimation

In order to obtain the proposed system optimal hyperparameters, a manual human analysis of a set of 10 pupil images of 8 different users (Table 2) has been done. The images were acquired with the optical mouse sensor in a fixed distance from the eye at 60 mm. The pupil images were analyzed by manual human inspection obtaining the limits of all valleys (boundaries) that contains a portion of pupil. Then, these valley data were used to obtain four valley characteristics (Table 1) and calculate the optimal hyperparameter values (thresholds). The four valley characteristics are 1)The absolute intensity difference between consecutive pixels, which its average and the standard deviation are used to calculate 2)The valley height, which its minimum and average were used to obtain and , respectively3)The valley length, which its minimum and maximum were used to obtain the valley length thresholds ( and ), this last one includes an offset of 30% for short distance sensor placement. Also, the MNP threshold is calculated using half of the average valley length, it means that at least 50% of the valley’s pixel intensity will be below 4)The pixel intensity difference with the absolute minimum of the valley, which the average and the standard deviation are used to calculate taking into account that all valleys will have at least one pixel difference with this intensity

After analysing a total of 80 images of 8 different pupils, the valley length and height standard deviation remain low considering that the pupil pixels’ intensity could be different for each user, and the valley length is tightly related to the pupil circular shape. The pixel intensity difference with the absolute minimum of the valley have more dispersion related to its average but remains stable into a feasible detection range.

The threshold, used in the pupil centroid estimation step, is calculated using the valley length limits obtained in Table 1. After a manual inspection, the maximum right and left pupil edge curve between consecutive rows is 2 pixels, then .

There are two eye blink under the study, the natural and the forced blink. The natural blink occurs when the user closes and opens the eyes quickly and unconsciously. These blinks can disturb the pupil tracking estimation and have to be disregarded. In case of forced blink, the user performs a controlled slow blink: closing the eyes, waiting some time, and finally opening the eyes. Many researchers propose the use of forced blinks as HCI input source [2, 5, 6, 6674]. Most of them apply image processing techniques on an eye region image stream, such as optical flow methods [2, 66], template matching [67, 68], eye feature extraction [69], facial landmarks [70], or multiple Gabor response waves [71]. When the eye images are acquired from a head-mounted device with a reflected NIR light, the simplest techniques can be applied with very accurate results [7274]. This scenario prevents false positive detections caused by head movements or inadequate lighting conditions. In [72], the eye blink is detected by means of difference images. This method requires an open eye reference image like in [73] where the detection of eye blinks using a simple histogram comparison of the eye region is proposed. A successful blink detection ratio of 87% is obtained.

In this work, the eye blinks are detected without additional image processing, determining if the eyes are open or closed according to whether or not the pupil position is obtained, as proposed in [74]. Then, the time while the eyes are closed indicates the blink type and, in case of forced blink, the action to perform. Although the accuracy of this method is highly depending on the pupil detection success, it does not require any additional processing time and this is a key factor for the low-performance proposed system.

Figure 7 shows the nonblocking procedure used to detect forced eye blinks by means of open-close-open eye sequences and perform actions depending on its duration. The variable run is used to know if the eye is closed. Then the local system time, , updated every millisecond, is used to calculate the elapsed time without blocking the main thread (pupil detection and mouse events). The tactionn indicates the time that the eye must remain closed to execute action , notice that the tactionn time must to be quite larger than a natural blink. The indicates the detection hysteresis time and finally, to validate a forced blink, a minimum elapsed time (topen) is required between a close-open eye sequence.

When the eye is closed, the eyelashes fill a significant area of the image which could generate false pupil detections in specific users. Large eyelashes also are the main disturbing source when the user looks down. The threshold constraints defined in the valley pupil location algorithm avoid these false positives. Figure 8 shows a forced eye blink image and its pupil location result, where the yellow crosses are the absolute minimums for each row. In this case there is not a set of rows with a minimum valley that satisfies the fixed thresholds; hence, the algorithm result is successful.

Figure 9 shows an example of a sequence of images captured during a natural blink, where the solid and dashed red lines are the detected valleys, the red solid lines are the detected pupil region and the turquoise blue asterisk indicates the located pupil centroid. While a natural blinking occurs, there could be a critical moment when the pupil is partially hidden by the top eyelid, creating a pupil location outlier (Figure 9, ). Due to the fast natural blinking speed and the low image acquisition rate (7.96 fps), this drawback can be filtered by updating the location only when at least two consecutive frames have the same result. In the frame , a slight error in the centroid estimation was produced due to the eyelid obstruction whereas in the , the pupil is fully hidden and there is no pupil location. According to the proposed filter, these two location results were rejected. Therefore, the estimated location was preserved in the position .

5. Pointing Device

5.1. Electronic Board

Figure 10 shows the single electronic board developed for the proposed pointing device including all the electronic parts embedded in a small printed circuit board (PCB). On the bottom, there is a low-cost high-performance 8-bit microcontroller PIC18F46J50 from Microchip [75] and its low-dropout regulator (LDR) of 3.3 V; on the top, which is the side towards the user’s eye, there are the optical flow sensor with its external ceramic resonator of 24 MHz, a near-infrared LED for sensor light source, a 20 MHz low-profile external crystal for the microcontroller clock, and a surface-mounted device (SMD) pushbutton. Also, there are two connectors, one for microcontroller debugging and flash programing and the other for USB communication and powering. Finally, the electronic board has five SMD LEDs duplicated in both sides in order to notify the device status and help both user and assistant during the initial adjustment procedure.

The ADNS-3080 is connected by SPI using a master synchronous serial port (MSSP) of the microcontroller. An entire frame can be transmitted and stored to the 3.7 Kbytes internal RAM in 71.2 ms. Then, the microcontroller executes the pupil tracking algorithm over the received frame in an average time of 42.6 ms. Finally, the internal USB 2.0 full-speed hardware peripheral is used to translate the pupil position to user-computer actions through a USB HID standard communication class [76]. All these procedures are repeated in a closed-loop reaching a sampling rate of 7.96 fps. Although the main source clock is a 20 MHz external quartz crystal resonator, the microcontroller is configured to operate with an internal phase-locked-loop (PLL) that allows to reach the proper 48 MHz USB clock. Finally, the SHF-4350 near-infrared emitter light source is connected to one of the PWM modules of the microcontroller allowing an accurate brightness adjustment.

The board power consumption in normal operation mode, acquiring images, processing, and generating mouse events, was 555.42 mW (583.05 mW with all feedback LEDs on). The main consumption source was a dropping resistor (291 mW) which limits the SHF-4350 environmental NIR light power at 39.1 mW (with the PWM at 100%). The power consumption of the PIC18F46J50 and the ADNS-3080 were 129.42 mW and 14.67 mW, respectively. Comparing to a recent power-efficient eye-tracking solution presented in [77], this new proposal is within the estimated power range (from 70 mW to 5.25 W), using a Raspberry Pi 3 in low-power mode operating at only 1 kHz. Complex image processing requirements will increase the power consumption as the system proposed in [78] based on a high-performance 32-bit microcontroller which consumes 703.65 mW while acquires QVGA images at a full rate.

5.2. Frame Design

The device approach is a self-contained head-mounted device in order to have the sensor fixed in front of the user within a short distance to allow pupil tracking. Since the sensor is fixed in the head, possible vibrations coming from the head, neck, or body are avoided. As a result, once the sensor is positioned it will only capture the same region of the eye. Additionally, a self-contained system also makes it more comfortable without having to install additional peripherals such as cameras, acquisition boards, or light sources, on different locations with a lot of wires.

Figure 11 shows the frame design proposed in this work. It is composed of a stick that holds a small box with the electronics in front of the user’s face. This stick is placed on the right side of the head and is held using headbands joined over the right ear which helps to keep the structure in place. The structure also contains some adjustable parts to facilitate the sensor placement. It has up to 5 degrees of freedom (DOF): height, length, and horizontal rotation of the stick, and azimuth and elevation of the box. This approach has only 10 plastic parts and a very simple assembly aiming to make the device as inexpensive as possible.

Figure 12 shows a close view of the box (36.3×15.2×28.8 mm) where the electronic board is placed. There are holes in the frame to increase the brightness of the front and rear LEDs. Their peripheral position and close placement to the eye help to perceive the LED notifications without stopping to stare at the screen. On the right side of the box, a moving cylindrical plastic part has been designed to handle the electronic board’s pushbutton. Finally, a slot within the stick is used to hide the USB cable which comes from the box.

5.3. Human-Computer Interaction

The operating process of the device is composed of three main operating states (Figure 13): the initial adjustment, where the user has to correctly place and fix the device to their head; the mouse emulation, where the user’s eye movements are translated into computer pointer actions; and the paused stage, where all interaction capabilities are disabled to freely move the gaze. The flow diagram of the interaction between operation stages is presented in Figure 13.

Once the device is connected and automatically detected by the operation system, the pointer is blocked permanently at the center of the screen and the device awaits the initial adjustment. In this stage, the user has to stay seated properly holding the device in a working distance of around 60 cm and with the eyes horizontally aligned with the pointer. The device then waits until it detects a pupil in the center of the images acquired (±4 pixels offset for both axis) with a valid diameter (between 4 and 16 pixels). The four peripheral LEDs offer positioning feedback to facilitate the manual centering of the sensor box. If the pupil is detected in one direction, for example on the left, the red LED of that direction will light on. The blinking frequency depends on the pupil size detected. If there is no blinking it means that the pupil size is in the valid range. Once the pupil appears into the expected area with the expected size during 5 s, the current position of the pupil will be stored as reference position and the green LED will blink indicating that the adjustment stage is finished giving way to the mouse emulation stage. This initial manual adjustment may require the help of an assistant in the case of a user with impaired mobility. At any time, the device can be restarted by holding the pushbutton during 6 s or forcing no pupil detection during more than 6 s.

In the mouse emulation stage, the user interaction by controlling HID mouse events is carried out. The pupil displacements and forced eye blinks actions are translated to X and Y relative movements and left and right clicks. The proposed system achieved a small horizontal and vertical pupil detection variation range of 10.84 and 8.51 pixels when looking at a 19 4 : 3 TFT computer screen at a distance of 60 cm. It is a significant limitation to use this system as a usual eye tracking, thus a relative-based computer mouse displacement must be implemented. In this work, five pupil regions are defined (left, right, up, down, and center), see Equation (13) and Figure 14. These regions are enough for a relative mouse emulation and increase the detection robustness. Figure 14 shows a representation of the regions in an image captured by the optical sensor while the user is looking at the left edge of the screen (the pupil is inside the left region, ). To force the activation of a region, the user has to keep the pupil for two consecutive frames in that region. The user can force left, right, up, and down regions () looking at the edges of the screen and looking at the center for the central region (). When the position of the pupil is calculated, Equation (13) identifies in which region it is.

An ellipse with major axis () of 2.8 pixels and minor axis () of 1.9 pixels located in defines the central region considering that the horizontal pupil movements are larger than the vertical ones. Two linear functions () are used to delimit the left, right, up, and down regions.

One of the main problems of the eye movement-based interaction is to combine nonintentional user gaze, looking naturally to an item or target on the display screen, with the generation of pointing actions. This problem, called Midas Touch problem, has been overcome in several research works [79, 80]. Most of them use flags (pushbuttons, log dwell times, etc.) to trigger an eye-controlled action. In [80], the authors propose a fast disengaging gaze control method to overcome the effects of moving the gaze point to look at the result of the input. In this work, the mouse events are controlled by means of moving the gaze in a predefined sequences of regions, called combos, which the user can trigger any moment during the emulation stage. Table 3 shows the list of possible combinations to perform and their description. Apart from LCc, the combos must start when the gaze remains at lest of 1 s in the central region (). The regions visited in a combo must remain active for less than 800 ms. This forces the user to perform a fast sequence of gaze regions. Finally, the combos end and commits to a mouse event remaining again in the region for at least 1 s. Although this dwell times can be fatiguing, it is the way to ensure that any combo is intended by the user. Thus, the system runs the mouse emulation algorithm (Figure 15) while the user can look naturally to the screen without disengaging the emulation.

Figure 15 shows the flow diagram of the algorithm that transforms the displacements combos () into USB HID mouse pointer displacements. When a , or is performed, the pointer starts moving in that direction at an exponential speed. The current pointer position is updated every and the pixels variation () is calculated using the Equation (14).

The factor by default is 0.008, although it can be modified using the device pushbutton to adjust the cursor speed to the screen size and resolution. In case of a common resolution and using the default , the pointer takes 6.91 s to cross horizontally the screen. The will be added or subtracted from the respective axis depending on the movement’s direction. The mouse pointer can also move in diagonal by activating both displacement combos (not necessary sequentially). Then, the amount of pixels to update in both axis in the sample is normalized using

In this approach, the user can stop the mouse pointer movement of an axis doing the opposite combo, whereas a forced blink () stops all of the active movements at once. In case that the mouse pointer is still, the forced blink does left clicks.

6. Results and Discussion

The pointing device proposed were tested by 8 volunteers aged between 23 and 38 who had no impaired mobility and did not wear glasses or contact lenses. Three volunteers were blue-eyed, and five, brown-eyed. They were seated in front of a 19 4 : 3 TFT computer screen at a distance of 60 cm. During each experiment, the users themselves did the initial adjustment stage following the actions explained in the previous section. Table 2 shows a sensor image capture of each user after the initial adjustment and Table 4 summarizes the results of this stage. The average time required to set up the device before emulation was 33 s. The average pupil diameter was barely 5.5 pixels because all users tended to place the optical sensor far from the eye to minimize obstruction and maximize the view angle of the screen. The obtained reference positions, as expected, were into a restricted region of ±4 pixels offset regarding the center of the image.

Once the initial adjustment was done, two different user tests were performed. In the first test, the user was asked to do a fixed sequence of pupil movements looking at the left, right, up, and down edges of the screen (screen frame) as well as its center in order to validate the detection of the defined , and regions. An average recording time of 27.7 s were required for each user with an image acquisition rate of 11.84 fps.

Figure 16 shows, for each user, the pupil position error during the movement sequence. In order to evaluate the detection improvement, the pupil was located, frame by frame, with three different algorithms which could be implemented in our proposed system: the proposed in this work, our previous approach [44], and the integrodifferential operator [46]. Also, for each frame, a manual pupil location was done and used as a reference. The Euclidian distance between the reference points and the algorithms’ result was considered the error in pixels.

In the case of the proposed algorithm, the average median error (red lines inside the boxes) for all users was 0.34 pixels (), whereas 0.41 pixels () and 1.39 pixels () pixels for the previous and integrodifferential algorithms, respectively. The average distribution between the 25% (lower quartile) and 75% (upper quartile) of the values (box height) was between 0.21 and 0.50 pixels for the proposed algorithm. In case of the previous approach, there was a slightly poor distribution (between 0.27 and 0.59 pixels) but better than the integrodifferential (between 1.09 and 1.68 pixels). The integrodifferential operator is very sensible with the differences produced between the eyelids and sclera which depicts an arc shape. Also, it always returns a pupil position candidate (maximum confidence), thus requiring an additional image processing to detect whether there is a blink or not and discard false positive detections during blinks.

As the results show in Figure 16, the proposed algorithm has an average minimum and maximum values within the ±1.5 IQR (interquartile), the whiskers, of 0.01 and 0.93 pixels, respectively, better than the previous approach (within 0.03 and 1.02 pixels). The error for the valley detection algorithms does not depend on the user and, overall, it remains below 1.5 pixels. However, extreme outlier errors appeared in some users during blinks because of eyelashes or eyelid obstruction while closing the eye. In the previous approach, these outliers occur in a 3.01% of the cases, generating a maximum error of 14.58 pixels, whereas in the new proposed algorithm, they were reduced to 0.25% with maximum error of 9.14 pixels. This significant improvement is because of a new set of hyperparameters and thresholds introduced in the algorithm. Furthermore, the natural eye blink problem was overcome in the emulation process applying a filter that rejects sudden pupil position changes as explained in the eye blink detection section (Figure 9).

Figure 17 show three sensor images of a volunteer eye, wearing glasses, contact lenses, and nothing. The pupil location was obtained manually and with the proposed method, showing the difference (error) on the images’ title. The yellow crosses are the absolute minimums for each row, the red lines are the detected valleys, and the turquoise blue asterisk indicates the pupil center. As the image shows, the glass lenses degrade the image quality and produces important highlights of the NIR LED. It is not possible to have a stable pupil detection with this conditions. On the contrary, the contact lenses did not affect the image quality. The pupil position detection was as good as the detection without intrusion. Only an insignificant misalignment of 0.15 pixels was obtained in the images showed, and, whether wearing contact lenses or not, the volunteer could operate with the device properly.

Figure 18 shows the user 8 first test pupil tracking results and the respective detected region. The manual pupil tracking (analyzing images by a human) and the estimated pupil tracking (using the proposed algorithm) are shown. The results are very similar and the pixel error was constrained into the expected margins. Likewise, the estimated region was calculated, frame by frame with no filters applied, by the function (Equation (13)). Figure 18(a) shows the regions estimated over time and their success considering the correct region as the region calculated with the reference pupil position (obtained manually). As the results show, the detection of the different regions was successful: the user looked once at each edge region starting always from the central one (). However, when crossing between regions, some pupil location errors were detected. In the user 8 case, there were 4 frames that generated an incorrect region (red circles in Figure 18) which were considered as unsuccessful region detection of that type. In that case, the success on detecting regions was 95.2% for , 96.8% for , 96.5% for , 95.8% for , and 100% for .

Table 5 reports the success in the regions detection for each user. In general, all displacements were detected correctly at the limit of the movements (when the user looked at the frame of the screen). The main error source appeared when the pupil crossed region borders, as shown in Figure 18 (red circles). Due to that error, the region success highly depends on the elapsed time in each region. Keeping the region time as similar as possible to each other was necessary. As the results show in Table 4, the regions were successfully detected in 94.7% of the analyzed cases corresponding to the eye left, right, up, and down orientations. There were no significant differences between users with blue and brown eyes. The highest success was obtained when moving the eye to the right edge with 97.1% whereas the worst detection results were obtained in the displacements to the opposite side (left edge) with 91.8% of success. This was mainly because the sensor was placed to the right of the eye, capturing less displacement to the left and, in addition, the left side had worse infrared illumination. When the user was looking to the upper edge, the region was correctly detected 92.0% of the cases. This displacement has the shortest eye movement and it is highly sensitive to unexpected head motion variations. When looking to the bottom edge, the average successful region detection was 96.8% although it was complex to detect with users who had large eyebrows. Finally, the central region was successfully detected in 95.8% of the cases. This is the smallest region area hence its detection extremely depends on keeping the head motionless in order to maintain the initial reference position. Due to the small displacement range of the pupil, the central region could not be enlarged since it would affect the result of the other regions. Unlike the edge regions, where the user has the screen’s frame as a reference point to look at, the central region does not have a reference mark and, after certain time, the user fluctuates his gaze generating false detections. This weakness could be overcome by updating the reference position every time the user gaze remains inside the central region.

Finally, in the second test, different combo actions were carried out (see Table 3), one followed by the other, by looking at the edges (the frame) and the center of the screen as requested. The user was asked to perform a sequence of 7 combos () 5 times, where each combo was tested consecutively twice. It is important to mention that the volunteers were untrained users and had not tested the device before. The experiments were done in the laboratory but simulating a real working scenario: an office desktop with untrained users. The results with users outside the laboratory should be similar despite the initial adjustment that requires a certain experience level. The proposed device is robust in front of illumination changes due to its direct external NIR LED, which avoids an indoor illumination adjustment.

Despite the initial problems to perform combos during the first contact, at the end of the test, most of the volunteers felt comfortable doing each combo successfully, repeating them twice in the worst case. As the results show in Table 6, the best combo detection was the forced blink () which was successful 97.5%. The combos for vertical and horizontal displacements ( and ) had a similar success rate with an average correct detection of 84.1%. These were easier to perform than the and combos, since moving between opposite regions in a very short time was needed. Consequently, these combos had the worst success result with 70.0% and 75.0%, respectively.

Working in a real scenario, the main weak point of the proposed system is the reference position sensibility obtained in the initial adjustment. The detection of each region depends on this reference point and, due to the low pupil displacement range, a small misalignment harms seriously the overall performance. In this work, the tests were all done by healthy users, and the main problem lay on having the head still. In case of using the sensor for the first time, head tics were detected because of the initial tension. Furthermore, after using the system for a while, volunteers tend to slightly down the head and, in some cases, they moved the head unconsciously while were concentrated doing mouse events. Although the proposed device is designed for people with severe disabilities who cannot move their head, it is reasonable to think that the slippage problem could be present because of head tics or gravity. In case of product commercialization, this problem must be addressed, for example, with the method proposed in [81]. Also, it could be solved by attaching a motion sensor to the headset, like an accelerometer, tracking and adjusting the reference mismatches. Likewise, this critical reference point could be dynamically updated by tracking the pupil maximum displacements and readjusting the regions (), but taking care of the increase in computational cost and the system performance.

Another point to improve is the head-mounted device clamping system. The device is very lightweight (41 g), comfortable, and low intrusive, but has problems with straight long hair users (slides forward). In some cases, due to a small device forward movement, the pupil got out of the camera field and unexpected pointer pauses were trigged. Hence, the initial adjustment procedure had to be repeated. A solution could a third headband around the crown to improve the head grip.

The experimental results combining user interaction and natural gaze (Midas Touch problem [79]) showed difficulties generating combos. Usually, users fail in the first trial because they do not understand how to perform mouse events. It is understandable that looking at multiple extreme eye gaze points (screen edges) within a controlled dwell time could be upsetting; however, this solution proved no false positive mouse events triggered during the natural eye gaze. Also, because the combos’ sequences are very different from each other, there were no events triggered (combos) by mistake.

Although the users had interaction difficulties in the first contact, all of them could perform the proposed combos (in some cases after two or three tries). In the case of the forced eye blink combo, once the user felt comfortable with the blink time, it was obtained 100% success, meaning the eye blinks were always detected properly and the complex part was to get used with the dwell time. Taking advantage of this robustness, an interaction method based on a fast engaging/disengaging mouse control could be implemented [80]. This could improve its usability, doing the interaction more natural by translating directly screen edges eyes gaze to relative mouse displacements; however, the main concern is how to look at a certain screen position (to trigger an event) and its result at once.

7. Conclusions

A new implementation of an inexpensive eye-controlled human-computer interface device using an optical mouse sensor is presented. The device takes advantage of the image acquisition capabilities of the ADNS-3080 low-cost optical mouse sensor, originally designed to operate as a displacement sensor, for pupil detection and tracking. Its default configuration of lenses and illumination, suited to work at short focal distance of 2.4 mm, has been replaced with a low-cost plastic CAY46 aspheric and an external NIR LED with a wavelength peak of 850 nm in order to obtain sharp images of the eye and the pupil. This proposal takes full advantage of the infrared wavelength responsivity of the optical mouse sensors to detect the pupil as the darkest part in the images.

An optimized algorithm to locate the pupil centroid in the acquired low-resolution sensor images ( pixels) is detailed. Although it has similar detection performances than state of the art algorithms, it can be easily integrated in a low-cost microcontroller without the help of any external memory. Moreover, a procedure to detect forced eye blinks is implemented in order to perform different mouse actions depending on the time that the eye remains closed (no pupil detection) rejecting natural blinks.

The proposed pointing device has been fully implemented and evaluated in terms of head-mounted structure, electronics design, and human-computer interaction and operation. It was tested on 8 volunteer users detecting and locating the pupil successfully for all of them with an average error of 0.34 pixels. The performance in the detection of the 5 defined image regions was successful in 94.7% of the cases. The set of combined sequence of pupil actions (combo actions) also has been successfully generated for all 8 users. In the combo sequences for pointer displacements, 84.1% were successful on the first attempt. In the case of forced blink actions such as left-clicking, 97.5% of the cases were successful. The pupil movement combination for right- and double-click was the worst result with an average success of 72.5% since these require a fast sequence of opposite pupil locations.

The validation results obtained confirm that the low-cost optical mouse sensor is capable of detecting pupil displacements and can be applied as a pupil tracking sensor in low-cost human-computer interface devices. Although the usability of the proposed pointing device is far from the being like the current computer mouse, it could be a very interesting alternative as affordable interface device for users with severe disability in the upper extremities.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare no conflict of interest.

Acknowledgments

This research was funded by Indra, Accessibility Chair, 2017. This research also was supported by the Government of Catalonia (Comissionat per a Universitats i Recerca, Departament d’Innovació, Universitats i Empresa) and the European Social Fund.