Skip to content

Out-of-bounds read in UNICHAR::UTF8ToUTF32 #4495

@hgarrereyn

Description

@hgarrereyn

Hi, there is a potential bug in UNICHAR::UTF8ToUTF32 when operating on invalid utf8 strings.

This bug was reproduced on f123423.

Description

What crashes

  • Calling tesseract::UNICHAR::UTF8ToUTF32 with a null-terminated string that begins with a truncated multibyte prefix (e.g., "\xE8\0") triggers a stack-buffer-overflow (read) in UNICHAR::utf8_step() via UNICHAR::const_iterator::is_legal().
  • The validator reads one byte past the end of the provided C string when attempting to examine continuation bytes for the multibyte sequence.

The function is documented as:

// Converts a utf-8 string to a vector of unicodes.
// Returns an empty vector if the input contains invalid UTF-8.

Yet during reading, it has the ability to step past the terminating null-byte when it sees a utf-8 multibyte indicator (such as 0xe8 in the poc).

Suggested fix would be to either remove this comment from the documentation saying it can operate on invalid utf8 strings, or fix the loop handler to properly ensure that the length or the remaining string is large enough to support the reported size of multibyte characters during parsing.

POC

The following testcase demonstrates the bug:

testcase.cpp

#include <string>
#include <cstdio>
#include "/fuzz/install/include/tesseract/unichar.h"

int main() {
  // Truncated 3-byte UTF-8 sequence: 0xE8 at end of string
  const char bad[] = { (char)0xE8, 0 }; // null-terminated
  // This should return an empty vector per API contract on invalid UTF-8,
  // but currently triggers an out-of-bounds read in utf8_step/is_legal.
  auto v = tesseract::UNICHAR::UTF8ToUTF32(bad);
  std::printf("size=%zu\n", v.size());
  return 0;
}

stdout


stderr

=================================================================
==1==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fcaf3800023 at pc 0x7fcaf6492dfe bp 0x7ffc29e49bd0 sp 0x7ffc29e49bc8
READ of size 1 at 0x7fcaf3800023 thread T0
    #0 0x7fcaf6492dfd in tesseract::UNICHAR::UTF8ToUTF32(char const*) (/fuzz/install/lib/libtesseract.so.5.5+0x454dfd) (BuildId: 9260d12595240c308531871605d7841b015cab5a)
    #1 0x5584b63b953f in main /fuzz/testcase.cpp:10:12
    #2 0x7fcaf5843d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #3 0x7fcaf5843e3f in __libc_start_main csu/../csu/libc-start.c:392:3
    #4 0x5584b62de324 in _start (/fuzz/test+0x2c324) (BuildId: 2ffe63f6eef34d1cbb242766d390b1734e095d0e)

Address 0x7fcaf3800023 is located in stack of thread T0 at offset 35 in frame
    #0 0x5584b63b942f in main /fuzz/testcase.cpp:5

  This frame has 2 object(s):
    [32, 34) 'bad' (line 7) <== Memory access at offset 35 overflows this variable
    [48, 72) 'v' (line 10)
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow (/fuzz/install/lib/libtesseract.so.5.5+0x454dfd) (BuildId: 9260d12595240c308531871605d7841b015cab5a) in tesseract::UNICHAR::UTF8ToUTF32(char const*)
Shadow bytes around the buggy address:
  0x7fcaf37ffd80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fcaf37ffe00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fcaf37ffe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fcaf37fff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fcaf37fff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x7fcaf3800000: f1 f1 f1 f1[02]f2 00 00 00 f3 f3 f3 f3 f3 f3 f3
  0x7fcaf3800080: f1 f1 f1 f1 f8 f8 f8 f8 f2 f2 f2 f2 00 f3 f3 f3
  0x7fcaf3800100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fcaf3800180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fcaf3800200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fcaf3800280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==1==ABORTING

Steps to Reproduce

The crash was triaged with the following Dockerfile:

Dockerfile

# Ubuntu 22.04 with some packages pre-installed
FROM hgarrereyn/stitch_repro_base@sha256:3ae94cdb7bf2660f4941dc523fe48cd2555049f6fb7d17577f5efd32a40fdd2c

RUN git clone https://github.com/tesseract-ocr/tesseract /fuzz/src && \
    cd /fuzz/src && \
    git checkout f1234239ef77b8c0b68f59f2fc7b66f4b52a4a0a && \
    git submodule update --init --remote --recursive

ENV LD_LIBRARY_PATH=/fuzz/install/lib
ENV ASAN_OPTIONS=hard_rss_limit_mb=1024:detect_leaks=0

RUN echo '#!/bin/bash\nexec clang-17 -fsanitize=address -O0 "$@"' > /usr/local/bin/clang_wrapper && \
    chmod +x /usr/local/bin/clang_wrapper && \
    echo '#!/bin/bash\nexec clang++-17 -fsanitize=address -O0 "$@"' > /usr/local/bin/clang_wrapper++ && \
    chmod +x /usr/local/bin/clang_wrapper++

# Install build tools and dependencies
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    cmake \
    ninja-build \
    pkg-config \
    libleptonica-dev \
    libpng-dev \
    libjpeg-turbo8-dev \
    libtiff-dev \
    zlib1g-dev \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /fuzz
RUN cmake -S /fuzz/src -B /fuzz/build \
    -G Ninja \
    -DCMAKE_C_COMPILER=clang_wrapper \
    -DCMAKE_CXX_COMPILER=clang_wrapper++ \
    -DCMAKE_INSTALL_PREFIX=/fuzz/install \
    -DBUILD_SHARED_LIBS=ON \
    -DBUILD_TESTS=OFF \
    -DBUILD_TRAINING_TOOLS=OFF \
    -DOPENMP_BUILD=OFF \
    -DSW_BUILD=OFF \
    -DDISABLE_ARCHIVE=ON \
    -DDISABLE_CURL=ON \
    -DGRAPHICS_DISABLED=ON \
    -DENABLE_UNITY_BUILD=ON \
    -DENABLE_PRECOMPILED_HEADERS=OFF
RUN cmake --build /fuzz/build --target install -j

Build Command

clang++-17 -fsanitize=address -g -O0 -o /fuzz/test /fuzz/testcase.cpp -I/fuzz/install/include -L/fuzz/install/lib -Wl,-rpath,/fuzz/install/lib -ltesseract -llept -lpthread && /fuzz/test

Reproduce

  1. Copy Dockerfile and testcase.cpp into a local folder.
  2. Build the repro image:
docker build . -t repro --platform=linux/amd64
  1. Compile and run the testcase in the image:
docker run \
    -it --rm \
    --platform linux/amd64 \
    --mount type=bind,source="$(pwd)/testcase.cpp",target=/fuzz/testcase.cpp \
    repro \
    bash -c "clang++-17 -fsanitize=address -g -O0 -o /fuzz/test /fuzz/testcase.cpp -I/fuzz/install/include -L/fuzz/install/lib -Wl,-rpath,/fuzz/install/lib -ltesseract -llept -lpthread && /fuzz/test"


Additional Info

This testcase was discovered by STITCH, an autonomous fuzzing system. All reports are reviewed manually (by a human) before submission.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions