-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Description
Hi, there is a potential bug in UNICHAR::UTF8ToUTF32 when operating on invalid utf8 strings.
This bug was reproduced on f123423.
Description
What crashes
- Calling tesseract::UNICHAR::UTF8ToUTF32 with a null-terminated string that begins with a truncated multibyte prefix (e.g., "\xE8\0") triggers a stack-buffer-overflow (read) in UNICHAR::utf8_step() via UNICHAR::const_iterator::is_legal().
- The validator reads one byte past the end of the provided C string when attempting to examine continuation bytes for the multibyte sequence.
The function is documented as:
tesseract/src/ccutil/unichar.cpp
Lines 217 to 218 in bb7eb84
| // Converts a utf-8 string to a vector of unicodes. | |
| // Returns an empty vector if the input contains invalid UTF-8. |
Yet during reading, it has the ability to step past the terminating null-byte when it sees a utf-8 multibyte indicator (such as 0xe8 in the poc).
Suggested fix would be to either remove this comment from the documentation saying it can operate on invalid utf8 strings, or fix the loop handler to properly ensure that the length or the remaining string is large enough to support the reported size of multibyte characters during parsing.
POC
The following testcase demonstrates the bug:
testcase.cpp
#include <string>
#include <cstdio>
#include "/fuzz/install/include/tesseract/unichar.h"
int main() {
// Truncated 3-byte UTF-8 sequence: 0xE8 at end of string
const char bad[] = { (char)0xE8, 0 }; // null-terminated
// This should return an empty vector per API contract on invalid UTF-8,
// but currently triggers an out-of-bounds read in utf8_step/is_legal.
auto v = tesseract::UNICHAR::UTF8ToUTF32(bad);
std::printf("size=%zu\n", v.size());
return 0;
}
stdout
stderr
=================================================================
==1==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fcaf3800023 at pc 0x7fcaf6492dfe bp 0x7ffc29e49bd0 sp 0x7ffc29e49bc8
READ of size 1 at 0x7fcaf3800023 thread T0
#0 0x7fcaf6492dfd in tesseract::UNICHAR::UTF8ToUTF32(char const*) (/fuzz/install/lib/libtesseract.so.5.5+0x454dfd) (BuildId: 9260d12595240c308531871605d7841b015cab5a)
#1 0x5584b63b953f in main /fuzz/testcase.cpp:10:12
#2 0x7fcaf5843d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#3 0x7fcaf5843e3f in __libc_start_main csu/../csu/libc-start.c:392:3
#4 0x5584b62de324 in _start (/fuzz/test+0x2c324) (BuildId: 2ffe63f6eef34d1cbb242766d390b1734e095d0e)
Address 0x7fcaf3800023 is located in stack of thread T0 at offset 35 in frame
#0 0x5584b63b942f in main /fuzz/testcase.cpp:5
This frame has 2 object(s):
[32, 34) 'bad' (line 7) <== Memory access at offset 35 overflows this variable
[48, 72) 'v' (line 10)
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
(longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow (/fuzz/install/lib/libtesseract.so.5.5+0x454dfd) (BuildId: 9260d12595240c308531871605d7841b015cab5a) in tesseract::UNICHAR::UTF8ToUTF32(char const*)
Shadow bytes around the buggy address:
0x7fcaf37ffd80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7fcaf37ffe00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7fcaf37ffe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7fcaf37fff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7fcaf37fff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x7fcaf3800000: f1 f1 f1 f1[02]f2 00 00 00 f3 f3 f3 f3 f3 f3 f3
0x7fcaf3800080: f1 f1 f1 f1 f8 f8 f8 f8 f2 f2 f2 f2 00 f3 f3 f3
0x7fcaf3800100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7fcaf3800180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7fcaf3800200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7fcaf3800280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==1==ABORTING
Steps to Reproduce
The crash was triaged with the following Dockerfile:
Dockerfile
# Ubuntu 22.04 with some packages pre-installed
FROM hgarrereyn/stitch_repro_base@sha256:3ae94cdb7bf2660f4941dc523fe48cd2555049f6fb7d17577f5efd32a40fdd2c
RUN git clone https://github.com/tesseract-ocr/tesseract /fuzz/src && \
cd /fuzz/src && \
git checkout f1234239ef77b8c0b68f59f2fc7b66f4b52a4a0a && \
git submodule update --init --remote --recursive
ENV LD_LIBRARY_PATH=/fuzz/install/lib
ENV ASAN_OPTIONS=hard_rss_limit_mb=1024:detect_leaks=0
RUN echo '#!/bin/bash\nexec clang-17 -fsanitize=address -O0 "$@"' > /usr/local/bin/clang_wrapper && \
chmod +x /usr/local/bin/clang_wrapper && \
echo '#!/bin/bash\nexec clang++-17 -fsanitize=address -O0 "$@"' > /usr/local/bin/clang_wrapper++ && \
chmod +x /usr/local/bin/clang_wrapper++
# Install build tools and dependencies
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
cmake \
ninja-build \
pkg-config \
libleptonica-dev \
libpng-dev \
libjpeg-turbo8-dev \
libtiff-dev \
zlib1g-dev \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /fuzz
RUN cmake -S /fuzz/src -B /fuzz/build \
-G Ninja \
-DCMAKE_C_COMPILER=clang_wrapper \
-DCMAKE_CXX_COMPILER=clang_wrapper++ \
-DCMAKE_INSTALL_PREFIX=/fuzz/install \
-DBUILD_SHARED_LIBS=ON \
-DBUILD_TESTS=OFF \
-DBUILD_TRAINING_TOOLS=OFF \
-DOPENMP_BUILD=OFF \
-DSW_BUILD=OFF \
-DDISABLE_ARCHIVE=ON \
-DDISABLE_CURL=ON \
-DGRAPHICS_DISABLED=ON \
-DENABLE_UNITY_BUILD=ON \
-DENABLE_PRECOMPILED_HEADERS=OFF
RUN cmake --build /fuzz/build --target install -jBuild Command
clang++-17 -fsanitize=address -g -O0 -o /fuzz/test /fuzz/testcase.cpp -I/fuzz/install/include -L/fuzz/install/lib -Wl,-rpath,/fuzz/install/lib -ltesseract -llept -lpthread && /fuzz/testReproduce
- Copy
Dockerfileandtestcase.cppinto a local folder. - Build the repro image:
docker build . -t repro --platform=linux/amd64- Compile and run the testcase in the image:
docker run \
-it --rm \
--platform linux/amd64 \
--mount type=bind,source="$(pwd)/testcase.cpp",target=/fuzz/testcase.cpp \
repro \
bash -c "clang++-17 -fsanitize=address -g -O0 -o /fuzz/test /fuzz/testcase.cpp -I/fuzz/install/include -L/fuzz/install/lib -Wl,-rpath,/fuzz/install/lib -ltesseract -llept -lpthread && /fuzz/test"Additional Info
This testcase was discovered by STITCH, an autonomous fuzzing system. All reports are reviewed manually (by a human) before submission.