summaryrefslogtreecommitdiff
path: root/README.md
blob: 8b9a97e3864c0ea89ff266314577607c845258c6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
[![CI](https://github.com/rusticstuff/simdutf8/actions/workflows/ci.yml/badge.svg)](https://github.com/rusticstuff/simdutf8/actions/workflows/ci.yml)
[![crates.io](https://img.shields.io/crates/v/simdutf8.svg)](https://crates.io/crates/simdutf8)
[![docs.rs](https://docs.rs/simdutf8/badge.svg)](https://docs.rs/simdutf8)

# simdutf8 – High-speed UTF-8 validation

Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementation from
[simdjson](https://github.com/simdjson/simdjson). Originally ported to Rust by the developers of [simd-json.rs](https://simd-json.rs), but now heavily improved.

## Status
This library has been thoroughly tested with sample data as well as fuzzing and there are no known bugs.

## Features
* `basic` API for the fastest validation, optimized for valid UTF-8
* `compat` API as a fully compatible replacement for `std::str::from_utf8()`
* Supports AVX 2 and SSE 4.2 implementations on x86 and x86-64
* 🆕 ARM64 (aarch64) SIMD is supported with Rust 1.59 (use feature `aarch64_neon`) and Nightly (no extra feature needed)
* 🆕 WASM (wasm32) SIMD is supported
* x86-64: Up to 23 times faster than the std library on valid non-ASCII, up to four times faster on ASCII
* aarch64: Up to eleven times faster than the std library on valid non-ASCII, up to four times faster on ASCII (Apple Silicon)
* Faster than the original simdjson implementation
* Selects the fastest implementation at runtime based on CPU support (on x86)
* Falls back to the excellent std implementation if SIMD extensions are not supported
* Written in pure Rust
* No dependencies
* No-std support

## Quick start
Add the dependency to your Cargo.toml file:
```toml
[dependencies]
simdutf8 = "0.1.4"
```
For ARM64 SIMD support on Rust 1.59:
```toml
[dependencies]
simdutf8 = { version = "0.1.4", features = ["aarch64_neon"] }
```

Use `simdutf8::basic::from_utf8()` as a drop-in replacement for `std::str::from_utf8()`.

```rust
use simdutf8::basic::from_utf8;

println!("{}", from_utf8(b"I \xE2\x9D\xA4\xEF\xB8\x8F UTF-8!").unwrap());
```

If you need detailed information on validation failures, use `simdutf8::compat::from_utf8()`
instead.

```rust
use simdutf8::compat::from_utf8;

let err = from_utf8(b"I \xE2\x9D\xA4\xEF\xB8 UTF-8!").unwrap_err();
assert_eq!(err.valid_up_to(), 5);
assert_eq!(err.error_len(), Some(2));
```

## APIs

### Basic flavor
Use the `basic` API flavor for maximum speed. It is fastest on valid UTF-8, but only checks
for errors after processing the whole byte sequence and does not provide detailed information if the data
is not valid UTF-8. `simdutf8::basic::Utf8Error` is a zero-sized error struct.

### Compat flavor
The `compat` flavor is fully API-compatible with `std::str::from_utf8()`. In particular, `simdutf8::compat::from_utf8()`
returns a `simdutf8::compat::Utf8Error`, which has `valid_up_to()` and `error_len()` methods. The first is useful for
verification of streamed data. The second is useful e.g. for replacing invalid byte sequences with a replacement character.

It also fails early: errors are checked on the fly as the string is processed and once
an invalid UTF-8 sequence is encountered, it returns without processing the rest of the data.
This comes at a slight performance penalty compared to the `basic` API even if the input is valid UTF-8.

## Implementation selection

### X86
The fastest implementation is selected at runtime using the `std::is_x86_feature_detected!` macro, unless the CPU
targeted by the compiler supports the fastest available implementation.
So if you compile with `RUSTFLAGS="-C target-cpu=native"` on a recent x86-64 machine, the AVX 2 implementation is selected at
compile-time and runtime selection is disabled.

For no-std support (compiled with `--no-default-features`) the implementation is always selected at compile time based on
the targeted CPU. Use `RUSTFLAGS="-C target-feature=+avx2"` for the AVX 2 implementation or `RUSTFLAGS="-C target-feature=+sse4.2"`
for the SSE 4.2 implementation.

### ARM64
The SIMD implementation is only available on Rust Nightly and Rust 1.59 or later. On Rust Nightly it is now turned on
automatically.
To get the SIMD implementation with Rust 1.59 (and likely 1.60) the crate feature `aarch64_neon` needs to be enabled.
For Rust Nightly this will no longer be required (but does not hurt either). It is expected that the SIMD implementation
will be enabled automatically starting with Rust 1.61.

### WASM32
For wasm32 support, the implementation is selected at compile time based on the presence of the `simd128` target feature.
Use `RUSTFLAGS="-C target-feature=+simd128"` to enable the WASM SIMD implementation.  WASM, at
the time of this writing, doesn't have a way to detect SIMD through WASM itself.  Although this capability
is available in various WASM host environments (e.g., [wasm-feature-detect] in the web browser), there is no portable
way from within the library to detect this.

[wasm-feature-detect]: https://github.com/GoogleChromeLabs/wasm-feature-detect

#### Building/Targeting WASM
See [this document](./wasm32-development.md) for more details.

### Access to low-level functionality

If you want to be able to call a SIMD implementation directly, use the `public_imp` feature flag. The validation implementations are then accessible in the `simdutf8::{basic, compat}::imp` hierarchy. Traits
facilitating streaming validation are available there as well.

## Optimisation flags
Do not use [`opt-level = "z"`](https://doc.rust-lang.org/cargo/reference/profiles.html), which prevents inlining and makes
the code quite slow.

## Minimum Supported Rust Version (MSRV)
This crate's minimum supported Rust version is 1.38.0.

## Benchmarks
The benchmarks have been done with [criterion](https://bheisler.github.io/criterion.rs/book/index.html), the tables
are created with [critcmp](https://github.com/BurntSushi/critcmp). Source code and data are in the
[bench directory](https://github.com/rusticstuff/simdutf8/tree/main/bench).

The naming schema is id-charset/size. _0-empty_ is the empty byte slice, _x-error/66536_ is a 64KiB slice where the very
first character is invalid UTF-8. Library versions are simdutf8 v0.1.2 and simdjson v0.9.2. When comparing
with simdjson simdutf8 is compiled with `#inline(never)`.

Configurations:
* X86-64: PC with an AMD Ryzen 7 PRO 3700 CPU (Zen2) on Linux with Rust 1.52.0
* Aarch64: Macbook Air with an Apple M1 CPU (Apple Silicon) on macOS with Rust rustc 1.54.0-nightly (881c1ac40 2021-05-08).

### simdutf8 basic vs std library on x86-64 (AMD Zen2)
![image](https://user-images.githubusercontent.com/3736990/117568104-1c00f900-b0bf-11eb-938f-4c253d192480.png)
Simdutf8 is up to 23 times faster than the std library on valid non-ASCII, up to four times on pure ASCII.

### simdutf8 basic vs std library on aarch64 (Apple Silicon)
![image](https://user-images.githubusercontent.com/3736990/117568160-42bf2f80-b0bf-11eb-86a4-9aeee4cee87d.png)
Simdutf8 is up to to eleven times faster than the std library on valid non-ASCII, up to four times faster on
pure ASCII.

### simdutf8 basic vs simdjson on x86-64
![image](https://user-images.githubusercontent.com/3736990/117568231-80bc5380-b0bf-11eb-8e90-1dcc6d966ebd.png)
Simdutf8 is faster than simdjson on almost all inputs.

### simdutf8 basic vs simdutf8 compat UTF-8 on x86-64
![image](https://user-images.githubusercontent.com/3736990/117568270-af3a2e80-b0bf-11eb-8ec4-e5a0a4ad7210.png)
There is a small performance penalty to continuously checking the error status while processing data, but detecting
errors early provides a huge benefit for the _x-error/66536_ benchmark.

## Technical details
For inputs shorter than 64 bytes validation is delegated to `core::str::from_utf8()` except for the direct-access
functions in `simdutf8::{basic, compat}::imp`.

The SIMD implementation is mostly similar to the one in simdjson except that it is has additional optimizations
for the pure ASCII case. Also it uses prefetch with AVX 2 on x86 which leads to slightly better performance with
some Intel CPUs on synthetic benchmarks.

For the compat API, we need to check the error status vector on each 64-byte block instead of just aggregating it. If an
error is found, the last bytes of the previous block are checked for a cross-block continuation and then
`std::str::from_utf8()` is run to find the exact location of the error.

Care is taken that all functions are properly inlined up to the public interface.

## Thanks
* to the authors of simdjson for coming up with the high-performance SIMD implementation and in particular to Daniel Lemire
  for his feedback. It was very helpful.
* to the authors of the simdjson Rust port who did most of the heavy lifting of porting the C++ code to Rust.


## License
This code is dual-licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0.html) and the [MIT License](https://opensource.org/licenses/MIT).

It is based on code distributed with simd-json.rs, the Rust port of simdjson, which is dual-licensed under
the MIT license and Apache 2.0 license as well.

simdjson itself is distributed under the Apache License 2.0.

## References
John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021