Rusty Buffers
Rust’s ownership system makes it easy and safe to create a zero-copy parser that takes a slice of bytes as input and outputs some structure containing references to the original input. Rust ensures that such references exist only while the underlying slice cannot be mutated.
As a concrete example say we have a &[u8]
containing
“3foo3bar3baz4quux”
and want to parse it into
vec![“foo”, “bar”, “baz”, “quux”]
. This is easily accomplished
by defining a couple of nom parser combinators:
named!(strings<Vec<&str>>,
many0!(map_res!(length_value!(ascii_num, rest), str::from_utf8))
);
named!(ascii_num<usize>,
map_res!(map_res!(take_while!(is_digit), str::from_utf8), usize::from_str)
);
fn main() {
let input = b"3foo3bar3baz4quux";
let expect = vec!["foo", "bar", "baz", "quux"];
let output = strings(input).unwrap().1;
assert_eq!(expect, output);
}
In real-world use the input slice may contain only partial data, for example
“3foo3bar3baz4q”
, in which case the parser will return
IResult::Incomplete
. Or it may contain multiple messages, e.g.
“3foo3bar 3baz4quux”
, and the parser will return the parsed
results plus the remaining bytes.
Buffers
If we’re reading data from the network into a fixed-size buffer which is passed to the parser then we must copy any partial or remaining bytes somewhere else before the next read overwrites them. When more data is received it can be appended to the existing data and passed to the parser again.
Copying is expensive so we should parse directly from the input buffer whenever
possible and only copy when there is existing data that the input must be
appended to. Here is a Buffer
type containing a Vec to store these
partial or remaining bytes:
pub struct Buffer {
vec: Vec<u8>
}
impl Buffer {
pub fn new() -> Buffer {
Buffer {
vec: Vec::new(),
}
}
pub fn buf<'a: 'b, 'b>(&'a mut self, more: &'b [u8]) -> Buf<'b> {
if self.vec.is_empty() {
Buf::Empty(&mut self.vec, more)
} else {
self.vec.extend_from_slice(more);
Buf::Some(&mut self.vec)
}
}
}
The buf(..)
method is called with a reference to the input buffer
and returns a Buf
that can be passed to the parser as a
&[u8]
via the Deref
trait. The lifetimes (<‘a: ‘b, ‘b>
) are a bit gnarly because the
compiler must be told that the returned Buf
has the same lifetime as
the input buffer which may be shorter than the lifetime of the Buffer
.
When no partial or remaining bytes have been buffered the Buf
simply dereferences to the input buffer directly. However when the internal
buffer is not empty the input buffer is appended to it and the Buf dereferences
to that larger buffer.
use std::ops::Deref;
pub enum Buf<'a> {
Empty(&'a mut Vec<u8>, &'a [u8]),
Some(&'a mut Vec<u8>),
}
impl<'a> Buf<'a> {
pub fn keep(&mut self, n: usize) {
match *self {
Buf::Empty(ref mut vec, more) => {
let n = more.len() - n;
vec.extend_from_slice(&more[n..]);
},
Buf::Some(ref mut vec) => {
let n = vec.len() - n;
vec.drain(..n);
},
}
}
}
impl<'a> Deref for Buf<'a> {
type Target = [u8];
fn deref(&self) -> &[u8] {
match *self {
Buf::Empty(_, more) => more,
Buf::Some(ref vec) => &vec[..],
}
}
}
When parsing is complete the keep(..)
method of Buf
is called with the number of bytes that have not been consumed. Those bytes are
retained in the internal buffer for use later.
Example
Here is an example parse
function that uses Buffer
:
fn parse(buffer: &mut Buffer, b: &[u8]) -> Option<Vec<String>> {
let mut buf = buffer.buf(b);
let mut res = None;
let mut len = buf.len();
if let IResult::Done(rest, vec) = strings(&buf[..]) {
res = Some(vec.into_iter().map(str::to_owned).collect());
len = rest.len();
}
buf.keep(len);
res
}
#[test]
fn test_partial() {
let mut buffer = Buffer::new();
let input = b"3foo3bar3baz4q";
let expect = vec!["foo", "bar", "baz", "quux"];
let res = parse(&mut buffer, input);
assert_eq!(None, res);
let res = parse(&mut buffer, b"uux").unwrap();
assert_eq!(expect, res);
}
Note that parse
returns an optional Vec of String
not &str
. The lifetime of the return value is longer than the
lifetime of the Buf
so a copy is necessary. Additionally the
call to buf.keep(..)
may shrink the buffer, invalidating any
references to its contents.