I am looking for something similar to SHM (SHared Memory) SPSC queue setup offered by boost::lockfree::spsc_queue and boost::interprocess but without allocating strings and storing them flat i.e. next to each other for maximum efficiency.
If I understand correctly that setup stores strings offset in the queue and allocates memory for the string somewhere else in the SHM.
Queue design can be:
| size | string 1 | size | string 2 | size | string 3 | ...
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SHM segment
in a circular buffer fashion. Idea:
struct Writer {
std::byte *shm;
void write(std::string_view str) {
// write size
const uint32_t sz = str.size();
std::memcpy(shm, &sz, sizeof(sz));
shm += sizeof(sz);
// write string
std::memcpy(shm, str.data(), sz);
shm += sz;
}
};
tl;dr
It is possible, although you'll have to deal with a bunch of extra edge-cases.
(see bottom of this post in case the link goes bad)
1. Assumptions
Given the information you provided i'm going to assume you want the following properties for the single-producer, single-consumer (spsc) queue:
(this is also the case for
boost::lockfree::spsc_queue)(like
boost::lockfree::spsc_queue)(like
boost::lockfree::spsc_queue, but not like most stuff inboost::interprocess(e.g.boost::interprocess::message_queueutilizes mutexes))2. Potential alternatives
Depending on your requirements there could be a few alternative options:
Fixed-length strings:
Like @John Zwinck suggested in his answer you could use a fixed-length string buffer.
The trade-off would be that your maximum string length is bounded by this size, and - depending on your expected variation of possible string sizes - might result in a lot of unused buffer space.
If you go this route i'd recommend you to use
boost::static_string- it's essentially astd::stringwith the dynamic allocation stuff removed and solely relying on its internal buffer.i.e.
boost::lockfree::spsc_queue<boost::static_string<N>>, whereNis the maximum size for the string values.Only store pointers in the queue and allocate the strings separately:
If you're already using
boost::interprocessyou could use aboost::interprocess::basic_stringwith aboost::interprocess::allocatorthat allocates the string separately in the same shared memory region.Here's a answer that contains a full example of this approach (it even uses
boost::lockfree::spsc_queue) (direct link to code example)The trade-off in this case is that the strings will be stored somewhere outside the spsc queue (but still within the same shared memory region).
If your strings are relatively long this might even be faster than storing the strings directly within the queue (the ring buffer can be alot smaller if it only needs to store pointers, and therefore would have a much better cache-locality).(cache locality won't help in this case - see this excellent comment from @Peter Cordes)
3. Design Considerations
3.1 Wrapped-around writes
A ringbuffer with fixed-size elements essentially splits the raw buffer into element-sized slots that are contiguous within the buffer, e.g.:
This automatically ensures that all objects within the buffer are contiguous. i.e. you never have to deal with a wrapped-around object:
If the element-size is not fixed however, we do have to handle this case somehow.
Assume for example an empty 8-byte ringbuffer with the read and write pointer on the 7th byte (the buffer is currently empty):
If we now attempt to write
"bar"into the buffer (prefixed by it's length), we would get a wrapped-around string:There are 2 ways to deal with this:
3.1.1 Making it part of the interface
The 1st option would be the easiest to implement, but it would be quite cumbersome to use for the consumer of the queue, because it needs to deal with 2 separate pointers + sizes that in combination represent the string. i.e. a potential interface for this could be:
Having to deal with 2 pointers and 2 sizes all the time for the odd case of a wrapped-around write is not the best solution imho, so i went with option 2:
3.1.2 Prevent wrap-arounds
The alternative option would be to prevent wrap-arounds from occuring in the first place.
A simple way to achieve this is to add a special "wrap-around" marker that tells the reader to immediately jump back to the beginning of the buffer (and therefore skip over all bytes after the wrap-around marker)
Example writing "bar": (
WArepresents a wrap-around marker)So once the reader tries to read the next element it'll encounter the wrap-around marker. This instructs it to directly go back to index 0, where the next element is located:
This technique allows all strings to be stored contiguously within the ring buffer - with the small trade-off that the end of the buffer might not be fully utilized and a couple extra branches in the code.
For this answer i chose the wrap-around marker approach.
3.2 What if there's no space for a wrap-around marker?
Another problem comes up once you want a string-size that's above 255 - at that point the size needs to be larger than 1 byte.
Assume we use a 2 byte-length and write "foo12" (length 5) into the ring buffer:
so far so good, but as soon as the read pointer catches up we have a problem:
there is only a single byte left to write before we need to wrap around, which is not enough to fit a 2-byte length!
So we would need to wrap-around the length on the next write (writing "foo" (length 3) into the ringbuffer):
There are three potential ways this could be resolved:
The downside to this approach is that it introduces a bunch more branches into the code (a simple memcpy for the size won't suffice anymore), it makes aligning strings more difficult and it goes against the design we chose in 3.1.
This would allow us to place a wrap-around marker and prevent wrapped-around string sizes: The downside with this approach is that it's rather difficult to differentiate between a wrap-around marker and an actual string-size. We also would need to probe the first byte of each length first to check if it's a wrap-around marker before reading the full length integer.
3.3 Destructive interference
Depending on how fast your reader is in comparison to your writer you might have a bit of destructive interference within the ring buffer.
If for example your L1 cache line size is 128 bytes, using 1-byte lengths for the strings in the ring-buffer, and only pushing length 1 strings, e.g.:
Then this would result in +/- 64 string entries being stored on the same cache line, which are continously written by the producer while they get read from the consumer => a lot of potential interference.
This can be prevented by padding the strings within the ring buffer to a multiple of your cache line size (in C++ available as
std::hardware_destructive_interference_size)i.e. strings padded to 4-bytes:
The trade-off here is of course that this will potentially waste a lot of space within the ring buffer.
So you'll have to profile how much padding you want for your string values.
The padding value you choose should be between those two values:
1 <= N <= std::hardware_destructive_interference_sizestd::hardware_destructive_interference_size=> worst space utilization, no potential interference4. Implementation
This is a full implementation of a wait-free, string only, spsc fifo queue - based on the design considerations listed above.
I've only implemented the bare minimum interface required, but you can easily create all the utility functions
boost::lockfree::spsc_queueprovides around those 3 core functions:godbolt
size_typeis the integral type that is used to store the length of the strings within the buffer, i.e. if you useunsigned chareach string could be at most 254 bytes in length (withunsigned short(assuming 2 bytes) it would be 65534, etc...) (the maximum value is used as a wrap-around marker)paddingis the alignment that's used for the string values. If it is set to 1 then strings will be packed as tightly as possible into the ring buffer (best space utilization). If you set it tostd::hardware_destructive_interference_sizethen there will be no interference between different string values in the ring buffer, at the cost of space utilization.Usage example: godbolt
boost::lockfree::spsc_queue::consume_alle.g. could be implemented like this (in terms of the 3 functions provided by this minimal implementation):Full implementation: godbolt