Best data store w/full text search for lots of small documents? (e.g. a Splunk-like system)

1.9k Views Asked by At

We are specing out a system that will index and store zillions of Syslog messages. These are text messages, with a few attributes (system name, date/time, message type, message body), that are typically 100 to 1500 bytes each.

We generate 2 to 10 gb of these messages per day, and need to retain at least 30 days of them.

The splunk system has a really great indexing and document compression system.

What to use?

I thought of mongodb, but it seems inappropriate for documents of this small size.

SQL Server is a possibility, but seems perhaps not super efficient for this purpose.

Text files with lucene? -- The windows file system doesn't always like dirs with zillions of files

Suggestions ?

Thanks!

5

There are 5 best solutions below

3
On BEST ANSWER

I thought of mongodb, but it seems inappropriate for documents of this small size

There's a company called Boxed Ice that actually builds a server monitoring system using MongoDB. I would argue that it's definitely appropriate.

These are text messages, with a few attributes (system name, date/time, message type, message body), that are typically 100 to 1500 bytes each.

From a MongoDB perspective, we would say that you are storing lots of small documents with a few attributes. In a case like this MongoDB has several benefits here.

  1. It can handle changing attributes seamlessly.
  2. It will flexibly handle different types.

We generate 2 to 10 gb of these messages per day, and need to retain at least 30 days of them.

This is well within the type of data range that MongoDB can handle. There are several different methods of handling the 30 day retention periods. These will depend on your reporting needs. I would poke around on the groups for ideas here.

Based on the people I've worked with, this type of insert-heavy logging is one of the places where Mongo tends to be a very good fit.

0
On

I would strongly consider using something Lucene or Solr.

Lucene is built specifically for full text search and provides a ton of additional helpful features that you may find useful in your application. As a bonus, Solr is dead simple to setup and configure. (And its super fast for searching)

They do not keep a file per entry, so you shouldnt have to worry much about zillions of files.

None of the free database options specialize in full text search - dont try to force them to do what you want.

0
On

Graylog2 is an open-source log management tool that is built on top of MongoDB. I believe Loggy, a logging-as-a-service provider, also uses MongoDB as their backend store. So there are quiet a few products using MongoDB for logging.

It should be possible to store the ngrams returned by a Lucene analyzer for better text searching. Not sure about the feasibility though given the large amount of documents. What is primary reporting use case?

0
On

It seems that you would want something like mongodb full-text search server, which will enable you to search on different attributes without losing performance. You may try MongoLantern: http://sourceforge.net/projects/mongolantern/. Though it's in alpha stage but gives very best result for me for 5M records.

Let me know whether this serves your purpose.

0
On

I think you should deploy your own (intranet-wide) stack of Grafana, Logstash + ElasticSearch

When setup once you have a flexibel schema, retention and a wonderful UI for your data with Grafana.