Skip to content

Latest commit

 

History

History
130 lines (106 loc) · 5.75 KB

README.md

File metadata and controls

130 lines (106 loc) · 5.75 KB

MIW

MIW is the shortname for Mobile Investivation Workstation. MIW is a tool for the fast summarization and analysis of very large quantities of logs. It is written in C++/C++11 and provides some small Python tooling.

MIW was built as an extremely efficient single-machine map-reduce alternative to Hadoop for processing hundreds of GB of logs. It is especially useful for on-premises log analysis, for generating condensed analytics out of raw logs, and for computing features useful in Machine Learning applications.

Though rather young in its development MIW has already been integrated several industrial projects and run over billions of logs, from laptops to HPC.

Main functionalities: At the moment, MIW supports implements the following features:

  • C++ map-reduce as an extension of Metis with support for multiple input files as input to the same task
  • auto-splitting of files based on available RAM
  • command line job launcher
  • configurable input log and job control formats in JSON
  • variety of outputs supported, from memory to JSON and CSV
  • variety of configuration options for preprocessing common log fields such as dates, time, URLs, ...
  • minimal Python job control utility

Dependencies:

  • C++11 compiler + autotools
  • protocol buffers for configuration and storage;
  • boost for network and tokenizer functionalities;
  • glog for logging events and debug;
  • gflags for command line parsing;
  • jsoncpp for JSON output;
  • gtest for unit testing (optional);
  • cppnetlib for preprocessing URIs;
  • snappy for log compression;
  • libcurl for connecting to external applications.

Implementation:

Authors

MIW is designed and implemented by Emmanuel Benazera around the c++ map-reduce library Metis, on behalf of SopraSteria cybersecurity.

Build

Below are instructions for Linux systems:

First, install dependencies

sudo apt-get install autotools-dev automake autoconf libtool pkg-config libprotobuf-dev protobuf-compiler python-protobuf libjsoncpp-dev libgoogle-glog-dev libgflags-dev libsnappy-dev libcurl4-openssl-dev libcppnetlib-dev python-simplejson

For compiling:

./autogen.sh
./configure
make

Documentation

Using the main command line exe:

./app/miw --help

yields the list of options:

Flags from job.cc:
-appname (optional application name) type: string default: ""
-autosplit (whether to autosplit file based on available memory
	   type: bool default: false
	   -compressed (whether to compress the original content) type: bool
		        default: false
-fnames (comma-separated input file names) type: string default: ""
-format_name (processing format name) type: string default: ""
-map_tasks (number of map tasks (default = auto)) type: int32 default: 0
-memory_factor (heuristic value for autosplit of very large files,
		representing the expected memory requirement ratio vs the size of the
		file, e.g. 10 times more memory than log volume) type: double default: 10
-merge_results (whether to merge results over multiple input files)
	       type: bool default: false
-ndisp (number of top records to show) type: int32 default: 5
-nprocs (number of cores (default = auto)) type: int32 default: 0
-ofname (output file name) type: string default: ""
-output_format (output format (json, csv)) type: string default: ""
-quiet (quietness) type: bool default: true
-reduce_tasks (number of reduce tasks (default = auto)) type: int32 default: 0
-skip_header (whether to skip first log line file as header) type: bool default: false
-store_content (whether to store the original content in the processed output) type: bool default: false
-tmp_save (whether to save temporary output of results after each file is processed) type: bool default: false

Example with a sampel of data from the repository:

./app/miw -fnames data/web_proxy_10lines.log -format_name miw/formats/proxy_format -output_format csv -ofname test.csv

should yield

files=data/web_proxy_10lines.log
I1203 16:06:42.315526 21243 job.cc:122] files size=1
I1203 16:06:42.315587 21243 job.cc:127] Processing file=data/web_proxy_10lines.log

logs preprocessing: results (TOP 5 from 2 keys, 4 logs):
                         2012-11-30_23_NqO3SB - 2
			 2012-11-30_23_- - 2

Runtime in millisecond [4 cores]
	Sample:	       15 Map:	0	Reduce:	0	Merge:	0	Sum:	15	Real:	16
Number of Tasks of last Metis run
	Sample:	0  Map:	6     Reduce:	67
I1203 16:06:42.330492 21243 job.cc:180] MR duration=0 seconds	       

The above call uses an existing format file. However, this is unlikely the provided formats match your logs. To generate your own log format:

  • Edit a json file in the manner of files in miw/formats
  • Convert it into a proto-buffer:
python format_json2pb.py yourformatfile.json yourformatfile.fmt
  • Use the format in a call: ./app/miw -fnames yourlogfile.log -format_name yourformatfile -output_format csv -ofname test.csv

Please note the omission of the .fmt extension.

Log Formats

The log formats are described in JSON, see the examples in miw/formats. They basically describe a log as a CSV like format, in which each column can be processed through a set of operations, from basic counts to aggregations, preprocessing of time, date, url.

The best way to learn from the built-in possibilities at this point is to study the JSON files in miw/formats.

Run tests

There are examples of unit tests in tests/ut-mr-parsing.cc. Edit the file as needed for using your own formats and logs and run:

make ut_mr_parsing
./ut_mr_parsing