Recently, I had to generate some box-and-whisker plots to present results of my work on my master thesis. Some google search revealed that there are no complete solutions to fully fit my expectations and needs. My colleague, Stefan, presented a solution on his blog, but that script still wasn't what I was looking for.
I wanted to have a script that, naturally, will present a set of results of an experiment. Stefan used quartiles and median, I liked standard deviation and average values more. Moreover, I wanted a script to be configurable at least by passing a path to a directory, where files with results are situated.
And that's how my funky-fresh-and-supercool-Ruby-script was born. Let me guide you through, as I think, most interesting parts of it. If you only need to se the whole script, you should scroll this post down, where you'll find a link to download it :-).
I decided that I'll allow the user to pass two arguments - first is the directory (explained earlier) and second one is optional and it's a filename pattern (if it's not present, default is "*.txt"). Script simply chooses only these files from given directory, which match the specified filename pattern. Thanks to the Ruby's magic, it's done simply as that:
# changing directory to given
Dir.chdir(directory)
# finding
files = Dir.glob(filename_pattern)
if files.length == 0
puts 'There are no files matching specified filename pattern'
Process.exit!
end
...
files.each do |file|
...
end
Well, I know that
Process.exit! may not be the nicest way to exit a script, but it works like a charm ;-).
Result files that this script process
have to be in a specific format:
iteration_number value
for every line. For example:
1 234.6
2 324.3
3 4.55
and so on.
My algorithm generates results in an untypical way - the result files may not have same number of lines - that's why I perform a check of number of lines of every file and getting a minimum number from them - that way we're sure that every iteration has the same number of values to process.
Next, script calculates the values: mean, standard deviation and keeps global minimum and maximum of results, to be used and explained later.
Two interesting things here:
1) because of using
Dir.chdir(directory)
earlier, we're still inside this directory, so invoking
output_file = File.new('output.dat', 'w')will result in creating a file inside this directory.
2) thanks to
Drew Olson's "5 things you can do with a Ruby array in one line" I came up with these lines:
sum = tab.inject { |s, item| s + item }for summing up the array and
variance = ((tab.map { |item| (item - average) ** 2 }).inject { |s, item| s + item }) / file_counterfor computing variance. Nice, huh? :)
I decided to print out the last two values: average and standard deviation. I needed them to compare my algorithm's best results for different parameters. Nevertheless - comment them out or just delete if you don't need them.
Now, to create a nice plot (I needed logarithmic scale for the lowest, latest values), I wrote a simple piece of code to find a range of two numbers being powers of ten closest to the given value. Ok, not so obvious, I know, but I think this code explains what I did:
first_bigger = 10.0
first_smaller = 1.0
while !(first_bigger > global_minimum && first_smaller < global_minimum)
if global_minimum > first_bigger
first_bigger *= 10.0
first_smaller *= 10.0
else
first_bigger /= 10.0
first_smaller /= 10.0
end
end
As you can see, if global minimum is, for example, 0.5, this code produces a range of (0.1, 1) and if global minimum is 45 - it gives (10, 100). Currently, I'm only using the first_smaller value, but first_bigger may also be useful for modifications, so I leave it there.
Finally, producing the gnuplot script file which inside looks like that:
set terminal png size 1280,1024
set output "output.png"
set boxwidth 0.2 absolute
set yrange [ #{first_smaller} : #{global_maximum.ceil} ]
set xrange [ 0 : #{min_lines + 10} ]
set log y
plot 'output.dat' every 5 using 1:3:2:6:5 with candlesticks lt 3 lw 2 notitle whiskerbars, '' using 1:4:4:4:4 with candlesticks lt -1 lw 2 notitle
Of course, #{value} are replaced with proper values. For explanation of all these enigmatic gnuplot options (maybe except output image resolution ;-)) I have to send you to the
gnuplot documentation page.
And that's it! Script assumes that the user has RW rights to the given directory, so be sure you set them. It produces two files: output.dat and gnuplot_script. The format of first of them is:
iteration_number minimum_value average-standard_deviation standard_deviation average+standard_deviation maximum
for every line.
To make gnuplot create outpug.png file with the plot, simply go to the directory you've passed to the script and type
?> gnuplot gnuplot_script
You can download this script from
here - whatever you want to do with it - you can. The only thing I ask for is some adnotation from where you got it.
Voila! :)