BFFT Techbog Dezember: H20, when should I go to work?
12/01/2016

BFFT Techblog Dezember: H2O, when should I go to work?

Topic: How to develop a prototype for a machinde-learning application with H2O easily

 

Motivation
Every day when I go to work, I imagine that the traffic has become denser again. In other words, is there an optimal time to go to work early in the morning?

As a data science enthusiast, this should be a relatively simple task? Thanks to digitalization everywhere, many webcams can be found on motorways, which are only waiting to be evaluated more closely.

Preprocessing

For this task the BayernInfo web page can be considered for data collection1. The images of the webcam are updated approximately every 60 seconds. For this I have set up a cronjob for the period of two weeks. This job saves every 5 minutes between 7 and 10 a.m. the current image to disk.

All further processing and analyzes were carried out using R2. Using the imager package3 we can easily process and display images. One sample image looks like this:

unnamed-chunk-1-1

Furthermore we can take only the left side of the image, which represents the way to Ingolstadt.

library(imager)
img <- imsub(grayscale(img), y > 32, x <= 320)
plot(img)

unnamed-chunk-2-1

Here it is a very common procedure, which is called sliding windows, to decompose the image in question into sub pictures4 5 6.

So we divide each image in 16 sub images, in the following we see on the left side a sub image without any car on it and on the right side a sub image with one car in it.

imgTile1 <- imsub(img, y > 224 & y <= 336, x > 0 & x <= 80)
imgTile2 <- imsub(img, y > 224 & y <= 336, x > 80 & x <= 160)

layout(t(1:2))
plot(imgTile1)
plot(imgTile2)

unnamed-chunk-3-1

In the following, we will assign all images in categories 0, 1 and 2. Here 0 means that no vehicle is recognizable on the picture. Category 1 means that one vehicle is recognizable and category 2 that two or more vehicles can be identified on the picture.

This allows the number of vehicles to be determined for each sub image. By adding up the vehicles on the 16 sub images, we can estimate the approximate number of vehicles on the complete image. It should be noted at this point that the perspective is distorted in the image. It is possible that more vehicles can be found on a partial picture in the upper right than for example in the lower left, as the vehicles appear smaller. So it would be better if the area in the upper right is even more finely divided. However, the presented division should be sufficient for a first approximation.

This procedure allows the task to be subdivided into several sub-tasks. The main advantage is that for the prediction the processing of the sub pictures can take place in parallel. On all images a gray-scaling is done, since for the recognition of vehicles it’s no matter what color the vehicles have.

Training Phase
For training the images of one day are taken out of the biweekly observation period. These images are manually labeled.

We place the labeled sub images in the folders “/labeled/0”, “labeled/1” and “labeled/2”. So we can for example add the labeled sub images with one car on it to the training data set as follows:

filelist1 <- list.files(paste0(getwd(),"/data/labeled/1"), full.names = TRUE)
train <- NULL
for (img in filelist1) {
  im.tmp <- load.image(img)
  train <- rbind(
    train, 
    data.frame(label = "1", feature = matrix(im.tmp[,,1,1], nrow = 1)))

If images are loaded with the load.image function, they are stored in R as an array with four dimensions (x, y, z, c). Here, x and y represent the spatial dimension, z is the depth dimension (corresponds to the time in a film), and c is the color dimension7. Since we have a gray-scaled image it follows z = 1 and c = 1. For further processing, we convert the x×matrix into a 1×(xy) matrix.

After that we can train our model with H2O8. H2O is an open source software which offers algorithms for statistics, data mining and machine learning in the context of big data. H2O has been implemented in java. In addition, H2O provides Rest APIs, so it is also possible to call H2using JavaScript, R, Python, Excel/Tableau and Flow which is a notebook style user interface.

download

Source:http://docs.h2o.ai/h2o/latest-stable/h2o-docs/architecture.html#h2o-software-stack

Furthermore, H2O can be called using spark or hadoop interfaces. H2O offers many different machine learning models like Generalized Linear Model (GLM), Gradient Boosting Machine (GBM) or Random Forest (RF). Even if in the normal case a deep learning approach is used for image recognition, a random forest approach often provides good results, so we should try it in this case:

library(h2o)

## use more threads one a multicore server
h2o.init(nthreads = 1) 

## import train data set to H2O
train.hex <- as.h2o(train)

## train a random forest model while using a 10-fold cross-validation 
## to avoid overfitting the first column is the target, the remaining 
## columns are the features
model.rf <- h2o.randomForest(y = 1, x = 2:ncol(train.hex), 
                             training_frame = train.hex, 
                             ntrees = 200, nfolds = 10)

Let’s take a look at the model performance, which H2O delivers us right away:

library(caret)
confusionMatrix(as.table(as.matrix(
  model.rf@model$cross_validation_metrics@metrics$cm$table[1:3, 1:3])))

## Output: Confusion Matrix and Statistics
## 
##     0   1   2
## 0 440   2   3
## 1  46  24   8
## 2   5  10  54
## 
## Overall Statistics
##                                           
##                Accuracy : 0.875           
##                  95% CI : (0.8456, 0.9006)
##     No Information Rate : 0.8294          
##     P-Value [Acc > NIR] : 0.001362        
##                                           
##                   Kappa : 0.6486          
##  Mcnemar's Test P-Value : 6.364e-09       
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.8961  0.66667  0.83077
## Specificity            0.9505  0.90288  0.97154
## Pos Pred Value         0.9888  0.30769  0.78261
## Neg Pred Value         0.6531  0.97665  0.97897
## Prevalence             0.8294  0.06081  0.10980
## Detection Rate         0.7432  0.04054  0.09122
## Detection Prevalence   0.7517  0.13176  0.11655
## Balanced Accuracy      0.9233  0.78477  0.90115

In the confusion matrix we get across the predicted and vertical the actual values for the categories from 0 to 2. This means in 440 cases there was 0 predicted which was correct, but in 46 cases 1 and in 5 cases 2 would have been the correct prediction, the model was wrong here. The calculated accuracy, which corresponds to the sum of the values on the diagonal divided by the number of all cases, is 87.5 percent which does not look too bad.

Prediction Phase
Now we are ready to apply our trained model to the data of the remaining days:

pred <- h2o.predict(object = model.rf, newdata = test.hex)

## shutdown the java process 
h2o.shutdown(prompt = FALSE)

For each day in two weeks we get the estimated number of vehicles between 07:00 and 10:00 o’clock in 5 minutes steps.

##                      t ncars  wday
## 1: 2016-09-29 07:00:02     4 Thurs
## 2: 2016-09-29 07:05:02    10 Thurs
## 3: 2016-09-29 07:10:02     7 Thurs
## 4: 2016-09-29 07:15:02     8 Thurs
## 5: 2016-09-29 07:20:01     9 Thurs
## 6: 2016-09-29 07:25:01     3 Thurs

Let’s look at the number of estimated vehicles regardless of the weekday:

library(ggplot2)
p <- ggplot(data = data.cars, aes(t, ncars)) + 
  geom_point() + geom_smooth() + theme_light()
p

unnamed-chunk-12-1

Between 8 and 9 o’clock in the morning, the fewest vehicles appear to be on the road. But are there differences between the individual weekdays?

p + facet_wrap(~wday)

unnamed-chunk-13-1

Monday and Tuesday, it seems to be advisable to go to work as late as possible at around 9 o’clock. On the remaining weekdays one should rather get up and go between 8 and 9 o’clock.

Conclusion
It is clear that the present sample for a viewing period of two weeks can not be representative of the morning traffic volume of a certain motorway section. The aim was thus to show how a prototype for a machine learning application with H2O can be developed at a relatively low effort.

H2O is used as machine learning framework as this has wrappers for many languages such as R, Python or Scala. Internally, however, H2O is written in Java. If a H2Ocluster is set up, it can be accessed via a REST API. In addition, H2O can access Apache Spark. Thus it seems very suitable for being able to integrate H2O into a big data application.

1. http://www.bayerninfo.de/webcams/webcams-all-muenchen-nuernberg

2. https://www.r-project.org/

3. https://cran.r-project.org/web/packages/imager/index.html

4. https://www.coursera.org/learn/machine-learning/lecture/bQhq3/sliding-windows

5. http://cs229.stanford.edu/proj2012/LiZahr-LearningToRecognizeObjectsInImages.pdf

6. http://datalab.lu/blog/2012/04/22/machine-learning-for-identification-of-cars/

7. https://cran.r-project.org/web/packages/imager/imager.pdf

8. http://www.h2o.ai/

 


Author: Andreas W. (Entwicklung Konzepte & Tooling)
Contact: techblog@bfft.de 
Picture sources: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/architecture.html#h2o-software-stack; Andreas W.

Topic January 2017: “Node how to Express yourself with Angular”


BFFT jobs: