Archive for the ‘OSCON07’ category

Machine Learning Made Easy with Perl (the day before)

July 24, 2007

Machine Learning Made Easy with Perl is the name of the session I am giving tomorrow afternoon at OSCON. I really worked hard on this one 🙂 It took me more time than I expected to make machine learning easy 😉 I do not want to spoil the surprise but the talk is really packed so if you are attending, do not close your eyes for a second because you might miss one of the pointers that could save your next machine learning project.

There is a small update to the session: I will only be covering “Exploratory financial data analysis using fuzzy clustering” and “Medical decision support systems using support vector machines”. I will cover only two case studies to provide more in depth information. Come and see what I mean 🙂

I hope to see many faces there. By the way, I will make available the slides and the source code one week after the talk.

Cheers,

Lino

OSCON 2007: 16 days away

July 7, 2007

Only 16 days separate us from OSCON and I am still polishing the material for my session 😉 I asked my fellow PerlMonks for feedback on a preliminary version of the presentation’s outline and as usual the comments were really useful. Based on the comments, I decided to reduce to two the number of case studies to be presented instead of the three I originally planned. I believe that in this way, I will have more time to clearly explain the techniques.

 

By the way, with this post, I will start a series of posts in which I show some of the snippets I will be presenting. Here are the first one:

Description:

A common practice in machine learning is to preprocess the data before building a model. One popular preprocessing technique is data normalization. Normalization puts the variables in a restricted range (with a zero mean and 1 standard deviation). This is important to achieve efficient and precise numerical computation.

In this snippet, I present how to do data normalization using the Perl Data Language. The input is a piddle (see comment below for a definition) in which each column represents a variable and each row represent a pattern. The output is a piddle (in which each variable is normalized to have a 0 mean and 1 standard deviation), and the mean and standard deviation of the input piddle.

What are Piddles?

They are a new data structure defined in the Perl Data Language. As indicated in RFC: Getting Started with PDL (the Perl Data Language):

Piddles are numerical arrays stored in column major order (meaning that the fastest varying dimension represent the columns following computational convention rather than the rows as mathematicians prefer). Even though, piddles look like Perl arrays, they are not. Unlike Perl arrays, piddles are stored in consecutive memory locations facilitating the passing of piddles to the C and FORTRAN code that handles the element by element arithmetic. One more thing to note about piddles is that they are referenced with a leading $

Code:


#!/usr/bin/perl
use warnings;
use strict;

use PDL;
use PDL::NiceSlice;

# ================================
# normalize
# ( $output_data, $mean_of_input, $stdev_of_input) =
# normalize( $input_data )
#
# processess $input_data so that $output_data
# has 0 mean and 1 stdev
#
# $output_data = ( $input_data – $mean_of_input ) / $stdev_of_input
# ================================
sub normalize {
my ( $input_data ) = @_;
my ( $mean, $stdev, $median, $min, $max, $adev )
= $input_data->xchg(0,1)->statsover();

my $idx = which( $stdev == 0 );
$stdev( $idx ) .= 1e-10;
my ( $number_of_dimensions, $number_of_patterns )
= $input_data->dims();
my $output_data
= ( $input_data – $mean->dummy(1, $number_of_patterns) )
/ $stdev->dummy(1, $number_of_patterns);

return ( $output_data, $mean, $stdev );
}

Machine Learning Made Easy with Perl

June 15, 2007

That is the title of a session I am giving on July 25, 2007 at OSCON. Here is the abstract:

Machine learning is concerned with the development of algorithms and techniques that allow computers to “learn” from large data sets. This talk presents an overview of a number of machine learning techniques and the main configuration issues the participants need to understand to successfully deploy machine learning applications. The talk also covers three case studies in which we will use Perl scripts to solve real life problems:

  1. Medical decision support systems using support vector machines
  2. Exploratory financial data analysis using fuzzy clustering
  3. Pattern recognition in weather data using neural networks

This talk offers an intensive presentation of machine learning terminology, best practices, standard process, and strategy. Participants will get to know the techniques but more important, they will learn when to use them and why to use them. The talk is appropriate for educators and programmers who want to use machine learning in their own problem domains.

I will be posting more details about the session as we get closer to OSCON.

Cheers!

Lino