Exanding the Boundaries of Local Similarity Analysis

Installation & Use


1. Download fastLSAv1.0(Mac).zip
2. Unzip the fastLSAv1.0(MacLinux).zip
3. Use g++ to compile the fastLSA program from the extracted fastLSA/ directory
$ g++ fastLSA.cpp pnorm.cpp lsaPack.cpp thread.cpp lsaParse.cpp -o fastLSA
Note: (G++ is included with Apple’s XCode or is obtainable through popular package managers like Synaptic)


1. Download fastLSAv1.0(Linux).tar.gz
2. Unzip the fastLSAv1.0(Linux).tar.gz
3. Use g++ to compile the fastLSA program from the extracted fastLSA/ directory
$ g++ fastLSA.cpp pnorm.cpp lsaPack.cpp thread.cpp lsaParse.cpp -o fastLSA -pthread


1. Download the Windows source code fastLSAv1.0(Windows).zip
1a. Alternatively compile a version using g++ for your version of windows:
> g++ fastLSA.cpp pnorm.cpp lsaPack.cpp thread.cpp lsaParse.cpp -o fastLSA.exe
2. Run fastLSA.exe using the options below

Using fastLSA

Calling fastLSA without any commands prints usage instructions:
$ ./fastLSA
Usage: ./fastLSA -i inputfile -o outputfile -d N -m f -a f -r N -t N
-i inputfile
-o outputfile
-d maximum time lag
-m minimum LSA value
-a alpha
-r distribution resolution
-t number of threads

$ ./fastLSA -i data.txt -o output.txt -d 5 -m 0.2 -a 0.0001 -r 1000000 -t 2
Windows: $ fastLSA.exe -i data.txt -o output.txt -d 5 -m 0.2 -a 0.0001 -r 1000000 -t 2

Argument overview
'-i' indicates the input file. No default.

'-o' indicates the output file. No default.

'-d' indicates the size of the lead-lag window in the unit of time steps. It bounds the lag by absolute value. For example, an argument of '-d 5' allows for lags from -5 to 5 inclusive. The default is 3.

'-m' indicates the minimum absolute LSA value to report. For example, an argument of '-m 0.5' will only return LSA values greater than 0.5 or less than -0.5. By default this is not activated.

'-a' indicates the significance level (alpha) for each test. fastLSA uses a p-value upper bound. If the upper bound is less than the stated significance level, the test is not reported. Default is 0.001.

'-r' indicates the quality of the p-value. p-values are calculated by a Riemann integral, and -r indicates the number of equally spaced time steps per standard deviation unit. Default is 100000.

'-t' is the maximum number of threads fastLSA is allowed to use while running. Default is 1.


fastLSA requires that all data be tab delimited text files with regular samples. Each row is a time series. Each column is a time step. Time steps should be the same duration apart. If your data has missing values, your should consider interpolating. If your data is not square (all time series must be of the same length), then all the time series must be shortened to the shortest of all time series.
Stacks Image 33
Example of an input file to fastLSA. An example of the input format for multiple time series. One time series per line, values are tab-delimited for each time step. Information about each time series should be kept in another file.

When your analysis finishes, your output file will have the following header over data columns:
index1 index2 LSA lag p-valueBound

The index columns indicate the paired indices of significantly correlated time series. For n times series, they are indexed from 0 to n-1. The LSA column provides the LSA statistic for each pair. The lag column provides the lag at which the significant correlation was found.
The p-value Bound column provides the p-value's upper bound for the significantly paired value. If your p-value bounds are all 0 valued, it means that they are less than the inverse of your '-r' value (ie: p < 1/(-r)). If you'd like to see a value greater than 0, you should increase '-r'. However, you should note that this will increase your RAM use.

Stacks Image 37
Example of an output file from fastLSA. The file has five columns containing information on each significant pair of time series found. The first two index1 and index2 are the indices of the respective time series, LSA is the LSA statistic value, lag is the relative lag of the two time series, and p-valueBound is the value of the calculated upper bound.
Some Extra Notes:
Depending on what you'd like to learn from your data, you may want to consider a transform. LSA requires that data is standardized (mean 0 and variance 1), and fastLSA does so automatically. LSA is sensitive to outliers, but depending on what you're trying to find, this can be a useful quality. For example, if you're interested in finding rare and co-occurring spikes (perhaps with a lag), then LSA's sensitivity to outliers is advantageous. However, if you're more concerned about correlation between large and small valued data, then log transform your data. This will cause large values to become relatively smaller and small values to become relatively larger, all while preserving order.

A common mistake for large data sets is to leave in some time series that have nearly constant values of zero. This can result some surprisingly strong correlations. This is because two flat time series correlate very well. Such time series should be removed before analysis, unless their constant values genuinely hold a meaningful interpretation.

Working with Cytoscape

Cytoscape is very easy to use and is the perfect tool for visualizing fastLSA's output in a few easy steps:
1. Remove the first line header from the output file.
2. Open Cytoscape.
3. Click File -> Import -> Network from table.
4. Then load the output file. The source column should be column 1 and target column should be column 2.

If you've collected metadata for your time series, you'll likely want to see how it distributes across your network like in the image below. In order to do this, you'll have to enumerate your metadata from 0 to n-1 for n time series. This lets Cytoscape know which node represents each time series. Then you can load the meta data into Cytoscape by using the File -> import -> Attribute from table command. Then colours, names, and shapes can be given to the nodes and edges. See the Cytoscape documentation for more information.
Stacks Image 46