fastLSA

Exanding the Boundaries of Local Similarity Analysis

What is fastLSA?


fastLSA is a new implementation of the Local Similarity Analysis (LSA) algorithm that approximates the detection of significant LSA results using an asymptotic upper bound. Making only the very standard assumptions (independent and identically distributed data, finite variance, asymptotic data size) fastLSA expands some of the boundaries imposed by previous implementations. Written in optimized C, and utilizing POSIX threads, fastLSA is a software package can calculate pairwise LSA statistics and p-value bounds for ten thousand time series with time-steps measuring in the thousands. Technically this is quadratic growth with time-series and linearly with their length. The table below shows some practical runtimes:
Sample running times for fastLSA on datasets of different sizes. All analysis was performed on a Mac Pro desktop computer running Mac OSX 10.6.8 with a 2 × 2.4 Ghz Quad-Core Intel Xeon processors and 16 GB of 1066 Mhz DDR3 RAM.
Time series Time points fastLSA (single thread) fastLSA (16 threads)
1,000 130 6 sec 1 sec
6,178 24 3.24 min 2.2 sec
14,105 390 58 min 7.5 min
100,000 100 - 54 min
1,000,000 30 - 2 days 3 hrs
1,000,000 100 - 7 days 23 hrs

What is LSA?


Local Similarity Analysis is a statistic developed by the Sun Lab at the University of Southern California, used for finding a special kind of local correlation between two time series seen below. Within a time-stream there are often instances of limited local agreement between time-series within a short window. Because this agreement is fleeting over a short period of time, a calculation of the agreement using a standard measure such as Pearson’s Correlation Coefficient (PCC) may not be significant. The LSA statistic captures this local and lagged variation enabling the detection of these interesting events.
Stacks Image 26
Stacks Image 214
Lagged-correlations between time series. An example of the kind of variation that the LSA statistic can detect.

LSA outputs values between -1 and +1; high positive values representing a strong coordinated correlation between two time-series (i.e. high-high, low-low) and low negative values representing strong anti-correlated behavior (i.e. high-low, low-high). Technically LSA does this by capturing the largest absolute-valued correlation with a pair of subsequences within a d-length lag window. In this way LSA subsumes the variation captured by PCC and includes others. If you are searching for correlations that may only occur for part of your time series or are interested in a lagged global trend then LSA may be the right method for you.

Furthermore, the output of LSA makes a great input into the graph visualization software, Cytoscape (http://www.cytoscape.org/). Using a force-directed layout it can enable the identification of strong clusters of correlation.
Stacks Image 19
A local similarity graph. Time series with significant LSA values can be visualized and clustered in the data visualization software Cytoscape. This example is from the Moving Pictures of the Human Microbiome dataset.
More details about fastLSA and the LSA statistic can be found in the BMC Genomics publication:

Durno et al., Expanding the boundaries of local similarity analysis. BMC Genomics 2013 14(Suppl 1):S3.

Please send comments and questions to shallam@mail.ubc.ca.