Home  |  About  |  RSS antonfagerberg.com

Scala streams - out of memory

01 Jan 2018

I was doing Advent of Code and was working on generating numbers based on a pattern (day 15). Let’s say I wanted to generate 0, 1, 2, ... to simplify this example.

The first thing that came to mind was building a stream. One way to do this in Scala is as follows:

Stream.iterate(0)(_ + 1)

or you can do it like this:

lazy val stream: Stream[Int] = 1 #:: stream.map(_ + 1)

The plan was then to lazily generate 40 million numbers, filter out then ones I wanted and count them. Turns out this didn’t work at all as memory started to run out (OutOfMemoryError) for two reasons which at first seemed counterintuitive.

First, if you look at the source code for Stream in Scala you find a lot warnings such as:

*  - One must be cautious of memoization; you can very quickly eat up large
*  amounts of memory if you're not careful.  The reason for this is that the
*  memoization of the `Stream` creates a structure much like
*  [[scala.collection.immutable.List]].  So long as something is holding on to
*  the head, the head holds on to the tail, and so it continues recursively.
*  If, on the other hand, there is nothing holding on to the head (e.g. we used
*  `def` to define the `Stream`) then once it is no longer being used directly,
*  it disappears.

The take method has more to say:

override def take(n: Int): Stream[A] = (
  // Note that the n == 1 condition appears redundant but is not.
  // It prevents "tail" from being referenced (and its head being evaluated)
  // when obtaining the last element of the result. Such are the challenges
  // of working with a lazy-but-not-really sequence.
  if (n <= 0 || isEmpty) Stream.empty
  else if (n == 1) cons(head, Stream.empty)
  else cons(head, tail take n-1)
)

Unaware of this, you (as I) may think that code like this should work just fine:

object MemoryTest extends App {
  val stream = Stream.iterate(0)(_ + 1)
  println(stream.dropWhile(_ < 1000000).head)
}

Running the code above with -Xmx10m will throw an java.lang.OutOfMemoryError: GC overhead limit exceeded exception. However, if you change the val to def it will work just fine since the stream’s head reference can be garbage collected.

object MemoryTest extends App {
  def stream = Stream.iterate(0)(_ + 1)
  println(stream.dropWhile(_ < 1000000).head)
}

The next tricky thing is that many methods, such as the common filter method, and unlike dropWhile, isn’t implemented in Stream but in TraversableLike.

With filter you have to worry about how many elements that are going to be memoized in between the elements which the predicate returns true for.

Running the following code with -Xmx10m will throw an OutOfMemoryError:

object MemoryTest extends App {
  def stream = Stream.iterate(0)(_ + 1)
  println(stream.filter(_ >= 1000000).head)
}

While this code will work just fine:

object MemoryTest extends App {
  def stream = Stream.iterate(0)(_ + 1)
  println(stream.filter(_ % 100 == 0).filter(_ >= 1000000).head)
}

With these two things in mind, you should be able to avoid the memory gotchas in Scala streams. In the end though, I realised that I didn’t need Stream but rather Iterator for my Advent of Code solution.

object MemoryTest extends App {
  val iterator = Iterator.iterate(0)(_ + 1)
  println(iterator.filter(_ >= 1000000).next())
}

If you’re not going to take advantage of the memoization, go with Iterator instead!