Skip to content

Latest commit

 

History

History
570 lines (472 loc) · 20.1 KB

working-with-linq.md

File metadata and controls

570 lines (472 loc) · 20.1 KB
title description keywords author manager ms.date ms.topic ms.prod ms.technology ms.devlang ms.assetid
Working with LINQ
Working with LINQ
.NET, .NET Core
BillWagner
wpickett
06/20/2016
article
.net-core
.net-core-technologies
dotnet
0db12548-82cb-4903-ac88-13103d70aa77

Working with LINQ

Introduction

This tutorial teaches you a number of features in .NET Core and the C# language. You’ll learn:

  • How to generate sequences with LINQ
  • How to write methods that can be easily used in LINQ queries.
  • How to distinguish between eager and lazy evaluation.

You'll learn these techniques by building an application that demonstrates one of the basic skills of any magician: the faro shuffle. Briefly, a faro shuffle is a technique where you split a card deck exactly in half, then the shuffle interleaves each one card from each half to rebuild the original deck.

Magicians use this technique because every card is in a known location after each shuffle, and the order is a repeating pattern.

For our purposes, it is a light hearted look at manipulating sequences of data. The application you'll build will construct a card deck, and then perform a sequence of shuffles, writing the sequence out each time. You'll also compare the updated order to the original order.

This tutorial has multiple steps. After each step, you can run the application and see the progress. You can also see the completed sample in our GitHub repository.

Prerequisites

You’ll need to setup your machine to run .NET core. You can find the installation instructions on the .NET Core page. You can run this application on Windows, Ubuntu Linux, OS X or in a Docker container. You’ll need to install your favorite code editor. The descriptions below use Visual Studio Code which is an open source, cross platform editor. However, you can use whatever tools you are comfortable with.

Create the Application

The first step is to create a new application. Open a command prompt and create a new directory for your application. Make that the current directory. Type the command "dotnet new" at the command prompt. This creates the starter files for a basic “Hello World” application.

If you've never used C# before, this tutorial explains the structure of a C# program. You can read that and then return here to learn more about LINQ.

Creating the Data Set

Let's start by creating a deck of cards. You'll do this using a LINQ query that has two sources (one for the four suits, one for the thirteen values). You'll combine those source into a 52 card deck.

Here's the query:

var startingDeck = from s in Suits()
                   from r in Ranks()
                   select new { Suit = s, Rank = r };

The multiple from clauses produce a SelectMany, which creates a single sequence from combining each element in the first sequence with each element in the second sequence. The order is important for our purposes. The first element in the first source sequence (Suits) is combined with every element in the second sequence (Values). This produces all thirteen cards of first suit. That process is repeated with each element in the first sequence (Suits). The end result is a deck of cards ordered by suits, followed by values.

Next, you'll need to build the Suits() and Ranks() methods. Let's start with a really simple set of iterator methods that generate the sequence as an enumerable of strings:

static IEnumerable<string> Suits()
{
    yield return "clubs";
    yield return "diamonds";
    yield return "hearts";
    yield return "spades";
}

static IEnumerable<string> Ranks()
{
    yield return "two";
    yield return "three";
    yield return "four";
    yield return "five";
    yield return "six";
    yield return "seven";
    yield return "eight";
    yield return "nine";
    yield return "ten";
    yield return "jack";
    yield return "queen";
    yield return "king";
    yield return "ace";
}

These two methods both utilize the yield return syntax to produce a sequence as they run. The compiler builds an object that implements IEnumerable<T> and generates the sequence of strings as they are requested.

Go ahead and run the sample you've built at this point. It will display all 52 cards in the deck. You may find it very helpful to run this sample under a debugger to observe how the Suits() and Values() methods execute. You can clearly see that each string in each sequence is generated only as it is needed.

Manipulating the Order

Next, let's build a utility method that can perform the shuffle. The first step is to split the deck in two. The Take() and Skip() methods that are part of the LINQ APIs provide that feature for us:

var top = startingDeck.Take(26);
var bottom = startingDeck.Skip(26);

The shuffle method doesn't exist in the standard library, so you'll have to write your own. This new method illustrates several techniques that you'll use with LINQ-based programs, so let's explain each part of the method in steps.

The signature for the method creates an extension method:

public static IEnumerable<T> InterleaveSequenceWith<T>
    (this IEnumerable<T> first, IEnumerable<T> second)

An extension method is a special purpose static method. You can see the addition of the this modifier on the first argument to the method. That means you call the method as though it were a member method of the type of the first argument.

Extension methods can be declared only inside static classes, so let's create a new static class called extensions for this functionality. You'll add more extension methods as you continue this tutorial, and those will be placed in the same class.

This method declaration also follows a standard idiom where the input and output types are IEnumerable<T>. That practice enables LINQ methods to be chained together to perform more complex queries.

using System.Collections.Generic;

namespace LinqFaroShuffle
{
    public static class Extensions
    {
        public static IEnumerable<T> InterleaveSequenceWith<T>
            (this IEnumerable<T> first, IEnumerable<T> second)
        {
            // implementation coming.
        }
    }
}

You will be enumerating both sequences at once, interleaving the elements, and creating one object. Writing a LINQ method that works with two sequences requires that you understand how IEnumerable works.

The IEnumerable interface has one method: GetEnumerator(). The object returned by GetEnumerator() has a method to move to the next element, and a property that retrieves the current element in the sequence. You will use those two members to enumerate the collection and return the elements. This Interleave method will be an iterator method, so instead of building a collection and returning the collection, you'll use the yield return syntax shown above.

Here's the implementation of that method:

public static IEnumerable<T> InterleaveSequenceWith<T>
    (this IEnumerable<T> first, IEnumerable<T> second)
{
    var firstIter = first.GetEnumerator();
    var secondIter = second.GetEnumerator();
    while (firstIter.MoveNext() && secondIter.MoveNext())
    {
        yield return firstIter.Current;
        yield return secondIter.Current;
    }
}

Now that you've written this method, go back to the Main method and shuffle the deck once:

public static void Main(string[] args)
{
    var startingDeck = from s in Suits()
                       from r in Ranks()
                       select new { Suit = s, Rank = r };
    foreach (var c in startingDeck)
        Console.WriteLine(c);
        
    var top = startingDeck.Take(26);
    var bottom = startingDeck.Skip(26);
    
    var shuffle = top.InterleaveSequenceWith(bottom);
    foreach (var c in shuffle)
        Console.WriteLine(c);
}

Comparisons

Let's see how many shuffles it takes to set the deck back to its original order. You'll need to write a method that determines if two sequences are equal. After you have that method, you'll need to place the code that shuffles the deck in a loop, and check to see when the deck is back in order.

Writing a method to determine if the two sequences are equal should be straightforward. It's a similar structure to the method you wrote to shuffle the deck. Only this time, instead of yield returning each element, you'll compare the matching elements of each sequence. When the entire sequence has been enumerated, if every element matches, the sequences are the same:

public static bool SequenceEquals<T>(this IEnumerable<T> first, IEnumerable<T> second)
{
    var firstIter = first.GetEnumerator();
    var secondIter = second.GetEnumerator();
    while (firstIter.MoveNext() && secondIter.MoveNext())
    {
        if (!firstIter.Current.Equals(secondIter.Current))
            return false;
    }
    return true;
}

This shows a second Linq idiom: terminal methods. They take a sequence as input (or in this case, two sequences), and return a single scalar value. These methods, when they are used, are always the final method of a query. (Hence the name).

You can see this in action when you use it to determine when the deck is back in its original order. Put the shuffle code inside a loop, and stop when the sequence is back in its original order by applying the SequenceEquals() method. You can see it would always be the final method in any query, because it returns a single value instead of a sequence:

var times = 0;
var shuffle = startingDeck;
do
{
    shuffle = shuffle.Take(26).InterleaveSequenceWith(shuffle.Skip(26));

    foreach (var c in shuffle)
        Console.WriteLine(c);

    Console.WriteLine();
    times++;
} while (!startingDeck.SequenceEquals(shuffle));
Console.WriteLine(times);

Run the sample, and see how the deck rearranges on each shuffle, until it returns to its original configuration after 8 iterations.

Optimizations

The sample you've built so far executes an in shuffle, where the top and bottom cards stay the same on each run. Let's make one change, and run an out shuffle, where all 52 cards change position. For an out shuffle, you interleave the deck so that the first card in the bottom half becomes the first card in the deck. That means the last card in the top half becomes the bottom card. That's just a one line change. Update the call to shuffle to change the order of the top and bottom halves of the deck:

shuffle = shuffle.Skip(26).InterleaveSequenceWith(shuffle.Take(26));

Run the program again, and you'll see that it takes 52 iterations for the deck to reorder itself. You'll also start to notice some serious performance degradations as the program continues to run.

There are a number of reasons for this. Let's tackle one of the major causes: inefficient use of lazy evaluation.

LINQ queries are evaluated lazily. The sequences are generated only as the elements are requested. Usually, that's a major benefit of LINQ. However, in a use such as this program, this causes exponential growth in execution time.

The original deck was generated using a LINQ query. Each shuffle is generated by performing three LINQ queries on the previous deck. All these are performed lazily. That also means they are performed again each time the sequence is requested. By the time you get to the 52nd iteration, you're regenerating the original deck many, many times. Let's write a log to demonstrate this behavior. Then, you'll fix it.

Here's a log method that can be appended to any query to mark that the query executed.

public static IEnumerable<T> LogQuery<T>(this IEnumerable<T> sequence, string tag)
{
    using (var writer = File.AppendText("debug.log"))
    {
        writer.WriteLine($"Executing Query {tag}");
    }
    return sequence;
}

Next, instrument the definition of each query with a log message:

public static void Main(string[] args)
{
var startingDeck = (from s in Suits().LogQuery("Suit Generation")
                    from r in Ranks().LogQuery("Rank Generation")
                    select new { Suit = s, Rank = r }).LogQuery("Starting Deck");
    foreach (var c in startingDeck)
        Console.WriteLine(c);
        
    Console.WriteLine();
    var times = 0;
    var shuffle = startingDeck;
    do
    {
        //shuffle = shuffle.Take(26).LogQuery("Top Half")
        //    .InterleaveSequenceWith(shuffle.Skip(26).LogQuery("Bottom Half")).LogQuery("Shuffle");

        shuffle = shuffle.Skip(26).LogQuery("Bottom Half")
            .InterleaveSequenceWith(shuffle.Take(26).LogQuery("Top Half")).LogQuery("Shuffle");

        foreach (var c in shuffle)
            Console.WriteLine(c);
        times++;
        Console.WriteLine(times);
    } while (!startingDeck.SequenceEquals(shuffle));
    Console.WriteLine(times);
}

Notice that you don't log every time you access a query. You log only when you create the original query. The program still takes a long time to run, but now you can see why. If you run out of patience running the outer shuffle with logging turned on, switch back to the inner shuffle. You'll still see the lazy evaluation effects. In one run, it executes 2592 queries, including all the value and suit generation.

There is an easy way to update this program to avoid all those executions. There are LINQ methods ToArray() and ToList() that cause the query to run, and store the results in an array or a list, respectively. You use these methods to cache the data results of a query rather than execute the source query again. Append the queries that generate the card decks with a call to ToArray() and run the query again:

public static void Main(string[] args)
{
var startingDeck = (from s in Suits().LogQuery("Suit Generation")
                    from v in Ranks().LogQuery("Rank Generation")
                    select new { Suit = s, Rank = r })
                    .LogQuery("Starting Deck")
                    .ToArray();
    foreach (var c in startingDeck)
        Console.WriteLine(c);
        
    Console.WriteLine();
    var times = 0;
    var shuffle = startingDeck;
    do
    {
        shuffle = shuffle.Take(26).LogQuery("Top Half")
            .InterleaveSequenceWith(shuffle.Skip(26).LogQuery("Bottom Half")).LogQuery("Shuffle").ToArray();

        //shuffle = shuffle.Skip(26).LogQuery("Bottom Half")
        //    .InterleaveSequenceWith(shuffle.Take(26).LogQuery("Top Half")).LogQuery("Shuffle");

        foreach (var c in shuffle)
            Console.WriteLine(c);
        times++;
        Console.WriteLine(times);
    } while (!startingDeck.SequenceEquals(shuffle));
    Console.WriteLine(times);
}

Run again, and the inner shuffle is down to 30 queries. Run again with the outer shuffle and you'll see similar improvements. (It now executes 162 queries).

Don't misinterpret this example by thinking that all queries should run eagerly. This example is designed to highlight the use cases where lazy evaluation can cause performance difficulties. That's because each new arrangement of the deck of cards is built from the previous arrangement. Using lazy evaluation means each new deck configuration is built from the original deck, even executing the code that built the startingDeck. That causes a large amount of extra work.

In practice, some algorithms run much better using eager evaluation, and others run much better using lazy evaluation. (In general, lazy evaluation is a much better choice when the data source is a separate process, like a database engine. In those cases, lazy evaluation enables more complex queries to execute only one round trip to the database process.) LINQ enables both lazy and eager evaluation. Measure, and pick the best choice.

Preparing for New Features

The code you've written for this sample is an example of creating a simple prototype that does the job. This is a great way to explore a problem space, and for many features, it may be the best permanent solution. You've leveraged anonymous types for the cards, and each card is represented by strings.

Anonymous Types have many productivity advantages. You don't need to define a class yourself to represent the storage. The compiler generates the type for you. The compiler generated type utilizes many of the best practices for simple data objects. It's immutable, meaning that none of its properties can be changed after it has been constructed. Anonymous types are internal to an assembly, so they aren't seen as part of the public API for that assembly. Anonymous types also contain an override of the ToString() method that returns a formatted string with each of the values.

Anonymous types also have disadvantages. They don't have accessible names, so you can't use them as return values or arguments. You'll notice that any methods above that used these anonymous types are generic methods. The override of ToString() may not be what you want as the application grows more features.

The sample also uses strings for the suit and the rank of each card. That's quite open ended. The C# type system can help us make better code, by leveraging enum types for those values.

Start with the suits. This is a perfect time to use an enum:

public enum Suit
{
    Clubs,
    Diamonds,
    Hearts,
    Spades
}

The Suits() method also changes type and implementation:

static IEnumerable<Suit> Suits()
{
    yield return Suit.Clubs;
    yield return Suit.Diamonds;
    yield return Suit.Hearts;
    yield return Suit.Spades;
}

Next, do the same change with the Rank of the cards:

public enum Rank
{
    Two,
    Three,
    Four,
    Five,
    Six,
    Seven,
    Eight,
    Nine,
    Ten,
    Jack,
    Queen,
    King,
    Ace
}

And the method that generates them:

static IEnumerable<Rank> Values()
{
    yield return Rank.Two;
    yield return Rank.Three;
    yield return Rank.Four;
    yield return Rank.Five;
    yield return Rank.Six;
    yield return Rank.Seven;
    yield return Rank.Eight;
    yield return Rank.Nine;
    yield return Rank.Ten;
    yield return Rank.Jack;
    yield return Rank.Queen;
    yield return Rank.King;
    yield return Rank.Ace;
}

As one final cleanup, let's make a type to represent the card, instead of relying on an anonymous type. Anonymous types are great for lightweight, local types, but in this example, the playing card is one of the main concepts. It should be a concrete type.

public class PlayingCard
{
    public Suit CardSuit { get; }
    public Rank CardRank { get; }
    
    public PlayingCard(Suit s, Rank r)
    {
        CardSuit = s;
        CardRank = r;
    }
    
    public override string ToString()
    {
        return $"{CardRank} of {CardSuit}";
    }
}

This type uses auto-implemented read-only properties which are set in the constructor, and then cannot be modified. It also makes use of the new string interpolation feature that makes it easier to format string output.

Update the query that generates the starting deck to use the new type:

var startingDeck = (from s in Suits().LogQuery("Suit Generation")
                    from r in Ranks().LogQuery("Value Generation")
                    select new PlayingCard(s, r))
                    .LogQuery("Starting Deck")
                    .ToArray();

Compile and run again. The output is a little cleaner, and the code is a bit more clear and can be extended more easily.

Conclusion

This sample should you some of the methods used in LINQ, how to create your own methods that will be easily used with LINQ enabled code. It also showed you the differences between lazy and eager evaluation, and the affect that decision can have on performance.

And, you learned a bit about one magician's technique. Magician's use the faro shuffle because they can control where every card moves in the deck. In some tricks, the magician has an audience member place a card on top of the deck, and shuffles a few times, knowing where that card goes. Other illusions require the deck set a certain way. A magician will set the deck prior to performing the trick. Then she will shuffle the deck 5 times using an inner shuffle. On stage, she can show what looks like a random deck, shuffle it 3 more times, and have the deck set exactly how she wants.