Unit 5. Data Manipulation
Revision Date: Oct 13, 2015 (Version 1.2)Pre-lesson Preparation: You should familiarize yourself with www.sorting-algorithms.com/ paying particular attention to the variety of algorithms and settings along the top of the page. For session 2, you should have the timedsorts.py code and data files (in the lesson folder) readily available for your students.
Summary
In this two-session lesson, students will explore algorithmic efficiency. They will understand the idea through discussion, manual analysis of simple algorithms, and data collection for implemented algorithms.
Outcomes
Students will be able to:
Overview
Session 1:
Session 2:
Student computer usage for this lesson is: required
Sorting:
Think-Pair-Share: Alternate Routes
Briefly discuss with your class the topic: what properties make for a good algorithm? What makes one algorithm better than another? Properties you may want to discuss if your students do not volunteer them:
A good analogy is purchasing a car, where people are concerned about:
Today's session will address the topic of efficiency.
Introduce the concept of algorithmic efficiency to your students by asking them if any can describe what algorithmic efficiency is, or what it means for an algorithm to be efficient. Briefly describe efficiency as how well an algorithm uses two resources, time and space (stored memory), to solve a problem. Some topics you may wish to discuss include:
Teacher note: This topic is more advanced, so you may wish to go more in depth or move on to the activity, as appropriate for your students.
A central idea of algorithms is that some algorithms will take more and more time as the size of their input increases. Time is not measured in seconds but rather the number of computational steps needed for the algorithm to finish operation on a given input. Great algorithms grow linearly, at the same rate as their input, meaning the time it takes to finish is directly proportional to the size of the problem they are solving (amount of input data). For instance, an algorithm that takes 10 steps for an input of size 10 and 1000 steps for an input of size 1000 is said to be linear in its input. However, most algorithms take longer as their input gets larger. For instance, an algorithm that takes only 25 steps for an input of size 5 may take 100 steps for an input of size 10, 10000 steps for an input of size 100, and one million steps for a size of only 1000 (it is taking quadratically more time as the input gets larger).
When we analyze algorithms, we often talk about the algorithm's computational complexity, which is the order of magnitude of the algorithm's running time. We almost always discuss the worst case complexity, since that is a bound on the resources required.
If an algorithm finishes with the same number of steps regardless of the size of its input, it is called constant time, which is O(1) in mathematical form (read aloud as "big-oh one"). Constant time algorithms are the fastest in terms of computational efficiency, and any algorithm that takes a constant number of steps is considered O(1). An algorithm that takes 10 steps for an input of size 10 and also takes 10 steps for an input of size 1000 is likely O(1). However, very, very few algorithms are constant time because most algorithms necessarily take longer as the size of their input increases.
An algorithm that can finish by looking at each piece of its input only once is called linear time or linear order, and is written mathematically as O(n), where n stands for "the size of the input." An algorithm that takes 10 steps for an input of size 10 and also takes 1000 steps for an input of size 1000 is likely O(n). Very few algorithms are linear order, especially if they must compare pieces in their input, such as sorting algorithms. The best sorting algorithms are somewhere between linear time and quadratic polynomial time, written as O(n^{2}), where n^{2} stands for "the size of the input, squared." Any algorithm that is O(n^{2}) typically must compare each piece of its input with every other piece of input at least once. An algorithm that takes 100 steps for input of size 10 and a million steps for input of size 1000 is likely O(n^{2}).
Most sorting algorithms are of an order between O(n) and O(n^{2}) known as linearithmic time, written as O(n log n), where log is the logarithmic function. In fact, O(n log n) is the fastest possible order for a comparison-based sorting algorithms. It is impossible for such algorithms to be O(n) since they must make at least some comparisons of their input data.
Using the simulation tools at http://www.sorting-algorithms.com/, students will investigate, compare, and contrast sorting algorithms. Notice the grid in the center of the page. Each column is a particular sorting algorithm, and each row is an ordering of horizontal bars (either random, nearly sorted, reversed order, or few unique). Each algorithm will sort the bars in a given cell from top to bottom in increasing order by length.
Ask your students to interact with the website by clicking the green start icons and observing how long it takes each algorithm to sort its bars relative to the other algorithms.
Some questions to have them discuss or record in their journal could include:
Make sure your students understand that the size and order of input data can affect how long an algorithm takes. You should direct or help your students discover that Bubble sort is a slow sorting algorithm that can be fairly fast for nearly sorted data. You may wish to discuss that Bubble sort is O(n^{2}) in the worst case, explaining why it takes so long for large input, but is O(n) in the best case, which is when input is already (or nearly) sorted. In contrast, Selection sort is O(n^{2}) in both the worst and best cases, and Merge sort is O(n log n) in both the worst and best cases. In general, most sorting algorithms that we would want to use are O(n log n), since O(n^{2}) is usually too slow. You may also want to mention that Bubble sort is considered one of the most inefficient sorting algorithms and that Quick sort’s worst performance is on already sorted data, so some Quick sort implementations shuffle the inputs before sorting to avoid that situation.
Watch one or more of the available movie clips that compare the performance of sorting algorithms:
Suggested list of videos (Many more are available):
Journal: Remind your students about the sorting algorithms from the previous session and have them answer the following questions:
The students will measure and analyze the effect of sorting set size on execution time for a given sorting algorithm using Python code. Using the timedsorts.py file in the lesson resources folder as a basis, the students will perform an experimental analysis to compare sorting algorithms by timing them on input data of different sizees. They will hypothesize, design and code their experiment, collect results, and write a report for homework.
The sorting functions available in the Python code include: quick sort, merge sort, selection sort, insertion sort, and bubble sort. For advanced students or classes may, you may wish to have them implement additional sorting algorithms.
The sample code includes helper functions to generate random data, to load data from a file, and to time sorting functions on the data. Example code for invoking these functions is included at the end of the file. You can remove this example code before sharing it with your students if you wish to emphasize the programming and critical thinking required to do this project.
Each student (or pair or group) needs their own copy of the Python code to modify for their experiments.
Students will compare sorting algorithms by timing them with Python code on input data of various sizes. Have your students (individually or in pairs) make a hypothesis about what will happen as the size of data input increases, answering the following questions:
Have your students write out a description of the steps they will take to perform the experiment.
Have students modify their Python sorting code to implement the experimental steps they outlined. Students must:
The data collection should be completed by the end of class, but students will continue to work on this activity by writing a report describing their results.
Assign as homework to write a short report about the findings, making sure to:
Students must complete a short research report on their sorting algorithm research procedure, results, and analysis of the results.
The teacher may decide to have the students choose how they want to organize the empirical analysis effort. Alternatively, scaffolding with a worksheet or checklist could be used to guide the students through the data collection and analysis tasks.
The following "Checks for Understanding" could be used to guide the students towards the three learning objectives:
Objective: SWBAT identify families of correct algorithms that have different efficiencies in their problem solving approach.
Objective: SWBAT demonstrate logical reasoning and metrics is used to describe an algorithm’s efficiency.
Objective: SWBAT to perform empirical analysis of sorting algorithms by running the algorithms on different inputs.
Students will complete a short research report on their sorting algorithm research procedure, results, and analysis of the results.