Home

# Final guide

Answers on the bottom. Also, you can bring a notecard (5in x 7in max) with notes or whatnot.

## Multi-agent systems

What are two principles for designing agent-based simulations?

### Ant intelligence

Which of the following features are in the “ant brood sorting” model?

• Scent-following
• Random walks
• Always pick up an object if ant finds one
• Decide stochastically to drop an object if holding one
• Drop the object only when encountering another object (if holding an object)
• Communicate directly with other ants

Why is it important that the chemical ants leave behind when they are carrying food slowly evaporates?

### NetLogo

What is a “turtle” in NetLogo?

What is a “patch” in NetLogo?

### Prisoner’s dilemma

True or false? “Defecting” means confessing to the crime so that you get off the hook but your partner goes to jail.

Describe the “Jesus” strategy.

Describe the “Lucifer” strategy.

Describe the “Unforgiving” strategy.

Describe the “Tit-for-tat” a.k.a. “Moses” strategy. What is the first choice in this strategy? What are all future choices?

Which of these three strategies generally performed best across random scenarios in the iterated prisoner’s dilemma?

If two “Tit-for-tat” strategies face each other in the iterated prisoner’s dilemma, we will see behavior equivalent to which non-Tit-for-tat strategy or strategies facing each other?

## Learning

Describe the difference between “supervised” and “unsupervised” learning.

### k-means clustering

k-means clustering is an unsupervised or supervised learning strategy?

What does the choice of $$k$$ represent?

### k-nearest neighbor

What does k-nearest neighbor allow us to do with a new, unknown data point?

k-nearest neighbor is an unsupervised or supervised learning strategy?

What does the choice of $$k$$ represent?

What problem may a very small value of $$k$$ cause?

What problem may a very large value of $$k$$ cause?

Is there one value for $$k$$ that works best for nearly all data sets? If so, what is it?

Give one benefit of k-nearest neighbor learning.

Give one drawback of k-nearest neighbor learning.

### Document classification

What is the goal of document classification?

Define true positive (TP). Define false positive (FP). Define false negative (FN). Define precision (in terms of TP and/or FP and/or FN). Define recall (in terms of TP and/or FP and/or FN). Define F-score (in terms of precision and/or recall).

Suppose we make our classification engine more cautious; that is, it is less likely overall to predict any category. Does precision go up or down or remain unchanged? Does recall go up or down or remain unchanged?

If the true categories for a document are [Applications, Robots, Interfaces, Speech] and the predicted categories are [Applications, Games, Speech], what are the true positive, false positive, and false negative counts?

If the true categories for document 1 are [Applications, Robots, Interfaces, Speech], for document 2 are [Games], for document 3 are [Ethics, Philosophy, Robots], and for document 4 are [Applications, Vision, ScienceFiction], then what are the precision, recall, and f-score if the predicted categories for document 1 are [Applications, Games, Speech], for document 2 are [Games], for document 3 are [MachineLearning, Vision], and for document 4 are [ScienceFiction]?

Give a possible result of stemming each of the following words: “linguistic,” “dog,” “jumping,” “philosophy.”

Describe a “binary feature vector” for a text document.

Why do you suppose tf-idf feature vectors perform better (in terms of F-score) than, say, frequency count feature vectors?

A common word (most documents have many occurrences of this word) would have a high or low tf-idf score?

What is a centroid (in terms of document vectors)?

What is one way a centroid approach to classification is better than k-nearest neighbor?

### Probability and Bayesian methods

Describe what $$P(A)$$ means (in words).

Describe what $$P(A,B)$$ means (in words).

Describe what $$P(A|B)$$ means (in words).

If events $$A$$ and $$B$$ are independent, and $$P(A) = 0.25$$, $$P(B) = 0.10$$, what is $$P(A,B)$$? What is $$P(B,A)$$?

Write Bayes’ theorem.

Suppose I know that $$P(B|A)=0.1, P(A)=0.9, P(B)=0.25$$, what is $$P(A|B)$$?

What is the outcome of computing $$\arg\max X (P(x))$$ where $$X$$ is an event? (If $$\arg\max X (P(X))$$ was a function, what would the output of the function be?) Describe in English.

Suppose I know that $$P(B|A)=0.1, P(A)=0.9, P(B|C)=0.2, P(C)=0.8$$, what is $$\arg\max X (P(X|B))$$?

As described in the naïve Bayesian classification notes, what does $$d_i$$ represent for some word $$i$$? What does $$d_{ci}$$ represent for some word $$i$$ and category $$c$$?

Why do we use logarithms for naïve Bayesian classification calculations?

What is the outcome of computing $$\arg\max x P(x)$$ where $$x$$ is an event? (If $$\arg\max x P(x)$$ was a function, what would the output of the function be?) Describe in English.

Give one benefit of naïve Bayesian classification (as compared to other classification methods like k-nearest neighbor).

Give one drawback of naïve Bayesian classification (as compared to other classification methods like support vector machines).

## The “Chinese room” argument

What is the essential goal of “strong AI?”

What is the most critical assumption in the Chinese room argument?

If you believe the Chinese room argument, can you also (reasonably) believe that passing the Turing test gives proof that a machine possesses a mind (i.e., can be said to truly understand things)?

### Multi-agent systems

What are two principles for designing agent-based simulations?

Here are seven:

1. agents not functions (not functional decomposition)
2. keep agents small in size
3. keep agents small in time (forgetful)
4. keep agents small in scope (local sensing and action)
5. decentralizd system control
6. support agent diversity
7. provide an entropy leak

#### Ant intelligence

• No: Scent-following
• Yes: Random walks
• No: Always pick up an object if ant finds one
• Yes: Decide stochastically to drop an object if holding one
• No: Drop the object only when encountering another object (if holding an object) — (every time step, regardless of whether another object is encountered, determine whether to drop the object if holding one)
• No: Communicate directly with other ants

Why is it important that the chemical ants leave behind when they are carrying food slowly evaporates?

The chemical must evaporate so that the ants do not continue trying to walk to a food source that no longer exists. The ants need to “forget” (one of the principles of agent design).

#### NetLogo

What is a “turtle” in NetLogo?

A turtle is a single agent that acts independently of all other turtles (agents).

What is a “patch” in NetLogo?

A patch is a square on the map; the patch can have a color or other properties like a chemical. Patches do not move but they can have actions (such as changing one of their properties).

#### Prisoner’s dilemma

True or false? “Defecting” means confessing to the crime so that you get off the hook but your partner goes to jail. True

Describe the “Jesus” strategy. Always cooperate

Describe the “Lucifer” strategy. Always defect

Describe the “Unforgiving” strategy. Cooperate until the partner defects, and then defect forever after.

Describe the “Tit-for-tat” a.k.a. “Moses” strategy. What is the first choice in this strategy? Tit-for-tat does the same action the partner did in the last time step; it just mirrors the partner. The first choice (if tit-for-tat goes first) is to cooperate.

Which of these three strategies generally performed best across random scenarios in the iterated prisoner’s dilemma? Tit-for-tat

If two “Tit-for-tat” strategies face each other in the iterated prisoner’s dilemma, we will see behavior equivalent to which non-Tit-for-tat strategy or strategies facing each other? Jesus strategies

### Learning

Describe the difference between “supervised” and “unsupervised” learning.

Supervised learning uses information about the truth when training. Unsupervised learning does not have the truth (ever) so obviously cannot use this information.

#### k-means clustering

k-means clustering is an unsupervised or supervised learning strategy? Unsupervised

What does the choice of $$k$$ represent? The number of clusters.

### k-nearest neighbor

What does k-nearest neighbor allow us to do with a new, unknown data point? Determine its category (class, label, tag, etc.).

k-nearest neighbor is an unsupervised or supervised learning strategy? Supervised

What does the choice of $$k$$ represent? The number of neighbors that get a “vote” during the classification stage.

What problem may a very small value of $$k$$ cause? Noise has too great an impact. The nearest neighbor will be chosen without considering the “larger” picture.

What problem may a very large value of $$k$$ cause? The more common category will be chosen more often than it should.

Is there one value for $$k$$ that works best for nearly all data sets? If so, what is it? There is not one best value; need to experiment with different values to find the best for your dataset.

Give one benefit of k-nearest neighbor learning. It is a very simple algorithm and can work quite well in some cases.

Give one drawback of k-nearest neighbor learning. It is very slow because it checks every item in the database. It also requires one to retain all the training examples in the database.

### Document classification

What is the goal of document classification? Determine the category (class) or categories for a new document.

Define true positive (TP). Define false positive (FP). Define false negative (FN). Define precision (in terms of TP and/or FP and/or FN). Define recall (in terms of TP and/or FP and/or FN). Define F-score (in terms of precision and/or recall).

True positive (tp): chosen categories that are true categories.

False positive (fp): chosen categories that are not true categories.

False negatives (fn): true categories that are not chosen.

Precision: $$tp/(tp+fp)$$.

Recall: $$tp/(tp+fn)$$.

F-score: $$2 * precision * recall / (precision + recall)$$.

Suppose we make our classification engine more cautious; that is, it is less likely overall to predict any category. Does precision go up or down or remain unchanged? Does recall go up or down or remain unchanged? Precision goes up because there are fewer $$fp$$. Recall goes down because there are more $$fn$$.

If the true categories for a document are [Applications, Robots, Interfaces, Speech] and the predicted categories are [Applications, Games, Speech], what are the true positive, false positive, and false negative counts? TP: 2, FP: 1, FN: 2

If the true categories for document 1 are [Applications, Robots, Interfaces, Speech], for document 2 are [Games], for document 3 are [Ethics, Philosophy, Robots], and for document 4 are [Applications, Vision, ScienceFiction], then what are the precision, recall, and f-score if the predicted categories for document 1 are [Applications, Games, Speech], for document 2 are [Games], for document 3 are [MachineLearning, Vision], and for document 4 are [ScienceFiction]?

For doc 1: TP: 2, FP: 1, FN: 2.

For doc 2: TP: 1, FP: 0, FN: 0.

For doc 3: TP: 0, FP: 2, FN: 3.

For doc 4: TP: 1, FP: 0, FN: 2.

Sums: TP: 4, FP: 3, FN: 7.

Thus, precision: $$4/(4+3)=0.57$$, recall: $$4/(4+7)=0.36$$, f-score: $$2*0.57*0.36/(0.57+0.36)=0.44$$.

Give a possible result of stemming each of the following words: “linguistic,” “dog,” “jumping,” “philosophy.” Maybe: “linguist” “dog” “jump” “philos”. “Dog” won’t be stemmed, for example.

Describe a “binary feature vector” for a text document. Each unique word is a “dimension,” and the value for that dimension is 1 or 0. It is 1 if the word is present in the document, 0 otherwise.

Why do you suppose tf-idf feature vectors perform better (in terms of F-score) than, say, frequency count feature vectors? The “contribution” of each word is considered rather than its count. This matters because some common words, like “the,” don’t contribute any information about the document’s category. We can detect the contribution of a word by considering how many other docs have the same word.

A common word (most documents have many occurrences of this word) would have a high or low tf-idf score? A low tf-idf score.

What is a centroid (in terms of document vectors)? A vector that is the average of the vectors of all documents in some category.

What is one way a centroid approach to classification is better than k-nearest neighbor? Only need to keep the centroids in memory, not every individual training document.

### Probability and Bayesian methods

Describe what $$P(A)$$ means (in words). The probability that some event $$A$$ occurs.

Describe what $$P(A,B)$$ means (in words). The probability that two events $$A$$ and $$B$$ occur together.

Describe what $$P(A|B)$$ means (in words). The probability that some event $$A$$ occurs given that we know or are assuming event $$B$$ also occurs.

If events $$A$$ and $$B$$ are independent, and $$P(A) = 0.25$$, $$P(B) = 0.10$$, what is $$P(A,B)$$? $0.25 * 0.10 = 0.025$ What is $$P(B,A)$$? Same.

Write Bayes’ theorem. $$P(B|A) = P(A|B)P(B)/P(A)$$ or equivalently $$P(A|B) = P(B|A)P(A)/P(B)$$.

Suppose I know that $$P(B|A)=0.1, P(A)=0.9, P(B)=0.25$$, what is $$P(A|B)$$? $$P(A|B)=0.1*0.9/0.25=0.36$$.

What is the outcome of computing $$\arg\max X (P(X))$$ where $$X$$ is an event? (If $$\arg\max X (P(x))$$ was a function, what would the output of the function be?) Describe in English. The outcome would be an event, not a probability.

Suppose I know that $$P(B|A)=0.1, P(A)=0.9, P(B|C)=0.2, P(C)=0.8$$, what is $$\arg\max X (P(X|B))$$? The answer is $$C$$ because $$P(B|A)P(A) < P(B|C)P(C)$$.

As described in the naïve Bayesian classification notes, what does $$d_i$$ represent for some word $$i$$? $$d_i$$ is the number of documents (in the whole database) that contain at least one occurrence of word $$i$$. What does $$d_{ci}$$ represent for some word $$i$$ and category $$c$$? $$d_{ci}$$ is the number of documents in category $$c$$ that contain at least one occurrence of word $$i$$.

Why do we use logarithms for naïve Bayesian classification calculations? To avoid “underflow” with our numbers; after multiplying a lot of probabilities (one probability per word), we get very small values.

Give one benefit of naïve Bayesian classification (as compared to other classification methods like k-nearest neighbor). It is fast.

Give one drawback of naïve Bayesian classification (as compared to other classification methods like support vector machines). It is not always highly accurate.

## The “Chinese room” argument

What is the essential goal of “strong AI?” Create a true “mind,” i.e., an intelligent thinking machine. It may even be conscious.

What is the most critical assumption in the Chinese room argument? That the person in the room does not understand Chinese.

If you believe the Chinese room argument, can you also (reasonably) believe that passing the Turing test gives proof that a machine possesses a mind (i.e., can be said to truly understand things)? No.