You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>The Fisher Information (MFI) method was introduced by Lord (Lord, 1980; Thissen & Mislevy, 2000) and it was the most widespread ISS in the early days of CAT.</p>
452
455
<p>Fisher information is a measurement of the amount of information about the unknown capacity <spanclass="math inline">\(\theta\)</span> generated by the response pattern(Davier et al., 2019).</p>
453
-
<divid="definition-1" class="section level3">
454
-
<h3>Definition</h3>
456
+
<divid="definition-1" class="section level4">
457
+
<h4>Definition</h4>
455
458
<p>According to Davier et al. (2019), firstly, We give the definition of the first derivative of log likelihood function as Score function:</p>
<p>where <spanclass="math inline">\(f(X_i;\theta)\)</span> refers to the likelihood function, θ is the underlying latent trait, and x represents the observed response pattern.</p>
458
461
<p>Fisher information is second moment of this Score function:</p>
459
462
<p><spanclass="math display">\[I(\theta)=E[S(X;\theta^2)]\]</span> where <spanclass="math inline">\(I(\theta)\)</span> is fisher information.</p>
<p>According to Davier et al. (2019), ①It can estimate the variance of the MLE equation: As <spanclass="math inline">\(E[S(X;\theta)=0\)</span>,we can get that <spanclass="math display">\[I(\theta)=E[S(X;\theta^2)]-E[S(X;\theta)]=Var[S(X|\theta)]\]</span> ②It is the expectation of the negative second order derivative of log likelihood at the true value of the parameter</p>
<p>③Fisher Information reflects the accuracy of our parameter estimates; the larger it is, the more accurate the parameter estimate, i.e. the more information it represents.</p>
466
469
</div>
467
-
<divid="application" class="section level3">
468
-
<h3>Application</h3>
470
+
<divid="application" class="section level4">
471
+
<h4>Application</h4>
469
472
<p>The item k’s Fisher information is given by <spanclass="math inline">\(I_k(\theta)=\frac {[P_k'(\theta)]^2}{P_k(\theta)Q_k(\theta)}\)</span> according to Davier et al. (2019), where <spanclass="math inline">\(P_k(\theta)\)</span> is the item response function for item k which is specified by the selected IRT model, and <spanclass="math inline">\(Q_k(θ) = 1 − P_k(θ)\)</span>, and <spanclass="math inline">\(P_k' (θ)\)</span> refers to the first derivative of the item response function in relation to <spanclass="math inline">\(\theta\)</span>.</p>
470
473
<p>Assuming local independence the test information I(θ) is additive in item information, that means <spanclass="math inline">\(I(\theta)=\Sigma I_k(\theta)\)</span>.</p>
471
474
<p>For the three-parameter logistic (3PL) model, <spanclass="math inline">\(P_j(θ)\)</span> is given by <spanclass="math display">\[P_k(\theta)=c_k+(1-c_k)\frac{e^{a_k(\theta-b_k)}}{1+e^{a_k(\theta-b_k)}}\]</span></p>
472
475
<p>where <spanclass="math inline">\(a_k\)</span>, <spanclass="math inline">\(b_k\)</span> and <spanclass="math inline">\(c_k\)</span> respectively refer to the discrimination, hardness, and guessing parameter for the kth item.</p>
473
476
<p>If the MFI method is applied to item selection, under the current estimate of <spanclass="math inline">\(\theta\)</span> , an eligible item in the bank with the largest Fisher information will be selected as the next item to be managed.</p>
474
477
<p>Since the asymptotic variance of <spanclass="math inline">\(\theta ^{ML}\)</span>,i.e. the maximum likelihood estimate of <spanclass="math inline">\(\theta\)</span>, is in inverse proportion to the test information, the MFI method is widely considered to be a method to minimize the asymptotic variance of the θ estimate, that is, to asymptotically maximize the measurement precision.</p>
475
478
</div>
476
-
<divid="drawbacks" class="section level3">
477
-
<h3>Drawbacks</h3>
479
+
<divid="drawbacks" class="section level4">
480
+
<h4>Drawbacks</h4>
478
481
<p>Firstly, Fisher information does not naturally apply to cognitive diagnosis as it is by definition on a continuous variable.In the early phases of CAT, capacity estimation may not yet be accurate. Maximizing information on the basis of an inaccurate and erratic estimate of <spanclass="math inline">\(\theta\)</span> can be described as “capitalization on chance”(van der Linden & Glas, 2000). Thus, using the MFI in the early stages of a CAT program may not be ideal.</p>
479
482
<p>Secondly, the MFI prefers to pick items with large distinguishing parameters, but uses few items with smaller discrimination parameters. This means that some of the items in the item pool may be underutilized. At the same time,, the excessive exposure of a small number of items with a high degree of distinction may be a critical threat to the security of the test(Chang, 2015; Chang & Ying, 1999).</p>
480
483
<p>In addition, the number of items from various content areas or sub-areas often need to be balanced in order to keep the CAT surface and content valid (Cheng, Chang, & Yi, 2007; Yi & Chang, 2003).</p>
481
484
</div>
482
-
<divid="improvement" class="section level3">
483
-
<h3>Improvement</h3>
485
+
<divid="improvement" class="section level4">
486
+
<h4>Improvement</h4>
484
487
<p>The global information method was put forward by Chang and Ying (1996), which use KL distance or information rather than Fisher information in item selection. They demonstrated that global information is more robust for addressing the problem of instability in capacity estimation in the early stage of CAT.</p>
485
488
</div>
486
489
</div>
487
-
<divid="kl-algorithm" class="section level2">
488
-
<h2>KL Algorithm</h2>
490
+
<divid="kl-algorithm" class="section level3">
491
+
<h3>KL Algorithm</h3>
489
492
<p>Chang & Ying (1996) proposed the global information method which utilized the KL distance or information instead of Fisher information in item selection. Being more robust, global information could be used to combat the instability of ability estimation in the early stage of CAT.</p>
490
493
<p>Fisher information is defined on a continuous variable, if involves discrete, KL Algorithm is preferred.</p>
491
494
<p>The Kullback Leibler distance (KL-distance) is defined as a natural distance function from a “true” probability distribution, p, to a “target” probability distribution, q. It can be interpreted as the expected extra message-length per datum due to using a code based on the wrong (target) distribution compared to using a code based on the true distribution.</p>
492
-
<divid="definition-2" class="section level3">
493
-
<h3>Definition</h3>
495
+
<divid="definition-2" class="section level4">
496
+
<h4>Definition</h4>
494
497
<p>For discrete (not necessarily finite) probability distributions, p={p1, …, pn} and q={q1, …, qn}, the KL-distance is defined to be</p>
<p>For continuous probability densities, <spanclass="math display">\[D_{KL}(P||Q)=\int_{-\infty}^{\infty} P(x)ln(\frac {P(x)}{Q(x)})\]</span></p>
497
500
<p>Xu et al.’s (2005) KL Algorithm:</p>
498
501
<p>According to Cover & Thomas(1991), KL information is a measure of “distance” between two probability distributions, which can be defined as: <spanclass="math display">\[d[f,g]=E_f[log \frac{f(x)}{g(x)}]\]</span> where f(x) and g(x) are two probability distributions.</p>
499
502
<p>However, because the unsymmetrical of d[f, g] and d[g, f], KL information is not a real distance measure. KL distance is still introduced due to the meaning of it, the larger d[f, g] is corresponding to the easier it is to single out between the two probability distributions f(x) and g(x) statistically (Henson & Douglas, 2005).</p>
<h4>The KL Algorithm Based on Kullback–Leibler Information (Cheng, 2009)</h4>
503
506
<p>Suppose t items are selected, and the available items in the pool form a set R(t) at this stage. Consider item h in <spanclass="math inline">\(R^{(t)}\)</span>. In cognitive diagnosis, conditional distribution of person i’s item responses <spanclass="math inline">\(U_{ih}\)</span> given his or her latent state, or cognitive profile, <spanclass="math inline">\(\alpha_i\)</span> are what interested. According to the notation of McGlohen and Chang (2008), <spanclass="math inline">\(\alpha_{i}=(\alpha_{i1},\alpha_{i2},...,\alpha_{ik},...,\alpha_{iK})'\)</span>.</p>
504
507
<p>Here <spanclass="math inline">\(\alpha_{ik}\)</span> = 0 indicates that the ith examinee not masters the kth attribute and <spanclass="math inline">\(\alpha_{ik}\)</span> = 1 otherwise. An attribute is a task, cognitive process, or skill involved in answering an item.</p>
505
508
<p>Due to the unknown true state, a global measure of discrimination can be constructed on the basis of the KL distance between the distribution of <spanclass="math inline">\(U_{ih}\)</span> given the current estimate of person i’s latent cognitive state (i.e., <spanclass="math inline">\(f(U_{ih}|\hat \alpha_i^{(t)}\)</span>)) and the distribution of <spanclass="math inline">\(U_{ih}\)</span> given other states.</p>
@@ -508,22 +511,23 @@ <h3>The KL Algorithm Based on Kullback–Leibler Information (Cheng, 2009)</h3>
508
511
<p>Xu et al. (2003) stated using the straight sum of the KL distances between <spanclass="math inline">\(f(U_{ih}|\hat \alpha_i^{(t)}\)</span>)) and all the <spanclass="math inline">\(f(U_{ih}|\alpha_c\)</span>)), c = 1, 2,…, <spanclass="math inline">\(2^K\)</span> (when there are K attributes, there are <spanclass="math inline">\(2^K\)</span> possible latent cognitive states): <spanclass="math display">\[KL_h(\hat \alpha_i^{(t)})=\sum _{c=1}^{2^K}D_h(\hat \alpha_i^{(t)}||\alpha_c)\]</span></p>
509
512
<p>Then the (t + 1)th item for the ith examinee is the item in R(t) that maximizes <spanclass="math inline">\(KL_h(\hat \alpha_i^{(t)})\)</span>. This is referred to as the KL algorithm. The items selected using this algorithm are the most powerful ones on average in distinguishing the current latent class estimate from all other possible latent classes.</p>
<p>It is helpful to choose the optimal parameter. For instance, if p(x) is unknown, a <spanclass="math inline">\(q(x|\theta)\)</span> can be constructed to estimate p(x). In order to know <spanclass="math inline">\(\theta\)</span>, select N samples from p(x) and construct such function:</p>
<p>Then use MLE to estimate <spanclass="math inline">\(\theta\)</span></p>
516
519
</div>
517
520
</div>
518
-
<divid="shannon-entropy" class="section level2">
519
-
<h2>Shannon entropy</h2>
521
+
<divid="shannon-entropy" class="section level3">
522
+
<h3>Shannon entropy</h3>
520
523
<p>It is necessary to know the uncertainty of a random variable and Shannon entropy is a good candidate to measure the uncertainty. Cheng(2009) listed an example about the Shannon entropy: a fair coin has entropy of one unit while an unfair coin has lower entropy because there is less uncertainty when guessing the outcome of one unfair coin.</p>
521
-
<divid="definition-3" class="section level3">
522
-
<h3>Definition</h3>
524
+
<divid="definition-3" class="section level4">
525
+
<h4>Definition</h4>
523
526
<p>For a discrete random variable X which takes value among <spanclass="math inline">\(x_1,x_2,...x_n\)</span>, the Shannon entropy is defined as:</p>
<p>In the definition, <spanclass="math inline">\(p(x_i)\)</span> is the probability when X = <spanclass="math inline">\(x_i\)</span>. H(X) can also be written as H(P) or H(<spanclass="math inline">\(p_1,p_2,...p_n\)</span>). Owing to the formula, we can conclude that independent uncertainties are additive. b is the base of logarithm, which takes value among 2,e and 10. The differences among choices of b are the unit of entropy. For b=2, unit is bit; for b = e, unit is nat; for b = 10, unit is dit or digit.</p>
526
529
</div>
530
+
</div>
527
531
<divid="properties" class="section level3">
528
532
<h3>Properties</h3>
529
533
<p>The choice of b does not influence the properties of Shannon entropy, so we do not need care the value of b in this part.</p>
0 commit comments