PERFORMANCE OF EXPLORATORY STRUCTURAL EQUATION MODEL (ESEM) IN DETECTING DIFFERENTIAL ITEM FUNCTIONING

The validity of a standardised test is questioned if an irrelevant construct is accounted for the performance of examinees, which is wrongly modeled as the ability in the construct (test items). A test must ensure precision in the examinee’s ability irrespective of their sub-population in any demographic variables. This paper explored the potentials of gender and school location as major covariates on the West African Examinations Council (WAEC) mathematics items among examinees (N=2,866) using Exploratory Structural Equation Modeling (ESEM). The results remarked that the test is multidimensional (six-factors) with compliance fix indices of (χ2 (940)=4882.024, p < 0.05, CFI=0.962, TLI=0.930, RMSEA=0.038, SRMR=0.030, 90 % CI=0.037-0.039, Akaike information criterion (AIC)=147290.577, Bayesian information criterion (BIC)=149585.436 and Sample-size adjusted BIC=148362.154) respectively. Also, there were 10 (20 %) significant DIF items in the WAEC to gender, while 3 (6 %) of the items indicated significant DIF to school location. Observed DIF items acquaint test developers; the existence of DIF may differentially affect the performance of examinees with the same ability level. The implications of the test are severe for the examinees. Hence, accurate and unbiased assessment should be the basic principles for any test item measurement, and test developers need to test the items to be free from biases psychometrically.


Introduction
Ensuring test fairness and equity among examinees is very important. Examinees should be given equal opportunity to display what they know and perform well in the tested area according to their demographic profile, such as race, gender, location, ethnicity, religion, color, linguistic background, etc. The measurement community has increasingly employed standardised testing as one of the significant tools, deployed in assessing examinees' outcomes and precise ability and examinee potential for future academic success. However, measurement experts over the years were apprehensive of the likelihood that test items, either cognitive or non-cognitive, might function differently (that is, favouring sub-group) across a group of examinees [1,2]. Thus, the validity of a standardised test is questioned if an irrelevant construct is accounted for the performance of examinees, which is wrongly modeled as the ability in the construct (test items). A test must ensure precision in the examinee's ability irrespective of their sub-population in any demographic variables. The examinee's test score interpretation concerns the extent of its statistical independence across and among different groups of examinees in educational testing.
The West African Examinations Council (WAEC) is an examination board, established by law to determine the examinations, required in the public interest in the English-speaking West African countries, to conduct terminal examinations for grade 12 students, and to award certificates comparable to those of equivalent examining authorities internationally [3]. Since its establishment in 1952, the council has contributed to education in Anglophonic countries of West Africa (Ghana, Nigeria, Sierra Leone, Liberia, andtGambia), with the number of examinations they have coordinated and certificates they have issued. It is noteworthy, that this public examining body must avoid integrating items that do not establish a comparative opportunity to examinees' in their test. They need to assess all items in a given test to ensure that they provide all the examinees equal opportunity to demonstrate their innate ability (traits) regardless of their demographic variables.

Social Sciences
but rather allow the optimal number of factors to be determined based on several statistical and interpretability criteria. All these might indicate that EFA is less essential and inferior to CFA. Nevertheless, [33] argued that CFA's acceptance, acknowledgment, and usefulness could be seen as a motivation to create more parsimonious measurement models. More often than not, these models and items include a certain level of systematic measurement error in the form of cross-loadings. [34] asserted that items are rarely pure indicators of their corresponding constructs; they are fallible; thus, at least some degree of construct-relevant association can be expected between items and the non-target, yet conceptually related constructs. When non-zero cross-loadings are present and unexpressed simultaneously, such restrictive constraints (that is, items can only load on one factor) could inflate the associations between the factors as the mis-specified cross-loadings could only be expressed through these factorial associations. Moreover, the goodness-of-fit of the models and the discriminant validity of the factors could also be undermined by these overly restrictive specifications [27]. To proffer solutions to these limitations, the Exploratory Structural Equation Modeling (ESEM) framework [23,28] has been developed, which integrates the advantages of the less restrictive EFA (that is allowing cross-loadings) and the more advanced CFA (that is goodnessof-fit) at the same time, providing a collaboration that is the best of both features and can sufficiently account for complex measurement models.

Exploratory Structural Equation Model (ESEM)
For the exploratory structural equation model [23,24,26], the response variables Y=(Y 1 , ..., Y n ), the predictor variables X=(X 1 , ..., X m ) and k latent variables η=(η 1 , ..., η k ) are under the standard assumptions that the ε and ζ are normally distributed residuals with mean of zero and variance-comvariance matrix θ and ψ respectively. Λ is a factor loading matrix, while C and Γ are matrices of regression coefficients that relate latent variables.
However, all parameters can be identified with the maximum likelihood estimation method (MLE); the model is generally not identified unless additional constraints are imposed. In CFA analyses, the two typical approaches are to identify the metric of the latent variable by either fixing the variance of the latent variable to be 1.0 or by fixing one of the factor loadings for each factor typically to be 1.0. The ESEM approach differs from the typical CFA approach in that all factor loadings are estimated, subject to constraints, so that the model can be identified. More importantly, [30] suggested that the ESEM model's estimation comprises many steps. First, an SEM model is estimated using the ML estimator. The factor variance-covariance matrix is indicated as an identity matrix (ψ=I), giving k (k+1)/2 restrictions. The EFA loading matrix (Λ) has all entries above the main diagonal (that is for the first k rows and column in the upper right-hand corner of the factor loading matrix, Λ), fixed to 0, providing remaining k (k−1)/2 identifying restrictions. This initial, unrotated model provides starting values that can be rotated into an EFA model with k factors. The asymptotic distribution of all parameter estimates in this starting value model is also obtained. Consequently, the ESEM variance-covariance matrix is computed.
Researchers, such as [32,35,36], argued that ESEM was a better and efficient method to adjust for the cross-factor loading instead of latent variables analysis, which assesses a measurement model of constructs through exploratory factor analysis (EFA) in place of CFA. Generally, ESEM showed improved model fit results as well as deflated inter-factor correlations that, in turn, improve the discriminant validity of the factors as well as produce a more realistic representation of the data [31,33,34,37]. Hence, the relative novelty of the ESEM method in literature is now adequately established from various studies within the context of behavioural socio-science research [38][39][40][41][42]. Numerous studies have remarked the impressive performance of ESEM compared to CFA in investigating the measurement structure of latent variables [27,37,43].

Social Sciences
Based on the review above, it's expected that ESEM would perform better in detecting differential item function for the demographic profile of the examinees. Consequently, there is a likelihood that the predictors (gender and school location) have unique and distinct effects on the test items that their impact on the latent variables cannot fully explain. Succinctly, bias is an issue to be addressed since tests are used as gatekeepers for educational opportunities, and test items should be fair for every examinee. Their importance is justified only if the measures, used to produce valid outcome data for different sub-populations, are presented with the same test. Personal experience has shown that examinees, responding to WAEC mathematics items, were from various regions in Nigeria and taught by different mathematics teachers. The teachers from these regions employed different constructs and instructional pedagogies to teach the examinees. The areas of specialty, years of experience, and qualifications of these teachers differed; therefore, content exposure and constructs, exposed to students, differed across regions. It is expected, that these students should write the same examination. The question is that do the public examining bodies, such as WAEC, take cognisance of this situation when developing their test items? The answer is an absolute NO.
Consequently, this may have significant interference on the students' performance in the test due to observed differing content exposure and location. Also, this may pose a threat to equity and fairness and make decisions taken not reflect the students' actual ability. This could affect the predictive and construct validity of the test since the item biasness was not adequately handled at the level of test development (i. e., assessing the psychometric properties of the items).
Therefore, this study aimed to establish the performance of exploratory structural equation modeling to detect differential item functioning and provides answers to the following research questions. They include (i) what is the number of underlying factors in the WAEC mathematics items, (ii) is there significant effect of gender covariate on the mathematics items, and (iii) is there significant effect of school location covariate on the mathematics items, respectively.

Materials and methods
Design, Population, and Sample Institutional Review Board Statement: The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of the University of Johannesburg (protocol code Sem 2-2021-164). Informed consent was obtained from all participants, involved in the study.
This paper employed a non-experimental design of instrumentation research type. The study population was prospective K-12 students, selected randomly from Education District 1 (Agege, Alimosho, and Ifako/Ijaye) of Lagos State, Nigeria. A sample of 2,866 participants, 1,233 (43 %) male, 1,633 (57 %) females, aged between 14 and 20 years, were obtained from 28 schools in the State.
Summary descriptive statistics were calculated ( Table 1), followed by exploratory factor analysis using WLSMV estimator and Goemin rotation, and subsequently confirmatory factor analysis (CFA). After finding the best model, differential item functioning (DIF) was conducted using the exploratory structural equation model (ESEM) to determine if the examinee's responses to item differed by gender and school location after controlling for theta (θ).

Instrumentation
In Nigeria, WAEC is a high-stake examination. It is a standardised examination, prepared by experienced and seasoned item developers from various higher institutions with test and measurement experts. The 2020 mathematics instrument contained 50 items with four options (A-D). Students took a survey with 50 multiple-choice items, drawn across common areas of the math- Table 1 Original Research Article: full paper (2022), «EUREKA: Social and Humanities» Number 1

Continuation of
Social Sciences ematics syllabus by the examining body. They rest assured that their responses would be treated with utmost secrecy. The survey took a maximum of 1 hour 30 minutes to complete. The empirical reliability was calculated to be 0.85. It was administered to 3,000 participants, and the optical mark's reader (OMR) sheet was used to shade the correct answers, but 2,866 OMR sheets were returned and used in the analysis. Data obtained was analysed using Mplus 7.4 [44] and estimated with the robust maximum likelihood estimator (MLR), which provides standard errors and tests of model fit that are robust to the non-normality of the data.

Results
Participants' responses to dichotomous test items (mathematics achievement test) were subjected to EFA, implemented in Mplus to determine the test data's optimal factor. To achieve this, 1-factor was hypothesised that fit the test data. Also, 2-factors were hypothesised that underlie the test data, then the fit indices for 1-factor and 2-factors were compared. Hence, if 2-factors fit the data better than 1-factor, the data is further calibrated under the hypothesis that 3-factors fit the test data, and the fitness of 2-factors and 3-factors were compared in the same trend. This process persists until the optimal factor, underlying the test data, is achieved. Tables 2, 3 presented the model fit summary information for the test data.  Table 2 presents the number of hypothesized factors, underlying the 2020 WAEC mathematics test, while Table 3 depicted the nested model fit when the hypothesised factors were compared. According to [45] as cited in [46], it is suggested, that a Comparative Fit index (CFI) of 0.90 indicates an acceptable level of fit and a value of 0.95 indicates good fit; CFI=1 indicates perfect fit, Tucker-Lewis index (TLI) of 0.90 as an acceptable threshold and 0.95 indicates a good fit; Standardised Root Mean Square Residual (SRMR) of 0.00 indicates perfect fit and value of 0.08 or less is considered to be an indicator of good model fit. The Root Mean Square Error of Approximation (RMSEA) is the benchmark for judging the overall fitness of a model with the threshold value of 0.05 or less as a sign of good model fit.
More importantly, to establish the number and characteristics of the factors, this is interpretable, non-linear exploratory structural equation modeling was employed. Fig. 1 presents the ESEM of the six factors, explaining the variance, observed in the performance of examinees' in the test. .154 respectively. Based on these fit indices, six factors are embedded in the test data, with each factor having more than three substantial item loadings of 0.32 and above. Fig. 2 present the performance of ESEM in detecting gender biases in the mathematics test data.  Table 4 as follows:  Table 4 shows the DIF assessment of mathematics test items using ESEM with gender as covariate to model their direct effects on the factor's indicators. The Table 4 remarked that the covariate has direct significant effects on 10 (20 %) out of 50 test items, namely; (item 3, p-value=0.003, item 4, p-value=0.006, item 6, p-value=0.003, item 11, p-value=0.000, item 14, p-value=0.039, item 18, p-value=0.000, item 21, p-value=0.008, item 22, p-value=0.000, item 26, p-value=0.038, and item 50, p-value=0.0037). This indicated that these items function differentially across gender of the students. Fig. 3 shows the ESEM with the location as a covariate of the six factors that underlies the test's students' performance. This suggested that 6-factor model, explaining the performance of students in mathematics test item, was viable with, χ 2 (940) =4937.553, p<0.05, CFI=0.968, TLI=0.977, RMSEA=0.038, SRMR=0.029, 90 % CI=0.037-0.039. Thus, the extent, to which school location impacts the characteristics that underlie students' performance in mathematics tests, was evaluated. The results are presented in Table 5.  Table 5 shows the DIF assessment of mathematics test items using ESEM with school location as a covariate to model their direct effects on the factor's indicators. The school location showed significant direct effects on 3 (6 %) out of 50 test items, namely; (item 5, p-value=0.000, item 12, p-value=0.009, and item 22, p-value=0.000). The implication is that those items function differentially across the school location of the students.

Discussions
The number of factors that underlie the WAEC mathematics test items was established using exploratory factor analysis, resulting in all items having statistically significant loadings on their intended factor. The model fit information was compared, and the 6-factor model demonstrated excellent fit. The six factors were labeled: Number and Numeration, Algebraic Process, Introduction to Calculus, Statistics and Probability, Mensuration, and Trigonometry, respectively. Thus, the instrument is multidimensional, and more than one trait explained the observed variance in Social Sciences students' performance to the test items. This submission conforms with the findings of [10][11][12]47], who posited that standardised instruments, developed for selections and placement and scholarship awards, might not be unidimensional, especially when test items were from various areas. For instance, the National Benchmark Test in South Africa consists of Academic Literacy, Quantitative Literacy, and Mathematics; Graduate Management Admission Test consists of mathematics, verbal reasoning, quantitative reasoning, and English language; Joint Admission and Matriculation Board consist of Mathematics, English, Physics, Chemistry, Biology and so on. Also, this result laid credence to the findings of [5,6], which argued that there was no evidence of unidimensionality in 2018 Osun State unified multiple-choice mathematics achievement test items. However, this result was in dissonance with the findings of [4,48] that the unidimensionality of the test was met when comparing three methods for evaluating dimensionality.
Also, ESEM was used to confirm the appropriateness of isolated six factors for its viability; the fit indices were adequately acceptable, making the factors interpretable. The ESEM was a potent tool for determining and identifying items that function differently across a sub-group of students. Few items were flagged to operating differentially, concerning covariate gender and school location, respectively. The findings align with the results of [21,22,25,34,37,49] that personal attributes such as gender and school location systematically affected examinees' performance on an item, though the method differed from ESEM.
This study has implications for public examining bodies, test developers, and practitioners' on the existence of DIF, which may inappropriately and differentially affect the performance of examinees with the same ability level in an examination. The implications of the test are severe for the examinees. Hence, stakeholders in educational assessment need to test the items to be free from biases psychometrically. Also, the findings of this study would serve as a scientific basis for drawing inferences, from which conclusions would be deduced, leading to recommendations for better improvements in the process of test development. The study's limitation is that the results might not be generalised because of the scope of this study.
Further study can establish item biasness of other demographic profiles, such as linguistic background, race, ethnicity, etc., using different methods aside from ESEM. The scope can also be expanciated to other regions in the Country, such as south-south, south-east, north-west, north-central, and north-east respectively to establish how these items function. Also, another limitation is that ESEM is a new technique, employed by few researchers to test item biasness and invariance of a measurement instrument across sub-populations (e. g., [30,39,42]). Nevertheless, given the peculiarity of the approach and few studies that used it in the past, some caution has to be taken. A recent computational simulation [50] suggests that ESEM has problems with convergence (e. g., the algorithm does not run), especially if the sample sizes are smaller (less than 200 or the ratio of variables to cases may be too small). ESEM is apt when there are considerable cross-loadings of items [50]. In instances where cross-loadings are close to zero and the factor structure is clear (high loadings of items on the relevant factors), ESEM may not be necessary. Hence, ESEM might be an appealing method if a researcher has large samples, and substantive cross-loadings in the model cannot be ignored.

Conclusion
It's crucial for item commission writers and test developers of this public examining body to ensure that the test items are valid, reliable, and free from bias. Factors that increase the validity of these test scores need to be improved. Those variables that lower the validity of scores interpretation from the test should be correctly deleted. These unwanted constructs, embedded in the test that affect the decision, should be removed from the scores. Decisions were made for the students based on the outcomes from this test. However, it is possible for items in a test to function inappropriately for a different subpopulation. It is another way to reason if the observed difference is embedded in the construct being assessed or a source of test interpretation bias. Therefore, differential item functioning should be performed on the items and ascertain that it functions equally across a subgroup of students before administering it. This would bring sanity and increase the credibility of the award certificate by the examining board.

Data Availability Statement
The dataset, presented in this study, are available on request. The data are not publicly available due to privacy reasons.