Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some records being assigned entries in CR_SUBJECT #121

Open
pmusser opened this issue Nov 19, 2024 · 4 comments
Open

Some records being assigned entries in CR_SUBJECT #121

pmusser opened this issue Nov 19, 2024 · 4 comments
Labels
bug Something isn't working product:ocw Issues related to the Open Courseware product

Comments

@pmusser
Copy link
Contributor

pmusser commented Nov 19, 2024

Expected Behavior

When ocw_oer_export.csv generates, all entries should have a value in the CR_SUBJECT field.

Current Behavior

On the most recent export, 52 rows do not have a value in CR_SUBJECT.

Steps to Reproduce

  1. Most recent csv attached for review (ocw_oer_export (7).csv), along with screenshot of csv in OpenRefine:
    Image

Possible Solution

I had a casual look through the records to see if there was some common thread (e.g. publish time or obvious subject that isn't mapping) but nothing jumped out at me. If someone can give me an export either as a csv or even a comment attached to this issue containing the course name, URL, and OCW subject headings for the entries that haven't been mapped, I can give mappings for them.

@pmusser pmusser added the bug Something isn't working label Nov 19, 2024
@pdpinch
Copy link
Member

pdpinch commented Nov 19, 2024

@ibrahimjaved12 can you take a look at this?

I looked at a couple of courses (e.g. 11.255 and 15.760a) and couldn't find any commonalities.

@pdpinch pdpinch added the product:ocw Issues related to the Open Courseware product label Nov 19, 2024
@ibrahimjaved12
Copy link
Collaborator

ibrahimjaved12 commented Nov 28, 2024

@pdpinch It looks like there are two topics fields in the MIT Learn API. And we may not be using the right one.

  1. First topics field (represents MIT Learn topics for OCW courses):
"topics": [
            {
                "id": 720,
                "name": "Policy and Administration",
                "icon": "RiUserSearchLine",
                "parent": 712,
                "channel_url": "https://learn.mit.edu/c/topic/policy-and-administration/"
            },
            {
                "id": 698,
                "name": "Data Science, Analytics & Computer Technology",
                "icon": "RiLineChartLine",
                "parent": null,
                "channel_url": "https://learn.mit.edu/c/topic/data-science-analytics-computer-technology/"
            },
            {
                "id": 715,
                "name": "Economics",
                "icon": "RiUserSearchLine",
                "parent": 712,
                "channel_url": "https://learn.mit.edu/c/topic/economics/"
            },
        ],
  1. Second ocw_topics field:
"ocw_topics": [
            "Developmental Economics",
            "Economics",
            "Energy",
            "International Development",
            "Public Administration",
            "Public Policy",
            "Science and Technology Policy",
            "Security Studies",
            "Social Science",
            "Technology"
        ],

The Issue and fix

Currently, we are relying on the first topics field (MIT Learn topics), but I believe we should be using the second ocw_topics field.

So the courses themselves are fine, it's that their ocw topics are being ignored. Can you please confirm this, so I can go ahead to creating a PR for this that uses ocw_topics instead.

Some additional analysis

@pmusser I also performed an analysis of the OCW topics that might be useful.

In flattened structure, we have 335 OCW topics. And we have 328 OCW topics to OER Subject in the mapping file.

These are the OCW Topics that are missing OER Subject:

African American Studies (15 published courses)
Biomaterials (19 published courses)
Discrete Mathematics (39 published courses)
Italian (1 published course)
Nonfiction Prose (27 published courses)
Periodic Literature (8 published courses)
Project Management (29 published courses)
Quechua (0 published course)
Russian (1 published course)

However, these are 2nd-level subtopics or 3rd-level specialties in the unflattened hierarchy (OCW topics hierarchy follows the structure: topic --> subtopic --> speciality). Which means, their OER subject would also be populated by their parent category. So in essence, they should be populated, but regardless I wanted to share this in case we would like to have more specific subjects for them.

OCW Topics Hierarchy for Missing Topics:

- Humanities:
  - Language:
    - Italian .........3rd level missing speciality
    - Quechua .........3rd level missing speciality
    - Russian .........3rd level missing speciality
  - Literature:
    - Periodic Literature .........3rd level missing speciality
    - Nonfiction Prose .........3rd level missing speciality
- Business:
  - Project Management .........2nd level missing subtopic
- Engineering: 
  - Biological Engineering:
    - Biomaterials .........3rd level missing speciality
- Mathematics:
  - Discrete Mathematics .........2nd level missing subtopic
- Society:
  - African American Studies .........2nd level missing subtopic

Analysis File

For a complete list of OCW courses and their mappings, please see this analysis file:
https://docs.google.com/spreadsheets/d/102Wdw1he5qVYh458XNTHwP5p9IVd0RBTRZx3piIgpTE/edit?gid=0#gid=0

All OCW Topics Hierarchy file

ocw_topics_hierarchy.txt

In any case, this is the list of the 52 courses that do not have a CR_SUBJECT currently:
https://docs.google.com/spreadsheets/d/1UoMRJUVtLHIWechYEXWa6GZNTpRcOIQWS23wK_5_GFk/edit?gid=632809784#gid=632809784

@pdpinch
Copy link
Member

pdpinch commented Dec 4, 2024

@ibrahimjaved12 If I recall correctly, the initial mapping for this app was written by a member of the OCW team, using ocw_topics_hierarchy.txt and the "Subject Area" values from OER Commons Search.

Sometime later, the MIT Learn site established a new (simpler) topic hierarchy, which replaced the values in topics in the MIT Learn API.

The original OCW topics should be available in the MIT Learn API via "ocw_topics" above. Does that seem correct?

I'm not sure which topic hierarchy to use for OER mapping. The current plan is to keep both. Given that we already have a mapping from OCW Topics to OER Commons Subject Areas, the simpler solution is to complete that mapping. Can you identify which values have been overlooked, and I can refer them to the OCW team?

@ibrahimjaved12
Copy link
Collaborator

@pdpinch

I created a new test CSV using values from ocw_topics for both CR_SUBJECT and for fallback of CR_KEYWORDS. In this case, there are no missing mappings for these fields in the mapping files. The existing mappings are sufficient. Using ocw_topics would resolve both of these issues.

However, there are missing values in the CSV for these two courses:

  1. Social Theory and Analysis (https://ocw.mit.edu/courses/21a-750j-social-theory-and-analysis-fall-2011) -- This should be removed from search index as we discussed earlier.
  2. Teaching La Princesse de Clèves (https://ocw.mit.edu/courses/res-21g-3001-teaching-la-princesse-de-cleves-fall-2023) -- This seems a course in progress that is yet to be published with actual data: https://ocw-studio.odl.mit.edu/sites/teaching-la-princesse-de-cleves

Since both CR_SUBJECT and CR_KEYWORDS are required fields for OER Commons, I propose:

  1. Using ocw_topics instead of topics.
  2. Use the fallback value "MIT Open Learning" so required fields aren't empty in any case. As proposed here: .CR_KEYWORD entries -- various issues #122 (comment)

Please note the current logic:

  • For CR_KEYWORDS, the function:
    • Checks for the course's keywords in fm_ocw_keywords_mapping based on Course_URL.
    • If none are found, as fallback it currently uses topics (instead of ocw_topics).
  • For CR_SUBJECT, the function:
    • Relies on the mapping file for conversion (but course topics are used, instead of ocw_topics).

Please let me know how you'd like to proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working product:ocw Issues related to the Open Courseware product
Projects
None yet
Development

No branches or pull requests

3 participants