Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google doc dates returned as unicode (e.g., \ue907) #2547

Open
nick-youngblut opened this issue Jan 13, 2025 · 2 comments
Open

Google doc dates returned as unicode (e.g., \ue907) #2547

nick-youngblut opened this issue Jan 13, 2025 · 2 comments
Assignees

Comments

@nick-youngblut
Copy link

Example code:

import os
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build
from google.auth import default

def get_document_dates(doc_id, creds_file=None):
    scopes = ['https://www.googleapis.com/auth/documents.readonly']
    if creds_file and os.path.exists(creds_file):
        creds = Credentials.from_service_account_file(creds_file, scopes=scopes)
    else:
        creds, project = default(scopes=scopes)
    
    # Build the Docs API service
    service = build('docs', 'v1', credentials=creds)
    
    # Get the document
    document = service.documents().get(
        documentId=doc_id,
        fields='body'  
    ).execute()
    
    # Access the document's content
    content = document.get('body').get('content')
    
    # Process each element
    for element in content:
        if 'paragraph' in element:
            paragraph = element.get('paragraph')
            elements = paragraph.get('elements', [])
            
            for elem in elements:
                print(elem)

The first section of the doc:

Image

I want to parse the date via the python API: Jan 13, 2025.

The first few elements printed:

{'startIndex': 1, 'endIndex': 5, 'textRun': {'content': '\ue907 | ', 'textStyle': {}}}
{'startIndex': 5, 'endIndex': 6, 'richLink': {'richLinkId': 'kix.p3Xj3hkh7bXl', 'textStyle': {}, 'richLinkProperties': {'title': 'Asana Board New NGS Submissions', 'uri': 'https://www.google.com/calendar/event?eid=XXX'}}}
{'startIndex': 6, 'endIndex': 7, 'textRun': {'content': '\n', 'textStyle': {}}}
{'startIndex': 7, 'endIndex': 18, 'textRun': {'content': 'Attendees: ', 'textStyle': {}}}

The date is returned in the first element as \ue907. How can that be converted to a date?

Note: there is a richLinkId in the second element, but that is for a separate calendar element, and not the Jan 13, 2025 date element.

More generally, why are date elements returned as unicode instead of something easier to work with?

@eseidohl
Copy link

eseidohl commented Feb 6, 2025

I believe (and cannot find it documented anywhere) that Docs uses Private Use Area Unicode characters to represent special elements like chips and code blocks.

@eseidohl
Copy link

eseidohl commented Feb 6, 2025

While this issue is about docs, It looks like, as of this writing, the feature is not available in sheets: https://stackoverflow.com/questions/79331123/how-to-extract-both-name-and-link-from-google-sheets-smart-chip-place-using-ap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants