Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]i could not able to perform the recursive chunking with PDFKnowledgebase #2012

Open
mahendra867 opened this issue Feb 5, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@mahendra867
Copy link

mahendra867 commented Feb 5, 2025

Description

Briefly describe the issue you’re experiencing or the bug you’ve found.

when i performed the recursive chunking by these code

code

pdf_knowledge_base = PDFKnowledgeBase(
path="D:\Projects\agentic_new_rag\pdfs",
vector_db=PgVector(
table_name="updated_rag009",
schema='ai',
db_url=db_url,
search_type=SearchType.hybrid,
vector_index=HNSW(),
embedder = OpenAIEmbedder(
api_key=OPENAI_API_KEY,
id="text-embedding-ada-002",
dimensions=1536,
encoding_format="float"
)
),
reader=PDFReader(chunk=False), # Use a default reader,
chunking_strategy=RecursiveChunking(chunk_size=4000,overlap = 800),
documents=3,
)

Steps to Reproduce

List the steps needed to encounter this bug or issue.

Agent Configuration (if applicable)

Provide relevant agent configuration.

Agentic Rag configuration iam using

Expected Behavior

What did you expect to happen?

based on the above code it should perform recursive chunking but it did not add the last 800 characters data to each content

Actual Behavior

What actually happened instead?

recursive chunking did not happened

this is the result i got

page 11 of pdf content

DoDM 5200.45, January 17, 2025

SECTION 3: OCA 11
(1) Combatant Commands will send all OCA delegation requests to the Chief, Joint Staff
Security Office for endorsement and Joint Staff submission to the DDI(CL&S).
(2) The designated SAOs in the MILDEPs manage the delegation of OCAs within their
respective MILDEPs and provide an updated list of OCA positions to DDI(CL&S) annually.
i. OCA delegation requests will:
(1) Identify the official by position title and classification level requested.
(2) Include a description of why the official requires OCA and why the classification
level requested is necessary and appropriate. If the position’s next-level supervisor has OCA,
requests will include a justification why it is either inappropriate or impractical for that official to
exercise OCA.
(3) Be submitted and endorsed by an official at least one supervisory level above the
official to whom the request seeks delegation of OCA. See Figure 1 for an example of an OCA
delegation request memorandum.
Figure 1. Example of Request for OCA

j.
All DoD Component requests for these changes to OCAs, excluding those from the
MILDEPs, require submission of a memorandum from the Component SAO to the DDI(CL&S)
detailing the reason for the change, which includes:
(1) Move of the position delegated OCA due to reorganization or realignment.
(2) Position title change.

page no 12 content of pdf

DoDM 5200.45, January 17, 2025

SECTION 3: OCA 12
(3) Removal or downgrade of OCA delegation.
k. All DoD Components, including the MILDEPs, must verify the officials by position
delegated OCA to the DDI(CL&S) annually. See Figure 2 for an example of an OCA
verification submission. This verification will include:
(1) A list of OCAs by position title and the classification level delegated.
(2) The date each OCA received the initial or annual refresher training.
Figure 2. OCA Verification Example

  1. TRAINING REQUIREMENTS.
    a. Before exercising OCA, and annually thereafter, officials delegated OCA will certify in
    writing that they have received the required training. This acknowledgement can be
    accomplished through any method determined by the DoD Component (e.g., signing a certificate
    or form). Individuals who hold OCA for multiple positions must complete the certification for
    each position delegated OCA.
    b. Personnel assigned responsibility for creating, reviewing, and managing SCGs and
    submitting information to OCAs for original classification decisions require additional
    knowledge of the original classification decision process and will take the OCA training annually
    in addition to all other security training requirements. Completion of this training will be tracked
    by the activity security manager and will not give the individual the authority to make original
    classification decisions.
    c. The activity security manager, or other designated personnel, will ensure:
    (1) OCA training is conducted as required. Training may be completed through
    individualized Component-developed training or through the OCA training course available on
    the Defense Counterintelligence and Security Agency Center for Development of Security
    Excellence Website at https://www.cdse.edu/Training/eLearning/IF102/.
    (2) Copies of the training records are maintained in accordance with DoDI 5015.02.
    (3) Training records are available when requested by appropriate authorities.
    d. At a minimum, the OCA training will address:

Screenshots or Logs (if applicable)

Include any relevant screenshots or error logs that demonstrate the issue.

Environment

  • OS: (e.g. macOS, Windows 11)
  • Browser (if relevant): (e.g. Chrome 108, Firefox 107)
  • Agno Version: (e.g. v1.0.0)
  • External Dependency Versions: (e.g., yfinance 0.2.52)
  • Additional Environment Details: (e.g., Python 3.10)

Possible Solutions (optional)

Suggest any ideas you might have to fix or address the issue.

Additional Context

Add any other context or details about the problem here.
solve this issue as quick as possible

@mahendra867 mahendra867 added the bug Something isn't working label Feb 5, 2025
@manthanguptaa
Copy link
Contributor

Hey @mahendra867! I found a small detail that might have been causing the issue. I see that you put chunk=False in your PDFReader class, making it skip the chunking process. If you turn it back to True or remove it altogether. It should work as expected

@mahendra867
Copy link
Author

mahendra867 commented Feb 7, 2025

Hi @manthanguptaa thanks for the response but i still could not able to create

when i make the chunk=True its taking lot of time for reading for almost more than 20 mintues its keep reading for small 500kb pdf when i put chunk=True and use RecursiveChunking(chunk_size=4000, overlap=800) and its consuming too much resources in such a way that atlast my pc is getting shut down but its not resulting to even create documents as well as documents which has content overlap in the pg vector database

but when i use fixed chunking stratergy which is default provied by PdfReader params the documents are getting created quickly

could you please try out from your side by taking small pdf text content file and use recursive chunking with must include these 2 params RecursiveChunking(chunk_size=4000, overlap=800) and make chunk=True in PdfReader and run the code and tell me the results

This is my code PDFKnowledgeBase Params i have used

pdf_knowledge_base = PDFKnowledgeBase(
path="D:\Projects\agentic_new_rag\pdfs",
vector_db=PgVector(
table_name="updated_rag6201",
schema='aii',
db_url=db_url,
search_type=SearchType.hybrid,
vector_index=HNSW(),
embedder = OpenAIEmbedder(
api_key=OPENAI_API_KEY,
id="text-embedding-ada-002",
dimensions=1536,
encoding_format="float"
)
),
reader=PDFReader(chunk=True),
chunking_strategy=RecursiveChunking(chunk_size=4000,overlap = 800),
documents=3,
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants