- 
                Notifications
    
You must be signed in to change notification settings  - Fork 734
 
Description
Is your feature request related to a problem?
I faced an issue when I read my PDB file.
I downloaded the PDB file from RCSB using the biopandas's package.
# download the files
from biopandas.pdb import PandasPdb
ppdb = PandasPdb().fetch_pdb("5N69")
ppdb.to_pdb(path='./dataset/examples/5N69.pdb', records=['ATOM', 'HETATM'])When I read it in MDAnalysis, I found unexpected segments and segids. This makes it difficult to select the atoms by chain (segid) and operate the universe at the SegmentGroup level.
I first checked my MDAnalysis.Universe. It seems that MDAnalysis has detected the segids and thus uses it in the MDAnalysis object. However, the PDB file I downloaded should not have segid information. Thus, from my expectation, it should use chain ID as seg ID in MDAnalysis.universe (ref from doc).
Next, I opened my PDB file and discovered that the issue stems from the exceeding digit in the tempFactor column.(see line 12-14 in the below pic).

However, for the current MDAnalysis version, there is no direct solution to correct this format issue in the seg ID. This format issue might often occur when other software processes the PDB file.
I suggest adding this feature to the main codebase so that the user can decide which information to load to segment when reading PDB files.
Describe the solution you'd like
The solution is currently available in the forked MDAnalysis repo: see changelog here
Considering the current PDBParser works well to get chain ID, the idea is to simply add a variable called force_chainids_to_segids. This will force the PDBParser to use chain ID as the seg ID. The user can decide whether to use it or not. If force_chainids_to_segids=True, the segments in the Universe are based on chain ID.
# read the universe in the future
u = mda.Universe(pdb_path, force_chainids_to_segids=True)Describe alternatives you've considered
In the current version of MDAnalysis, the only solution to select by chain is to use (but it seems we can't operate the SegmentGroup properly):
# for instance, to select chain A
u.select_atoms('chainID A')
# to operate the segment (chain), might need to select atom first rather than using the SegmentGroup directly
u_chainA = u.select_atoms('chainID A')
u_chainA