Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port shapeit5 to MacOS on Apple Silicon using SIMDE #68

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

pettyalex
Copy link

@pettyalex pettyalex commented Oct 28, 2023

There's demand for users to be able to run shapeit on Mac OS, with users who want that creating Github issues or asking on other discussion forums:

odelaneau/shapeit4#15
odelaneau/shapeit4#36
#22

Shapeit5 cannot run on modern (2020+) Macs even in a VM, because it uses AVX2. Apple's x86 support supports SSE1/2/3/4 but not AVX or AVX2: https://developer.apple.com/documentation/apple-silicon/about-the-rosetta-translation-environment.

This PR contains a minimal set of changes that will allow a native ARM build on modern Macs. It does this using https://github.com/simd-everywhere/simde, a header-only library that provides high-performance vector intrinsic translation between x86 and ARM. This would also allow Linux on ARM support to be added very easily. Because of AWS's very low pricing for ARM compute instances, ARM is already the cheapest way to run large computational loads on the cloud so Linux on ARM support would be helpful for anyone operating in AWS.

This PR also has some very small changes to support building on Apple clang. std::random_shuffle was removed in C++17, and will be removed in a future version of GCC. It's already gone in clang 15, which I used to develop and test this PR. As recommended, I've replaced it with std::shuffle: https://en.cppreference.com/w/cpp/algorithm/random_shuffle

@pettyalex pettyalex force-pushed the mac-build-with-simde branch from e2034c2 to fad421b Compare March 26, 2024 18:43
@srubinacci
Copy link
Collaborator

srubinacci commented Mar 26, 2024

Hi Alex,
Thanks for your this, it's actually really great and we are definitively interested in merging it. I've tried a previous version myself and it works very well.
Just waiting @odelaneau for merging with new updates. Also we need to be sure it works with the new make system @rwk-unil introduced recently.

If we approve this, we should add mac compilation in the github workflow.

@srubinacci srubinacci assigned odelaneau and unassigned odelaneau Mar 26, 2024
@srubinacci srubinacci requested a review from odelaneau March 26, 2024 20:35
@pettyalex
Copy link
Author

I'd be glad to add a mac build workflow to github actions, and clean up the make target a bit. I was mostly just validating that this will work until I heard you were interested in it.

Right now, as is, it suffixes the binaries with _mac which is pretty silly.

@pettyalex
Copy link
Author

pettyalex commented Mar 27, 2024

OK, I went ahead and simplified the makefile for Mac build, and also tested it on aarch64-linux-gnu. I noticed a few things:

  1. It doesn't look like libboost_serialization is actually used anywhere? I don't see it linked in any dynamic libs, and there's no includes into it.
  2. Any objections to targeting x86-64-v3 instead of -mavx2 -mfma? The baseline requirement is the same, but the compiler is able to generate many more instructions where they may be useful: https://en.wikipedia.org/wiki/X86-64
  3. Is there a reason why when linking it dynamically we need to explicitly link to all the dependencies of our dependencies, like -lbz2 -llzma -lcurl -lcrypto -ldeflate? Can't we only link -lboost_iostreams -lboost_program_options -lhts? If so, then the conditional for mac build support can be entirely eliminated.

Oh and for x86 this updates the -mtune to Skylake server, which should be a good baseline these days for older machines.

@pettyalex pettyalex force-pushed the mac-build-with-simde branch from bf39376 to 59b0ddd Compare March 27, 2024 05:10
@pettyalex pettyalex force-pushed the mac-build-with-simde branch from 59b0ddd to 2f78fb5 Compare March 27, 2024 05:10
@rwk-unil
Copy link
Contributor

@pettyalex thank you for your pull request and your work,
@srubinacci @odelaneau I have reviewed the changes, here are my comments :

  1. I like integrating SIMDE which will help port to other architectures
  2. Thank you for replacing random_shuffle as it is removed in C++17 and this will break builds depending on compiler. I fully agree with this change, we have been discussing this with @srubinacci already. (also the big problem with random_shuffle is that the source of randomness is implementation defined, often with std::rand used (see ref), making reproducibility an issue, even with same seed)
  3. As for the Makefile :

OK, I went ahead and simplified the makefile for Mac build, and also tested it on aarch64-linux-gnu. I noticed a few things:

  1. It doesn't look like libboost_serialization is actually used anywhere? I don't see it linked in any dynamic libs, and there's no includes into it.

It may be an artefact left over from experimentation, these were introduced when the Makefiles were updated (see git blame e.g., makefile) and may not be necessary anymore.

  1. Any objections to targeting x86-64-v3 instead of -mavx2 -mfma? The baseline requirement is the same, but the compiler is able to generate many more instructions where they may be useful: https://en.wikipedia.org/wiki/X86-64

I agree, AVX2/FMA requires Haswell (2013) or newer CPU anyways, so targeting x86-64-v3 would enable extra CPU features that are available anyways while simplifying the build command at the same time. I totally support this suggestion.

  1. Is there a reason why when linking it dynamically we need to explicitly link to all the dependencies of our dependencies, like -lbz2 -llzma -lcurl -lcrypto -ldeflate? Can't we only link -lboost_iostreams -lboost_program_options -lhts? If so, then the conditional for mac build support can be entirely eliminated.

When I simplified the Makefile, I did the minimal amount of changes possible that would get SHAPEIT5 to build in an Ubuntu 20.04 GitHub action, IIRQ without the flags it would fail the linking the executable. But I agree the whole Makefile could benefit from further cleanup. If your fork allows to build without the flags on the following action https://github.com/odelaneau/shapeit5/blob/main/.github/workflows/build.yml (and you make the necessary changes so that the static build also still works) we can think of removing the flags (which would be nice because as you said we can drop the conditional entirely).

Oh and for x86 this updates the -mtune to Skylake server, which should be a good baseline these days for older machines.

Not sure we would want to include this, it is very specific, people wanting to really optimise their build can change this on their own, but I wouldn't add it as default.

Cheers.
Rick

…used requirement for boost serialization. Remove unecessary extra dynamic linking flags for deps of our deps when linking dynamically.
… This will fix building in environments which override CXXFLAG
@pettyalex pettyalex force-pushed the mac-build-with-simde branch from 5de5c3c to 50c277d Compare March 28, 2024 20:11
@pettyalex
Copy link
Author

I have:

  1. Reverted the -mtune change entirely.
  2. Tested on both an M1 Mac and ubuntu 20.04 environment
  3. Added a Github Actions workflow to test on Mac OS, but I haven't tested it. Could we enable workflow runs for this PR? The github action to configure boost doesn't support Mac OS, and I feel like compiling boost on every action run feels a little wasteful... so I used boost & htslib from brew. If you don't like this, I can just compile both of those.

I've also cleaned up the makefile a smidge to make it easier to build inside of places like conda that expect to be able to set CXXFLAGS to change the include path.

@pettyalex
Copy link
Author

I believe this xcftools PR that fixes up its build on Apple Silicon would have to go through before that action will pass, though: odelaneau/xcftools#7

Also, while I was testing packaging this in conda, I noticed that your build rules do not include LDFLAGS or CXXFLAGS. I changed CXXFLAG to CXXFLAGS in the makefile, so now it will pick up the standard environment variables and build more easily in more places. I can revert this if you'd like.

Finally, I enabled lto, link-time optimization. If you don't want this, I can get that out of this PR.

@rwk-unil
Copy link
Contributor

rwk-unil commented Apr 2, 2024

I have:

  1. Reverted the -mtune change entirely.
  2. Tested on both an M1 Mac and ubuntu 20.04 environment
  3. Added a Github Actions workflow to test on Mac OS, but I haven't tested it. Could we enable workflow runs for this PR? The github action to configure boost doesn't support Mac OS, and I feel like compiling boost on every action run feels a little wasteful... so I used boost & htslib from brew. If you don't like this, I can just compile both of those.

I've also cleaned up the makefile a smidge to make it easier to build inside of places like conda that expect to be able to set CXXFLAGS to change the include path.

This seems great, I think the most important is that the Linux (Ubuntu) build succeeds with the new Makefiles, for the Mac OS build, I don't think it is necessary to have an action that builds every time, but anyways it is good to have, but it should not run on "push" and "PR" on the main branch because it will consume ressources. I think this should only be ran on major updates (new versions) manually.

I believe this xcftools PR that fixes up its build on Apple Silicon would have to go through before that action will pass, though: odelaneau/xcftools#7

Also, while I was testing packaging this in conda, I noticed that your build rules do not include LDFLAGS or CXXFLAGS. I changed CXXFLAG to CXXFLAGS in the makefile, so now it will pick up the standard environment variables and build more easily in more places. I can revert this if you'd like.

Finally, I enabled lto, link-time optimization. If you don't want this, I can get that out of this PR.

This has been bothering me too for quite some time, I totally support this change 👍 .

One other thing, we should ditch -lpthread in the libraries and use -pthread (see stackoverflow)

Also, for linking we should use the standard variables LDLIBS or LOADLIBES.

If these standard variables are used, we could remove the rules for building the .o object files and the binaries by relying on the Make built-in rules, this would further simplify the Makefile (but in no way a necessity, and might make the Makefile less clear).

  • The default rule for .o from .cc/.cpp/.C is : $(CXX) $(CPPFLAGS) $(CXXFLAGS) -c
  • The default rule for linking is : $(CC) $(LDFLAGS) $^ $(LOADLIBES) $(LDLIBS)

Reference : Make catalog of rules

Cheers.

@pettyalex
Copy link
Author

The build is working on Linux in its current state, I've been testing it. The real blocker at this point to getting this merge ready is the xcftools PR: odelaneau/xcftools#7

Once that's merged I can update the xcftools submodule to point at it.

I updated to -pthread instead.

for the Mac OS build, I don't think it is necessary to have an action that builds every time, but anyways it is good to have, but it should not run on "push" and "PR" on the main branch because it will consume ressources. I think this should only be ran on major updates (new versions) manually.

Given that they're free, github provided runners, is spending the extra resources to run CI on Mac a problem? If it takes longer to find/run on Mac it could be disabled in the future, but my observation has been that Github has plenty of mac runners.

@rwk-unil
Copy link
Contributor

rwk-unil commented Apr 3, 2024

for the Mac OS build, I don't think it is necessary to have an action that builds every time, but anyways it is good to have, but it should not run on "push" and "PR" on the main branch because it will consume ressources. I think this should only be ran on major updates (new versions) manually.

Given that they're free, github provided runners, is spending the extra resources to run CI on Mac a problem? If it takes longer to find/run on Mac it could be disabled in the future, but my observation has been that Github has plenty of mac runners.

They are free up to a point, free tier GitHub provides up to 2,000 minutes a month (all hosted projects combined) and Mac OS X runners have a 10x multiplier on running minutes. That's why I suggest running it only on releases. (Also it is more environmentally friendly). At the end the decision for this should go to @odelaneau as the project is hosted on his account.

Reference : https://docs.github.com/en/billing/managing-billing-for-github-actions/about-billing-for-github-actions

@pettyalex
Copy link
Author

@rwk-unil I've validated that for open-source projects like SHAPEIT5, the 2000 minute limit does not apply. that's only for closed-source projects.

I'd love to get this in, I was working w/ SHAPEIT5 locally on my laptop today and noticed that this was still not in main.

@rwk-unil
Copy link
Contributor

Great ! @pettyalex Where did you find that the limit does not apply to open-source projects ? I could not find this in the GitHub doc. Thanks.

I'd recommend merging @srubinacci @odelaneau.
Just double check the commit reference on the XCFTools submodule.

Cheers.
Rick

# Disable this if building on x86 CPUs without AVX2 support.
UNAME_M := $(shell uname -m)
ifeq ($(UNAME_M),x86_64)
CXXFLAGS+= -march=x86-64-v3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I got this error when trying to compile this in an linux/amd64 instance in Drone:

#12 0.943 g++ -std=c++17 -O3 -march=x86-64-v3 -D__COMMIT_ID__=\"faa3224\" -D__COMMIT_DATE__=\"2024-07-18\" -c src/containers/bitmatrix.cpp -o obj/bitmatrix.o -Isrc -I../simde -I/opt/htslib/include -I/opt/boost/include
#12 0.950 cc1plus: error: bad value ('x86-64-v3') for '-march=' switch
#12 0.950 cc1plus: note: valid arguments to '-march=' switch are: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client icelake-server cascadelake tigerlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 btver1 btver2 native; did you mean 'x86-64'?

Switching the flags back to -mavx2 -mfma worked

#COMPILATION RULES
all: desktop

$(BFILE): $(OFILE)
$(CXX) $(LDFLAG) $^ -o $@ $(DYN_LIBS)
$(CXX) $(LDFLAGS) $^ -o $@ $(DYN_LIBS)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For using non-standard htslib and boost paths, I needed to modify this to successfully compile and run the binaries, where HTSSRC_LIB=/opt/htslib/lib and BOOST_LIB=/opt/boost/lib:

$(BFILE): $(OFILE)
	$(CXX) $(LDFLAGS) $^ -o $@ $(DYN_LIBS) -Wl,-rpath=$(BOOST_LIB) -Wl,-rpath=$(HTSSRC_LIB)

I also needed to add LDFLAGS+= -L$(HTSSRC_LIB) -L$(BOOST_LIB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants