sped up `Tokenizer::dump()` #5009

firewave · 2023-04-24T14:35:19Z

Scanning the cli folder with DISABLE_VALUEFLOW=1 Tokenizer::dump() will consume almost 25% of the total Ir count when an addon is specified. This is mainly caused by the usage of std::ostream.

Encountered while profiling #4958.

firewave · 2023-05-04T09:17:40Z

It looks like the pointer address is just used as a unique ID for the objects in the dump. @danmar?

If that is the case (and as the pointer string representation is different on Windows and Linux) we could use something more light-weight, simple and portable instead. Or simply use just one representation.

firewave · 2023-08-23T14:26:34Z

Scanning with DISABLE_VALUEFLOW=1 and --addon=misra -I../lib -D__GNUC__ (second values are the % of total IR only used by Tokenizer::dump()):

cli/filelister.cpp
Clang 15 125,212,971 -> 109,163,772 / 20.64% -> 8.87%

cli/cmdlineparser.cpp
Clang 15 1,105,405,465 -> 949,986,812 / 20.28% -> 8.40

firewave · 2023-08-23T14:32:03Z

I know that the new code is horrible but that's what we have to pay for having overengineered, unusable garbage in the standard (also looking at you std::regex).

I am aware that we might also be able use the {fmt} library but we always try to avoid external dependencies. And we already have Boost as optional one to work around other performance issues in the standard implementations...

danmar · 2023-08-28T08:40:21Z

It looks like the pointer address is just used as a unique ID for the objects in the dump.

Yes it is a unique ID. In python it can be any arbitrary string, doesn't have to be numeric. But the premium addon expects that it's a hexadecimal value.

I am not against that you change it if you think there is faster approach.

lib/cppcheck.cpp

firewave · 2023-08-28T08:48:35Z

Yes it is a unique ID. In python it can be any arbitrary string, doesn't have to be numeric. But the premium addon expects that it's a hexadecimal value.

As you see the value is different on several platforms and might not even be a hexadecimal value (0, 0230FB33 - the latter albeit an address it is actually considered octal). Does premium actually work outside of GNU-like compilers?

So I would keep the current format for now and prepare another PR which generates a consistent output across all platform. You could then test that with premium as well. Also if we run into issues we have a separate commit to revert which only contains the changes in the identifier.

danmar · 2023-08-28T08:49:24Z

I know that the new code is horrible but that's what we have to pay for having overengineered, unusable garbage in the standard (also looking at you std::regex).

Personally I don't think it's horrible. The ostream never made me feel excited anyway.

danmar · 2023-08-28T08:52:30Z

Does premium actually work outside of GNU-like compilers?

Yes it does. If there is a "0x" or not does not matter. It always uses base 16 when converting the string. Even if the string says 01234567

danmar · 2023-08-28T10:10:34Z

As far as I see there are many opportunities to "fold" newlines.. if you do that I don't have any negative opinion about this. looks OK to merge then.

danmar

ptr_to_string

lib/utils.h

danmar · 2023-08-28T10:15:05Z

lib/utils.h

+            // a-f / A-F
+            c = 55 + temp;
+#if !defined(_WIN32) || defined(__MINGW32__)
+            c += 32; // add 32 for lowercase


there is no technical reason to use lower case as far as I know.

It is necessary to match the result of std::ostringstream. As mentioned before I will do the portable approach in another PR.

if we really want to lowercase, I think that either c = 'a' + temp; or c += 'a' - 'A'; would be more elegant.

I don't see why it's important to match std::ostringstream though. the only possible usage I can think of for this function is when generating the dump file or debug output.

People might depend on the format. Also I want that change separate so it can easily be reverted if necessary. I will also put it behind a define for a release or two.

if we really want to lowercase, I think that either c = 'a' + temp; or c += 'a' - 'A'; would be more elegant.

I think the - 10 is better as it highlights that we are dealing with a "base 16"/hex value as input.

lib/utils.h

danmar · 2023-08-28T10:24:04Z

lib/token.cpp

        line = tok->linenr();
        if (!xml) {
            ValueFlow::Value::ValueKind valueKind = values->front().valueKind;
            const bool same = std::all_of(values->begin(), values->end(), [&](const ValueFlow::Value& value) {
                return value.valueKind == valueKind;
            });
-            out << "  " << tok->str() << " ";
+            outs += "  ";


nit: we can use ' ' here

Probably - if it were a single character... but I guess that is just a typo.

It seems the code using a char is much slower similar to the stream insertion case I mentioned in another comment. Will investigate but that is out-of-scope of this PR.

yes it should have been a single character that was a typo 👍

firewave · 2023-08-28T11:55:51Z

Regarding single character char vs. string literal in stream insertion I encountered something weird - a char is slower: llvm/llvm-project#65040

…al` to `std::unordered_map`

danmar · 2023-08-28T17:33:55Z

Regarding single character char vs. string literal in stream insertion I encountered something weird - a char is slower:

ok

firewave force-pushed the tok-dump branch 6 times, most recently from c58378b to e4f46ac Compare April 28, 2023 17:51

firewave changed the title ~~optimized Tokenizer::dump()~~ sped up Tokenizer::dump() Apr 30, 2023

firewave force-pushed the tok-dump branch 4 times, most recently from fb4bbd7 to c3058d2 Compare August 23, 2023 14:13

firewave force-pushed the tok-dump branch 2 times, most recently from d95f72c to e184436 Compare August 23, 2023 14:28

firewave force-pushed the tok-dump branch 8 times, most recently from 66172f1 to 63da8d1 Compare August 25, 2023 21:14

firewave marked this pull request as ready for review August 25, 2023 21:41

danmar reviewed Aug 28, 2023

View reviewed changes

lib/cppcheck.cpp Outdated Show resolved Hide resolved

firewave added 2 commits August 28, 2023 12:20

avoid expensive std::ostringstream usage in ErrorLogger::toxml()

ed28863

CppCheck: avoid unnecessary flushing of dump files

da211c8

firewave force-pushed the tok-dump branch from 63da8d1 to d3b6b37 Compare August 28, 2023 10:20

danmar reviewed Aug 28, 2023

View reviewed changes

firewave force-pushed the tok-dump branch from d3b6b37 to 611090b Compare August 28, 2023 12:40

firewave added 7 commits August 28, 2023 15:41

utils.h: added ptr_to_string()

d30561f

avoid expensive std::ostringstream usage in ValueType::dump()

788e5ec

avoid expensive std::ostream usage in Tokenizer::dump()

a5bbaf8

changed VariableMap::mVariableId and `VariableMap::mVariableId_glob…

19b36cf

…al` to `std::unordered_map`

avoid expensive std::ostream usage in Token::printValueFlow()

7cf2165

utils.h: added bool_to_string()

ab1145b

avoid expensive std::ostream usage in SymbolDatabase::printXml()

5b32407

firewave force-pushed the tok-dump branch from 611090b to 5b32407 Compare August 28, 2023 13:42

danmar merged commit 0fadf9e into danmar:main Aug 31, 2023

firewave deleted the tok-dump branch August 31, 2023 10:43

sped up Tokenizer::dump() #5009

sped up Tokenizer::dump() #5009

Uh oh!

Conversation

firewave commented Apr 24, 2023

Uh oh!

firewave commented May 4, 2023

Uh oh!

firewave commented Aug 23, 2023

Uh oh!

firewave commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmar commented Aug 28, 2023

Uh oh!

Uh oh!

firewave commented Aug 28, 2023

Uh oh!

danmar commented Aug 28, 2023

Uh oh!

danmar commented Aug 28, 2023

Uh oh!

danmar commented Aug 28, 2023

Uh oh!

danmar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

firewave commented Aug 28, 2023

Uh oh!

danmar commented Aug 28, 2023

Uh oh!

Uh oh!

sped up `Tokenizer::dump()` #5009

sped up `Tokenizer::dump()` #5009

firewave commented Aug 23, 2023 •

edited

Loading