C++ running slower than VB6












0















this is my first post on Stack.



I usually dev in VB6 but have recently started doing more coding in C++ using the DEV-C++ IDE with the g++ compiler lib.



I'm having a problem with general program execution speeds.



This old VB6 code runs in 20 seconds.



DefLng A-Z

Private Sub Form_Load()

Dim n(10000, 10) As Long
Dim c(10000, 10) As Long

For d = 1 To 1000000
For dd = 1 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
Next

MsgBox "Done"

End Sub


This C++ code takes 57 seconds...



int main(int argc, char *argv) {

long n[10000][10];
long c[10000][10];

for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}

system("PAUSE");
return EXIT_SUCCESS; }


Most of the coding I do is AI related and very heavy on array usage. I've tried using int rather than long, I've tried different machines, the C++ always runs at least three times slower.



Am I being dumb? Can anyone explain what I'm doing wrong?



Cheers.










share|improve this question























  • Hum, interesting. Probably down to your C++ compiler optimisation settings. For example, are you using checked array bounds?

    – Bathsheba
    Nov 13 '18 at 15:04






  • 6





    After optimization your C++-program can be reduzed to int main() { system("PAUSE"); } since all the rest has no observable effect. You are most likely benchmarking debug builds/builds without optimization.

    – Swordfish
    Nov 13 '18 at 15:05








  • 5





    Turn on optimizations (as you always should when you are anywhere near 'performance' domain with C++) and witness 0 execution time. It is also super unclear how you are actually measuring execution time, and my crystal ball tells me you are doing it incorrectly.

    – SergeyA
    Nov 13 '18 at 15:05








  • 3





    You are measuring nonsense, the code you posted takes zero seconds in C++, on any system godbolt.org/z/gW3M1D Read up on proper benchmarking, those results you present are meaningless.

    – Baum mit Augen
    Nov 13 '18 at 15:10






  • 4





    @Progger No, with proper optimizations the code does literally nothing. There won't even be any array nor loops as you can see from the generated assembler code when you follow BaumMitAugens link.

    – Swordfish
    Nov 13 '18 at 15:25


















0















this is my first post on Stack.



I usually dev in VB6 but have recently started doing more coding in C++ using the DEV-C++ IDE with the g++ compiler lib.



I'm having a problem with general program execution speeds.



This old VB6 code runs in 20 seconds.



DefLng A-Z

Private Sub Form_Load()

Dim n(10000, 10) As Long
Dim c(10000, 10) As Long

For d = 1 To 1000000
For dd = 1 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
Next

MsgBox "Done"

End Sub


This C++ code takes 57 seconds...



int main(int argc, char *argv) {

long n[10000][10];
long c[10000][10];

for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}

system("PAUSE");
return EXIT_SUCCESS; }


Most of the coding I do is AI related and very heavy on array usage. I've tried using int rather than long, I've tried different machines, the C++ always runs at least three times slower.



Am I being dumb? Can anyone explain what I'm doing wrong?



Cheers.










share|improve this question























  • Hum, interesting. Probably down to your C++ compiler optimisation settings. For example, are you using checked array bounds?

    – Bathsheba
    Nov 13 '18 at 15:04






  • 6





    After optimization your C++-program can be reduzed to int main() { system("PAUSE"); } since all the rest has no observable effect. You are most likely benchmarking debug builds/builds without optimization.

    – Swordfish
    Nov 13 '18 at 15:05








  • 5





    Turn on optimizations (as you always should when you are anywhere near 'performance' domain with C++) and witness 0 execution time. It is also super unclear how you are actually measuring execution time, and my crystal ball tells me you are doing it incorrectly.

    – SergeyA
    Nov 13 '18 at 15:05








  • 3





    You are measuring nonsense, the code you posted takes zero seconds in C++, on any system godbolt.org/z/gW3M1D Read up on proper benchmarking, those results you present are meaningless.

    – Baum mit Augen
    Nov 13 '18 at 15:10






  • 4





    @Progger No, with proper optimizations the code does literally nothing. There won't even be any array nor loops as you can see from the generated assembler code when you follow BaumMitAugens link.

    – Swordfish
    Nov 13 '18 at 15:25
















0












0








0








this is my first post on Stack.



I usually dev in VB6 but have recently started doing more coding in C++ using the DEV-C++ IDE with the g++ compiler lib.



I'm having a problem with general program execution speeds.



This old VB6 code runs in 20 seconds.



DefLng A-Z

Private Sub Form_Load()

Dim n(10000, 10) As Long
Dim c(10000, 10) As Long

For d = 1 To 1000000
For dd = 1 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
Next

MsgBox "Done"

End Sub


This C++ code takes 57 seconds...



int main(int argc, char *argv) {

long n[10000][10];
long c[10000][10];

for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}

system("PAUSE");
return EXIT_SUCCESS; }


Most of the coding I do is AI related and very heavy on array usage. I've tried using int rather than long, I've tried different machines, the C++ always runs at least three times slower.



Am I being dumb? Can anyone explain what I'm doing wrong?



Cheers.










share|improve this question














this is my first post on Stack.



I usually dev in VB6 but have recently started doing more coding in C++ using the DEV-C++ IDE with the g++ compiler lib.



I'm having a problem with general program execution speeds.



This old VB6 code runs in 20 seconds.



DefLng A-Z

Private Sub Form_Load()

Dim n(10000, 10) As Long
Dim c(10000, 10) As Long

For d = 1 To 1000000
For dd = 1 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
Next

MsgBox "Done"

End Sub


This C++ code takes 57 seconds...



int main(int argc, char *argv) {

long n[10000][10];
long c[10000][10];

for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}

system("PAUSE");
return EXIT_SUCCESS; }


Most of the coding I do is AI related and very heavy on array usage. I've tried using int rather than long, I've tried different machines, the C++ always runs at least three times slower.



Am I being dumb? Can anyone explain what I'm doing wrong?



Cheers.







c++






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 '18 at 14:58









ProggerProgger

12




12













  • Hum, interesting. Probably down to your C++ compiler optimisation settings. For example, are you using checked array bounds?

    – Bathsheba
    Nov 13 '18 at 15:04






  • 6





    After optimization your C++-program can be reduzed to int main() { system("PAUSE"); } since all the rest has no observable effect. You are most likely benchmarking debug builds/builds without optimization.

    – Swordfish
    Nov 13 '18 at 15:05








  • 5





    Turn on optimizations (as you always should when you are anywhere near 'performance' domain with C++) and witness 0 execution time. It is also super unclear how you are actually measuring execution time, and my crystal ball tells me you are doing it incorrectly.

    – SergeyA
    Nov 13 '18 at 15:05








  • 3





    You are measuring nonsense, the code you posted takes zero seconds in C++, on any system godbolt.org/z/gW3M1D Read up on proper benchmarking, those results you present are meaningless.

    – Baum mit Augen
    Nov 13 '18 at 15:10






  • 4





    @Progger No, with proper optimizations the code does literally nothing. There won't even be any array nor loops as you can see from the generated assembler code when you follow BaumMitAugens link.

    – Swordfish
    Nov 13 '18 at 15:25





















  • Hum, interesting. Probably down to your C++ compiler optimisation settings. For example, are you using checked array bounds?

    – Bathsheba
    Nov 13 '18 at 15:04






  • 6





    After optimization your C++-program can be reduzed to int main() { system("PAUSE"); } since all the rest has no observable effect. You are most likely benchmarking debug builds/builds without optimization.

    – Swordfish
    Nov 13 '18 at 15:05








  • 5





    Turn on optimizations (as you always should when you are anywhere near 'performance' domain with C++) and witness 0 execution time. It is also super unclear how you are actually measuring execution time, and my crystal ball tells me you are doing it incorrectly.

    – SergeyA
    Nov 13 '18 at 15:05








  • 3





    You are measuring nonsense, the code you posted takes zero seconds in C++, on any system godbolt.org/z/gW3M1D Read up on proper benchmarking, those results you present are meaningless.

    – Baum mit Augen
    Nov 13 '18 at 15:10






  • 4





    @Progger No, with proper optimizations the code does literally nothing. There won't even be any array nor loops as you can see from the generated assembler code when you follow BaumMitAugens link.

    – Swordfish
    Nov 13 '18 at 15:25



















Hum, interesting. Probably down to your C++ compiler optimisation settings. For example, are you using checked array bounds?

– Bathsheba
Nov 13 '18 at 15:04





Hum, interesting. Probably down to your C++ compiler optimisation settings. For example, are you using checked array bounds?

– Bathsheba
Nov 13 '18 at 15:04




6




6





After optimization your C++-program can be reduzed to int main() { system("PAUSE"); } since all the rest has no observable effect. You are most likely benchmarking debug builds/builds without optimization.

– Swordfish
Nov 13 '18 at 15:05







After optimization your C++-program can be reduzed to int main() { system("PAUSE"); } since all the rest has no observable effect. You are most likely benchmarking debug builds/builds without optimization.

– Swordfish
Nov 13 '18 at 15:05






5




5





Turn on optimizations (as you always should when you are anywhere near 'performance' domain with C++) and witness 0 execution time. It is also super unclear how you are actually measuring execution time, and my crystal ball tells me you are doing it incorrectly.

– SergeyA
Nov 13 '18 at 15:05







Turn on optimizations (as you always should when you are anywhere near 'performance' domain with C++) and witness 0 execution time. It is also super unclear how you are actually measuring execution time, and my crystal ball tells me you are doing it incorrectly.

– SergeyA
Nov 13 '18 at 15:05






3




3





You are measuring nonsense, the code you posted takes zero seconds in C++, on any system godbolt.org/z/gW3M1D Read up on proper benchmarking, those results you present are meaningless.

– Baum mit Augen
Nov 13 '18 at 15:10





You are measuring nonsense, the code you posted takes zero seconds in C++, on any system godbolt.org/z/gW3M1D Read up on proper benchmarking, those results you present are meaningless.

– Baum mit Augen
Nov 13 '18 at 15:10




4




4





@Progger No, with proper optimizations the code does literally nothing. There won't even be any array nor loops as you can see from the generated assembler code when you follow BaumMitAugens link.

– Swordfish
Nov 13 '18 at 15:25







@Progger No, with proper optimizations the code does literally nothing. There won't even be any array nor loops as you can see from the generated assembler code when you follow BaumMitAugens link.

– Swordfish
Nov 13 '18 at 15:25














2 Answers
2






active

oldest

votes


















3














Short answer



You need to look into your compiler optimization settings. This resource might help



Takeaway: C++ allows you to use many tricks some generic and some dependent on your architecture, when used properly it will be superior to VB in terms of performance.



Long answer



Keep in mind this is highly dependent on your architecture and compiler, also compiler settings. You should configure your compiler to do a more agressive optimization.
Also you should write optimized code taking into account memory access, using CPU cache wisely etc.



I have done a test for you on an ubuntu 16.04 virtual machine using a core of Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz. Using the code bellow here are my times depending on the optimization level of the compiler I used g++ 5.4.0



I'm using optimization level 0,1,2,3,s and obtain 36s(completely unoptimized), 23s, and then.. zero.




osboxes@osboxes:~/test$ g++ a.cpp -O0 -o a0
osboxes@osboxes:~/test$ ./a0 start..finished in 36174855 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O1 -o a1
osboxes@osboxes:~/test$ ./a1 start..finished in 2352767 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O2 -o a2
osboxes@osboxes:~/test$ ./a2 start..finished in 0 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 0 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -Os -o as
osboxes@osboxes:~/test$ ./as start..finished in 0 micro seconds



Note that by using a more agressive optimization level, the compiler will eliminate the code completely because the values in n are not used in the program.
To force the compiler to generate code use the volatile keyword when declaring n



With the volatile added now you'll have ~12s with the most agressive optimization (on my machine)




osboxes@osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 12139348 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -Os -o as
osboxes@osboxes:~/test$ ./as start..finished in 12493927 micro seconds



The code I used for the test(based on your example)



#include <iostream>
#include <sys/time.h>
using namespace std;

typedef unsigned long long u64;

u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}

int main()
{
cout<<"start"<<endl;
u64 t0 = timestamp();

volatile long n[10000][10];
long c[10000][10];

for(long d=1;d<1000000;d++)
{
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}

u64 t1 = timestamp();

cout<<"..finished in "<< (t1-t0) << " micro secondsn";

return 0;
}


Multithreading



I have converted your code to use multithreading, using 2 threads I am able to reduce the time by half.



I am using the fact that as it is now, the results are not used so the inner for is not dependent on the outer one, in reality you should find another way to split the work so that the results do not overwrite one another.



#include <iostream>
#include <sys/time.h>
#include <omp.h>
using namespace std;

typedef unsigned long long u64;

u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}

int main()
{
omp_set_num_threads(2);
#pragma omp parallel
{
}

cout<<"start"<<endl;
u64 t0 = timestamp();

volatile long n[10000][10];
long c[10000][10];

for(long d=1;d<1000000;d++)
{
#pragma omp parallel for
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}

u64 t1 = timestamp();
cout<<"..finished in "<< (t1-t0) << " micro secondsn";
return 0;
}



osboxes@osboxes:~/test$ g++ a.cpp -O3 -fopenmp -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 6673741 micro seconds






share|improve this answer


























  • So it takes a 4.2 GHz I7 running ubuntu to get 23 secs without volatile, and I'm using a 2.GHz dual core XP 32 bit machine to get 12 seconds with VB6... that's Visual Basic 6... can't be right.

    – Progger
    Nov 13 '18 at 17:59








  • 1





    @Progger It takes 0 seconds without volatile in any reasonable benchmark setting, because your benchmark is flawed.

    – Baum mit Augen
    Nov 13 '18 at 21:15











  • Honestly, I'm not 100% sure the test is meaningful with volatile either, even when you fix the UB. (Downvoted for that btw, as that invalidates all results to begin with.)

    – Baum mit Augen
    Nov 13 '18 at 21:18











  • @BaummitAugen of course the code is not a correct benchmark, I have pointed that out above, since the compiler will optimize it out, but this is the "usecase" the OP has tested in VB. What I am trying to do is keep the code as close as possible to the original one and make a comparison to VB. Code is not correct for many reasons: arrays are not initialised, results are not used, all is allocated on stack, unaligned memory, etc etc, so I understand your point of view and respect it, but I tried to illustrate something else :)

    – Andrei Diea
    Nov 14 '18 at 8:54











  • @Progger C++ is a very powerful language and there are many things you can do to improve your code. Like everyone points out this is not the best example code, but C++'s power lays in its ability to go low level and optimize for the specific architecture you are using. For example, I used multithreading on your code and execution drops to 6 seconds using 2 threads on 2 cores. C++ allows you to take advantage of: multicores, specific vectorization instructions that some CPUs have (ex AVX for intel), L1 cache,etc so once you learnin more tricks you will be able to achive best performance!

    – Andrei Diea
    Nov 14 '18 at 9:00



















0














UPDATE : Recent c++ compilers give much better results compare to VB6 compiler.



The VB6 codes shown above does not reflect the real case, where access index can be varied and calculations can be breaked into embraced functions.
More experiences show that VB6 has a huge penaty for optimizing when arrays used were passed as function input (by reference).



I tried severals c++ compilers and rewritten the benchmark codes where some random behaviors was added to trick the optimization. Certainly,
the codes could be better, all suggestions are welcome. I used -O3 option for maximizing the speed, multithread mode was not used.



RESULT:



The g++ 8.1 gives the best final result with 5 seconds with const index access (compiler known it, case shown in OP), and 6 seconds with dynamic index access.
VC 2017 takes 2 seconds only for const index access but 35 seconds for dynamic case.



VB6 S1 : Original version, n and c are local variables. VB6 S2: inner loops was moved into a function, n and c are input variables passed by reference.
Test condition : Intel Xeon W3690 / Windows 10



Here are the test result :



Executions time in seconds
The rewritten codes :



VB6:



DefLng A-Z
Private Declare Function GetTickCount Lib "kernel32" () As Long

Public Sub cal_var(ByRef n() As Long, ByRef c() As Long, id As Long)

For dd = 0 To 10000
n(dd, id) = c(dd, id + 1) + c(dd, id + 2)
Next

End Sub

Public Sub cal_const(ByRef n() As Long, ByRef c() As Long)
For dd = 0 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
End Sub


Private Sub Form_Load()

Dim n(10001, 10) As Long
Dim c(10001, 10) As Long

Dim t0 As Long
Dim t1 As Long
Dim t2 As Long

Dim id As Long
Dim ret As Long

t0 = GetTickCount

For d = 1 To 1000000
id = d And 7
Call cal_var(n, c, id) 'For VB S2

'For dd = 0 To 10000
' n(dd, id + 0) = c(dd, id + 1) + c(dd, id + 2)
'Next

Next

t1 = GetTickCount

For d = 1 To 1000000
Call cal_const(n, c) 'For VB S2

'For dd = 0 To 10000
' n(dd, 1) = c(dd, 2) + c(dd, 3)
'Next
Next

t2 = GetTickCount

For d = 0 To 10000
Sum = Sum + n(d, t0 And 7)
Next

MsgBox "Done in " & (t1 - t0) & " and " & (t2 - t1) & " miliseconds"

End Sub


C++ Code:



#include <iostream>
#include <time.h>
#include <string.h>

using namespace std;

#define NUM_ITERATION 1000000
#define ARR_SIZE 10000

typedef long long_arr[ARR_SIZE][10];

void calc_var(long_arr &n, long_arr &c, long id) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][id] = c[dd][id + 1] + c[dd][id + 2];
}
}

void calc_const(long_arr &n, long_arr &c) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][0] = c[dd][1] + c[dd][2];
}
}

int main()
{
cout << "start ..." << endl;
time_t t0 = time(NULL);

long_arr n;
long_arr c;

memset(n, 0, sizeof(n));
memset(c, t0 & 1, sizeof(c));

for (long d = 1; d < NUM_ITERATION; d++) {
calc_var(n, c, (long)t0 & 7);
}

time_t t1 = time(NULL);

for (long d = 1; d < NUM_ITERATION; d++) {
calc_const(n, c);
}

time_t t2 = time(NULL);

long sum = 0;
for (int i = 0; i < ARR_SIZE; i++) {
sum += n[i][t0 & 7];
}

cout << "by dynamic index: finished in " << (t1 - t0) << " seconds" << endl;
cout << "by const index: finished in " << (t2 - t1) << " seconds" << endl;
cout << "with final result : " << sum << endl;

return 0;
}


The following original answer was based only on test done in VC2008, VB6 S1 case.



ORIGINAL ANSWER:



I have the same result with Progger. I use the following code to avoid the optimization (code blank) made by the compiler :



int main(int argc, char *argv) {

long n[10000][10];
long c[10000][10];
long sum = 0;

memset(n, 0, sizeof(n) );
memset(c, 0, sizeof(c) );

for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}

for (long dd=1;dd<10000;dd++){
sum += n[dd][1];
}

return sum;
}


I use the visual studio C++ 2008 to compile the code. it turns out that the compiler uses a lot of adress reallocation with local variable, mixing with an imul (multiplication) instruction, that could be very costly, for example :



.text:00401065                 mov     edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx


Each access to an index take one multiplication and one add instruction.



EDIT1: The problem is related to multidimension array access performance in C++. More information can be found here:



Array Lookups




C++ doesn't provide multidimensional arrays so scientific and engineering applications either write their own or use one of the available array libraries such as Eigen, Armadillo, uBLAS, or Boost.MultiArray. High quality C++ array libraries can provide generally excellent performance but yet simple element lookup speed can still lag that of Fortran. One reason is that Fortran arrays are built in to the language so its compilers can figure out array index strides in loops and avoid computing the memory offset from the indexes for each element. We can't change the C++ compiler so the next best solution is to offer a linear, or flat, 1D indexing operator for multidimensional arrays that can be used to improve performance of hot spot loops.




The VB version use a traditional way (1D optimized flat memory), intructions used are only registers add (eax, edx, esi ...), they are much faster.



.text:00401AD4                 mov     eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi


That can answer the question of speed.
Advice: you should use existed libraries (for example: uBlas) if possible. More discussion can be found here.



Here is the machine code of C++ version :



.text:00401000 ; int __cdecl main(int argc, const char **argv, const char **envp)
.text:00401000 _main proc near ; CODE XREF: ___tmainCRTStartup+F6p
.text:00401000
.text:00401000 var_C3514 = dword ptr -0C3514h
.text:00401000 var_C3510 = dword ptr -0C3510h
.text:00401000 var_C350C = dword ptr -0C350Ch
.text:00401000 var_C3500 = dword ptr -0C3500h
.text:00401000 var_C34FC = dword ptr -0C34FCh
.text:00401000 var_61A84 = dword ptr -61A84h
.text:00401000 var_61A7C = dword ptr -61A7Ch
.text:00401000 argc = dword ptr 8
.text:00401000 argv = dword ptr 0Ch
.text:00401000 envp = dword ptr 10h
.text:00401000
.text:00401000 push ebp
.text:00401001 mov ebp, esp
.text:00401003 mov eax, 0C3514h
.text:00401008 call __alloca_probe
.text:0040100D mov [ebp+var_61A84], 0
.text:00401017 mov [ebp+var_C350C], 1
.text:00401021 jmp short loc_401032
.text:00401023 ; ---------------------------------------------------------------------------
.text:00401023
.text:00401023 loc_401023: ; CODE XREF: _main:loc_401097j
.text:00401023 mov eax, [ebp+var_C350C]
.text:00401029 add eax, 1
.text:0040102C mov [ebp+var_C350C], eax
.text:00401032
.text:00401032 loc_401032: ; CODE XREF: _main+21j
.text:00401032 cmp [ebp+var_C350C], 0F4240h
.text:0040103C jge short loc_401099
.text:0040103E mov [ebp+var_C3510], 1
.text:00401048 jmp short loc_401059
.text:0040104A ; ---------------------------------------------------------------------------
.text:0040104A
.text:0040104A loc_40104A: ; CODE XREF: _main+95j
.text:0040104A mov ecx, [ebp+var_C3510]
.text:00401050 add ecx, 1
.text:00401053 mov [ebp+var_C3510], ecx
.text:00401059
.text:00401059 loc_401059: ; CODE XREF: _main+48j
.text:00401059 cmp [ebp+var_C3510], 2710h
.text:00401063 jge short loc_401097
.text:00401065 mov edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx
.text:00401095 jmp short loc_40104A
.text:00401097 ; ---------------------------------------------------------------------------
.text:00401097
.text:00401097 loc_401097: ; CODE XREF: _main+63j
.text:00401097 jmp short loc_401023
.text:00401099 ; ---------------------------------------------------------------------------
.text:00401099
.text:00401099 loc_401099: ; CODE XREF: _main+3Cj
.text:00401099 mov [ebp+var_C3514], 1
.text:004010A3 jmp short loc_4010B4
.text:004010A5 ; ---------------------------------------------------------------------------
.text:004010A5
.text:004010A5 loc_4010A5: ; CODE XREF: _main+DCj
.text:004010A5 mov eax, [ebp+var_C3514]
.text:004010AB add eax, 1
.text:004010AE mov [ebp+var_C3514], eax
.text:004010B4
.text:004010B4 loc_4010B4: ; CODE XREF: _main+A3j
.text:004010B4 cmp [ebp+var_C3514], 2710h
.text:004010BE jge short loc_4010DE
.text:004010C0 mov ecx, [ebp+var_C3514]
.text:004010C6 imul ecx, 28h
.text:004010C9 mov edx, [ebp+var_61A84]
.text:004010CF add edx, [ebp+ecx+var_61A7C]
.text:004010D6 mov [ebp+var_61A84], edx
.text:004010DC jmp short loc_4010A5
.text:004010DE ; ---------------------------------------------------------------------------
.text:004010DE
.text:004010DE loc_4010DE: ; CODE XREF: _main+BEj
.text:004010DE mov eax, [ebp+var_61A84]
.text:004010E4 mov esp, ebp
.text:004010E6 pop ebp
.text:004010E7 retn
.text:004010E7 _main endp


The VB code is longer, so I post here only the main function :



.text:00401A81 loc_401A81:                             ; CODE XREF: .text:00401B18j
.text:00401A81 mov ecx, [ebp-68h]
.text:00401A84 mov eax, 0F4240h
.text:00401A89 cmp ecx, eax
.text:00401A8B jg loc_401B1D
.text:00401A91 mov edx, [ebp-18h]
.text:00401A94 mov edi, 1
.text:00401A99 add edx, 1
.text:00401A9C mov esi, edi
.text:00401A9E jo loc_401CEC
.text:00401AA4 mov [ebp-18h], edx
.text:00401AA7
.text:00401AA7 loc_401AA7: ; CODE XREF: .text:00401B03j
.text:00401AA7 mov eax, 2710h
.text:00401AAC cmp esi, eax
.text:00401AAE jg short loc_401B05
.text:00401AB0 mov ebx, ds:__vbaGenerateBoundsError
.text:00401AB6 cmp esi, 2711h
.text:00401ABC jb short loc_401AC0
.text:00401ABE call ebx ; __vbaGenerateBoundsError
.text:00401AC0 ; ---------------------------------------------------------------------------
.text:00401AC0
.text:00401AC0 loc_401AC0: ; CODE XREF: .text:00401ABCj
.text:00401AC0 cmp esi, 2711h
.text:00401AC6 jb short loc_401AD4
.text:00401AC8 call ebx ; __vbaGenerateBoundsError
.text:00401ACA ; ---------------------------------------------------------------------------
.text:00401ACA cmp esi, 2711h
.text:00401AD0 jb short loc_401AD4
.text:00401AD2 call ebx
.text:00401AD4
.text:00401AD4 loc_401AD4: ; CODE XREF: .text:00401AC6j
.text:00401AD4 ; .text:00401AD0j
.text:00401AD4 mov eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi
.text:00401AEC jo loc_401CEC
.text:00401AF2 add eax, esi
.text:00401AF4 mov [edx+esi*4+9C44h], ecx
.text:00401AFB jo loc_401CEC
.text:00401B01 mov esi, eax
.text:00401B03 jmp short loc_401AA7
.text:00401B05 ; ---------------------------------------------------------------------------
.text:00401B05
.text:00401B05 loc_401B05: ; CODE XREF: .text:00401AAEj
.text:00401B05 mov ecx, [ebp-68h]
.text:00401B08 mov eax, 1
.text:00401B0D add eax, ecx
.text:00401B0F jo loc_401CEC
.text:00401B15 mov [ebp-68h], eax
.text:00401B18 jmp loc_401A81
.text:00401B1D ; ---------------------------------------------------------------------------
.text:00401B1D
.text:00401B1D loc_401B1D: ; CODE XREF: .text:00401A8Bj
.text:00401B1D mov edi, 1
.text:00401B22 mov ebx, 2710h
.text:00401B27 mov esi, edi
.text:00401B29
.text:00401B29 loc_401B29: ; CODE XREF: .text:00401B5Fj
.text:00401B29 cmp esi, ebx
.text:00401B2B jg short loc_401B61
.text:00401B2D cmp esi, 2711h
.text:00401B33 jb short loc_401B3B
.text:00401B35 call ds:__vbaGenerateBoundsError
.text:00401B3B ; ---------------------------------------------------------------------------
.text:00401B3B
.text:00401B3B loc_401B3B: ; CODE XREF: .text:00401B33j
.text:00401B3B mov ecx, [ebp-2Ch]
.text:00401B3E mov eax, [ebp-40h]
.text:00401B41 mov edx, [ecx+esi*4+9C44h]
.text:00401B48 add edx, eax
.text:00401B4A mov eax, edi
.text:00401B4C jo loc_401CEC
.text:00401B52 add eax, esi
.text:00401B54 mov [ebp-40h], edx
.text:00401B57 jo loc_401CEC
.text:00401B5D mov esi, eax
.text:00401B5F jmp short loc_401B29
.text:00401B61 ; ---------------------------------------------------------------------------
.text:00401B61
.text:00401B61 loc_401B61: ; CODE XREF: .text:00401B2Bj
.text:00401B61 mov ebx, ds:__vbaStrI4
.text:00401B67 mov ecx, 80020004h
.text:00401B6C mov [ebp-0B4h], ecx
.text:00401B72 mov [ebp-0A4h], ecx
.text:00401B78 mov [ebp-94h], ecx
.text:00401B7E mov ecx, [ebp-40h]
.text:00401B81 mov eax, 0Ah
.text:00401B86 push offset aDone ; "Done : "





share|improve this answer


























  • Cheers, so the g++ is running loads slower... hmmm... I've run several tests since and VB6 wins hands down every time... also g++ won't allow array allocations over 10 million, VB6 no problem. So... VB6 is better than g++... nah! some things not right...

    – Progger
    Nov 13 '18 at 17:59











  • This code has UB and the loops are still at least dodgy. Not a valid benchmark.

    – Baum mit Augen
    Nov 13 '18 at 20:14











  • @progger : for the big array, you should use the dynamic allocation as discussed in this post: stackoverflow.com/questions/216259/….

    – tunglt
    Nov 13 '18 at 20:55











  • @Baum mit Augen: thanks to point out the UB, do you think the benchmark will be better if we fix it ?

    – tunglt
    Nov 13 '18 at 21:03











  • It will be better, yes, in that the code would be valid and the valid code produces different assembly. But using loops like this is already dodgy in itself, esp given the inner loop does not depend on d. It raises the question what exactly we are even trying to benchmark, and what exactly would be desirable compiler optimizations and where we would like to trick the compiler to actually give us meaningful results.

    – Baum mit Augen
    Nov 13 '18 at 21:14











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283791%2fc-running-slower-than-vb6%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














Short answer



You need to look into your compiler optimization settings. This resource might help



Takeaway: C++ allows you to use many tricks some generic and some dependent on your architecture, when used properly it will be superior to VB in terms of performance.



Long answer



Keep in mind this is highly dependent on your architecture and compiler, also compiler settings. You should configure your compiler to do a more agressive optimization.
Also you should write optimized code taking into account memory access, using CPU cache wisely etc.



I have done a test for you on an ubuntu 16.04 virtual machine using a core of Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz. Using the code bellow here are my times depending on the optimization level of the compiler I used g++ 5.4.0



I'm using optimization level 0,1,2,3,s and obtain 36s(completely unoptimized), 23s, and then.. zero.




osboxes@osboxes:~/test$ g++ a.cpp -O0 -o a0
osboxes@osboxes:~/test$ ./a0 start..finished in 36174855 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O1 -o a1
osboxes@osboxes:~/test$ ./a1 start..finished in 2352767 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O2 -o a2
osboxes@osboxes:~/test$ ./a2 start..finished in 0 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 0 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -Os -o as
osboxes@osboxes:~/test$ ./as start..finished in 0 micro seconds



Note that by using a more agressive optimization level, the compiler will eliminate the code completely because the values in n are not used in the program.
To force the compiler to generate code use the volatile keyword when declaring n



With the volatile added now you'll have ~12s with the most agressive optimization (on my machine)




osboxes@osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 12139348 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -Os -o as
osboxes@osboxes:~/test$ ./as start..finished in 12493927 micro seconds



The code I used for the test(based on your example)



#include <iostream>
#include <sys/time.h>
using namespace std;

typedef unsigned long long u64;

u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}

int main()
{
cout<<"start"<<endl;
u64 t0 = timestamp();

volatile long n[10000][10];
long c[10000][10];

for(long d=1;d<1000000;d++)
{
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}

u64 t1 = timestamp();

cout<<"..finished in "<< (t1-t0) << " micro secondsn";

return 0;
}


Multithreading



I have converted your code to use multithreading, using 2 threads I am able to reduce the time by half.



I am using the fact that as it is now, the results are not used so the inner for is not dependent on the outer one, in reality you should find another way to split the work so that the results do not overwrite one another.



#include <iostream>
#include <sys/time.h>
#include <omp.h>
using namespace std;

typedef unsigned long long u64;

u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}

int main()
{
omp_set_num_threads(2);
#pragma omp parallel
{
}

cout<<"start"<<endl;
u64 t0 = timestamp();

volatile long n[10000][10];
long c[10000][10];

for(long d=1;d<1000000;d++)
{
#pragma omp parallel for
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}

u64 t1 = timestamp();
cout<<"..finished in "<< (t1-t0) << " micro secondsn";
return 0;
}



osboxes@osboxes:~/test$ g++ a.cpp -O3 -fopenmp -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 6673741 micro seconds






share|improve this answer


























  • So it takes a 4.2 GHz I7 running ubuntu to get 23 secs without volatile, and I'm using a 2.GHz dual core XP 32 bit machine to get 12 seconds with VB6... that's Visual Basic 6... can't be right.

    – Progger
    Nov 13 '18 at 17:59








  • 1





    @Progger It takes 0 seconds without volatile in any reasonable benchmark setting, because your benchmark is flawed.

    – Baum mit Augen
    Nov 13 '18 at 21:15











  • Honestly, I'm not 100% sure the test is meaningful with volatile either, even when you fix the UB. (Downvoted for that btw, as that invalidates all results to begin with.)

    – Baum mit Augen
    Nov 13 '18 at 21:18











  • @BaummitAugen of course the code is not a correct benchmark, I have pointed that out above, since the compiler will optimize it out, but this is the "usecase" the OP has tested in VB. What I am trying to do is keep the code as close as possible to the original one and make a comparison to VB. Code is not correct for many reasons: arrays are not initialised, results are not used, all is allocated on stack, unaligned memory, etc etc, so I understand your point of view and respect it, but I tried to illustrate something else :)

    – Andrei Diea
    Nov 14 '18 at 8:54











  • @Progger C++ is a very powerful language and there are many things you can do to improve your code. Like everyone points out this is not the best example code, but C++'s power lays in its ability to go low level and optimize for the specific architecture you are using. For example, I used multithreading on your code and execution drops to 6 seconds using 2 threads on 2 cores. C++ allows you to take advantage of: multicores, specific vectorization instructions that some CPUs have (ex AVX for intel), L1 cache,etc so once you learnin more tricks you will be able to achive best performance!

    – Andrei Diea
    Nov 14 '18 at 9:00
















3














Short answer



You need to look into your compiler optimization settings. This resource might help



Takeaway: C++ allows you to use many tricks some generic and some dependent on your architecture, when used properly it will be superior to VB in terms of performance.



Long answer



Keep in mind this is highly dependent on your architecture and compiler, also compiler settings. You should configure your compiler to do a more agressive optimization.
Also you should write optimized code taking into account memory access, using CPU cache wisely etc.



I have done a test for you on an ubuntu 16.04 virtual machine using a core of Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz. Using the code bellow here are my times depending on the optimization level of the compiler I used g++ 5.4.0



I'm using optimization level 0,1,2,3,s and obtain 36s(completely unoptimized), 23s, and then.. zero.




osboxes@osboxes:~/test$ g++ a.cpp -O0 -o a0
osboxes@osboxes:~/test$ ./a0 start..finished in 36174855 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O1 -o a1
osboxes@osboxes:~/test$ ./a1 start..finished in 2352767 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O2 -o a2
osboxes@osboxes:~/test$ ./a2 start..finished in 0 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 0 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -Os -o as
osboxes@osboxes:~/test$ ./as start..finished in 0 micro seconds



Note that by using a more agressive optimization level, the compiler will eliminate the code completely because the values in n are not used in the program.
To force the compiler to generate code use the volatile keyword when declaring n



With the volatile added now you'll have ~12s with the most agressive optimization (on my machine)




osboxes@osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 12139348 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -Os -o as
osboxes@osboxes:~/test$ ./as start..finished in 12493927 micro seconds



The code I used for the test(based on your example)



#include <iostream>
#include <sys/time.h>
using namespace std;

typedef unsigned long long u64;

u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}

int main()
{
cout<<"start"<<endl;
u64 t0 = timestamp();

volatile long n[10000][10];
long c[10000][10];

for(long d=1;d<1000000;d++)
{
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}

u64 t1 = timestamp();

cout<<"..finished in "<< (t1-t0) << " micro secondsn";

return 0;
}


Multithreading



I have converted your code to use multithreading, using 2 threads I am able to reduce the time by half.



I am using the fact that as it is now, the results are not used so the inner for is not dependent on the outer one, in reality you should find another way to split the work so that the results do not overwrite one another.



#include <iostream>
#include <sys/time.h>
#include <omp.h>
using namespace std;

typedef unsigned long long u64;

u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}

int main()
{
omp_set_num_threads(2);
#pragma omp parallel
{
}

cout<<"start"<<endl;
u64 t0 = timestamp();

volatile long n[10000][10];
long c[10000][10];

for(long d=1;d<1000000;d++)
{
#pragma omp parallel for
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}

u64 t1 = timestamp();
cout<<"..finished in "<< (t1-t0) << " micro secondsn";
return 0;
}



osboxes@osboxes:~/test$ g++ a.cpp -O3 -fopenmp -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 6673741 micro seconds






share|improve this answer


























  • So it takes a 4.2 GHz I7 running ubuntu to get 23 secs without volatile, and I'm using a 2.GHz dual core XP 32 bit machine to get 12 seconds with VB6... that's Visual Basic 6... can't be right.

    – Progger
    Nov 13 '18 at 17:59








  • 1





    @Progger It takes 0 seconds without volatile in any reasonable benchmark setting, because your benchmark is flawed.

    – Baum mit Augen
    Nov 13 '18 at 21:15











  • Honestly, I'm not 100% sure the test is meaningful with volatile either, even when you fix the UB. (Downvoted for that btw, as that invalidates all results to begin with.)

    – Baum mit Augen
    Nov 13 '18 at 21:18











  • @BaummitAugen of course the code is not a correct benchmark, I have pointed that out above, since the compiler will optimize it out, but this is the "usecase" the OP has tested in VB. What I am trying to do is keep the code as close as possible to the original one and make a comparison to VB. Code is not correct for many reasons: arrays are not initialised, results are not used, all is allocated on stack, unaligned memory, etc etc, so I understand your point of view and respect it, but I tried to illustrate something else :)

    – Andrei Diea
    Nov 14 '18 at 8:54











  • @Progger C++ is a very powerful language and there are many things you can do to improve your code. Like everyone points out this is not the best example code, but C++'s power lays in its ability to go low level and optimize for the specific architecture you are using. For example, I used multithreading on your code and execution drops to 6 seconds using 2 threads on 2 cores. C++ allows you to take advantage of: multicores, specific vectorization instructions that some CPUs have (ex AVX for intel), L1 cache,etc so once you learnin more tricks you will be able to achive best performance!

    – Andrei Diea
    Nov 14 '18 at 9:00














3












3








3







Short answer



You need to look into your compiler optimization settings. This resource might help



Takeaway: C++ allows you to use many tricks some generic and some dependent on your architecture, when used properly it will be superior to VB in terms of performance.



Long answer



Keep in mind this is highly dependent on your architecture and compiler, also compiler settings. You should configure your compiler to do a more agressive optimization.
Also you should write optimized code taking into account memory access, using CPU cache wisely etc.



I have done a test for you on an ubuntu 16.04 virtual machine using a core of Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz. Using the code bellow here are my times depending on the optimization level of the compiler I used g++ 5.4.0



I'm using optimization level 0,1,2,3,s and obtain 36s(completely unoptimized), 23s, and then.. zero.




osboxes@osboxes:~/test$ g++ a.cpp -O0 -o a0
osboxes@osboxes:~/test$ ./a0 start..finished in 36174855 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O1 -o a1
osboxes@osboxes:~/test$ ./a1 start..finished in 2352767 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O2 -o a2
osboxes@osboxes:~/test$ ./a2 start..finished in 0 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 0 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -Os -o as
osboxes@osboxes:~/test$ ./as start..finished in 0 micro seconds



Note that by using a more agressive optimization level, the compiler will eliminate the code completely because the values in n are not used in the program.
To force the compiler to generate code use the volatile keyword when declaring n



With the volatile added now you'll have ~12s with the most agressive optimization (on my machine)




osboxes@osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 12139348 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -Os -o as
osboxes@osboxes:~/test$ ./as start..finished in 12493927 micro seconds



The code I used for the test(based on your example)



#include <iostream>
#include <sys/time.h>
using namespace std;

typedef unsigned long long u64;

u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}

int main()
{
cout<<"start"<<endl;
u64 t0 = timestamp();

volatile long n[10000][10];
long c[10000][10];

for(long d=1;d<1000000;d++)
{
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}

u64 t1 = timestamp();

cout<<"..finished in "<< (t1-t0) << " micro secondsn";

return 0;
}


Multithreading



I have converted your code to use multithreading, using 2 threads I am able to reduce the time by half.



I am using the fact that as it is now, the results are not used so the inner for is not dependent on the outer one, in reality you should find another way to split the work so that the results do not overwrite one another.



#include <iostream>
#include <sys/time.h>
#include <omp.h>
using namespace std;

typedef unsigned long long u64;

u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}

int main()
{
omp_set_num_threads(2);
#pragma omp parallel
{
}

cout<<"start"<<endl;
u64 t0 = timestamp();

volatile long n[10000][10];
long c[10000][10];

for(long d=1;d<1000000;d++)
{
#pragma omp parallel for
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}

u64 t1 = timestamp();
cout<<"..finished in "<< (t1-t0) << " micro secondsn";
return 0;
}



osboxes@osboxes:~/test$ g++ a.cpp -O3 -fopenmp -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 6673741 micro seconds






share|improve this answer















Short answer



You need to look into your compiler optimization settings. This resource might help



Takeaway: C++ allows you to use many tricks some generic and some dependent on your architecture, when used properly it will be superior to VB in terms of performance.



Long answer



Keep in mind this is highly dependent on your architecture and compiler, also compiler settings. You should configure your compiler to do a more agressive optimization.
Also you should write optimized code taking into account memory access, using CPU cache wisely etc.



I have done a test for you on an ubuntu 16.04 virtual machine using a core of Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz. Using the code bellow here are my times depending on the optimization level of the compiler I used g++ 5.4.0



I'm using optimization level 0,1,2,3,s and obtain 36s(completely unoptimized), 23s, and then.. zero.




osboxes@osboxes:~/test$ g++ a.cpp -O0 -o a0
osboxes@osboxes:~/test$ ./a0 start..finished in 36174855 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O1 -o a1
osboxes@osboxes:~/test$ ./a1 start..finished in 2352767 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O2 -o a2
osboxes@osboxes:~/test$ ./a2 start..finished in 0 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 0 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -Os -o as
osboxes@osboxes:~/test$ ./as start..finished in 0 micro seconds



Note that by using a more agressive optimization level, the compiler will eliminate the code completely because the values in n are not used in the program.
To force the compiler to generate code use the volatile keyword when declaring n



With the volatile added now you'll have ~12s with the most agressive optimization (on my machine)




osboxes@osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 12139348 micro seconds
osboxes@osboxes:~/test$ g++ a.cpp -Os -o as
osboxes@osboxes:~/test$ ./as start..finished in 12493927 micro seconds



The code I used for the test(based on your example)



#include <iostream>
#include <sys/time.h>
using namespace std;

typedef unsigned long long u64;

u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}

int main()
{
cout<<"start"<<endl;
u64 t0 = timestamp();

volatile long n[10000][10];
long c[10000][10];

for(long d=1;d<1000000;d++)
{
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}

u64 t1 = timestamp();

cout<<"..finished in "<< (t1-t0) << " micro secondsn";

return 0;
}


Multithreading



I have converted your code to use multithreading, using 2 threads I am able to reduce the time by half.



I am using the fact that as it is now, the results are not used so the inner for is not dependent on the outer one, in reality you should find another way to split the work so that the results do not overwrite one another.



#include <iostream>
#include <sys/time.h>
#include <omp.h>
using namespace std;

typedef unsigned long long u64;

u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}

int main()
{
omp_set_num_threads(2);
#pragma omp parallel
{
}

cout<<"start"<<endl;
u64 t0 = timestamp();

volatile long n[10000][10];
long c[10000][10];

for(long d=1;d<1000000;d++)
{
#pragma omp parallel for
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}

u64 t1 = timestamp();
cout<<"..finished in "<< (t1-t0) << " micro secondsn";
return 0;
}



osboxes@osboxes:~/test$ g++ a.cpp -O3 -fopenmp -o a3
osboxes@osboxes:~/test$ ./a3 start..finished in 6673741 micro seconds







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 14 '18 at 9:17

























answered Nov 13 '18 at 16:19









Andrei DieaAndrei Diea

565




565













  • So it takes a 4.2 GHz I7 running ubuntu to get 23 secs without volatile, and I'm using a 2.GHz dual core XP 32 bit machine to get 12 seconds with VB6... that's Visual Basic 6... can't be right.

    – Progger
    Nov 13 '18 at 17:59








  • 1





    @Progger It takes 0 seconds without volatile in any reasonable benchmark setting, because your benchmark is flawed.

    – Baum mit Augen
    Nov 13 '18 at 21:15











  • Honestly, I'm not 100% sure the test is meaningful with volatile either, even when you fix the UB. (Downvoted for that btw, as that invalidates all results to begin with.)

    – Baum mit Augen
    Nov 13 '18 at 21:18











  • @BaummitAugen of course the code is not a correct benchmark, I have pointed that out above, since the compiler will optimize it out, but this is the "usecase" the OP has tested in VB. What I am trying to do is keep the code as close as possible to the original one and make a comparison to VB. Code is not correct for many reasons: arrays are not initialised, results are not used, all is allocated on stack, unaligned memory, etc etc, so I understand your point of view and respect it, but I tried to illustrate something else :)

    – Andrei Diea
    Nov 14 '18 at 8:54











  • @Progger C++ is a very powerful language and there are many things you can do to improve your code. Like everyone points out this is not the best example code, but C++'s power lays in its ability to go low level and optimize for the specific architecture you are using. For example, I used multithreading on your code and execution drops to 6 seconds using 2 threads on 2 cores. C++ allows you to take advantage of: multicores, specific vectorization instructions that some CPUs have (ex AVX for intel), L1 cache,etc so once you learnin more tricks you will be able to achive best performance!

    – Andrei Diea
    Nov 14 '18 at 9:00



















  • So it takes a 4.2 GHz I7 running ubuntu to get 23 secs without volatile, and I'm using a 2.GHz dual core XP 32 bit machine to get 12 seconds with VB6... that's Visual Basic 6... can't be right.

    – Progger
    Nov 13 '18 at 17:59








  • 1





    @Progger It takes 0 seconds without volatile in any reasonable benchmark setting, because your benchmark is flawed.

    – Baum mit Augen
    Nov 13 '18 at 21:15











  • Honestly, I'm not 100% sure the test is meaningful with volatile either, even when you fix the UB. (Downvoted for that btw, as that invalidates all results to begin with.)

    – Baum mit Augen
    Nov 13 '18 at 21:18











  • @BaummitAugen of course the code is not a correct benchmark, I have pointed that out above, since the compiler will optimize it out, but this is the "usecase" the OP has tested in VB. What I am trying to do is keep the code as close as possible to the original one and make a comparison to VB. Code is not correct for many reasons: arrays are not initialised, results are not used, all is allocated on stack, unaligned memory, etc etc, so I understand your point of view and respect it, but I tried to illustrate something else :)

    – Andrei Diea
    Nov 14 '18 at 8:54











  • @Progger C++ is a very powerful language and there are many things you can do to improve your code. Like everyone points out this is not the best example code, but C++'s power lays in its ability to go low level and optimize for the specific architecture you are using. For example, I used multithreading on your code and execution drops to 6 seconds using 2 threads on 2 cores. C++ allows you to take advantage of: multicores, specific vectorization instructions that some CPUs have (ex AVX for intel), L1 cache,etc so once you learnin more tricks you will be able to achive best performance!

    – Andrei Diea
    Nov 14 '18 at 9:00

















So it takes a 4.2 GHz I7 running ubuntu to get 23 secs without volatile, and I'm using a 2.GHz dual core XP 32 bit machine to get 12 seconds with VB6... that's Visual Basic 6... can't be right.

– Progger
Nov 13 '18 at 17:59







So it takes a 4.2 GHz I7 running ubuntu to get 23 secs without volatile, and I'm using a 2.GHz dual core XP 32 bit machine to get 12 seconds with VB6... that's Visual Basic 6... can't be right.

– Progger
Nov 13 '18 at 17:59






1




1





@Progger It takes 0 seconds without volatile in any reasonable benchmark setting, because your benchmark is flawed.

– Baum mit Augen
Nov 13 '18 at 21:15





@Progger It takes 0 seconds without volatile in any reasonable benchmark setting, because your benchmark is flawed.

– Baum mit Augen
Nov 13 '18 at 21:15













Honestly, I'm not 100% sure the test is meaningful with volatile either, even when you fix the UB. (Downvoted for that btw, as that invalidates all results to begin with.)

– Baum mit Augen
Nov 13 '18 at 21:18





Honestly, I'm not 100% sure the test is meaningful with volatile either, even when you fix the UB. (Downvoted for that btw, as that invalidates all results to begin with.)

– Baum mit Augen
Nov 13 '18 at 21:18













@BaummitAugen of course the code is not a correct benchmark, I have pointed that out above, since the compiler will optimize it out, but this is the "usecase" the OP has tested in VB. What I am trying to do is keep the code as close as possible to the original one and make a comparison to VB. Code is not correct for many reasons: arrays are not initialised, results are not used, all is allocated on stack, unaligned memory, etc etc, so I understand your point of view and respect it, but I tried to illustrate something else :)

– Andrei Diea
Nov 14 '18 at 8:54





@BaummitAugen of course the code is not a correct benchmark, I have pointed that out above, since the compiler will optimize it out, but this is the "usecase" the OP has tested in VB. What I am trying to do is keep the code as close as possible to the original one and make a comparison to VB. Code is not correct for many reasons: arrays are not initialised, results are not used, all is allocated on stack, unaligned memory, etc etc, so I understand your point of view and respect it, but I tried to illustrate something else :)

– Andrei Diea
Nov 14 '18 at 8:54













@Progger C++ is a very powerful language and there are many things you can do to improve your code. Like everyone points out this is not the best example code, but C++'s power lays in its ability to go low level and optimize for the specific architecture you are using. For example, I used multithreading on your code and execution drops to 6 seconds using 2 threads on 2 cores. C++ allows you to take advantage of: multicores, specific vectorization instructions that some CPUs have (ex AVX for intel), L1 cache,etc so once you learnin more tricks you will be able to achive best performance!

– Andrei Diea
Nov 14 '18 at 9:00





@Progger C++ is a very powerful language and there are many things you can do to improve your code. Like everyone points out this is not the best example code, but C++'s power lays in its ability to go low level and optimize for the specific architecture you are using. For example, I used multithreading on your code and execution drops to 6 seconds using 2 threads on 2 cores. C++ allows you to take advantage of: multicores, specific vectorization instructions that some CPUs have (ex AVX for intel), L1 cache,etc so once you learnin more tricks you will be able to achive best performance!

– Andrei Diea
Nov 14 '18 at 9:00













0














UPDATE : Recent c++ compilers give much better results compare to VB6 compiler.



The VB6 codes shown above does not reflect the real case, where access index can be varied and calculations can be breaked into embraced functions.
More experiences show that VB6 has a huge penaty for optimizing when arrays used were passed as function input (by reference).



I tried severals c++ compilers and rewritten the benchmark codes where some random behaviors was added to trick the optimization. Certainly,
the codes could be better, all suggestions are welcome. I used -O3 option for maximizing the speed, multithread mode was not used.



RESULT:



The g++ 8.1 gives the best final result with 5 seconds with const index access (compiler known it, case shown in OP), and 6 seconds with dynamic index access.
VC 2017 takes 2 seconds only for const index access but 35 seconds for dynamic case.



VB6 S1 : Original version, n and c are local variables. VB6 S2: inner loops was moved into a function, n and c are input variables passed by reference.
Test condition : Intel Xeon W3690 / Windows 10



Here are the test result :



Executions time in seconds
The rewritten codes :



VB6:



DefLng A-Z
Private Declare Function GetTickCount Lib "kernel32" () As Long

Public Sub cal_var(ByRef n() As Long, ByRef c() As Long, id As Long)

For dd = 0 To 10000
n(dd, id) = c(dd, id + 1) + c(dd, id + 2)
Next

End Sub

Public Sub cal_const(ByRef n() As Long, ByRef c() As Long)
For dd = 0 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
End Sub


Private Sub Form_Load()

Dim n(10001, 10) As Long
Dim c(10001, 10) As Long

Dim t0 As Long
Dim t1 As Long
Dim t2 As Long

Dim id As Long
Dim ret As Long

t0 = GetTickCount

For d = 1 To 1000000
id = d And 7
Call cal_var(n, c, id) 'For VB S2

'For dd = 0 To 10000
' n(dd, id + 0) = c(dd, id + 1) + c(dd, id + 2)
'Next

Next

t1 = GetTickCount

For d = 1 To 1000000
Call cal_const(n, c) 'For VB S2

'For dd = 0 To 10000
' n(dd, 1) = c(dd, 2) + c(dd, 3)
'Next
Next

t2 = GetTickCount

For d = 0 To 10000
Sum = Sum + n(d, t0 And 7)
Next

MsgBox "Done in " & (t1 - t0) & " and " & (t2 - t1) & " miliseconds"

End Sub


C++ Code:



#include <iostream>
#include <time.h>
#include <string.h>

using namespace std;

#define NUM_ITERATION 1000000
#define ARR_SIZE 10000

typedef long long_arr[ARR_SIZE][10];

void calc_var(long_arr &n, long_arr &c, long id) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][id] = c[dd][id + 1] + c[dd][id + 2];
}
}

void calc_const(long_arr &n, long_arr &c) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][0] = c[dd][1] + c[dd][2];
}
}

int main()
{
cout << "start ..." << endl;
time_t t0 = time(NULL);

long_arr n;
long_arr c;

memset(n, 0, sizeof(n));
memset(c, t0 & 1, sizeof(c));

for (long d = 1; d < NUM_ITERATION; d++) {
calc_var(n, c, (long)t0 & 7);
}

time_t t1 = time(NULL);

for (long d = 1; d < NUM_ITERATION; d++) {
calc_const(n, c);
}

time_t t2 = time(NULL);

long sum = 0;
for (int i = 0; i < ARR_SIZE; i++) {
sum += n[i][t0 & 7];
}

cout << "by dynamic index: finished in " << (t1 - t0) << " seconds" << endl;
cout << "by const index: finished in " << (t2 - t1) << " seconds" << endl;
cout << "with final result : " << sum << endl;

return 0;
}


The following original answer was based only on test done in VC2008, VB6 S1 case.



ORIGINAL ANSWER:



I have the same result with Progger. I use the following code to avoid the optimization (code blank) made by the compiler :



int main(int argc, char *argv) {

long n[10000][10];
long c[10000][10];
long sum = 0;

memset(n, 0, sizeof(n) );
memset(c, 0, sizeof(c) );

for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}

for (long dd=1;dd<10000;dd++){
sum += n[dd][1];
}

return sum;
}


I use the visual studio C++ 2008 to compile the code. it turns out that the compiler uses a lot of adress reallocation with local variable, mixing with an imul (multiplication) instruction, that could be very costly, for example :



.text:00401065                 mov     edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx


Each access to an index take one multiplication and one add instruction.



EDIT1: The problem is related to multidimension array access performance in C++. More information can be found here:



Array Lookups




C++ doesn't provide multidimensional arrays so scientific and engineering applications either write their own or use one of the available array libraries such as Eigen, Armadillo, uBLAS, or Boost.MultiArray. High quality C++ array libraries can provide generally excellent performance but yet simple element lookup speed can still lag that of Fortran. One reason is that Fortran arrays are built in to the language so its compilers can figure out array index strides in loops and avoid computing the memory offset from the indexes for each element. We can't change the C++ compiler so the next best solution is to offer a linear, or flat, 1D indexing operator for multidimensional arrays that can be used to improve performance of hot spot loops.




The VB version use a traditional way (1D optimized flat memory), intructions used are only registers add (eax, edx, esi ...), they are much faster.



.text:00401AD4                 mov     eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi


That can answer the question of speed.
Advice: you should use existed libraries (for example: uBlas) if possible. More discussion can be found here.



Here is the machine code of C++ version :



.text:00401000 ; int __cdecl main(int argc, const char **argv, const char **envp)
.text:00401000 _main proc near ; CODE XREF: ___tmainCRTStartup+F6p
.text:00401000
.text:00401000 var_C3514 = dword ptr -0C3514h
.text:00401000 var_C3510 = dword ptr -0C3510h
.text:00401000 var_C350C = dword ptr -0C350Ch
.text:00401000 var_C3500 = dword ptr -0C3500h
.text:00401000 var_C34FC = dword ptr -0C34FCh
.text:00401000 var_61A84 = dword ptr -61A84h
.text:00401000 var_61A7C = dword ptr -61A7Ch
.text:00401000 argc = dword ptr 8
.text:00401000 argv = dword ptr 0Ch
.text:00401000 envp = dword ptr 10h
.text:00401000
.text:00401000 push ebp
.text:00401001 mov ebp, esp
.text:00401003 mov eax, 0C3514h
.text:00401008 call __alloca_probe
.text:0040100D mov [ebp+var_61A84], 0
.text:00401017 mov [ebp+var_C350C], 1
.text:00401021 jmp short loc_401032
.text:00401023 ; ---------------------------------------------------------------------------
.text:00401023
.text:00401023 loc_401023: ; CODE XREF: _main:loc_401097j
.text:00401023 mov eax, [ebp+var_C350C]
.text:00401029 add eax, 1
.text:0040102C mov [ebp+var_C350C], eax
.text:00401032
.text:00401032 loc_401032: ; CODE XREF: _main+21j
.text:00401032 cmp [ebp+var_C350C], 0F4240h
.text:0040103C jge short loc_401099
.text:0040103E mov [ebp+var_C3510], 1
.text:00401048 jmp short loc_401059
.text:0040104A ; ---------------------------------------------------------------------------
.text:0040104A
.text:0040104A loc_40104A: ; CODE XREF: _main+95j
.text:0040104A mov ecx, [ebp+var_C3510]
.text:00401050 add ecx, 1
.text:00401053 mov [ebp+var_C3510], ecx
.text:00401059
.text:00401059 loc_401059: ; CODE XREF: _main+48j
.text:00401059 cmp [ebp+var_C3510], 2710h
.text:00401063 jge short loc_401097
.text:00401065 mov edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx
.text:00401095 jmp short loc_40104A
.text:00401097 ; ---------------------------------------------------------------------------
.text:00401097
.text:00401097 loc_401097: ; CODE XREF: _main+63j
.text:00401097 jmp short loc_401023
.text:00401099 ; ---------------------------------------------------------------------------
.text:00401099
.text:00401099 loc_401099: ; CODE XREF: _main+3Cj
.text:00401099 mov [ebp+var_C3514], 1
.text:004010A3 jmp short loc_4010B4
.text:004010A5 ; ---------------------------------------------------------------------------
.text:004010A5
.text:004010A5 loc_4010A5: ; CODE XREF: _main+DCj
.text:004010A5 mov eax, [ebp+var_C3514]
.text:004010AB add eax, 1
.text:004010AE mov [ebp+var_C3514], eax
.text:004010B4
.text:004010B4 loc_4010B4: ; CODE XREF: _main+A3j
.text:004010B4 cmp [ebp+var_C3514], 2710h
.text:004010BE jge short loc_4010DE
.text:004010C0 mov ecx, [ebp+var_C3514]
.text:004010C6 imul ecx, 28h
.text:004010C9 mov edx, [ebp+var_61A84]
.text:004010CF add edx, [ebp+ecx+var_61A7C]
.text:004010D6 mov [ebp+var_61A84], edx
.text:004010DC jmp short loc_4010A5
.text:004010DE ; ---------------------------------------------------------------------------
.text:004010DE
.text:004010DE loc_4010DE: ; CODE XREF: _main+BEj
.text:004010DE mov eax, [ebp+var_61A84]
.text:004010E4 mov esp, ebp
.text:004010E6 pop ebp
.text:004010E7 retn
.text:004010E7 _main endp


The VB code is longer, so I post here only the main function :



.text:00401A81 loc_401A81:                             ; CODE XREF: .text:00401B18j
.text:00401A81 mov ecx, [ebp-68h]
.text:00401A84 mov eax, 0F4240h
.text:00401A89 cmp ecx, eax
.text:00401A8B jg loc_401B1D
.text:00401A91 mov edx, [ebp-18h]
.text:00401A94 mov edi, 1
.text:00401A99 add edx, 1
.text:00401A9C mov esi, edi
.text:00401A9E jo loc_401CEC
.text:00401AA4 mov [ebp-18h], edx
.text:00401AA7
.text:00401AA7 loc_401AA7: ; CODE XREF: .text:00401B03j
.text:00401AA7 mov eax, 2710h
.text:00401AAC cmp esi, eax
.text:00401AAE jg short loc_401B05
.text:00401AB0 mov ebx, ds:__vbaGenerateBoundsError
.text:00401AB6 cmp esi, 2711h
.text:00401ABC jb short loc_401AC0
.text:00401ABE call ebx ; __vbaGenerateBoundsError
.text:00401AC0 ; ---------------------------------------------------------------------------
.text:00401AC0
.text:00401AC0 loc_401AC0: ; CODE XREF: .text:00401ABCj
.text:00401AC0 cmp esi, 2711h
.text:00401AC6 jb short loc_401AD4
.text:00401AC8 call ebx ; __vbaGenerateBoundsError
.text:00401ACA ; ---------------------------------------------------------------------------
.text:00401ACA cmp esi, 2711h
.text:00401AD0 jb short loc_401AD4
.text:00401AD2 call ebx
.text:00401AD4
.text:00401AD4 loc_401AD4: ; CODE XREF: .text:00401AC6j
.text:00401AD4 ; .text:00401AD0j
.text:00401AD4 mov eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi
.text:00401AEC jo loc_401CEC
.text:00401AF2 add eax, esi
.text:00401AF4 mov [edx+esi*4+9C44h], ecx
.text:00401AFB jo loc_401CEC
.text:00401B01 mov esi, eax
.text:00401B03 jmp short loc_401AA7
.text:00401B05 ; ---------------------------------------------------------------------------
.text:00401B05
.text:00401B05 loc_401B05: ; CODE XREF: .text:00401AAEj
.text:00401B05 mov ecx, [ebp-68h]
.text:00401B08 mov eax, 1
.text:00401B0D add eax, ecx
.text:00401B0F jo loc_401CEC
.text:00401B15 mov [ebp-68h], eax
.text:00401B18 jmp loc_401A81
.text:00401B1D ; ---------------------------------------------------------------------------
.text:00401B1D
.text:00401B1D loc_401B1D: ; CODE XREF: .text:00401A8Bj
.text:00401B1D mov edi, 1
.text:00401B22 mov ebx, 2710h
.text:00401B27 mov esi, edi
.text:00401B29
.text:00401B29 loc_401B29: ; CODE XREF: .text:00401B5Fj
.text:00401B29 cmp esi, ebx
.text:00401B2B jg short loc_401B61
.text:00401B2D cmp esi, 2711h
.text:00401B33 jb short loc_401B3B
.text:00401B35 call ds:__vbaGenerateBoundsError
.text:00401B3B ; ---------------------------------------------------------------------------
.text:00401B3B
.text:00401B3B loc_401B3B: ; CODE XREF: .text:00401B33j
.text:00401B3B mov ecx, [ebp-2Ch]
.text:00401B3E mov eax, [ebp-40h]
.text:00401B41 mov edx, [ecx+esi*4+9C44h]
.text:00401B48 add edx, eax
.text:00401B4A mov eax, edi
.text:00401B4C jo loc_401CEC
.text:00401B52 add eax, esi
.text:00401B54 mov [ebp-40h], edx
.text:00401B57 jo loc_401CEC
.text:00401B5D mov esi, eax
.text:00401B5F jmp short loc_401B29
.text:00401B61 ; ---------------------------------------------------------------------------
.text:00401B61
.text:00401B61 loc_401B61: ; CODE XREF: .text:00401B2Bj
.text:00401B61 mov ebx, ds:__vbaStrI4
.text:00401B67 mov ecx, 80020004h
.text:00401B6C mov [ebp-0B4h], ecx
.text:00401B72 mov [ebp-0A4h], ecx
.text:00401B78 mov [ebp-94h], ecx
.text:00401B7E mov ecx, [ebp-40h]
.text:00401B81 mov eax, 0Ah
.text:00401B86 push offset aDone ; "Done : "





share|improve this answer


























  • Cheers, so the g++ is running loads slower... hmmm... I've run several tests since and VB6 wins hands down every time... also g++ won't allow array allocations over 10 million, VB6 no problem. So... VB6 is better than g++... nah! some things not right...

    – Progger
    Nov 13 '18 at 17:59











  • This code has UB and the loops are still at least dodgy. Not a valid benchmark.

    – Baum mit Augen
    Nov 13 '18 at 20:14











  • @progger : for the big array, you should use the dynamic allocation as discussed in this post: stackoverflow.com/questions/216259/….

    – tunglt
    Nov 13 '18 at 20:55











  • @Baum mit Augen: thanks to point out the UB, do you think the benchmark will be better if we fix it ?

    – tunglt
    Nov 13 '18 at 21:03











  • It will be better, yes, in that the code would be valid and the valid code produces different assembly. But using loops like this is already dodgy in itself, esp given the inner loop does not depend on d. It raises the question what exactly we are even trying to benchmark, and what exactly would be desirable compiler optimizations and where we would like to trick the compiler to actually give us meaningful results.

    – Baum mit Augen
    Nov 13 '18 at 21:14
















0














UPDATE : Recent c++ compilers give much better results compare to VB6 compiler.



The VB6 codes shown above does not reflect the real case, where access index can be varied and calculations can be breaked into embraced functions.
More experiences show that VB6 has a huge penaty for optimizing when arrays used were passed as function input (by reference).



I tried severals c++ compilers and rewritten the benchmark codes where some random behaviors was added to trick the optimization. Certainly,
the codes could be better, all suggestions are welcome. I used -O3 option for maximizing the speed, multithread mode was not used.



RESULT:



The g++ 8.1 gives the best final result with 5 seconds with const index access (compiler known it, case shown in OP), and 6 seconds with dynamic index access.
VC 2017 takes 2 seconds only for const index access but 35 seconds for dynamic case.



VB6 S1 : Original version, n and c are local variables. VB6 S2: inner loops was moved into a function, n and c are input variables passed by reference.
Test condition : Intel Xeon W3690 / Windows 10



Here are the test result :



Executions time in seconds
The rewritten codes :



VB6:



DefLng A-Z
Private Declare Function GetTickCount Lib "kernel32" () As Long

Public Sub cal_var(ByRef n() As Long, ByRef c() As Long, id As Long)

For dd = 0 To 10000
n(dd, id) = c(dd, id + 1) + c(dd, id + 2)
Next

End Sub

Public Sub cal_const(ByRef n() As Long, ByRef c() As Long)
For dd = 0 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
End Sub


Private Sub Form_Load()

Dim n(10001, 10) As Long
Dim c(10001, 10) As Long

Dim t0 As Long
Dim t1 As Long
Dim t2 As Long

Dim id As Long
Dim ret As Long

t0 = GetTickCount

For d = 1 To 1000000
id = d And 7
Call cal_var(n, c, id) 'For VB S2

'For dd = 0 To 10000
' n(dd, id + 0) = c(dd, id + 1) + c(dd, id + 2)
'Next

Next

t1 = GetTickCount

For d = 1 To 1000000
Call cal_const(n, c) 'For VB S2

'For dd = 0 To 10000
' n(dd, 1) = c(dd, 2) + c(dd, 3)
'Next
Next

t2 = GetTickCount

For d = 0 To 10000
Sum = Sum + n(d, t0 And 7)
Next

MsgBox "Done in " & (t1 - t0) & " and " & (t2 - t1) & " miliseconds"

End Sub


C++ Code:



#include <iostream>
#include <time.h>
#include <string.h>

using namespace std;

#define NUM_ITERATION 1000000
#define ARR_SIZE 10000

typedef long long_arr[ARR_SIZE][10];

void calc_var(long_arr &n, long_arr &c, long id) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][id] = c[dd][id + 1] + c[dd][id + 2];
}
}

void calc_const(long_arr &n, long_arr &c) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][0] = c[dd][1] + c[dd][2];
}
}

int main()
{
cout << "start ..." << endl;
time_t t0 = time(NULL);

long_arr n;
long_arr c;

memset(n, 0, sizeof(n));
memset(c, t0 & 1, sizeof(c));

for (long d = 1; d < NUM_ITERATION; d++) {
calc_var(n, c, (long)t0 & 7);
}

time_t t1 = time(NULL);

for (long d = 1; d < NUM_ITERATION; d++) {
calc_const(n, c);
}

time_t t2 = time(NULL);

long sum = 0;
for (int i = 0; i < ARR_SIZE; i++) {
sum += n[i][t0 & 7];
}

cout << "by dynamic index: finished in " << (t1 - t0) << " seconds" << endl;
cout << "by const index: finished in " << (t2 - t1) << " seconds" << endl;
cout << "with final result : " << sum << endl;

return 0;
}


The following original answer was based only on test done in VC2008, VB6 S1 case.



ORIGINAL ANSWER:



I have the same result with Progger. I use the following code to avoid the optimization (code blank) made by the compiler :



int main(int argc, char *argv) {

long n[10000][10];
long c[10000][10];
long sum = 0;

memset(n, 0, sizeof(n) );
memset(c, 0, sizeof(c) );

for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}

for (long dd=1;dd<10000;dd++){
sum += n[dd][1];
}

return sum;
}


I use the visual studio C++ 2008 to compile the code. it turns out that the compiler uses a lot of adress reallocation with local variable, mixing with an imul (multiplication) instruction, that could be very costly, for example :



.text:00401065                 mov     edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx


Each access to an index take one multiplication and one add instruction.



EDIT1: The problem is related to multidimension array access performance in C++. More information can be found here:



Array Lookups




C++ doesn't provide multidimensional arrays so scientific and engineering applications either write their own or use one of the available array libraries such as Eigen, Armadillo, uBLAS, or Boost.MultiArray. High quality C++ array libraries can provide generally excellent performance but yet simple element lookup speed can still lag that of Fortran. One reason is that Fortran arrays are built in to the language so its compilers can figure out array index strides in loops and avoid computing the memory offset from the indexes for each element. We can't change the C++ compiler so the next best solution is to offer a linear, or flat, 1D indexing operator for multidimensional arrays that can be used to improve performance of hot spot loops.




The VB version use a traditional way (1D optimized flat memory), intructions used are only registers add (eax, edx, esi ...), they are much faster.



.text:00401AD4                 mov     eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi


That can answer the question of speed.
Advice: you should use existed libraries (for example: uBlas) if possible. More discussion can be found here.



Here is the machine code of C++ version :



.text:00401000 ; int __cdecl main(int argc, const char **argv, const char **envp)
.text:00401000 _main proc near ; CODE XREF: ___tmainCRTStartup+F6p
.text:00401000
.text:00401000 var_C3514 = dword ptr -0C3514h
.text:00401000 var_C3510 = dword ptr -0C3510h
.text:00401000 var_C350C = dword ptr -0C350Ch
.text:00401000 var_C3500 = dword ptr -0C3500h
.text:00401000 var_C34FC = dword ptr -0C34FCh
.text:00401000 var_61A84 = dword ptr -61A84h
.text:00401000 var_61A7C = dword ptr -61A7Ch
.text:00401000 argc = dword ptr 8
.text:00401000 argv = dword ptr 0Ch
.text:00401000 envp = dword ptr 10h
.text:00401000
.text:00401000 push ebp
.text:00401001 mov ebp, esp
.text:00401003 mov eax, 0C3514h
.text:00401008 call __alloca_probe
.text:0040100D mov [ebp+var_61A84], 0
.text:00401017 mov [ebp+var_C350C], 1
.text:00401021 jmp short loc_401032
.text:00401023 ; ---------------------------------------------------------------------------
.text:00401023
.text:00401023 loc_401023: ; CODE XREF: _main:loc_401097j
.text:00401023 mov eax, [ebp+var_C350C]
.text:00401029 add eax, 1
.text:0040102C mov [ebp+var_C350C], eax
.text:00401032
.text:00401032 loc_401032: ; CODE XREF: _main+21j
.text:00401032 cmp [ebp+var_C350C], 0F4240h
.text:0040103C jge short loc_401099
.text:0040103E mov [ebp+var_C3510], 1
.text:00401048 jmp short loc_401059
.text:0040104A ; ---------------------------------------------------------------------------
.text:0040104A
.text:0040104A loc_40104A: ; CODE XREF: _main+95j
.text:0040104A mov ecx, [ebp+var_C3510]
.text:00401050 add ecx, 1
.text:00401053 mov [ebp+var_C3510], ecx
.text:00401059
.text:00401059 loc_401059: ; CODE XREF: _main+48j
.text:00401059 cmp [ebp+var_C3510], 2710h
.text:00401063 jge short loc_401097
.text:00401065 mov edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx
.text:00401095 jmp short loc_40104A
.text:00401097 ; ---------------------------------------------------------------------------
.text:00401097
.text:00401097 loc_401097: ; CODE XREF: _main+63j
.text:00401097 jmp short loc_401023
.text:00401099 ; ---------------------------------------------------------------------------
.text:00401099
.text:00401099 loc_401099: ; CODE XREF: _main+3Cj
.text:00401099 mov [ebp+var_C3514], 1
.text:004010A3 jmp short loc_4010B4
.text:004010A5 ; ---------------------------------------------------------------------------
.text:004010A5
.text:004010A5 loc_4010A5: ; CODE XREF: _main+DCj
.text:004010A5 mov eax, [ebp+var_C3514]
.text:004010AB add eax, 1
.text:004010AE mov [ebp+var_C3514], eax
.text:004010B4
.text:004010B4 loc_4010B4: ; CODE XREF: _main+A3j
.text:004010B4 cmp [ebp+var_C3514], 2710h
.text:004010BE jge short loc_4010DE
.text:004010C0 mov ecx, [ebp+var_C3514]
.text:004010C6 imul ecx, 28h
.text:004010C9 mov edx, [ebp+var_61A84]
.text:004010CF add edx, [ebp+ecx+var_61A7C]
.text:004010D6 mov [ebp+var_61A84], edx
.text:004010DC jmp short loc_4010A5
.text:004010DE ; ---------------------------------------------------------------------------
.text:004010DE
.text:004010DE loc_4010DE: ; CODE XREF: _main+BEj
.text:004010DE mov eax, [ebp+var_61A84]
.text:004010E4 mov esp, ebp
.text:004010E6 pop ebp
.text:004010E7 retn
.text:004010E7 _main endp


The VB code is longer, so I post here only the main function :



.text:00401A81 loc_401A81:                             ; CODE XREF: .text:00401B18j
.text:00401A81 mov ecx, [ebp-68h]
.text:00401A84 mov eax, 0F4240h
.text:00401A89 cmp ecx, eax
.text:00401A8B jg loc_401B1D
.text:00401A91 mov edx, [ebp-18h]
.text:00401A94 mov edi, 1
.text:00401A99 add edx, 1
.text:00401A9C mov esi, edi
.text:00401A9E jo loc_401CEC
.text:00401AA4 mov [ebp-18h], edx
.text:00401AA7
.text:00401AA7 loc_401AA7: ; CODE XREF: .text:00401B03j
.text:00401AA7 mov eax, 2710h
.text:00401AAC cmp esi, eax
.text:00401AAE jg short loc_401B05
.text:00401AB0 mov ebx, ds:__vbaGenerateBoundsError
.text:00401AB6 cmp esi, 2711h
.text:00401ABC jb short loc_401AC0
.text:00401ABE call ebx ; __vbaGenerateBoundsError
.text:00401AC0 ; ---------------------------------------------------------------------------
.text:00401AC0
.text:00401AC0 loc_401AC0: ; CODE XREF: .text:00401ABCj
.text:00401AC0 cmp esi, 2711h
.text:00401AC6 jb short loc_401AD4
.text:00401AC8 call ebx ; __vbaGenerateBoundsError
.text:00401ACA ; ---------------------------------------------------------------------------
.text:00401ACA cmp esi, 2711h
.text:00401AD0 jb short loc_401AD4
.text:00401AD2 call ebx
.text:00401AD4
.text:00401AD4 loc_401AD4: ; CODE XREF: .text:00401AC6j
.text:00401AD4 ; .text:00401AD0j
.text:00401AD4 mov eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi
.text:00401AEC jo loc_401CEC
.text:00401AF2 add eax, esi
.text:00401AF4 mov [edx+esi*4+9C44h], ecx
.text:00401AFB jo loc_401CEC
.text:00401B01 mov esi, eax
.text:00401B03 jmp short loc_401AA7
.text:00401B05 ; ---------------------------------------------------------------------------
.text:00401B05
.text:00401B05 loc_401B05: ; CODE XREF: .text:00401AAEj
.text:00401B05 mov ecx, [ebp-68h]
.text:00401B08 mov eax, 1
.text:00401B0D add eax, ecx
.text:00401B0F jo loc_401CEC
.text:00401B15 mov [ebp-68h], eax
.text:00401B18 jmp loc_401A81
.text:00401B1D ; ---------------------------------------------------------------------------
.text:00401B1D
.text:00401B1D loc_401B1D: ; CODE XREF: .text:00401A8Bj
.text:00401B1D mov edi, 1
.text:00401B22 mov ebx, 2710h
.text:00401B27 mov esi, edi
.text:00401B29
.text:00401B29 loc_401B29: ; CODE XREF: .text:00401B5Fj
.text:00401B29 cmp esi, ebx
.text:00401B2B jg short loc_401B61
.text:00401B2D cmp esi, 2711h
.text:00401B33 jb short loc_401B3B
.text:00401B35 call ds:__vbaGenerateBoundsError
.text:00401B3B ; ---------------------------------------------------------------------------
.text:00401B3B
.text:00401B3B loc_401B3B: ; CODE XREF: .text:00401B33j
.text:00401B3B mov ecx, [ebp-2Ch]
.text:00401B3E mov eax, [ebp-40h]
.text:00401B41 mov edx, [ecx+esi*4+9C44h]
.text:00401B48 add edx, eax
.text:00401B4A mov eax, edi
.text:00401B4C jo loc_401CEC
.text:00401B52 add eax, esi
.text:00401B54 mov [ebp-40h], edx
.text:00401B57 jo loc_401CEC
.text:00401B5D mov esi, eax
.text:00401B5F jmp short loc_401B29
.text:00401B61 ; ---------------------------------------------------------------------------
.text:00401B61
.text:00401B61 loc_401B61: ; CODE XREF: .text:00401B2Bj
.text:00401B61 mov ebx, ds:__vbaStrI4
.text:00401B67 mov ecx, 80020004h
.text:00401B6C mov [ebp-0B4h], ecx
.text:00401B72 mov [ebp-0A4h], ecx
.text:00401B78 mov [ebp-94h], ecx
.text:00401B7E mov ecx, [ebp-40h]
.text:00401B81 mov eax, 0Ah
.text:00401B86 push offset aDone ; "Done : "





share|improve this answer


























  • Cheers, so the g++ is running loads slower... hmmm... I've run several tests since and VB6 wins hands down every time... also g++ won't allow array allocations over 10 million, VB6 no problem. So... VB6 is better than g++... nah! some things not right...

    – Progger
    Nov 13 '18 at 17:59











  • This code has UB and the loops are still at least dodgy. Not a valid benchmark.

    – Baum mit Augen
    Nov 13 '18 at 20:14











  • @progger : for the big array, you should use the dynamic allocation as discussed in this post: stackoverflow.com/questions/216259/….

    – tunglt
    Nov 13 '18 at 20:55











  • @Baum mit Augen: thanks to point out the UB, do you think the benchmark will be better if we fix it ?

    – tunglt
    Nov 13 '18 at 21:03











  • It will be better, yes, in that the code would be valid and the valid code produces different assembly. But using loops like this is already dodgy in itself, esp given the inner loop does not depend on d. It raises the question what exactly we are even trying to benchmark, and what exactly would be desirable compiler optimizations and where we would like to trick the compiler to actually give us meaningful results.

    – Baum mit Augen
    Nov 13 '18 at 21:14














0












0








0







UPDATE : Recent c++ compilers give much better results compare to VB6 compiler.



The VB6 codes shown above does not reflect the real case, where access index can be varied and calculations can be breaked into embraced functions.
More experiences show that VB6 has a huge penaty for optimizing when arrays used were passed as function input (by reference).



I tried severals c++ compilers and rewritten the benchmark codes where some random behaviors was added to trick the optimization. Certainly,
the codes could be better, all suggestions are welcome. I used -O3 option for maximizing the speed, multithread mode was not used.



RESULT:



The g++ 8.1 gives the best final result with 5 seconds with const index access (compiler known it, case shown in OP), and 6 seconds with dynamic index access.
VC 2017 takes 2 seconds only for const index access but 35 seconds for dynamic case.



VB6 S1 : Original version, n and c are local variables. VB6 S2: inner loops was moved into a function, n and c are input variables passed by reference.
Test condition : Intel Xeon W3690 / Windows 10



Here are the test result :



Executions time in seconds
The rewritten codes :



VB6:



DefLng A-Z
Private Declare Function GetTickCount Lib "kernel32" () As Long

Public Sub cal_var(ByRef n() As Long, ByRef c() As Long, id As Long)

For dd = 0 To 10000
n(dd, id) = c(dd, id + 1) + c(dd, id + 2)
Next

End Sub

Public Sub cal_const(ByRef n() As Long, ByRef c() As Long)
For dd = 0 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
End Sub


Private Sub Form_Load()

Dim n(10001, 10) As Long
Dim c(10001, 10) As Long

Dim t0 As Long
Dim t1 As Long
Dim t2 As Long

Dim id As Long
Dim ret As Long

t0 = GetTickCount

For d = 1 To 1000000
id = d And 7
Call cal_var(n, c, id) 'For VB S2

'For dd = 0 To 10000
' n(dd, id + 0) = c(dd, id + 1) + c(dd, id + 2)
'Next

Next

t1 = GetTickCount

For d = 1 To 1000000
Call cal_const(n, c) 'For VB S2

'For dd = 0 To 10000
' n(dd, 1) = c(dd, 2) + c(dd, 3)
'Next
Next

t2 = GetTickCount

For d = 0 To 10000
Sum = Sum + n(d, t0 And 7)
Next

MsgBox "Done in " & (t1 - t0) & " and " & (t2 - t1) & " miliseconds"

End Sub


C++ Code:



#include <iostream>
#include <time.h>
#include <string.h>

using namespace std;

#define NUM_ITERATION 1000000
#define ARR_SIZE 10000

typedef long long_arr[ARR_SIZE][10];

void calc_var(long_arr &n, long_arr &c, long id) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][id] = c[dd][id + 1] + c[dd][id + 2];
}
}

void calc_const(long_arr &n, long_arr &c) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][0] = c[dd][1] + c[dd][2];
}
}

int main()
{
cout << "start ..." << endl;
time_t t0 = time(NULL);

long_arr n;
long_arr c;

memset(n, 0, sizeof(n));
memset(c, t0 & 1, sizeof(c));

for (long d = 1; d < NUM_ITERATION; d++) {
calc_var(n, c, (long)t0 & 7);
}

time_t t1 = time(NULL);

for (long d = 1; d < NUM_ITERATION; d++) {
calc_const(n, c);
}

time_t t2 = time(NULL);

long sum = 0;
for (int i = 0; i < ARR_SIZE; i++) {
sum += n[i][t0 & 7];
}

cout << "by dynamic index: finished in " << (t1 - t0) << " seconds" << endl;
cout << "by const index: finished in " << (t2 - t1) << " seconds" << endl;
cout << "with final result : " << sum << endl;

return 0;
}


The following original answer was based only on test done in VC2008, VB6 S1 case.



ORIGINAL ANSWER:



I have the same result with Progger. I use the following code to avoid the optimization (code blank) made by the compiler :



int main(int argc, char *argv) {

long n[10000][10];
long c[10000][10];
long sum = 0;

memset(n, 0, sizeof(n) );
memset(c, 0, sizeof(c) );

for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}

for (long dd=1;dd<10000;dd++){
sum += n[dd][1];
}

return sum;
}


I use the visual studio C++ 2008 to compile the code. it turns out that the compiler uses a lot of adress reallocation with local variable, mixing with an imul (multiplication) instruction, that could be very costly, for example :



.text:00401065                 mov     edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx


Each access to an index take one multiplication and one add instruction.



EDIT1: The problem is related to multidimension array access performance in C++. More information can be found here:



Array Lookups




C++ doesn't provide multidimensional arrays so scientific and engineering applications either write their own or use one of the available array libraries such as Eigen, Armadillo, uBLAS, or Boost.MultiArray. High quality C++ array libraries can provide generally excellent performance but yet simple element lookup speed can still lag that of Fortran. One reason is that Fortran arrays are built in to the language so its compilers can figure out array index strides in loops and avoid computing the memory offset from the indexes for each element. We can't change the C++ compiler so the next best solution is to offer a linear, or flat, 1D indexing operator for multidimensional arrays that can be used to improve performance of hot spot loops.




The VB version use a traditional way (1D optimized flat memory), intructions used are only registers add (eax, edx, esi ...), they are much faster.



.text:00401AD4                 mov     eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi


That can answer the question of speed.
Advice: you should use existed libraries (for example: uBlas) if possible. More discussion can be found here.



Here is the machine code of C++ version :



.text:00401000 ; int __cdecl main(int argc, const char **argv, const char **envp)
.text:00401000 _main proc near ; CODE XREF: ___tmainCRTStartup+F6p
.text:00401000
.text:00401000 var_C3514 = dword ptr -0C3514h
.text:00401000 var_C3510 = dword ptr -0C3510h
.text:00401000 var_C350C = dword ptr -0C350Ch
.text:00401000 var_C3500 = dword ptr -0C3500h
.text:00401000 var_C34FC = dword ptr -0C34FCh
.text:00401000 var_61A84 = dword ptr -61A84h
.text:00401000 var_61A7C = dword ptr -61A7Ch
.text:00401000 argc = dword ptr 8
.text:00401000 argv = dword ptr 0Ch
.text:00401000 envp = dword ptr 10h
.text:00401000
.text:00401000 push ebp
.text:00401001 mov ebp, esp
.text:00401003 mov eax, 0C3514h
.text:00401008 call __alloca_probe
.text:0040100D mov [ebp+var_61A84], 0
.text:00401017 mov [ebp+var_C350C], 1
.text:00401021 jmp short loc_401032
.text:00401023 ; ---------------------------------------------------------------------------
.text:00401023
.text:00401023 loc_401023: ; CODE XREF: _main:loc_401097j
.text:00401023 mov eax, [ebp+var_C350C]
.text:00401029 add eax, 1
.text:0040102C mov [ebp+var_C350C], eax
.text:00401032
.text:00401032 loc_401032: ; CODE XREF: _main+21j
.text:00401032 cmp [ebp+var_C350C], 0F4240h
.text:0040103C jge short loc_401099
.text:0040103E mov [ebp+var_C3510], 1
.text:00401048 jmp short loc_401059
.text:0040104A ; ---------------------------------------------------------------------------
.text:0040104A
.text:0040104A loc_40104A: ; CODE XREF: _main+95j
.text:0040104A mov ecx, [ebp+var_C3510]
.text:00401050 add ecx, 1
.text:00401053 mov [ebp+var_C3510], ecx
.text:00401059
.text:00401059 loc_401059: ; CODE XREF: _main+48j
.text:00401059 cmp [ebp+var_C3510], 2710h
.text:00401063 jge short loc_401097
.text:00401065 mov edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx
.text:00401095 jmp short loc_40104A
.text:00401097 ; ---------------------------------------------------------------------------
.text:00401097
.text:00401097 loc_401097: ; CODE XREF: _main+63j
.text:00401097 jmp short loc_401023
.text:00401099 ; ---------------------------------------------------------------------------
.text:00401099
.text:00401099 loc_401099: ; CODE XREF: _main+3Cj
.text:00401099 mov [ebp+var_C3514], 1
.text:004010A3 jmp short loc_4010B4
.text:004010A5 ; ---------------------------------------------------------------------------
.text:004010A5
.text:004010A5 loc_4010A5: ; CODE XREF: _main+DCj
.text:004010A5 mov eax, [ebp+var_C3514]
.text:004010AB add eax, 1
.text:004010AE mov [ebp+var_C3514], eax
.text:004010B4
.text:004010B4 loc_4010B4: ; CODE XREF: _main+A3j
.text:004010B4 cmp [ebp+var_C3514], 2710h
.text:004010BE jge short loc_4010DE
.text:004010C0 mov ecx, [ebp+var_C3514]
.text:004010C6 imul ecx, 28h
.text:004010C9 mov edx, [ebp+var_61A84]
.text:004010CF add edx, [ebp+ecx+var_61A7C]
.text:004010D6 mov [ebp+var_61A84], edx
.text:004010DC jmp short loc_4010A5
.text:004010DE ; ---------------------------------------------------------------------------
.text:004010DE
.text:004010DE loc_4010DE: ; CODE XREF: _main+BEj
.text:004010DE mov eax, [ebp+var_61A84]
.text:004010E4 mov esp, ebp
.text:004010E6 pop ebp
.text:004010E7 retn
.text:004010E7 _main endp


The VB code is longer, so I post here only the main function :



.text:00401A81 loc_401A81:                             ; CODE XREF: .text:00401B18j
.text:00401A81 mov ecx, [ebp-68h]
.text:00401A84 mov eax, 0F4240h
.text:00401A89 cmp ecx, eax
.text:00401A8B jg loc_401B1D
.text:00401A91 mov edx, [ebp-18h]
.text:00401A94 mov edi, 1
.text:00401A99 add edx, 1
.text:00401A9C mov esi, edi
.text:00401A9E jo loc_401CEC
.text:00401AA4 mov [ebp-18h], edx
.text:00401AA7
.text:00401AA7 loc_401AA7: ; CODE XREF: .text:00401B03j
.text:00401AA7 mov eax, 2710h
.text:00401AAC cmp esi, eax
.text:00401AAE jg short loc_401B05
.text:00401AB0 mov ebx, ds:__vbaGenerateBoundsError
.text:00401AB6 cmp esi, 2711h
.text:00401ABC jb short loc_401AC0
.text:00401ABE call ebx ; __vbaGenerateBoundsError
.text:00401AC0 ; ---------------------------------------------------------------------------
.text:00401AC0
.text:00401AC0 loc_401AC0: ; CODE XREF: .text:00401ABCj
.text:00401AC0 cmp esi, 2711h
.text:00401AC6 jb short loc_401AD4
.text:00401AC8 call ebx ; __vbaGenerateBoundsError
.text:00401ACA ; ---------------------------------------------------------------------------
.text:00401ACA cmp esi, 2711h
.text:00401AD0 jb short loc_401AD4
.text:00401AD2 call ebx
.text:00401AD4
.text:00401AD4 loc_401AD4: ; CODE XREF: .text:00401AC6j
.text:00401AD4 ; .text:00401AD0j
.text:00401AD4 mov eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi
.text:00401AEC jo loc_401CEC
.text:00401AF2 add eax, esi
.text:00401AF4 mov [edx+esi*4+9C44h], ecx
.text:00401AFB jo loc_401CEC
.text:00401B01 mov esi, eax
.text:00401B03 jmp short loc_401AA7
.text:00401B05 ; ---------------------------------------------------------------------------
.text:00401B05
.text:00401B05 loc_401B05: ; CODE XREF: .text:00401AAEj
.text:00401B05 mov ecx, [ebp-68h]
.text:00401B08 mov eax, 1
.text:00401B0D add eax, ecx
.text:00401B0F jo loc_401CEC
.text:00401B15 mov [ebp-68h], eax
.text:00401B18 jmp loc_401A81
.text:00401B1D ; ---------------------------------------------------------------------------
.text:00401B1D
.text:00401B1D loc_401B1D: ; CODE XREF: .text:00401A8Bj
.text:00401B1D mov edi, 1
.text:00401B22 mov ebx, 2710h
.text:00401B27 mov esi, edi
.text:00401B29
.text:00401B29 loc_401B29: ; CODE XREF: .text:00401B5Fj
.text:00401B29 cmp esi, ebx
.text:00401B2B jg short loc_401B61
.text:00401B2D cmp esi, 2711h
.text:00401B33 jb short loc_401B3B
.text:00401B35 call ds:__vbaGenerateBoundsError
.text:00401B3B ; ---------------------------------------------------------------------------
.text:00401B3B
.text:00401B3B loc_401B3B: ; CODE XREF: .text:00401B33j
.text:00401B3B mov ecx, [ebp-2Ch]
.text:00401B3E mov eax, [ebp-40h]
.text:00401B41 mov edx, [ecx+esi*4+9C44h]
.text:00401B48 add edx, eax
.text:00401B4A mov eax, edi
.text:00401B4C jo loc_401CEC
.text:00401B52 add eax, esi
.text:00401B54 mov [ebp-40h], edx
.text:00401B57 jo loc_401CEC
.text:00401B5D mov esi, eax
.text:00401B5F jmp short loc_401B29
.text:00401B61 ; ---------------------------------------------------------------------------
.text:00401B61
.text:00401B61 loc_401B61: ; CODE XREF: .text:00401B2Bj
.text:00401B61 mov ebx, ds:__vbaStrI4
.text:00401B67 mov ecx, 80020004h
.text:00401B6C mov [ebp-0B4h], ecx
.text:00401B72 mov [ebp-0A4h], ecx
.text:00401B78 mov [ebp-94h], ecx
.text:00401B7E mov ecx, [ebp-40h]
.text:00401B81 mov eax, 0Ah
.text:00401B86 push offset aDone ; "Done : "





share|improve this answer















UPDATE : Recent c++ compilers give much better results compare to VB6 compiler.



The VB6 codes shown above does not reflect the real case, where access index can be varied and calculations can be breaked into embraced functions.
More experiences show that VB6 has a huge penaty for optimizing when arrays used were passed as function input (by reference).



I tried severals c++ compilers and rewritten the benchmark codes where some random behaviors was added to trick the optimization. Certainly,
the codes could be better, all suggestions are welcome. I used -O3 option for maximizing the speed, multithread mode was not used.



RESULT:



The g++ 8.1 gives the best final result with 5 seconds with const index access (compiler known it, case shown in OP), and 6 seconds with dynamic index access.
VC 2017 takes 2 seconds only for const index access but 35 seconds for dynamic case.



VB6 S1 : Original version, n and c are local variables. VB6 S2: inner loops was moved into a function, n and c are input variables passed by reference.
Test condition : Intel Xeon W3690 / Windows 10



Here are the test result :



Executions time in seconds
The rewritten codes :



VB6:



DefLng A-Z
Private Declare Function GetTickCount Lib "kernel32" () As Long

Public Sub cal_var(ByRef n() As Long, ByRef c() As Long, id As Long)

For dd = 0 To 10000
n(dd, id) = c(dd, id + 1) + c(dd, id + 2)
Next

End Sub

Public Sub cal_const(ByRef n() As Long, ByRef c() As Long)
For dd = 0 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
End Sub


Private Sub Form_Load()

Dim n(10001, 10) As Long
Dim c(10001, 10) As Long

Dim t0 As Long
Dim t1 As Long
Dim t2 As Long

Dim id As Long
Dim ret As Long

t0 = GetTickCount

For d = 1 To 1000000
id = d And 7
Call cal_var(n, c, id) 'For VB S2

'For dd = 0 To 10000
' n(dd, id + 0) = c(dd, id + 1) + c(dd, id + 2)
'Next

Next

t1 = GetTickCount

For d = 1 To 1000000
Call cal_const(n, c) 'For VB S2

'For dd = 0 To 10000
' n(dd, 1) = c(dd, 2) + c(dd, 3)
'Next
Next

t2 = GetTickCount

For d = 0 To 10000
Sum = Sum + n(d, t0 And 7)
Next

MsgBox "Done in " & (t1 - t0) & " and " & (t2 - t1) & " miliseconds"

End Sub


C++ Code:



#include <iostream>
#include <time.h>
#include <string.h>

using namespace std;

#define NUM_ITERATION 1000000
#define ARR_SIZE 10000

typedef long long_arr[ARR_SIZE][10];

void calc_var(long_arr &n, long_arr &c, long id) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][id] = c[dd][id + 1] + c[dd][id + 2];
}
}

void calc_const(long_arr &n, long_arr &c) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][0] = c[dd][1] + c[dd][2];
}
}

int main()
{
cout << "start ..." << endl;
time_t t0 = time(NULL);

long_arr n;
long_arr c;

memset(n, 0, sizeof(n));
memset(c, t0 & 1, sizeof(c));

for (long d = 1; d < NUM_ITERATION; d++) {
calc_var(n, c, (long)t0 & 7);
}

time_t t1 = time(NULL);

for (long d = 1; d < NUM_ITERATION; d++) {
calc_const(n, c);
}

time_t t2 = time(NULL);

long sum = 0;
for (int i = 0; i < ARR_SIZE; i++) {
sum += n[i][t0 & 7];
}

cout << "by dynamic index: finished in " << (t1 - t0) << " seconds" << endl;
cout << "by const index: finished in " << (t2 - t1) << " seconds" << endl;
cout << "with final result : " << sum << endl;

return 0;
}


The following original answer was based only on test done in VC2008, VB6 S1 case.



ORIGINAL ANSWER:



I have the same result with Progger. I use the following code to avoid the optimization (code blank) made by the compiler :



int main(int argc, char *argv) {

long n[10000][10];
long c[10000][10];
long sum = 0;

memset(n, 0, sizeof(n) );
memset(c, 0, sizeof(c) );

for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}

for (long dd=1;dd<10000;dd++){
sum += n[dd][1];
}

return sum;
}


I use the visual studio C++ 2008 to compile the code. it turns out that the compiler uses a lot of adress reallocation with local variable, mixing with an imul (multiplication) instruction, that could be very costly, for example :



.text:00401065                 mov     edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx


Each access to an index take one multiplication and one add instruction.



EDIT1: The problem is related to multidimension array access performance in C++. More information can be found here:



Array Lookups




C++ doesn't provide multidimensional arrays so scientific and engineering applications either write their own or use one of the available array libraries such as Eigen, Armadillo, uBLAS, or Boost.MultiArray. High quality C++ array libraries can provide generally excellent performance but yet simple element lookup speed can still lag that of Fortran. One reason is that Fortran arrays are built in to the language so its compilers can figure out array index strides in loops and avoid computing the memory offset from the indexes for each element. We can't change the C++ compiler so the next best solution is to offer a linear, or flat, 1D indexing operator for multidimensional arrays that can be used to improve performance of hot spot loops.




The VB version use a traditional way (1D optimized flat memory), intructions used are only registers add (eax, edx, esi ...), they are much faster.



.text:00401AD4                 mov     eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi


That can answer the question of speed.
Advice: you should use existed libraries (for example: uBlas) if possible. More discussion can be found here.



Here is the machine code of C++ version :



.text:00401000 ; int __cdecl main(int argc, const char **argv, const char **envp)
.text:00401000 _main proc near ; CODE XREF: ___tmainCRTStartup+F6p
.text:00401000
.text:00401000 var_C3514 = dword ptr -0C3514h
.text:00401000 var_C3510 = dword ptr -0C3510h
.text:00401000 var_C350C = dword ptr -0C350Ch
.text:00401000 var_C3500 = dword ptr -0C3500h
.text:00401000 var_C34FC = dword ptr -0C34FCh
.text:00401000 var_61A84 = dword ptr -61A84h
.text:00401000 var_61A7C = dword ptr -61A7Ch
.text:00401000 argc = dword ptr 8
.text:00401000 argv = dword ptr 0Ch
.text:00401000 envp = dword ptr 10h
.text:00401000
.text:00401000 push ebp
.text:00401001 mov ebp, esp
.text:00401003 mov eax, 0C3514h
.text:00401008 call __alloca_probe
.text:0040100D mov [ebp+var_61A84], 0
.text:00401017 mov [ebp+var_C350C], 1
.text:00401021 jmp short loc_401032
.text:00401023 ; ---------------------------------------------------------------------------
.text:00401023
.text:00401023 loc_401023: ; CODE XREF: _main:loc_401097j
.text:00401023 mov eax, [ebp+var_C350C]
.text:00401029 add eax, 1
.text:0040102C mov [ebp+var_C350C], eax
.text:00401032
.text:00401032 loc_401032: ; CODE XREF: _main+21j
.text:00401032 cmp [ebp+var_C350C], 0F4240h
.text:0040103C jge short loc_401099
.text:0040103E mov [ebp+var_C3510], 1
.text:00401048 jmp short loc_401059
.text:0040104A ; ---------------------------------------------------------------------------
.text:0040104A
.text:0040104A loc_40104A: ; CODE XREF: _main+95j
.text:0040104A mov ecx, [ebp+var_C3510]
.text:00401050 add ecx, 1
.text:00401053 mov [ebp+var_C3510], ecx
.text:00401059
.text:00401059 loc_401059: ; CODE XREF: _main+48j
.text:00401059 cmp [ebp+var_C3510], 2710h
.text:00401063 jge short loc_401097
.text:00401065 mov edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx
.text:00401095 jmp short loc_40104A
.text:00401097 ; ---------------------------------------------------------------------------
.text:00401097
.text:00401097 loc_401097: ; CODE XREF: _main+63j
.text:00401097 jmp short loc_401023
.text:00401099 ; ---------------------------------------------------------------------------
.text:00401099
.text:00401099 loc_401099: ; CODE XREF: _main+3Cj
.text:00401099 mov [ebp+var_C3514], 1
.text:004010A3 jmp short loc_4010B4
.text:004010A5 ; ---------------------------------------------------------------------------
.text:004010A5
.text:004010A5 loc_4010A5: ; CODE XREF: _main+DCj
.text:004010A5 mov eax, [ebp+var_C3514]
.text:004010AB add eax, 1
.text:004010AE mov [ebp+var_C3514], eax
.text:004010B4
.text:004010B4 loc_4010B4: ; CODE XREF: _main+A3j
.text:004010B4 cmp [ebp+var_C3514], 2710h
.text:004010BE jge short loc_4010DE
.text:004010C0 mov ecx, [ebp+var_C3514]
.text:004010C6 imul ecx, 28h
.text:004010C9 mov edx, [ebp+var_61A84]
.text:004010CF add edx, [ebp+ecx+var_61A7C]
.text:004010D6 mov [ebp+var_61A84], edx
.text:004010DC jmp short loc_4010A5
.text:004010DE ; ---------------------------------------------------------------------------
.text:004010DE
.text:004010DE loc_4010DE: ; CODE XREF: _main+BEj
.text:004010DE mov eax, [ebp+var_61A84]
.text:004010E4 mov esp, ebp
.text:004010E6 pop ebp
.text:004010E7 retn
.text:004010E7 _main endp


The VB code is longer, so I post here only the main function :



.text:00401A81 loc_401A81:                             ; CODE XREF: .text:00401B18j
.text:00401A81 mov ecx, [ebp-68h]
.text:00401A84 mov eax, 0F4240h
.text:00401A89 cmp ecx, eax
.text:00401A8B jg loc_401B1D
.text:00401A91 mov edx, [ebp-18h]
.text:00401A94 mov edi, 1
.text:00401A99 add edx, 1
.text:00401A9C mov esi, edi
.text:00401A9E jo loc_401CEC
.text:00401AA4 mov [ebp-18h], edx
.text:00401AA7
.text:00401AA7 loc_401AA7: ; CODE XREF: .text:00401B03j
.text:00401AA7 mov eax, 2710h
.text:00401AAC cmp esi, eax
.text:00401AAE jg short loc_401B05
.text:00401AB0 mov ebx, ds:__vbaGenerateBoundsError
.text:00401AB6 cmp esi, 2711h
.text:00401ABC jb short loc_401AC0
.text:00401ABE call ebx ; __vbaGenerateBoundsError
.text:00401AC0 ; ---------------------------------------------------------------------------
.text:00401AC0
.text:00401AC0 loc_401AC0: ; CODE XREF: .text:00401ABCj
.text:00401AC0 cmp esi, 2711h
.text:00401AC6 jb short loc_401AD4
.text:00401AC8 call ebx ; __vbaGenerateBoundsError
.text:00401ACA ; ---------------------------------------------------------------------------
.text:00401ACA cmp esi, 2711h
.text:00401AD0 jb short loc_401AD4
.text:00401AD2 call ebx
.text:00401AD4
.text:00401AD4 loc_401AD4: ; CODE XREF: .text:00401AC6j
.text:00401AD4 ; .text:00401AD0j
.text:00401AD4 mov eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi
.text:00401AEC jo loc_401CEC
.text:00401AF2 add eax, esi
.text:00401AF4 mov [edx+esi*4+9C44h], ecx
.text:00401AFB jo loc_401CEC
.text:00401B01 mov esi, eax
.text:00401B03 jmp short loc_401AA7
.text:00401B05 ; ---------------------------------------------------------------------------
.text:00401B05
.text:00401B05 loc_401B05: ; CODE XREF: .text:00401AAEj
.text:00401B05 mov ecx, [ebp-68h]
.text:00401B08 mov eax, 1
.text:00401B0D add eax, ecx
.text:00401B0F jo loc_401CEC
.text:00401B15 mov [ebp-68h], eax
.text:00401B18 jmp loc_401A81
.text:00401B1D ; ---------------------------------------------------------------------------
.text:00401B1D
.text:00401B1D loc_401B1D: ; CODE XREF: .text:00401A8Bj
.text:00401B1D mov edi, 1
.text:00401B22 mov ebx, 2710h
.text:00401B27 mov esi, edi
.text:00401B29
.text:00401B29 loc_401B29: ; CODE XREF: .text:00401B5Fj
.text:00401B29 cmp esi, ebx
.text:00401B2B jg short loc_401B61
.text:00401B2D cmp esi, 2711h
.text:00401B33 jb short loc_401B3B
.text:00401B35 call ds:__vbaGenerateBoundsError
.text:00401B3B ; ---------------------------------------------------------------------------
.text:00401B3B
.text:00401B3B loc_401B3B: ; CODE XREF: .text:00401B33j
.text:00401B3B mov ecx, [ebp-2Ch]
.text:00401B3E mov eax, [ebp-40h]
.text:00401B41 mov edx, [ecx+esi*4+9C44h]
.text:00401B48 add edx, eax
.text:00401B4A mov eax, edi
.text:00401B4C jo loc_401CEC
.text:00401B52 add eax, esi
.text:00401B54 mov [ebp-40h], edx
.text:00401B57 jo loc_401CEC
.text:00401B5D mov esi, eax
.text:00401B5F jmp short loc_401B29
.text:00401B61 ; ---------------------------------------------------------------------------
.text:00401B61
.text:00401B61 loc_401B61: ; CODE XREF: .text:00401B2Bj
.text:00401B61 mov ebx, ds:__vbaStrI4
.text:00401B67 mov ecx, 80020004h
.text:00401B6C mov [ebp-0B4h], ecx
.text:00401B72 mov [ebp-0A4h], ecx
.text:00401B78 mov [ebp-94h], ecx
.text:00401B7E mov ecx, [ebp-40h]
.text:00401B81 mov eax, 0Ah
.text:00401B86 push offset aDone ; "Done : "






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 16 '18 at 9:03

























answered Nov 13 '18 at 16:50









tunglttunglt

6941210




6941210













  • Cheers, so the g++ is running loads slower... hmmm... I've run several tests since and VB6 wins hands down every time... also g++ won't allow array allocations over 10 million, VB6 no problem. So... VB6 is better than g++... nah! some things not right...

    – Progger
    Nov 13 '18 at 17:59











  • This code has UB and the loops are still at least dodgy. Not a valid benchmark.

    – Baum mit Augen
    Nov 13 '18 at 20:14











  • @progger : for the big array, you should use the dynamic allocation as discussed in this post: stackoverflow.com/questions/216259/….

    – tunglt
    Nov 13 '18 at 20:55











  • @Baum mit Augen: thanks to point out the UB, do you think the benchmark will be better if we fix it ?

    – tunglt
    Nov 13 '18 at 21:03











  • It will be better, yes, in that the code would be valid and the valid code produces different assembly. But using loops like this is already dodgy in itself, esp given the inner loop does not depend on d. It raises the question what exactly we are even trying to benchmark, and what exactly would be desirable compiler optimizations and where we would like to trick the compiler to actually give us meaningful results.

    – Baum mit Augen
    Nov 13 '18 at 21:14



















  • Cheers, so the g++ is running loads slower... hmmm... I've run several tests since and VB6 wins hands down every time... also g++ won't allow array allocations over 10 million, VB6 no problem. So... VB6 is better than g++... nah! some things not right...

    – Progger
    Nov 13 '18 at 17:59











  • This code has UB and the loops are still at least dodgy. Not a valid benchmark.

    – Baum mit Augen
    Nov 13 '18 at 20:14











  • @progger : for the big array, you should use the dynamic allocation as discussed in this post: stackoverflow.com/questions/216259/….

    – tunglt
    Nov 13 '18 at 20:55











  • @Baum mit Augen: thanks to point out the UB, do you think the benchmark will be better if we fix it ?

    – tunglt
    Nov 13 '18 at 21:03











  • It will be better, yes, in that the code would be valid and the valid code produces different assembly. But using loops like this is already dodgy in itself, esp given the inner loop does not depend on d. It raises the question what exactly we are even trying to benchmark, and what exactly would be desirable compiler optimizations and where we would like to trick the compiler to actually give us meaningful results.

    – Baum mit Augen
    Nov 13 '18 at 21:14

















Cheers, so the g++ is running loads slower... hmmm... I've run several tests since and VB6 wins hands down every time... also g++ won't allow array allocations over 10 million, VB6 no problem. So... VB6 is better than g++... nah! some things not right...

– Progger
Nov 13 '18 at 17:59





Cheers, so the g++ is running loads slower... hmmm... I've run several tests since and VB6 wins hands down every time... also g++ won't allow array allocations over 10 million, VB6 no problem. So... VB6 is better than g++... nah! some things not right...

– Progger
Nov 13 '18 at 17:59













This code has UB and the loops are still at least dodgy. Not a valid benchmark.

– Baum mit Augen
Nov 13 '18 at 20:14





This code has UB and the loops are still at least dodgy. Not a valid benchmark.

– Baum mit Augen
Nov 13 '18 at 20:14













@progger : for the big array, you should use the dynamic allocation as discussed in this post: stackoverflow.com/questions/216259/….

– tunglt
Nov 13 '18 at 20:55





@progger : for the big array, you should use the dynamic allocation as discussed in this post: stackoverflow.com/questions/216259/….

– tunglt
Nov 13 '18 at 20:55













@Baum mit Augen: thanks to point out the UB, do you think the benchmark will be better if we fix it ?

– tunglt
Nov 13 '18 at 21:03





@Baum mit Augen: thanks to point out the UB, do you think the benchmark will be better if we fix it ?

– tunglt
Nov 13 '18 at 21:03













It will be better, yes, in that the code would be valid and the valid code produces different assembly. But using loops like this is already dodgy in itself, esp given the inner loop does not depend on d. It raises the question what exactly we are even trying to benchmark, and what exactly would be desirable compiler optimizations and where we would like to trick the compiler to actually give us meaningful results.

– Baum mit Augen
Nov 13 '18 at 21:14





It will be better, yes, in that the code would be valid and the valid code produces different assembly. But using loops like this is already dodgy in itself, esp given the inner loop does not depend on d. It raises the question what exactly we are even trying to benchmark, and what exactly would be desirable compiler optimizations and where we would like to trick the compiler to actually give us meaningful results.

– Baum mit Augen
Nov 13 '18 at 21:14


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283791%2fc-running-slower-than-vb6%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python