Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /var/boincadm/prj/html/inc/boinc_db.inc on line 147
TOT5 Workunit Stuck

TOT5 Workunit Stuck

Message boards : Closed Issues : TOT5 Workunit Stuck
Message board moderation

To post messages, you must log in.


Deprecated: Creation of dynamic property BoincUser::$prefs is deprecated in /var/boincadm/prj/html/inc/forum_db.inc on line 164

Deprecated: Creation of dynamic property BoincUser::$prefs is deprecated in /var/boincadm/prj/html/inc/forum_db.inc on line 164
AuthorMessage
marsinph

Send message
Joined: 13 Aug 19

Deprecated: Creation of dynamic property BoincUser::$nposts is deprecated in /var/boincadm/prj/html/inc/forum.inc on line 613
Posts: 13
Credit: 2,205,884
RAC: 0
Message 3510 - Posted: 14 Aug 2019, 7:57:33 UTC - in response to Message 3483.  


Technical: the search space has been split to so called segments. From each segment one workunit is generated. The workunits can be made any size, but are currently set to run approximately two hours on my computer. Once result of a workunit is returned to the server, it is picked up by the validator program. This program checks the result validity, saves the result into the database and generates a new workunit to continue processing the segment.



Hello,
Set to run about two hours !?
I have four WU on 4 differents hosts i7-2600K OC to 4.2Ghz
After 15-18 hours : 99.2% - 99.989%
Slot directory changes.
Stderr.txt show "checkpoint each 2-3 minutes !
But strange task manager shows very little CPU use

Should I cancel ? ( host id 4149 )

Best regards

-<active_task>

<project_master_url>https://boinc.tbrada.eu/</project_master_url>

<result_name>tot5_51c_St9HdfLp97npeHT7T9omyZcVL_0</result_name>

<checkpoint_cpu_time>3.884425</checkpoint_cpu_time>

<checkpoint_elapsed_time>56160.656206</checkpoint_elapsed_time>

<fraction_done>0.000000</fraction_done>

<peak_working_set_size>4894720</peak_working_set_size>

<peak_swap_size>2080768</peak_swap_size>

<peak_disk_usage>34629</peak_disk_usage>

</active_task>
ID: 3510 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marsinph

Send message
Joined: 13 Aug 19
Posts: 13
Credit: 2,205,884
RAC: 0
Message 3511 - Posted: 14 Aug 2019, 15:08:00 UTC - in response to Message 3510.  

Additional info
On host 4152 , WU 749394 : runnig time 63.441 sec, CPU time : 12.21 sec ! ,.00000001% cpu usage. So I canceled
Why to block use of a core if PRJ and the WU almost not use it ?

On host 4151 , WU 749695 : already one day and one hour running and CPU time 4 sec ! at 99.987% about each 20 minutes +0.001%
But there are changes in slot.
checkpoint file : also chage, but not a language (only signs)

The only one valid is from host 4150 , WU 749580 running 55.780sec cpu : 11974sec ( 20%)

We are very very far away from the predictions of about two hours

All my hosts have a power of 4GFlops
Project it self estimated computing size of 40.000GFlops.
So for me it would take about 10.000sec to finish Not 90.000 !

All hosts Win7 x64, with manual JRE, C++ and VisualStudio.

Who can explain ?

Best regards
ID: 3511 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marsinph

Send message
Joined: 13 Aug 19
Posts: 13
Credit: 2,205,884
RAC: 0
Message 3512 - Posted: 14 Aug 2019, 15:09:45 UTC - in response to Message 3510.  

Additional info
On host 4152 , WU 749394 : runnig time 63.441 sec, CPU time : 12.21 sec ! ,.00000001% cpu usage. So I canceled
Why to block use of a core if PRJ and the WU almost not use it ?

On host 4151 , WU 749695 : already one day and one hour running and CPU time 4 sec ! at 99.987% about each 20 minutes +0.001%
But there are changes in slot.
checkpoint file : also chage, but not a language (only signs)

The only one valid is from host 4150 , WU 749580 running 55.780sec cpu : 11974sec ( 20%)

We are very very far away from the predictions of about two hours

All my hosts have a power of 4GFlops
Project it self estimated computing size of 40.000GFlops.
So for me it would take about 10.000sec to finish Not 90.000 !

All hosts Win7 x64, with manual JRE, C++ and VisualStudio.

Who can explain ?

Best regards
ID: 3512 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19

Deprecated: Creation of dynamic property BoincUser::$nposts is deprecated in /var/boincadm/prj/html/inc/forum.inc on line 613
Posts: 667
Credit: 432,784
RAC: 0
Message 3513 - Posted: 15 Aug 2019, 9:10:02 UTC

Please zip and upload the checkpoint, input and stderr files of the misbehaved workunits and then abort them.
Thank you for reporting.
If you have trouble uploading them to Internet, you can use my email tomasbrod@azet.sk
Additionally, I will look at the database and see if there are any wus with outrageous times.
The information that cpu was not used very much is a very important to me. It could be that the wu thinks it should be suspended. What happens when you restart boinc and/or your computer? (do it after the zip/upload)
ID: 3513 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marsinph

Send message
Joined: 13 Aug 19
Posts: 13
Credit: 2,205,884
RAC: 0
Message 3514 - Posted: 15 Aug 2019, 10:05:47 UTC - in response to Message 3513.  

Please zip and upload the checkpoint, input and stderr files of the misbehaved workunits and then abort them.
Thank you for reporting.
If you have trouble uploading them to Internet, you can use my email tomasbrod@azet.sk
Additionally, I will look at the database and see if there are any wus with outrageous times.
The information that cpu was not used very much is a very important to me. It could be that the wu thinks it should be suspended. What happens when you restart boinc and/or your computer? (do it after the zip/upload)



Hello,
Thank you.
mail sent to your adres
Best regards
ID: 3514 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 667
Credit: 432,784
RAC: 0
Message 3517 - Posted: 16 Aug 2019, 10:14:03 UTC

There have been few reports of tasks running for much more than the target two hours.
ID: 3517 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 667
Credit: 432,784
RAC: 0
Message 3518 - Posted: 16 Aug 2019, 10:19:16 UTC - in response to Message 3514.  

mail sent to your adress

Not received any email.
Can you please upload it to Googl, Yandex, Mediafire, Mega or others?
I will grant you credit for the failed tasks.
ID: 3518 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marsinph

Send message
Joined: 13 Aug 19
Posts: 13
Credit: 2,205,884
RAC: 0
Message 3519 - Posted: 16 Aug 2019, 17:06:55 UTC - in response to Message 3518.  

mail sent to your adress

Not received any email.
Can you please upload it to Googl, Yandex, Mediafire, Mega or others?
I will grant you credit for the failed tasks.



Of course the mail adress you provide and where 8 sent, arrive in my private inbox here on project !!!

Please send me a valid adress by private message. Yandex block very often mail from "west"

I send you by private message my personal mail. So you will be able to reply securely.

Best regards
ID: 3519 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 667
Credit: 432,784
RAC: 0
Message 3524 - Posted: 17 Aug 2019, 12:16:17 UTC

I am sorry that you are experiencing issues. I finally received the mail from ****@skynet.be. I do not have time to look at it immediately. The needed checkpoint files have been obtained, which is great. The files are needed to (hopefully) replicate the issue and, eventually, fix it.

... I think I am not a litlle cruncher and I think I have some experience. Of course, I will recommend my team (100th world) to not crunch at this time

Of course you are not required to crunch for this project. Nobody is. Including myself. If you do not want, then you can leave. You are of course welcome, if you choose to return when the issue is solved.

The workunits are estimated to run two hours on modern hardware. If they run more, then something is not right. First try restarting boinc and/or computer and if it does not help, abort the workunits and set project to suspend.

As a reaction to the recent problems, the "tot5" application has been marked as beta to not waste computation resources.
Thank you.
ID: 3524 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marsinph

Send message
Joined: 13 Aug 19
Posts: 13
Credit: 2,205,884
RAC: 0
Message 3527 - Posted: 17 Aug 2019, 18:13:39 UTC - in response to Message 3524.  

Hello Tomas Brada,
Like you say you finally receive all my reports. At least on from the three.
Thank you.
You could answer me directly by using my mail.
Thank you to not have fully publish it, but some clever , now, know it !!!
I wrote private and you make it public !!!
And now, this mail is submerged of spam !!!

By the way you wrote about two hours running on "modern" config.
Your config : https://boinc.tbrada.eu/hosts_user.php?userid=1
Is 25% more powerfull
https://www.cpubenchmark.net/high_end_cpus.html

At you two hours, at me more than two days !!!

The only explanation is that your WU are developped on AMD / Linux architecture.
By the way if i look to statistics, top users, top computers, it seem to be so !

I (and my team) will come back when it will be clarified !

Best regards. You have my private mail. So please use it and in english.
ID: 3527 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marsinph

Send message
Joined: 13 Aug 19
Posts: 13
Credit: 2,205,884
RAC: 0
Message 3533 - Posted: 20 Aug 2019, 17:52:24 UTC - in response to Message 3527.  

Already three days and no any explanation !
No any answer.
No any response.
Nothing.

Nice !!!
Perhaps not the best way of do to attract cruncher !
Look amount of crunchers, results on ODLK, ODLK1 and here : as good as nothing !
Of course, never answer.

Not forget the best motivation of big cruncher is based on SETIBZH and the world chalenge between team.
The worst is that on ODLK1, admin says he is on holiday. But not say how long ! Nice consideration for crunchers !!!
ID: 3533 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 667
Credit: 432,784
RAC: 0
Message 3534 - Posted: 20 Aug 2019, 20:51:35 UTC - in response to Message 3533.  

I looked into the issue and found no answer.
Your result reports to be running only 31 seconds, but the log says it was running many hours.
The checkpoint is even correctly written.
When I resumed the result from a checkpoint, there was no issue. Even debugger confirmed the search is progressing.
The same workunit (sent to another cruncher, because you aborted them as I told you) completed without issues in 2.5 hours on windows 10.
I submitted the one workunit checkpoint that you sent for validation, and it validated ok. See.
ID: 3534 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 667
Credit: 432,784
RAC: 0
Message 3535 - Posted: 20 Aug 2019, 21:00:42 UTC
Last modified: 2 Sep 2019, 8:03:25 UTC

If you think you can solve the issue better/faster than me, please take a look at the source code on github.
Edit: I do not mean to sound rude. Seriously, I do not know where the issue is.
ID: 3535 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 667
Credit: 432,784
RAC: 0
Message 3579 - Posted: 6 Sep 2019, 17:41:47 UTC

I will add a condition to terminate the task if the deadline is close or if the elapsed gflop is much over the estimate. This will at least give credit for the good part of the task.
ID: 3579 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 667
Credit: 432,784
RAC: 0
Message 3591 - Posted: 13 Sep 2019, 22:33:48 UTC

Added few days ago for linux only.
ID: 3591 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 667
Credit: 432,784
RAC: 0
Message 3592 - Posted: 14 Sep 2019, 8:13:58 UTC
Last modified: 14 Sep 2019, 18:56:06 UTC

Thanks for testing the appversion 39 (tot5/5.08). Your computer 4332 is returning tasks with very short run time
I would like to investigate why. But unfortunately, I deleted the error log. Please, can you resume computation on that host and finish couple more short workunits?


n3eo: Hello Tomas,

I have processed another run of tasks on this host. The only inidication I get is 'Message from task: Time limit reached!' in the BOINC log. Most tasks finish 'successfully' after about 1 minute, some quit with an error.

Success (still time limit reached): https://boinc.tbrada.eu/result.php?resultid=1068356
Error: https://boinc.tbrada.eu/result.php?resultid=1068372

Thanks,
Mario

It seems the real time limit is computed incorrectly.
Example result 1068288 has been sent 14 Sep with deadline of 21 Sep, but the app calculated that the deadline was on 11 Sep, thus finished immediately.
This is a protection against tasks missing the deadline and being wasted.
Error of task 1068369 is just error in the base code of BOINC.
ID: 3592 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 667
Credit: 432,784
RAC: 0
Message 3594 - Posted: 14 Sep 2019, 13:12:22 UTC

n3eo: I think the error was caused by your minimum buffer time set too high.
ID: 3594 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Closed Issues : TOT5 Workunit Stuck

©2024 Tomáš Brada