Hey Zack,
I know you are super busy packing for your big move, but I thought as tomorrow is the hard deadline for GSoC, is not a bad ideas to give you some final update.
So most of last week I spent on the conflict that occurred between libcurl and libevent having their eyes on the same socket. I tried different ugly and uglier solutions but finally settled on this one which is not that bad:
http://stackoverflow.com/questions/12021217/how-to-ask-libcurl-not-to-listen...
After that, I'm passing all time line tests now. However, when I look at the client log, I see that some of the sequence no are missing in the recv: like this
203.7169 [debug] recv: <3.81> receiving block 42 <d=0 p=9 f=DAT> T:203.7170: ckt 3 <ntp 1 outq 0>: recv 42 <d=0 p=9 f=DAT>
and the next recv is like this:
203.7926 [debug] recv: <3.99> receiving block 60 <d=0 p=1 f=DAT> T:203.7927: ckt 3 <ntp 1 outq 0>: recv 60 <d=0 p=1 f=DAT>
Is this for sure an error and loss of data? or there is a normal situation that such a thing can happen? For example when the whole block can't be sent over one connection? (otherwise how do the tl tests all pass?).
Further more, I merged the payload scraper and stegotorus. I this way, http_apache checks for the payload database and if it's not there it calls the scraper (it needs apache to be installed on the same machine). These all are commited on the github.
I started writing a unit test that fetch a webpage directly (using curl) and through stegotorus and check if they are the same. I thought curl is best approximation of a browser that is available to us. But as I don't think I can finish that till tomorrow, I'm going to spend the remaining time on documenting the 'uri transport protocol' and deal with small issues.
After we are done with the evaluation, I'll go back to unit testing.
Thanks for your help.
Cheers, vmon
On Sun, Aug 19, 2012 at 8:25 PM, vmon vmonmoonshine@gmail.com wrote:
Hey Zack,
I know you are super busy packing for your big move, but I thought as tomorrow is the hard deadline for GSoC, is not a bad ideas to give you some final update.
I appreciate it. I am still in the middle of moving and don't have the time to go over your changes in detail, but I read over the diff between 'master' and 'f--payload-servers' and I like it for the most part. I don't think we want to _merge_ it just yet, but it's definitely something we can use. I like the URL dictionary concept and the use of query parameters, and it looks like you've got some good cleanups in here too.
I have a few specific change requests for you. First: boost should not be used. I know it's convenient when you're writing the code, but it causes so many deployment headaches that it's best avoided. It *appears* that you're only using one thing, boost::filesystem::exists, to check whether a file exists before attempting to open it. That's an antipattern. The correct approach is, don't check first, just attempt to open the file. If it succeeds, the file existed. If it fails, then check whether errno == ENOENT to find out whether the problem is that the file doesn't exist. (If you needed boost for something else, please let me know and we'll figure out a replacement.) Second: do not include the header <iostream>, ever. It is only *necessary* to include that header to get 'cin', 'cout', or 'cerr', and none of those should ever be used in this program. (For what you were doing you probably want to include <istream> instead.) (This will also mean that you don't need to add std::__ioinit to the global variables whitelist.) Third: I appreciate your having fixed my sloppy fwrite()s, but please, whenever you print an error message as a direct consequence of an I/O operation failing, include strerror(errno). That is, instead of log_debug("Error writing data"); it should be log_warn("Error writing data: %s", strerror(errno));
... I really gotta find time to write a style guide for this thing :-/
Also, could you please explain in detail the problem you were having with the DECLARE macros? If they are getting in your way, then we should change them, but I do not understand what the issue is.
So most of last week I spent on the conflict that occurred between libcurl and libevent having their eyes on the same socket. I tried different ugly and uglier solutions but finally settled on this one which is not that bad:
http://stackoverflow.com/questions/12021217/how-to-ask-libcurl-not-to-listen...
After that, I'm passing all time line tests now.
Great.
However, when I look at the client log, I see that some of the sequence no are missing in the recv: like this
203.7169 [debug] recv: <3.81> receiving block 42 <d=0 p=9 f=DAT> T:203.7170: ckt 3 <ntp 1 outq 0>: recv 42 <d=0 p=9 f=DAT>
and the next recv is like this:
203.7926 [debug] recv: <3.99> receiving block 60 <d=0 p=1 f=DAT> T:203.7927: ckt 3 <ntp 1 outq 0>: recv 60 <d=0 p=1 f=DAT>
Is this for sure an error and loss of data? or there is a normal situation that such a thing can happen? For example when the whole block can't be sent over one connection? (otherwise how do the tl tests all pass?).
Blocks are cut to fit, so it never happens that "the whole block can't be sent over one connection". What _can_ happen is that blocks arrive in a different order than they are received. Sometimes they get reordered to an extent that seems incredible, e.g. block 60 arriving immediately after block 42 as in your example (I suspect if you look at the complete log you'll see that what actually happened is block 42 was *delayed* until after blocks 43 through 59 had already arrived).
That's perfectly normal by itself, but there is a bug. If block K+128 arrives while the receiver is still waiting for block K, the receiver will declare a protocol error and kill the entire circuit. That's by design; the bug is that the sender has no way of knowing that the receiver is still waiting for block K. This is what I'm supposed to be fixing right now (by introducing explicit acknowledgments).
Further more, I merged the payload scraper and stegotorus. I this way, http_apache checks for the payload database and if it's not there it calls the scraper (it needs apache to be installed on the same machine). These all are commited on the github.
Sounds good. When we get a config file we'll want to be able to say that we expect to use one or the other, but for now this is a good approach.
Are we now exclusively using libcurl to generate requests or is there still legacy hand-generation code in there?
I started writing a unit test that fetch a webpage directly (using curl) and through stegotorus and check if they are the same. I thought curl is best approximation of a browser that is available to us. But as I don't think I can finish that till tomorrow, I'm going to spend the remaining time on documenting the 'uri transport protocol' and deal with small issues.
After we are done with the evaluation, I'll go back to unit testing.
Sounds good.
zw
Hello Zack,
I hope you have already settled down and your new home feels homey by now.
Frist let me start with this, that took me the longest (not) to "tackle":
Sometimes they get reordered to an extent that seems incredible, e.g. block 60 arriving immediately after block 42 as in your example
After talking to Nick, exchanging emails on curl mailing list, hacking the curl code and recompiling it, finally guess what was the problem, ..., my terminal was censoring stderr and was selectively printing the log_debug, hence only print recv 42 and recv 60 but nothing in between. That was why everything was going smooth while I was "losing" dozens of blocks every so often. When I redirect the stderr to a file I got all packets in order. So we are good here.
First: boost should not be used.
I'm using boost for recursively traversing the document root directory of the http server in PayloadScraper. Collaterally, I used to its other function, like existence of a file, because the damage was already done. If you know another option that does the same (iterating through all files in subdirs recursively) in a hassle-free way and is portable to Windows, then I'm at your service to replace boost.
Second: do not include the header <iostream>
Got rid of it.
whenever you print an error message as a direct consequence of an I/O operation failing, include
Done.
the problem you were having with the DECLARE macros?
This is the issue: http_steg and http_apache_steg they have a lot in common. In c++ fashion, one just inherit http_apache_steg from http_steg and re-write only those functions that are different. But using DECLARE functions there are two problems:
1. You have to re-define all the functions even those that are exactly the same, and I have few of these:
int http_apache_steg_t::receive(struct evbuffer *dest) { return http_steg_t::receive(dest); }
2. If I was to follow the procedure rigorously even #1 solution was not possible, because it requires that http_steg is defined in a anonymous namespace so I couldn't inherit another structure from it in another file, so I had have http_steg not in its own namespace.
So your call, if it is OK, is OK, if not we can get rid of the DECLARE.
Cheers, vmon