41 Commits

Author SHA1 Message Date
bad4eac887 Moved collnum_t type to separate header file 2017-10-06 14:32:45 +02:00
804d838cd9 Msg13Request: use offsetof() instead of pointer calculations 2017-03-10 12:56:32 +01:00
2788f518c4 #include cleanup in Msg13.* 2016-11-11 15:51:30 +01:00
fd6e8cbb21 Remove more qa specific code 2016-07-26 15:19:01 +02:00
570838d7f6 Remove m_isCustomCrawl & logic surrounding it 2016-07-22 13:59:55 +02:00
4cd54ec39e Include RdbCache.h only in files where needed. Use fwd-decl elsewhere. 2016-07-18 16:22:06 +02:00
feafdf1dca Partial fix conversion from string literal to 'char *' for XmlDoc 2016-05-31 11:26:01 +02:00
8dab0db467 Remove extra semicolons 2016-05-19 18:37:26 +02:00
6116fc502f Remove always true m_forwardDownloadRequest variable 2016-05-10 16:15:32 +02:00
5e75b2c607 Remove unused variable & commented out code 2016-05-10 16:15:32 +02:00
8e559558e4 Marked a portion of global variabels and functions as static
Found with Flexelint. Some of them were not used at all.
2016-04-27 11:50:07 +02:00
ef3b8a343b Removed explicit m_buf[0] from Msg13Request 2016-04-04 13:43:09 +02:00
ab0b9d03ea Standardize header guards 2016-03-08 22:16:02 +01:00
166f7a80a0 1-bit bit fields must be unsigned
Using int32_t or any signed type leads to undefined behaviour.
In gcc's case the possible vales are actually 0 and -1
2016-03-01 14:51:00 +01:00
cd095e66bc Remove FORMAT_PROCOG & related codes. Remove more scraping code/google detection code 2016-01-05 12:17:17 +01:00
7fcc2ab4e1 in the sockets table page,
show url download requests that are queued up to prevent
hammering an ip. also show the first 500 bytes of the send buf
in the http server sockets table.
2015-08-25 09:34:45 -07:00
86800a0656 if a root/seed url has no outlinks, assumed banned. 2015-05-04 14:23:28 -07:00
1825f6bd27 retry download if was in the twitchy table
at start of download, and not using proxies at all.
2015-04-30 16:06:13 -07:00
6d8bb19962 checkpoint for auto proxy logic 2015-04-30 13:28:57 -07:00
6fc83566e2 more fixes 2015-02-02 14:06:38 -08:00
c15bd53e52 added support for supplying basic proxy authorization
to spider proxies. username:password@1.2.3.4:80
2015-02-02 13:23:38 -08:00
8e315504a2 fix empty rdbcache bug of not enough buf mem. 2014-11-27 13:17:00 -08:00
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
65800b65cf fix so diffbot doesn't timeout due
to large floater/proxy backoff crawl delay.
append &timeout=MAXCRAWLDELAY to diffbot api url.
2014-10-07 14:32:38 -07:00
c2f98a81b6 fix floater bug from reading hashtable off disk.
force use floaters if ! useRobots and is diffbot crawl.
2014-09-26 15:30:42 -07:00
6a28250e94 get qa test working after nyt bug fix 2014-08-06 16:00:25 -07:00
947be58f10 Merge branch 'diffbot-testing' into testing
Conflicts:
	HttpRequest.cpp
	Msg13.cpp
	XmlDoc.cpp
2014-08-05 17:19:53 -07:00
cc1ceaaac2 fix nyt.com cookie redir bug.
fixed bug when POSTing injection request with multipart/form-data.
2014-08-05 17:04:11 -07:00
05fcef9651 more vote infusion and squid proxy fixes. 2014-07-09 14:57:58 -07:00
ea90e7f755 more fixes for sectiondb markup code 2014-06-12 13:05:45 -07:00
7d452a766c completed squid proxy simulation code 2014-06-09 12:42:05 -07:00
965d992f98 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Msg13.cpp
2014-06-06 15:14:41 -07:00
3f2dcda4e1 got new floater/proxy logic compiling. 2014-06-06 15:11:51 -07:00
ce7294e9a9 more mem leak fixes for fake
bulk job empty http replies
2014-06-05 20:09:12 -07:00
ee5af6b30e more spider proxy fixes 2014-06-02 14:59:15 -07:00
ca450e6bbd using msg55 when done downloading through a proxy to record
stats for load balancing on host 
2014-06-02 13:48:33 -07:00
b6e5424e32 do not download bulkjob urls in crawlbot.
just return a fake http reply.
however, do use crawl-delay throttling
logic. deduping is already turned off for
bulk jobs so it should be ok.
2014-03-21 12:40:38 -07:00
0f3374e3f3 measure crawl delay by default from
start of each download now. it is
a parm in msg13request.
2013-11-26 14:07:28 -08:00
e8065a0f0a enforce crawl delay perfectly. 2013-11-22 18:26:34 -08:00
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00