Commit Graph

91 Commits

Author SHA1 Message Date
ca2225fe30 localtime() -> localtime_r() 2016-08-09 16:24:54 +02:00
35d4becbbc Remove g_stats.addSpiderPoint and related spider statistics 2016-07-25 17:03:51 +02:00
57e741a115 Remove crawlbot page and 'bulk' download of titlerec in csv files 2016-07-22 13:59:55 +02:00
7093ea212f Fix conversion from string literal to 'char *' for PageBasic 2016-05-31 11:26:01 +02:00
d40ecb2f8e Replace INT32/INT64 and likes with PRId32 and likes. Add space before definition. 2016-05-20 09:18:32 +02:00
5347395a01 Move getMatchingUrlPattern & updateSiteListBuf to Spider.cpp.
Call updateSiteListBuf when siteListIsEmpty is not valid for insitelist (fixing a bug where a page is forced deleted even when sitelist is empty)
2016-05-19 15:22:19 +02:00
eeebeec2e6 Use logTrace. Remove redundant addMetaList method. 2016-05-10 16:15:34 +02:00
34a4d08cd1 Add Url::set overload for all false boolean paramater 2016-04-05 23:21:50 +02:00
5ccfde3e2a Remove titleRecVersion from Url.h/Url.cpp. Checks there are versions from before gigablast was open-sourced 2016-04-05 23:21:50 +02:00
e9236a724a Make a proper headerfile for printFrontPageShell() in PageRoot.h 2016-02-23 16:02:04 +01:00
79978dabad Remove commented out codes. Encode HTML entities when viewing in HTML format to prevent breaking UI. 2016-02-16 12:35:45 +01:00
8463f36179 Split Spider.cpp into Spider, SpiderColl and SpiderLoop. More to come.. 2016-01-30 19:41:54 +01:00
0601043a51 valgrind: avoid testing undefined bytes in SpiderColl::m_siteListIsEmpty 2016-01-25 17:07:04 +01:00
e0fac73695 Got rid of default parameter values in Url.cpp set functions. Added support for removal of common tracking parameters in URLs. 2016-01-17 22:49:02 +01:00
0b98f2c337 Remove some unused methods/class. Minor restructuring of test files. 2015-12-10 12:12:50 +01:00
1cac2e282e fix core from adding a lot of sites
to the sitelist.
2015-03-07 20:57:17 -07:00
30a77dd422 checkpoint on massive spidering speed ups. 2015-02-11 17:55:28 -08:00
931a1c4bc6 good checkpoint. quite a few fixes. 2014-11-17 18:13:36 -08:00
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
783ae1d4e7 print chrome on other pages 2014-09-23 20:59:48 -07:00
f21f2a282a gigabot advice updates 2014-09-06 21:05:11 -07:00
c392e6eade fix minserpscore parm. support TYPE_DOUBLE parms. 2014-09-01 20:37:17 -07:00
91ab558eb5 fix widget from initializing with serps
because 'cd' was a text node, "No Results Found..."
and did not have the getAttribute() function available
thus causing a javascript error.
2014-09-01 19:24:04 -07:00
58d8861a34 widget page updates 2014-09-01 17:04:08 -07:00
7f622bd416 fixes for cloud support. 2014-08-31 16:23:11 -07:00
253ed05074 gigabot advice blurb. 2014-08-30 22:14:53 -07:00
811c9c3c4d show latest spider status msgs in page basic status.
show coll name, not num, in spider queue.
2014-07-29 07:14:54 -07:00
2a7f40fd43 fix core from adding site like http://xyz.com/?ddd 2014-07-15 06:26:27 -07:00
24aa79bc85 seed urls after tag: directives. 2014-07-14 07:13:55 -07:00
d5805733e5 more api updates 2014-07-13 09:35:44 -07:00
61afdc0fb9 fix core from adding/deleting collection 2014-07-12 08:23:40 -07:00
5f26918910 lots of bug fixes. more qa fixes. 2014-07-11 08:00:30 -07:00
13e295b89d Merge branch 'diffbot-matt' into testing 2014-07-11 05:41:22 -07:00
2a570cc1ef got tag: support working in the sitelist and url filters 2014-07-10 20:41:59 -07:00
b393a1bbbe Merge branch 'testing' into diffbot-matt
Conflicts:
	Errno.cpp
	Errno.h
2014-07-10 10:06:55 -07:00
0da6063983 bring tags back in site list / url filters. 2014-07-10 07:44:16 -07:00
6434e5cc04 Merge branch 'testing' into diffbot-matt
Conflicts:
	Errno.cpp
	Errno.h
	Parms.h
2014-07-07 09:49:59 -07:00
e4f43848d5 fix http://abc.com in sitelist some more. 2014-07-06 08:31:27 -07:00
3f8bbed3ac added note 2014-07-04 09:22:09 -07:00
888db7787d make patterns in the site list that start with
http: or https: behave like the regex ^http:
so it requires the start of the url match exactly.
fixes pattern "http://xyz.com/".
2014-07-04 09:18:46 -07:00
a09d4cd723 Merge branch 'master' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Pages.cpp
	XmlDoc.cpp
	gb.conf
2014-06-20 09:35:39 -07:00
327c866327 widget scrolling more continuous 2014-06-20 07:59:19 -07:00
ec1b66aff5 Merge branch 'master' into diffbot-testing 2014-06-03 20:50:59 -07:00
f8073b5adc Merge branch 'diffbot-testing' into diffbot-matt 2014-06-03 14:59:31 -07:00
a1f1daad16 Merge branch 'master' into diffbot-matt
Conflicts:
	Spider.cpp
2014-06-03 11:41:46 -07:00
6dcbc10e92 spider proxy updates. 2014-06-03 11:38:44 -07:00
ba2329808b fix siteListIsEmpty bug causing spider to
spider the whole internet when it shouldn't
2014-06-03 11:37:31 -07:00
c06f9fde36 gigablast now has a notion of version based on the request 2014-05-27 20:11:12 -07:00
b1cd0cac86 indexing spider replies now working.
use type:status to see them or
gbstatus:success or gbstatus:tcp or gbstatus:0.
2014-05-09 18:07:38 -07:00
3bf52f0f2d if "wpid" is supplied try to update sitelist
for that wpid. hopefully we can get the wp admin
tools to send a /search?wpid=xxxx&sites=xyz.com request so
we can start spidering those sites before they even see the
widget. also it is simpler than trying to update m_siteListBuf
each time someone does a query since those can be hundreds
a second.
2014-05-07 16:10:26 -07:00