ca2225fe30
localtime() -> localtime_r()
2016-08-09 16:24:54 +02:00
35d4becbbc
Remove g_stats.addSpiderPoint and related spider statistics
2016-07-25 17:03:51 +02:00
57e741a115
Remove crawlbot page and 'bulk' download of titlerec in csv files
2016-07-22 13:59:55 +02:00
7093ea212f
Fix conversion from string literal to 'char *' for PageBasic
2016-05-31 11:26:01 +02:00
d40ecb2f8e
Replace INT32/INT64 and likes with PRId32 and likes. Add space before definition.
2016-05-20 09:18:32 +02:00
5347395a01
Move getMatchingUrlPattern & updateSiteListBuf to Spider.cpp.
...
Call updateSiteListBuf when siteListIsEmpty is not valid for insitelist (fixing a bug where a page is forced deleted even when sitelist is empty)
2016-05-19 15:22:19 +02:00
eeebeec2e6
Use logTrace. Remove redundant addMetaList method.
2016-05-10 16:15:34 +02:00
34a4d08cd1
Add Url::set overload for all false boolean paramater
2016-04-05 23:21:50 +02:00
5ccfde3e2a
Remove titleRecVersion from Url.h/Url.cpp. Checks there are versions from before gigablast was open-sourced
2016-04-05 23:21:50 +02:00
e9236a724a
Make a proper headerfile for printFrontPageShell() in PageRoot.h
2016-02-23 16:02:04 +01:00
79978dabad
Remove commented out codes. Encode HTML entities when viewing in HTML format to prevent breaking UI.
2016-02-16 12:35:45 +01:00
8463f36179
Split Spider.cpp into Spider, SpiderColl and SpiderLoop. More to come..
2016-01-30 19:41:54 +01:00
0601043a51
valgrind: avoid testing undefined bytes in SpiderColl::m_siteListIsEmpty
2016-01-25 17:07:04 +01:00
e0fac73695
Got rid of default parameter values in Url.cpp set functions. Added support for removal of common tracking parameters in URLs.
2016-01-17 22:49:02 +01:00
0b98f2c337
Remove some unused methods/class. Minor restructuring of test files.
2015-12-10 12:12:50 +01:00
1cac2e282e
fix core from adding a lot of sites
...
to the sitelist.
2015-03-07 20:57:17 -07:00
30a77dd422
checkpoint on massive spidering speed ups.
2015-02-11 17:55:28 -08:00
931a1c4bc6
good checkpoint. quite a few fixes.
2014-11-17 18:13:36 -08:00
96b8197ad3
now it compiles with -m32
2014-11-10 14:45:11 -08:00
783ae1d4e7
print chrome on other pages
2014-09-23 20:59:48 -07:00
f21f2a282a
gigabot advice updates
2014-09-06 21:05:11 -07:00
c392e6eade
fix minserpscore parm. support TYPE_DOUBLE parms.
2014-09-01 20:37:17 -07:00
91ab558eb5
fix widget from initializing with serps
...
because 'cd' was a text node, "No Results Found..."
and did not have the getAttribute() function available
thus causing a javascript error.
2014-09-01 19:24:04 -07:00
58d8861a34
widget page updates
2014-09-01 17:04:08 -07:00
7f622bd416
fixes for cloud support.
2014-08-31 16:23:11 -07:00
253ed05074
gigabot advice blurb.
2014-08-30 22:14:53 -07:00
811c9c3c4d
show latest spider status msgs in page basic status.
...
show coll name, not num, in spider queue.
2014-07-29 07:14:54 -07:00
2a7f40fd43
fix core from adding site like http://xyz.com/?ddd
2014-07-15 06:26:27 -07:00
24aa79bc85
seed urls after tag: directives.
2014-07-14 07:13:55 -07:00
d5805733e5
more api updates
2014-07-13 09:35:44 -07:00
61afdc0fb9
fix core from adding/deleting collection
2014-07-12 08:23:40 -07:00
5f26918910
lots of bug fixes. more qa fixes.
2014-07-11 08:00:30 -07:00
13e295b89d
Merge branch 'diffbot-matt' into testing
2014-07-11 05:41:22 -07:00
2a570cc1ef
got tag: support working in the sitelist and url filters
2014-07-10 20:41:59 -07:00
b393a1bbbe
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Errno.cpp
Errno.h
2014-07-10 10:06:55 -07:00
0da6063983
bring tags back in site list / url filters.
2014-07-10 07:44:16 -07:00
6434e5cc04
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Errno.cpp
Errno.h
Parms.h
2014-07-07 09:49:59 -07:00
e4f43848d5
fix http://abc.com in sitelist some more.
2014-07-06 08:31:27 -07:00
3f8bbed3ac
added note
2014-07-04 09:22:09 -07:00
888db7787d
make patterns in the site list that start with
...
http: or https: behave like the regex ^http:
so it requires the start of the url match exactly.
fixes pattern "http://xyz.com/ ".
2014-07-04 09:18:46 -07:00
a09d4cd723
Merge branch 'master' into diffbot-matt
...
Conflicts:
Collectiondb.cpp
Pages.cpp
XmlDoc.cpp
gb.conf
2014-06-20 09:35:39 -07:00
327c866327
widget scrolling more continuous
2014-06-20 07:59:19 -07:00
ec1b66aff5
Merge branch 'master' into diffbot-testing
2014-06-03 20:50:59 -07:00
f8073b5adc
Merge branch 'diffbot-testing' into diffbot-matt
2014-06-03 14:59:31 -07:00
a1f1daad16
Merge branch 'master' into diffbot-matt
...
Conflicts:
Spider.cpp
2014-06-03 11:41:46 -07:00
6dcbc10e92
spider proxy updates.
2014-06-03 11:38:44 -07:00
ba2329808b
fix siteListIsEmpty bug causing spider to
...
spider the whole internet when it shouldn't
2014-06-03 11:37:31 -07:00
c06f9fde36
gigablast now has a notion of version based on the request
2014-05-27 20:11:12 -07:00
b1cd0cac86
indexing spider replies now working.
...
use type:status to see them or
gbstatus:success or gbstatus:tcp or gbstatus:0.
2014-05-09 18:07:38 -07:00
3bf52f0f2d
if "wpid" is supplied try to update sitelist
...
for that wpid. hopefully we can get the wp admin
tools to send a /search?wpid=xxxx&sites=xyz.com request so
we can start spidering those sites before they even see the
widget. also it is simpler than trying to update m_siteListBuf
each time someone does a query since those can be hundreds
a second.
2014-05-07 16:10:26 -07:00